Back to Blog
Tutorial12 min read2026-07-02

How to Deploy a LangChain Agent API

Wrap a LangChain agent in a FastAPI service and deploy it to production with streaming responses, timeouts, memory, and the cost controls agents always need.

Ajay Kumar
Ajay Kumar
Founder & DevOps, PandaStack

# How to Deploy a LangChain Agent API

A LangChain agent that runs in a notebook is a demo. A LangChain agent behind an HTTP API that handles concurrent users, streams tokens, enforces timeouts, and doesn't blow your LLM budget is a product. This guide covers the gap.

Wrap the agent in FastAPI

Keep the agent construction outside the request handler so you build it once, not per request:

# app.py
import os
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from langchain_openai import ChatOpenAI
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate
from tools import TOOLS  # your tool definitions

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Use tools when needed."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])
agent = create_tool_calling_agent(llm, TOOLS, prompt)
executor = AgentExecutor(agent=agent, tools=TOOLS, max_iterations=6, max_execution_time=30)

app = FastAPI()

class Query(BaseModel):
    input: str

@app.post("/agent")
async def run_agent(q: Query):
    result = await executor.ainvoke({"input": q.input})
    return {"output": result["output"]}

@app.get("/healthz")
def health():
    return {"status": "ok"}

Two settings that save you from runaway agents: max_iterations caps tool-calling loops, and max_execution_time bounds wall-clock time. Without these, a confused agent can loop until it exhausts your token budget.

Stream tokens for better UX

Agents can take seconds. Streaming makes the wait feel shorter:

@app.post("/agent/stream")
async def stream_agent(q: Query):
    async def gen():
        async for chunk in executor.astream({"input": q.input}):
            if "output" in chunk:
                yield chunk["output"]
    return StreamingResponse(gen(), media_type="text/plain")

Don't trust the LLM provider to be fast or up

Production agents need guardrails around the model call:

  • Timeouts: set a request timeout on the LLM client so a hung provider doesn't pin a worker.
  • Retries with backoff: transient 429/503s are common; retry a bounded number of times.
  • Concurrency limits: cap in-flight requests so a traffic spike doesn't open hundreds of provider connections at once.

Memory and state

If your agent needs conversation memory, store it server-side keyed by a session ID — never in process memory, which doesn't survive restarts or span replicas. A managed Postgres or Redis is the right home:

# Pseudocode: load/save history per session_id from DATABASE_URL
history = load_history(session_id)
result = await executor.ainvoke({"input": q.input, "chat_history": history})
save_history(session_id, result)

Containerize

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Deploy on PandaStack

  1. 1Push to GitHub and create a container app in the [dashboard](https://dashboard.pandastack.io). It detects Python (or uses your Dockerfile), builds with rootless BuildKit, and serves an HTTPS endpoint.
  2. 2Add OPENAI_API_KEY (or your provider's key) and tool credentials as encrypted env vars.
  3. 3Provision a managed PostgreSQL for session memory — DATABASE_URL is auto-wired, so your code reads it directly.

Sizing and scale-to-zero

Agents are I/O-bound (waiting on the LLM), so they don't need huge CPU — but they do hold connections open while waiting. A couple of considerations:

FactorRecommendation
Cold startsFree tier scales to zero; first request after idle is slow. For a user-facing agent, use a paid tier to stay warm.
MemoryLarger context windows and heavy frameworks want headroom; m1/m2 memory-optimized tiers help if you load big prompts/embeddings.
ConcurrencyRun multiple uvicorn workers and let the platform scale replicas under load.

For internal tools or low traffic, the free tier's scale-to-zero is a feature, not a bug — you pay nothing when idle.

Cost control is part of the deployment

LLM cost is your biggest variable. Beyond max_iterations:

  • Log token usage per request (the response includes usage metadata) and aggregate it.
  • Use a smaller/cheaper model for routing and a stronger one only when needed.
  • Cache deterministic tool results so the agent doesn't re-fetch.

PandaStack's server-side metrics give you request rate and latency without instrumenting your app; pair that with your own token logging to watch spend.

Verify in production

curl -X POST https://<app>/agent \
  -H 'Content-Type: application/json' \
  -d '{"input": "What is the weather in Paris, and should I bring an umbrella?"}'

Tail live logs to confirm tool calls fire, iterations stay bounded, and execution time stays under your cap.

References

  • [LangChain: Agents](https://python.langchain.com/docs/concepts/agents/)
  • [LangChain: Streaming](https://python.langchain.com/docs/concepts/streaming/)
  • [FastAPI: Deployment concepts](https://fastapi.tiangolo.com/deployment/concepts/)
  • [OpenAI: Rate limits and retries](https://platform.openai.com/docs/guides/rate-limits)

Shipping an agent means handling timeouts, memory, and cost — not just prompts. PandaStack gives you HTTPS, encrypted secrets, an auto-wired Postgres, and warm compute on the free tier to start. Deploy yours at [dashboard.pandastack.io](https://dashboard.pandastack.io).

Ready to deploy?

Start free on PandaStack.

Start free on PandaStack

More in Tutorial

Browse all Tutorial articles →

See also