# How to Deploy a LangChain Agent API

A LangChain agent that runs in a notebook is a demo. A LangChain agent behind an HTTP API that handles concurrent users, streams tokens, enforces timeouts, and doesn't blow your LLM budget is a product. This guide covers the gap.

Wrap the agent in FastAPI

Keep the agent construction outside the request handler so you build it once, not per request:

# app.py
import os
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from langchain_openai import ChatOpenAI
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate
from tools import TOOLS  # your tool definitions

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Use tools when needed."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])
agent = create_tool_calling_agent(llm, TOOLS, prompt)
executor = AgentExecutor(agent=agent, tools=TOOLS, max_iterations=6, max_execution_time=30)

app = FastAPI()

class Query(BaseModel):
    input: str

@app.post("/agent")
async def run_agent(q: Query):
    result = await executor.ainvoke({"input": q.input})
    return {"output": result["output"]}

@app.get("/healthz")
def health():
    return {"status": "ok"}

Two settings that save you from runaway agents: max_iterations caps tool-calling loops, and max_execution_time bounds wall-clock time. Without these, a confused agent can loop until it exhausts your token budget.

Stream tokens for better UX

Agents can take seconds. Streaming makes the wait feel shorter:

@app.post("/agent/stream")
async def stream_agent(q: Query):
    async def gen():
        async for chunk in executor.astream({"input": q.input}):
            if "output" in chunk:
                yield chunk["output"]
    return StreamingResponse(gen(), media_type="text/plain")

Don't trust the LLM provider to be fast or up

Production agents need guardrails around the model call:

Timeouts: set a request timeout on the LLM client so a hung provider doesn't pin a worker.
Retries with backoff: transient 429/503s are common; retry a bounded number of times.
Concurrency limits: cap in-flight requests so a traffic spike doesn't open hundreds of provider connections at once.

Memory and state

If your agent needs conversation memory, store it server-side keyed by a session ID — never in process memory, which doesn't survive restarts or span replicas. A managed Postgres or Redis is the right home:

# Pseudocode: load/save history per session_id from DATABASE_URL
history = load_history(session_id)
result = await executor.ainvoke({"input": q.input, "chat_history": history})
save_history(session_id, result)

Containerize

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Deploy on PandaStack

1Push to GitHub and create a container app in the [dashboard](https://dashboard.pandastack.io). It detects Python (or uses your Dockerfile), builds with rootless BuildKit, and serves an HTTPS endpoint.
2Add OPENAI_API_KEY (or your provider's key) and tool credentials as encrypted env vars.
3Provision a managed PostgreSQL for session memory — DATABASE_URL is auto-wired, so your code reads it directly.

Sizing and scale-to-zero

Agents are I/O-bound (waiting on the LLM), so they don't need huge CPU — but they do hold connections open while waiting. A couple of considerations:

Factor	Recommendation
Cold starts	Free tier scales to zero; first request after idle is slow. For a user-facing agent, use a paid tier to stay warm.
Memory	Larger context windows and heavy frameworks want headroom; m1/m2 memory-optimized tiers help if you load big prompts/embeddings.
Concurrency	Run multiple uvicorn workers and let the platform scale replicas under load.

For internal tools or low traffic, the free tier's scale-to-zero is a feature, not a bug — you pay nothing when idle.

Cost control is part of the deployment

LLM cost is your biggest variable. Beyond max_iterations:

Log token usage per request (the response includes usage metadata) and aggregate it.
Use a smaller/cheaper model for routing and a stronger one only when needed.
Cache deterministic tool results so the agent doesn't re-fetch.

PandaStack's server-side metrics give you request rate and latency without instrumenting your app; pair that with your own token logging to watch spend.

Verify in production

curl -X POST https://<app>/agent \
  -H 'Content-Type: application/json' \
  -d '{"input": "What is the weather in Paris, and should I bring an umbrella?"}'

Tail live logs to confirm tool calls fire, iterations stay bounded, and execution time stays under your cap.

References

[LangChain: Agents](https://python.langchain.com/docs/concepts/agents/)
[LangChain: Streaming](https://python.langchain.com/docs/concepts/streaming/)
[FastAPI: Deployment concepts](https://fastapi.tiangolo.com/deployment/concepts/)
[OpenAI: Rate limits and retries](https://platform.openai.com/docs/guides/rate-limits)

Shipping an agent means handling timeouts, memory, and cost — not just prompts. PandaStack gives you HTTPS, encrypted secrets, an auto-wired Postgres, and warm compute on the free tier to start. Deploy yours at [dashboard.pandastack.io](https://dashboard.pandastack.io).

How to Deploy a LangChain Agent API

Wrap the agent in FastAPI

Stream tokens for better UX

Don't trust the LLM provider to be fast or up

Memory and state

Containerize

Deploy on PandaStack

Sizing and scale-to-zero

Cost control is part of the deployment

Verify in production

References

Ready to deploy?

More in Tutorial

How to Deploy a Phoenix (Elixir) App to the Cloud

How to Deploy a Monorepo with Multiple Services

How to Deploy a Python RQ Background Worker

See also