# How to Deploy a LangChain Agent API
A LangChain agent that runs in a notebook is a demo. A LangChain agent behind an HTTP API that handles concurrent users, streams tokens, enforces timeouts, and doesn't blow your LLM budget is a product. This guide covers the gap.
Wrap the agent in FastAPI
Keep the agent construction outside the request handler so you build it once, not per request:
# app.py
import os
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from langchain_openai import ChatOpenAI
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate
from tools import TOOLS # your tool definitions
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant. Use tools when needed."),
("human", "{input}"),
("placeholder", "{agent_scratchpad}"),
])
agent = create_tool_calling_agent(llm, TOOLS, prompt)
executor = AgentExecutor(agent=agent, tools=TOOLS, max_iterations=6, max_execution_time=30)
app = FastAPI()
class Query(BaseModel):
input: str
@app.post("/agent")
async def run_agent(q: Query):
result = await executor.ainvoke({"input": q.input})
return {"output": result["output"]}
@app.get("/healthz")
def health():
return {"status": "ok"}Two settings that save you from runaway agents: max_iterations caps tool-calling loops, and max_execution_time bounds wall-clock time. Without these, a confused agent can loop until it exhausts your token budget.
Stream tokens for better UX
Agents can take seconds. Streaming makes the wait feel shorter:
@app.post("/agent/stream")
async def stream_agent(q: Query):
async def gen():
async for chunk in executor.astream({"input": q.input}):
if "output" in chunk:
yield chunk["output"]
return StreamingResponse(gen(), media_type="text/plain")Don't trust the LLM provider to be fast or up
Production agents need guardrails around the model call:
- Timeouts: set a request timeout on the LLM client so a hung provider doesn't pin a worker.
- Retries with backoff: transient 429/503s are common; retry a bounded number of times.
- Concurrency limits: cap in-flight requests so a traffic spike doesn't open hundreds of provider connections at once.
Memory and state
If your agent needs conversation memory, store it server-side keyed by a session ID — never in process memory, which doesn't survive restarts or span replicas. A managed Postgres or Redis is the right home:
# Pseudocode: load/save history per session_id from DATABASE_URL
history = load_history(session_id)
result = await executor.ainvoke({"input": q.input, "chat_history": history})
save_history(session_id, result)Containerize
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]Deploy on PandaStack
- 1Push to GitHub and create a container app in the [dashboard](https://dashboard.pandastack.io). It detects Python (or uses your Dockerfile), builds with rootless BuildKit, and serves an HTTPS endpoint.
- 2Add
OPENAI_API_KEY(or your provider's key) and tool credentials as encrypted env vars. - 3Provision a managed PostgreSQL for session memory —
DATABASE_URLis auto-wired, so your code reads it directly.
Sizing and scale-to-zero
Agents are I/O-bound (waiting on the LLM), so they don't need huge CPU — but they do hold connections open while waiting. A couple of considerations:
| Factor | Recommendation |
|---|---|
| Cold starts | Free tier scales to zero; first request after idle is slow. For a user-facing agent, use a paid tier to stay warm. |
| Memory | Larger context windows and heavy frameworks want headroom; m1/m2 memory-optimized tiers help if you load big prompts/embeddings. |
| Concurrency | Run multiple uvicorn workers and let the platform scale replicas under load. |
For internal tools or low traffic, the free tier's scale-to-zero is a feature, not a bug — you pay nothing when idle.
Cost control is part of the deployment
LLM cost is your biggest variable. Beyond max_iterations:
- Log token usage per request (the response includes usage metadata) and aggregate it.
- Use a smaller/cheaper model for routing and a stronger one only when needed.
- Cache deterministic tool results so the agent doesn't re-fetch.
PandaStack's server-side metrics give you request rate and latency without instrumenting your app; pair that with your own token logging to watch spend.
Verify in production
curl -X POST https://<app>/agent \
-H 'Content-Type: application/json' \
-d '{"input": "What is the weather in Paris, and should I bring an umbrella?"}'Tail live logs to confirm tool calls fire, iterations stay bounded, and execution time stays under your cap.
References
- [LangChain: Agents](https://python.langchain.com/docs/concepts/agents/)
- [LangChain: Streaming](https://python.langchain.com/docs/concepts/streaming/)
- [FastAPI: Deployment concepts](https://fastapi.tiangolo.com/deployment/concepts/)
- [OpenAI: Rate limits and retries](https://platform.openai.com/docs/guides/rate-limits)
Shipping an agent means handling timeouts, memory, and cost — not just prompts. PandaStack gives you HTTPS, encrypted secrets, an auto-wired Postgres, and warm compute on the free tier to start. Deploy yours at [dashboard.pandastack.io](https://dashboard.pandastack.io).