Back to Blog
Tutorial12 min read2026-07-01

How to Deploy a LangChain AI App to Production

Deploy a LangChain app to production: wrap it in an API, manage API keys and rate limits, stream responses, add a vector store, and handle long-running calls reliably.

Ajay Kumar
Ajay Kumar
Founder & DevOps, PandaStack

LangChain makes it easy to prototype an LLM application, but production has different demands: you need an API surface, secure key management, streaming, a real vector store, and sane timeout handling for slow model calls. This guide takes a LangChain prototype to a deployable service.

Wrap the chain in an API

A notebook chain isn't deployable. Expose it through a web framework — FastAPI is a natural fit for async LLM calls:

from fastapi import FastAPI
from pydantic import BaseModel
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

app = FastAPI()
llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)
prompt = ChatPromptTemplate.from_template('Answer concisely: {question}')
chain = prompt | llm

class Query(BaseModel):
    question: str

@app.post('/ask')
async def ask(q: Query):
    result = await chain.ainvoke({'question': q.question})
    return {'answer': result.content}

Note the async ainvoke — LLM calls are I/O-bound, and async lets one worker handle many concurrent requests while waiting on the model.

Manage API keys as secrets

Model-provider keys are billing credentials. Never hardcode them. Read from the environment and set them as platform secrets:

import os
assert os.environ.get('OPENAI_API_KEY'), 'OPENAI_API_KEY is required'

Fail fast at startup if a required key is missing, rather than on the first user request.

Streaming responses

Users expect token-by-token output. Stream from the chain and forward as a server-sent events or chunked response:

from fastapi.responses import StreamingResponse

@app.post('/stream')
async def stream(q: Query):
    async def gen():
        async for chunk in chain.astream({'question': q.question}):
            yield chunk.content
    return StreamingResponse(gen(), media_type='text/plain')

Streaming also reduces perceived latency dramatically, which matters because model calls can take seconds.

Timeouts and retries

Model APIs occasionally hang or rate-limit you. Set explicit timeouts and bounded retries:

llm = ChatOpenAI(model='gpt-4o-mini', timeout=30, max_retries=2)

Without a timeout, a stuck upstream call ties up a worker until the platform kills the request. Surface 429s back to the client rather than retrying forever.

Adding a vector store

For retrieval-augmented generation you need a persistent vector store. PostgreSQL with the pgvector extension is a pragmatic choice — one database for both relational data and embeddings:

from langchain_postgres import PGVector

store = PGVector(
    connection=os.environ['DATABASE_URL'],
    embeddings=embeddings,
    collection_name='docs'
)

This avoids running a separate vector database for small-to-medium workloads.

Dockerfile

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Deploying on PandaStack

  1. 1Connect your repo as a container app in the [dashboard](https://dashboard.pandastack.io). PandaStack auto-detects Python and installs dependencies.
  2. 2Set OPENAI_API_KEY (or your provider's key) and other secrets as environment variables.
  3. 3Attach a managed PostgreSQL database for your vector store and chat history; DATABASE_URL is injected automatically, ready for PGVector.
  4. 4PandaStack deploys via Helm with automatic SSL, live logs, and server-side metrics.

A note on long requests and scale-to-zero

LLM responses can take many seconds. Keep request timeouts generous on the ingress, and prefer streaming so the connection stays active. For free-tier apps with scale-to-zero, the first request after idle cold-starts the container *and then* makes a model call — noticeable latency. For interactive products, run on a tier that keeps an instance warm.

Production concernApproach
Slow model callsStreaming + explicit timeouts
Key safetyPlatform env secrets, fail-fast
Retrievalpgvector on managed PostgreSQL
Rate limitsSurface 429s, bounded retries
Cold startWarm tier for interactive apps

Cost control

LLM usage is your biggest variable cost. Cache identical prompts, set max_tokens, and log token usage (LangChain callbacks expose it) so you can attribute spend. The compute to run the app is usually trivial next to the model bill.

References

  • [LangChain documentation](https://python.langchain.com/docs/introduction/)
  • [LangChain streaming guide](https://python.langchain.com/docs/concepts/streaming/)
  • [pgvector extension](https://github.com/pgvector/pgvector)
  • [FastAPI documentation](https://fastapi.tiangolo.com/)

Deploy your LangChain service with an auto-wired pgvector-ready database on PandaStack's free tier — start at [dashboard.pandastack.io](https://dashboard.pandastack.io).

Ready to deploy?

Start free on PandaStack.

Start free on PandaStack

More in Tutorial

Browse all Tutorial articles →

See also