LangChain makes it easy to prototype an LLM application, but production has different demands: you need an API surface, secure key management, streaming, a real vector store, and sane timeout handling for slow model calls. This guide takes a LangChain prototype to a deployable service.
Wrap the chain in an API
A notebook chain isn't deployable. Expose it through a web framework — FastAPI is a natural fit for async LLM calls:
from fastapi import FastAPI
from pydantic import BaseModel
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
app = FastAPI()
llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)
prompt = ChatPromptTemplate.from_template('Answer concisely: {question}')
chain = prompt | llm
class Query(BaseModel):
question: str
@app.post('/ask')
async def ask(q: Query):
result = await chain.ainvoke({'question': q.question})
return {'answer': result.content}Note the async ainvoke — LLM calls are I/O-bound, and async lets one worker handle many concurrent requests while waiting on the model.
Manage API keys as secrets
Model-provider keys are billing credentials. Never hardcode them. Read from the environment and set them as platform secrets:
import os
assert os.environ.get('OPENAI_API_KEY'), 'OPENAI_API_KEY is required'Fail fast at startup if a required key is missing, rather than on the first user request.
Streaming responses
Users expect token-by-token output. Stream from the chain and forward as a server-sent events or chunked response:
from fastapi.responses import StreamingResponse
@app.post('/stream')
async def stream(q: Query):
async def gen():
async for chunk in chain.astream({'question': q.question}):
yield chunk.content
return StreamingResponse(gen(), media_type='text/plain')Streaming also reduces perceived latency dramatically, which matters because model calls can take seconds.
Timeouts and retries
Model APIs occasionally hang or rate-limit you. Set explicit timeouts and bounded retries:
llm = ChatOpenAI(model='gpt-4o-mini', timeout=30, max_retries=2)Without a timeout, a stuck upstream call ties up a worker until the platform kills the request. Surface 429s back to the client rather than retrying forever.
Adding a vector store
For retrieval-augmented generation you need a persistent vector store. PostgreSQL with the pgvector extension is a pragmatic choice — one database for both relational data and embeddings:
from langchain_postgres import PGVector
store = PGVector(
connection=os.environ['DATABASE_URL'],
embeddings=embeddings,
collection_name='docs'
)This avoids running a separate vector database for small-to-medium workloads.
Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]Deploying on PandaStack
- 1Connect your repo as a container app in the [dashboard](https://dashboard.pandastack.io). PandaStack auto-detects Python and installs dependencies.
- 2Set
OPENAI_API_KEY(or your provider's key) and other secrets as environment variables. - 3Attach a managed PostgreSQL database for your vector store and chat history;
DATABASE_URLis injected automatically, ready forPGVector. - 4PandaStack deploys via Helm with automatic SSL, live logs, and server-side metrics.
A note on long requests and scale-to-zero
LLM responses can take many seconds. Keep request timeouts generous on the ingress, and prefer streaming so the connection stays active. For free-tier apps with scale-to-zero, the first request after idle cold-starts the container *and then* makes a model call — noticeable latency. For interactive products, run on a tier that keeps an instance warm.
| Production concern | Approach |
|---|---|
| Slow model calls | Streaming + explicit timeouts |
| Key safety | Platform env secrets, fail-fast |
| Retrieval | pgvector on managed PostgreSQL |
| Rate limits | Surface 429s, bounded retries |
| Cold start | Warm tier for interactive apps |
Cost control
LLM usage is your biggest variable cost. Cache identical prompts, set max_tokens, and log token usage (LangChain callbacks expose it) so you can attribute spend. The compute to run the app is usually trivial next to the model bill.
References
- [LangChain documentation](https://python.langchain.com/docs/introduction/)
- [LangChain streaming guide](https://python.langchain.com/docs/concepts/streaming/)
- [pgvector extension](https://github.com/pgvector/pgvector)
- [FastAPI documentation](https://fastapi.tiangolo.com/)
Deploy your LangChain service with an auto-wired pgvector-ready database on PandaStack's free tier — start at [dashboard.pandastack.io](https://dashboard.pandastack.io).