A FastAPI wrapper is the most common way to expose model inference as a production API — whether you're proxying a hosted LLM provider or serving a local model. FastAPI's async model is a great fit for inference workloads, which are I/O-bound when calling an upstream and need careful concurrency control when running locally.
This guide builds and deploys a robust inference API with streaming, backpressure, and health checks.
Two architectures, one API
Your FastAPI service can either:
- 1Proxy a hosted LLM (Anthropic, OpenAI, or a self-hosted Ollama endpoint) — your service is a thin, async, I/O-bound layer. Scales cheaply.
- 2Run the model in-process — heavy, GPU/RAM-bound, scales by replica with strict concurrency limits.
We'll structure the code so the same API surface works for both, then deploy.
Step 1: Build the API with streaming
Streaming is the single biggest UX win for LLM APIs — users see tokens immediately instead of waiting for the full response.
# main.py
import os
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import httpx
app = FastAPI()
UPSTREAM = os.environ["LLM_UPSTREAM_URL"] # e.g. an Ollama or provider URL
API_KEY = os.environ.get("UPSTREAM_API_KEY", "")
class ChatRequest(BaseModel):
prompt: str
model: str = "llama3.2:3b"
max_tokens: int = 512
@app.get("/healthz")
async def healthz():
return {"status": "ok"}
@app.post("/chat")
async def chat(req: ChatRequest):
async def token_stream():
async with httpx.AsyncClient(timeout=120) as client:
async with client.stream(
"POST", f"{UPSTREAM}/api/generate",
json={"model": req.model, "prompt": req.prompt, "stream": True},
headers={"Authorization": f"Bearer {API_KEY}"} if API_KEY else {},
) as resp:
if resp.status_code != 200:
raise HTTPException(resp.status_code, "upstream error")
async for line in resp.aiter_lines():
if line:
yield line + "\n"
return StreamingResponse(token_stream(), media_type="text/event-stream")Step 2: Add concurrency control
Unbounded concurrency is how inference services die. Each in-flight request consumes memory (and GPU if local). Add a semaphore to bound concurrency and return 429 when saturated rather than OOM-crashing.
import asyncio
MAX_CONCURRENCY = int(os.environ.get("MAX_CONCURRENCY", "8"))
sem = asyncio.Semaphore(MAX_CONCURRENCY)
@app.post("/chat")
async def chat(req: ChatRequest):
if sem.locked() and sem._value == 0:
raise HTTPException(429, "server busy, retry shortly")
async def guarded():
async with sem:
async for chunk in token_stream(req):
yield chunk
return StreamingResponse(guarded(), media_type="text/event-stream")For in-process model inference, set MAX_CONCURRENCY low (often 1–2 per GPU) to avoid VRAM thrash.
Step 3: Containerize for production
Run FastAPI under a production ASGI server. Uvicorn with the right worker count:
FROM python:3.12-slim
WORKDIR /app
ENV PYTHONUNBUFFERED=1
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]# requirements.txt
fastapi==0.115.0
uvicorn[standard]==0.30.6
httpx==0.27.2
pydantic==2.9.2Note on workers: for a proxy service, more uvicorn workers help. For an in-process model, use a single worker per replica (each worker would load its own copy of the model) and scale via replicas instead.
Step 4: Deploy on PandaStack
- 1Push the repo to GitHub.
- 2Create a container app on [PandaStack](https://dashboard.pandastack.io) — it builds the Dockerfile with rootless BuildKit and deploys via Helm.
- 3Configure env vars:
LLM_UPSTREAM_URL,UPSTREAM_API_KEY(as a secret),MAX_CONCURRENCY. - 4Set the health check path to
/healthzso the platform only routes traffic to ready replicas. - 5Add a custom domain — SSL is automatic.
For a proxy-style API with bursty traffic, this scales horizontally cheaply. For in-process models, pick a memory- or compute-optimized tier and keep replicas warm.
Step 5: Tune for load
| Concern | Fix |
|---|---|
| Slow first byte | Use streaming responses (shown above) |
| OOM under burst | Semaphore + 429 backpressure |
| Long requests timing out | Raise client + ingress timeouts; stream to keep the connection alive |
| Cold starts on a model server | Keep ≥1 warm replica; don't scale-to-zero in-process models |
| Cost during quiet hours | For proxy services, scale-to-zero is fine — they reload fast |
Step 6: Add basic auth and rate limits
Protect your endpoint with an API key dependency and per-key rate limiting (e.g. via a Redis token bucket). Even an internal API should require a token so a leaked URL isn't an open inference budget.
from fastapi import Header
async def require_key(authorization: str = Header("")):
if authorization != f"Bearer {os.environ['SERVICE_TOKEN']}":
raise HTTPException(401, "unauthorized")References
- [FastAPI documentation](https://fastapi.tiangolo.com/)
- [FastAPI StreamingResponse](https://fastapi.tiangolo.com/advanced/custom-response/#streamingresponse)
- [Uvicorn deployment](https://www.uvicorn.org/deployment/)
- [Server-Sent Events (MDN)](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events)
---
A well-built FastAPI inference layer is mostly about streaming and backpressure — get those right and it scales gracefully. PandaStack handles the build, deploy, health checks, and SSL so you focus on the API. Proxy services even fit comfortably in the free tier; start at [dashboard.pandastack.io](https://dashboard.pandastack.io).