A FastAPI wrapper is the most common way to expose model inference as a production API — whether you're proxying a hosted LLM provider or serving a local model. FastAPI's async model is a great fit for inference workloads, which are I/O-bound when calling an upstream and need careful concurrency control when running locally.

This guide builds and deploys a robust inference API with streaming, backpressure, and health checks.

Two architectures, one API

Your FastAPI service can either:

1Proxy a hosted LLM (Anthropic, OpenAI, or a self-hosted Ollama endpoint) — your service is a thin, async, I/O-bound layer. Scales cheaply.
2Run the model in-process — heavy, GPU/RAM-bound, scales by replica with strict concurrency limits.

We'll structure the code so the same API surface works for both, then deploy.

Step 1: Build the API with streaming

Streaming is the single biggest UX win for LLM APIs — users see tokens immediately instead of waiting for the full response.

# main.py
import os
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import httpx

app = FastAPI()
UPSTREAM = os.environ["LLM_UPSTREAM_URL"]  # e.g. an Ollama or provider URL
API_KEY = os.environ.get("UPSTREAM_API_KEY", "")

class ChatRequest(BaseModel):
    prompt: str
    model: str = "llama3.2:3b"
    max_tokens: int = 512

@app.get("/healthz")
async def healthz():
    return {"status": "ok"}

@app.post("/chat")
async def chat(req: ChatRequest):
    async def token_stream():
        async with httpx.AsyncClient(timeout=120) as client:
            async with client.stream(
                "POST", f"{UPSTREAM}/api/generate",
                json={"model": req.model, "prompt": req.prompt, "stream": True},
                headers={"Authorization": f"Bearer {API_KEY}"} if API_KEY else {},
            ) as resp:
                if resp.status_code != 200:
                    raise HTTPException(resp.status_code, "upstream error")
                async for line in resp.aiter_lines():
                    if line:
                        yield line + "\n"
    return StreamingResponse(token_stream(), media_type="text/event-stream")

Step 2: Add concurrency control

Unbounded concurrency is how inference services die. Each in-flight request consumes memory (and GPU if local). Add a semaphore to bound concurrency and return 429 when saturated rather than OOM-crashing.

import asyncio
MAX_CONCURRENCY = int(os.environ.get("MAX_CONCURRENCY", "8"))
sem = asyncio.Semaphore(MAX_CONCURRENCY)

@app.post("/chat")
async def chat(req: ChatRequest):
    if sem.locked() and sem._value == 0:
        raise HTTPException(429, "server busy, retry shortly")
    async def guarded():
        async with sem:
            async for chunk in token_stream(req):
                yield chunk
    return StreamingResponse(guarded(), media_type="text/event-stream")

For in-process model inference, set MAX_CONCURRENCY low (often 1–2 per GPU) to avoid VRAM thrash.

Step 3: Containerize for production

Run FastAPI under a production ASGI server. Uvicorn with the right worker count:

FROM python:3.12-slim
WORKDIR /app
ENV PYTHONUNBUFFERED=1
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]

# requirements.txt
fastapi==0.115.0
uvicorn[standard]==0.30.6
httpx==0.27.2
pydantic==2.9.2

Note on workers: for a proxy service, more uvicorn workers help. For an in-process model, use a single worker per replica (each worker would load its own copy of the model) and scale via replicas instead.

Step 4: Deploy on PandaStack

1Push the repo to GitHub.
2Create a container app on [PandaStack](https://dashboard.pandastack.io) — it builds the Dockerfile with rootless BuildKit and deploys via Helm.
3Configure env vars: LLM_UPSTREAM_URL, UPSTREAM_API_KEY (as a secret), MAX_CONCURRENCY.
4Set the health check path to /healthz so the platform only routes traffic to ready replicas.
5Add a custom domain — SSL is automatic.

For a proxy-style API with bursty traffic, this scales horizontally cheaply. For in-process models, pick a memory- or compute-optimized tier and keep replicas warm.

Step 5: Tune for load

Concern	Fix
Slow first byte	Use streaming responses (shown above)
OOM under burst	Semaphore + 429 backpressure
Long requests timing out	Raise client + ingress timeouts; stream to keep the connection alive
Cold starts on a model server	Keep ≥1 warm replica; don't scale-to-zero in-process models
Cost during quiet hours	For proxy services, scale-to-zero is fine — they reload fast

Step 6: Add basic auth and rate limits

Protect your endpoint with an API key dependency and per-key rate limiting (e.g. via a Redis token bucket). Even an internal API should require a token so a leaked URL isn't an open inference budget.

from fastapi import Header
async def require_key(authorization: str = Header("")):
    if authorization != f"Bearer {os.environ['SERVICE_TOKEN']}":
        raise HTTPException(401, "unauthorized")

References

[FastAPI documentation](https://fastapi.tiangolo.com/)
[FastAPI StreamingResponse](https://fastapi.tiangolo.com/advanced/custom-response/#streamingresponse)
[Uvicorn deployment](https://www.uvicorn.org/deployment/)
[Server-Sent Events (MDN)](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events)

---

A well-built FastAPI inference layer is mostly about streaming and backpressure — get those right and it scales gracefully. PandaStack handles the build, deploy, health checks, and SSL so you focus on the API. Proxy services even fit comfortably in the free tier; start at [dashboard.pandastack.io](https://dashboard.pandastack.io).

How to Deploy a FastAPI LLM Inference API

Two architectures, one API

Step 1: Build the API with streaming

Step 2: Add concurrency control

Step 3: Containerize for production

Step 4: Deploy on PandaStack

Step 5: Tune for load

Step 6: Add basic auth and rate limits

References

Ready to deploy?

More in Tutorial

How to Deploy a Phoenix (Elixir) App to the Cloud

How to Deploy a Monorepo with Multiple Services

How to Deploy a Python RQ Background Worker

See also