Back to Blog
Tutorial11 min read2026-07-02

How to Deploy a FastAPI LLM Inference API

Wrapping model inference in a FastAPI service gives you a clean, scalable API. This guide covers async endpoints, streaming responses, concurrency limits, and deploying inference that won't fall over under load.

Ajay Kumar
Ajay Kumar
Founder & DevOps, PandaStack

A FastAPI wrapper is the most common way to expose model inference as a production API — whether you're proxying a hosted LLM provider or serving a local model. FastAPI's async model is a great fit for inference workloads, which are I/O-bound when calling an upstream and need careful concurrency control when running locally.

This guide builds and deploys a robust inference API with streaming, backpressure, and health checks.

Two architectures, one API

Your FastAPI service can either:

  1. 1Proxy a hosted LLM (Anthropic, OpenAI, or a self-hosted Ollama endpoint) — your service is a thin, async, I/O-bound layer. Scales cheaply.
  2. 2Run the model in-process — heavy, GPU/RAM-bound, scales by replica with strict concurrency limits.

We'll structure the code so the same API surface works for both, then deploy.

Step 1: Build the API with streaming

Streaming is the single biggest UX win for LLM APIs — users see tokens immediately instead of waiting for the full response.

# main.py
import os
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import httpx

app = FastAPI()
UPSTREAM = os.environ["LLM_UPSTREAM_URL"]  # e.g. an Ollama or provider URL
API_KEY = os.environ.get("UPSTREAM_API_KEY", "")

class ChatRequest(BaseModel):
    prompt: str
    model: str = "llama3.2:3b"
    max_tokens: int = 512

@app.get("/healthz")
async def healthz():
    return {"status": "ok"}

@app.post("/chat")
async def chat(req: ChatRequest):
    async def token_stream():
        async with httpx.AsyncClient(timeout=120) as client:
            async with client.stream(
                "POST", f"{UPSTREAM}/api/generate",
                json={"model": req.model, "prompt": req.prompt, "stream": True},
                headers={"Authorization": f"Bearer {API_KEY}"} if API_KEY else {},
            ) as resp:
                if resp.status_code != 200:
                    raise HTTPException(resp.status_code, "upstream error")
                async for line in resp.aiter_lines():
                    if line:
                        yield line + "\n"
    return StreamingResponse(token_stream(), media_type="text/event-stream")

Step 2: Add concurrency control

Unbounded concurrency is how inference services die. Each in-flight request consumes memory (and GPU if local). Add a semaphore to bound concurrency and return 429 when saturated rather than OOM-crashing.

import asyncio
MAX_CONCURRENCY = int(os.environ.get("MAX_CONCURRENCY", "8"))
sem = asyncio.Semaphore(MAX_CONCURRENCY)

@app.post("/chat")
async def chat(req: ChatRequest):
    if sem.locked() and sem._value == 0:
        raise HTTPException(429, "server busy, retry shortly")
    async def guarded():
        async with sem:
            async for chunk in token_stream(req):
                yield chunk
    return StreamingResponse(guarded(), media_type="text/event-stream")

For in-process model inference, set MAX_CONCURRENCY low (often 1–2 per GPU) to avoid VRAM thrash.

Step 3: Containerize for production

Run FastAPI under a production ASGI server. Uvicorn with the right worker count:

FROM python:3.12-slim
WORKDIR /app
ENV PYTHONUNBUFFERED=1
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
# requirements.txt
fastapi==0.115.0
uvicorn[standard]==0.30.6
httpx==0.27.2
pydantic==2.9.2

Note on workers: for a proxy service, more uvicorn workers help. For an in-process model, use a single worker per replica (each worker would load its own copy of the model) and scale via replicas instead.

Step 4: Deploy on PandaStack

  1. 1Push the repo to GitHub.
  2. 2Create a container app on [PandaStack](https://dashboard.pandastack.io) — it builds the Dockerfile with rootless BuildKit and deploys via Helm.
  3. 3Configure env vars: LLM_UPSTREAM_URL, UPSTREAM_API_KEY (as a secret), MAX_CONCURRENCY.
  4. 4Set the health check path to /healthz so the platform only routes traffic to ready replicas.
  5. 5Add a custom domain — SSL is automatic.

For a proxy-style API with bursty traffic, this scales horizontally cheaply. For in-process models, pick a memory- or compute-optimized tier and keep replicas warm.

Step 5: Tune for load

ConcernFix
Slow first byteUse streaming responses (shown above)
OOM under burstSemaphore + 429 backpressure
Long requests timing outRaise client + ingress timeouts; stream to keep the connection alive
Cold starts on a model serverKeep ≥1 warm replica; don't scale-to-zero in-process models
Cost during quiet hoursFor proxy services, scale-to-zero is fine — they reload fast

Step 6: Add basic auth and rate limits

Protect your endpoint with an API key dependency and per-key rate limiting (e.g. via a Redis token bucket). Even an internal API should require a token so a leaked URL isn't an open inference budget.

from fastapi import Header
async def require_key(authorization: str = Header("")):
    if authorization != f"Bearer {os.environ['SERVICE_TOKEN']}":
        raise HTTPException(401, "unauthorized")

References

  • [FastAPI documentation](https://fastapi.tiangolo.com/)
  • [FastAPI StreamingResponse](https://fastapi.tiangolo.com/advanced/custom-response/#streamingresponse)
  • [Uvicorn deployment](https://www.uvicorn.org/deployment/)
  • [Server-Sent Events (MDN)](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events)

---

A well-built FastAPI inference layer is mostly about streaming and backpressure — get those right and it scales gracefully. PandaStack handles the build, deploy, health checks, and SSL so you focus on the API. Proxy services even fit comfortably in the free tier; start at [dashboard.pandastack.io](https://dashboard.pandastack.io).

Ready to deploy?

Start free on PandaStack.

Start free on PandaStack

More in Tutorial

Browse all Tutorial articles →

See also