Two layers: the inference engine and the API in front of it
When teams say "host an LLM inference API," they're really talking about two layers that need different infrastructure:
- 1The inference engine — the GPU-backed service running the model (vLLM, TGI, TensorRT-LLM, or SGLang) with batching, KV-cache management, and GPU autoscaling.
- 2The API gateway / orchestration layer — auth, rate limiting, usage metering, request routing, streaming to clients, prompt logging, and fallback between providers. This layer runs on plain CPU and is where a lot of production value (and a lot of bugs) live.
Most guides only cover layer 1. Layer 2 is where you'll actually spend engineering time.
What the inference engine needs
- GPUs sized to your model (L4/A10 for small, A100/H100 for large).
- An optimized serving runtime — vLLM and TGI dominate in 2026 for throughput via continuous batching and PagedAttention.
- Cold-start strategy — model weights are huge; loading takes time, so warm pools or fast snapshotting matter.
- Autoscaling on GPU utilization / queue depth.
What the gateway layer needs
- Streaming (SSE/WebSocket) passthrough.
- Rate limiting, API keys, usage metering.
- A database for keys, usage records, and logs.
- Provider fallback and routing.
The options
| Platform | GPU inference | Serving runtime | Gateway layer | Notes |
|---|---|---|---|---|
| Modal / Baseten / RunPod | Yes | vLLM/TGI/custom | DIY | GPU-first |
| Together / Fireworks / Anyscale | Yes (managed) | Optimized | Built-in | Managed inference APIs |
| AWS/GCP/Azure GPUs | Yes | Self-managed | DIY | Full control |
| OpenRouter / LiteLLM | N/A (router) | N/A | Yes | Gateway/routing |
| PandaStack | No GPU | N/A | Yes (CPU) | Run the gateway + DB |
GPU inference platforms
For the engine, Modal, Baseten, and RunPod let you deploy vLLM/TGI on GPUs with autoscaling. Together, Fireworks, and Anyscale offer fully managed inference endpoints if you'd rather not manage serving at all. Use these for layer 1.
# Typical vLLM server on a GPU platform (layer 1)
# vllm serve meta-llama/Llama-3-8B --max-model-len 8192Where PandaStack fits
Be clear up front: PandaStack does not provide GPUs, so it is not where you run the model. PandaStack is an excellent place to run layer 2 — the gateway and orchestration API — on CPU, with a managed database wired in.
Deploy your gateway (API keys, rate limiting, metering, streaming, provider fallback) as a container; PandaStack injects DATABASE_URL for storing keys and usage, gives you Redis for rate-limit counters, and serves streaming responses through Kong ingress:
# Gateway on PandaStack: meter + stream from an upstream GPU endpoint
import os, httpx
UPSTREAM = os.environ['INFERENCE_URL'] # vLLM on Modal/Baseten
async def proxy(req):
async with httpx.AsyncClient(timeout=120) as c:
async with c.stream('POST', f'{UPSTREAM}/v1/chat/completions',
json=req) as r:
async for chunk in r.aiter_bytes():
yield chunk # stream to the client; record usage to DATABASE_URLYou get server-side metrics (ClickHouse), cronjobs (e.g., nightly usage rollups and invoicing), secure env vars for upstream credentials, custom domains, and automatic SSL. For production, run the gateway warm (paid tier) so it doesn't cold-start in front of your inference traffic.
Honest summary: if you want a single platform that does GPU inference end-to-end, use a GPU-first provider or a managed inference API. PandaStack's role is the durable, CPU-bound gateway and control plane around inference — the part that needs a database, metering, streaming, and scheduled jobs more than it needs a GPU.
A clean split
Client ─▶ Gateway API (PandaStack, warm) ──┬─▶ vLLM on GPU (Modal/Baseten)
keys/rate-limit/metering ├─▶ Managed Postgres (usage, keys)
streaming passthrough └─▶ Redis (rate-limit counters)
Usage rollup cronjob (PandaStack) ─▶ Postgres ─▶ billingDecision guide
- Self-host the model on GPUs → Modal / Baseten / RunPod (+ vLLM/TGI).
- Fully managed inference endpoint → Together / Fireworks / Anyscale.
- Build the gateway/metering/streaming layer with a DB → PandaStack.
References
- vLLM docs: https://docs.vllm.ai/
- Hugging Face TGI: https://huggingface.co/docs/text-generation-inference/
- Modal docs: https://modal.com/docs
- LiteLLM (gateway/router): https://docs.litellm.ai/
- OpenAI-compatible API spec: https://platform.openai.com/docs/api-reference
---
Need the control plane in front of your LLM inference — keys, metering, streaming, usage jobs? PandaStack runs your gateway with a managed Postgres and Redis. Start free at https://dashboard.pandastack.io