Running your own LLM inference API gives you control over latency, cost, and data privacy — but it's also where teams most often underestimate hardware requirements. This guide walks through serving a model with a runtime like Ollama or vLLM, sizing resources honestly, and wrapping it in a production API.
Choose a serving runtime
| Runtime | Strengths | Best for |
|---|---|---|
| Ollama | Dead-simple, great DX, CPU or GPU | Small models, prototypes, internal tools |
| vLLM | High-throughput batching, GPU | Production serving of open models |
| llama.cpp / GGUF | Efficient CPU/quantized inference | Resource-constrained or CPU-only |
Ollama is the easiest entry point: it pulls models with one command and exposes an HTTP API. vLLM is the choice when you need throughput and have GPUs.
Be honest about hardware
This is the part to get right before you deploy anything. Model size drives memory:
- A 7B-parameter model in 4-bit quantization needs roughly 4-6 GB; in 16-bit, ~14 GB.
- Larger models (13B, 70B) scale up accordingly and realistically need GPUs for usable latency.
- CPU inference *works* for small quantized models but is slow — fine for low-volume internal tools, not for latency-sensitive user-facing traffic.
Don't try to run a 70B model on a small CPU instance. Pick a small quantized model for CPU, or plan for GPU capacity for anything larger.
Serving with Ollama
Ollama exposes an OpenAI-compatible API and its own /api/generate endpoint:
FROM ollama/ollama:latest
# Optionally bake a model in at build timeAt runtime, start the server and pull a model:
ollama serve &
ollama pull llama3.2:3b
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2:3b",
"prompt": "Hello"
}'Bake small models into the image or load them from a persistent volume so you don't re-download on every cold start — a multi-gigabyte pull on startup will blow past health-check timeouts.
Model loading and memory
Models load into memory at startup, which can take tens of seconds. Configure your readiness probe to allow for this, and set the start-up grace period generously. Keep the model warm — unloading and reloading per request destroys latency. Size the container's memory limit above the model's footprint plus working memory, or the OOM killer will terminate it mid-load.
A thin API wrapper
Rather than exposing the raw runtime, front it with a small API that adds auth, rate limiting, and input validation:
from fastapi import FastAPI, Header, HTTPException
import httpx, os
app = FastAPI()
KEY = os.environ['API_KEY']
@app.post('/generate')
async def generate(body: dict, authorization: str = Header(None)):
if authorization != f'Bearer {KEY}':
raise HTTPException(401)
async with httpx.AsyncClient(timeout=120) as c:
r = await c.post('http://localhost:11434/api/generate', json=body)
return r.json()This keeps the inference runtime private and gives you a control point for quotas and logging.
Deploying on PandaStack
- 1Connect your repo (with the Ollama or vLLM Dockerfile) as a container app in the [dashboard](https://dashboard.pandastack.io).
- 2Choose a compute tier sized for your model. PandaStack compute tiers range from Free (0.25 CPU / 512 MB) up to C2-2XCompute (8 CPU / 16 GB); pick a memory-optimized m1/m2 tier for larger models. Be realistic — small quantized models only on modest CPU tiers.
- 3Set
API_KEYand any model configuration as environment secrets. - 4PandaStack builds with rootless BuildKit and deploys via Helm with automatic SSL and live logs.
Cold starts matter a lot here
LLM containers are large and slow to start because the model must load into memory. Free-tier scale-to-zero means an idle API cold-starts *and* loads the model on the next request — potentially a minute. For anything interactive, run on a tier that keeps the instance warm so the model stays resident.
| Decision | Guidance |
|---|---|
| Model size | Match to memory: 4-bit 7B ≈ 4-6 GB |
| CPU vs GPU | Small quantized → CPU; larger → GPU |
| Cold start | Keep warm; bake model into image |
| Security | Front with an auth/rate-limit layer |
References
- [Ollama documentation](https://github.com/ollama/ollama/blob/main/docs/README.md)
- [Ollama API reference](https://github.com/ollama/ollama/blob/main/docs/api.md)
- [vLLM documentation](https://docs.vllm.ai/en/latest/)
- [Hugging Face: model memory and quantization](https://huggingface.co/docs/transformers/main/en/quantization/overview)
Serve your own LLM API with the right compute tier on PandaStack — explore the options and start at [dashboard.pandastack.io](https://dashboard.pandastack.io).