Running your own LLM inference API gives you control over latency, cost, and data privacy — but it's also where teams most often underestimate hardware requirements. This guide walks through serving a model with a runtime like Ollama or vLLM, sizing resources honestly, and wrapping it in a production API.

Choose a serving runtime

Runtime	Strengths	Best for
Ollama	Dead-simple, great DX, CPU or GPU	Small models, prototypes, internal tools
vLLM	High-throughput batching, GPU	Production serving of open models
llama.cpp / GGUF	Efficient CPU/quantized inference	Resource-constrained or CPU-only

Ollama is the easiest entry point: it pulls models with one command and exposes an HTTP API. vLLM is the choice when you need throughput and have GPUs.

Be honest about hardware

This is the part to get right before you deploy anything. Model size drives memory:

A 7B-parameter model in 4-bit quantization needs roughly 4-6 GB; in 16-bit, ~14 GB.
Larger models (13B, 70B) scale up accordingly and realistically need GPUs for usable latency.
CPU inference *works* for small quantized models but is slow — fine for low-volume internal tools, not for latency-sensitive user-facing traffic.

Don't try to run a 70B model on a small CPU instance. Pick a small quantized model for CPU, or plan for GPU capacity for anything larger.

Serving with Ollama

Ollama exposes an OpenAI-compatible API and its own /api/generate endpoint:

FROM ollama/ollama:latest
# Optionally bake a model in at build time

At runtime, start the server and pull a model:

ollama serve &
ollama pull llama3.2:3b
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:3b",
  "prompt": "Hello"
}'

Bake small models into the image or load them from a persistent volume so you don't re-download on every cold start — a multi-gigabyte pull on startup will blow past health-check timeouts.

Model loading and memory

Models load into memory at startup, which can take tens of seconds. Configure your readiness probe to allow for this, and set the start-up grace period generously. Keep the model warm — unloading and reloading per request destroys latency. Size the container's memory limit above the model's footprint plus working memory, or the OOM killer will terminate it mid-load.

A thin API wrapper

Rather than exposing the raw runtime, front it with a small API that adds auth, rate limiting, and input validation:

from fastapi import FastAPI, Header, HTTPException
import httpx, os

app = FastAPI()
KEY = os.environ['API_KEY']

@app.post('/generate')
async def generate(body: dict, authorization: str = Header(None)):
    if authorization != f'Bearer {KEY}':
        raise HTTPException(401)
    async with httpx.AsyncClient(timeout=120) as c:
        r = await c.post('http://localhost:11434/api/generate', json=body)
    return r.json()

This keeps the inference runtime private and gives you a control point for quotas and logging.

Deploying on PandaStack

1Connect your repo (with the Ollama or vLLM Dockerfile) as a container app in the [dashboard](https://dashboard.pandastack.io).
2Choose a compute tier sized for your model. PandaStack compute tiers range from Free (0.25 CPU / 512 MB) up to C2-2XCompute (8 CPU / 16 GB); pick a memory-optimized m1/m2 tier for larger models. Be realistic — small quantized models only on modest CPU tiers.
3Set API_KEY and any model configuration as environment secrets.
4PandaStack builds with rootless BuildKit and deploys via Helm with automatic SSL and live logs.

Cold starts matter a lot here

LLM containers are large and slow to start because the model must load into memory. Free-tier scale-to-zero means an idle API cold-starts *and* loads the model on the next request — potentially a minute. For anything interactive, run on a tier that keeps the instance warm so the model stays resident.

Decision	Guidance
Model size	Match to memory: 4-bit 7B ≈ 4-6 GB
CPU vs GPU	Small quantized → CPU; larger → GPU
Cold start	Keep warm; bake model into image
Security	Front with an auth/rate-limit layer

References

[Ollama documentation](https://github.com/ollama/ollama/blob/main/docs/README.md)
[Ollama API reference](https://github.com/ollama/ollama/blob/main/docs/api.md)
[vLLM documentation](https://docs.vllm.ai/en/latest/)
[Hugging Face: model memory and quantization](https://huggingface.co/docs/transformers/main/en/quantization/overview)

Serve your own LLM API with the right compute tier on PandaStack — explore the options and start at [dashboard.pandastack.io](https://dashboard.pandastack.io).

How to Deploy an LLM Inference API to the Cloud

Choose a serving runtime

Be honest about hardware

Serving with Ollama

Model loading and memory

A thin API wrapper

Deploying on PandaStack

Cold starts matter a lot here

References

Ready to deploy?

More in Tutorial

How to Deploy a Phoenix (Elixir) App to the Cloud

How to Deploy a Monorepo with Multiple Services

How to Deploy a Python RQ Background Worker

See also