Ollama is the easiest on-ramp to running open models like Llama, Mistral, Qwen, and Gemma. It bundles a runtime, a model registry, and an OpenAI-compatible API behind a single binary. Deploying it yourself gives you data control and predictable costs — but there are real considerations around compute, model storage, and security.
This guide covers a production-minded Ollama deployment and is honest about the hardware tradeoffs.
Set expectations on compute first
Be realistic: model size dictates everything.
| Model size | Practical compute | Notes |
|---|---|---|
| 1B–3B (e.g. Llama 3.2 3B) | CPU works; modest RAM | Slow but usable for light tasks |
| 7B–8B | GPU strongly recommended | CPU works but is slow |
| 13B–34B+ | GPU with ample VRAM | CPU impractical |
Small quantized models (3B/Q4) run acceptably on CPU for low-traffic internal tools. Anything 7B+ for real throughput wants a GPU. Choose your compute tier accordingly, and start small.
Step 1: Run the official image
Ollama publishes an official container. The simplest deployment uses it directly:
FROM ollama/ollama:latest
# The image already exposes port 11434 and runs `ollama serve`That's genuinely all you need to *run* the server. The interesting parts are storage, model pulling, and security.
Step 2: Persist model weights
Models are large (gigabytes each) and downloaded at runtime into /root/.ollama. Containers are ephemeral, so without persistence every restart re-downloads everything — wasting bandwidth and cold-start time.
Attach a persistent volume mounted at /root/.ollama. With a volume, pulled models survive restarts and redeploys.
To pre-pull a model so it's ready on first request, run an init/post-deploy command:
ollama pull llama3.2:3bStep 3: Configure the server
Useful environment variables:
# Bind to all interfaces so the platform ingress can reach it
OLLAMA_HOST=0.0.0.0:11434
# How long an idle model stays loaded in memory
OLLAMA_KEEP_ALIVE=10m
# Limit concurrent model loads to control memory
OLLAMA_MAX_LOADED_MODELS=1OLLAMA_KEEP_ALIVE matters for cost: keep models warm during business hours, let them unload off-hours.
Step 4: Deploy on PandaStack
- 1Push a repo containing the Dockerfile (or reference the
ollama/ollamaimage). - 2Create a container app on [PandaStack](https://dashboard.pandastack.io).
- 3Choose a compute tier sized for your model — PandaStack offers compute-optimized (c1/c2) and memory-optimized (m1/m2) tiers; pick memory-optimized for larger models that need RAM headroom on CPU.
- 4Attach a persistent volume at
/root/.ollama. - 5Expose port
11434. - 6Set the pre-pull command to download your model after deploy.
Do not put an inference server on scale-to-zero unless you accept long cold starts — loading a multi-GB model from disk takes time, and from zero you'd also re-provision the node. For interactive use, keep at least one warm replica.
Step 5: Secure the API — this is not optional
Ollama's API has no built-in authentication. An open :11434 on the public internet is an open invitation to run up your compute bill or exfiltrate data. Options:
- Don't expose it publicly. Keep Ollama internal and only let your own backend services reach it.
- Put a reverse proxy with auth in front. A small Nginx/Caddy sidecar that checks a bearer token before proxying to Ollama.
- Use platform firewall rules to restrict source IPs.
A minimal token-checking Caddy proxy:
:8080 {
@authorized header Authorization "Bearer {$API_TOKEN}"
handle @authorized {
reverse_proxy localhost:11434
}
respond 401
}Expose the proxy port publicly; keep 11434 internal.
Step 6: Call it
Ollama speaks both its native API and an OpenAI-compatible endpoint:
curl https://your-ollama.example.com/api/generate -d '{
"model": "llama3.2:3b",
"prompt": "Explain durable workflows in one sentence.",
"stream": false
}'OpenAI-compatible, so existing SDKs work:
from openai import OpenAI
client = OpenAI(base_url="https://your-ollama.example.com/v1", api_key="token")
resp = client.chat.completions.create(
model="llama3.2:3b",
messages=[{"role": "user", "content": "Hi"}],
)Cost-control tips
- Use the smallest model that meets quality needs; quantized (Q4) variants cut memory dramatically.
- Tune
OLLAMA_KEEP_ALIVEso idle models free memory. - Run one model per instance for predictable resource use.
- Monitor token throughput vs. compute spend before scaling up.
References
- [Ollama documentation](https://github.com/ollama/ollama/tree/main/docs)
- [Ollama API reference](https://github.com/ollama/ollama/blob/main/docs/api.md)
- [Ollama OpenAI compatibility](https://github.com/ollama/ollama/blob/main/docs/openai.md)
- [Ollama Docker image](https://hub.docker.com/r/ollama/ollama)
---
Self-hosting Ollama is mostly about right-sizing compute and persisting weights — and locking down the API. PandaStack's compute and memory-optimized tiers plus persistent volumes cover the infrastructure; just remember to put auth in front. Explore the tiers at [dashboard.pandastack.io](https://dashboard.pandastack.io).