Back to Blog
Tutorial10 min read2026-06-30

How to Deploy Ollama for Self-Hosted LLMs

Ollama makes running open LLMs as simple as pulling a model. This guide covers containerizing Ollama, choosing compute, persisting model weights, and securing the API for self-hosted inference.

Ajay Kumar
Ajay Kumar
Founder & DevOps, PandaStack

Ollama is the easiest on-ramp to running open models like Llama, Mistral, Qwen, and Gemma. It bundles a runtime, a model registry, and an OpenAI-compatible API behind a single binary. Deploying it yourself gives you data control and predictable costs — but there are real considerations around compute, model storage, and security.

This guide covers a production-minded Ollama deployment and is honest about the hardware tradeoffs.

Set expectations on compute first

Be realistic: model size dictates everything.

Model sizePractical computeNotes
1B–3B (e.g. Llama 3.2 3B)CPU works; modest RAMSlow but usable for light tasks
7B–8BGPU strongly recommendedCPU works but is slow
13B–34B+GPU with ample VRAMCPU impractical

Small quantized models (3B/Q4) run acceptably on CPU for low-traffic internal tools. Anything 7B+ for real throughput wants a GPU. Choose your compute tier accordingly, and start small.

Step 1: Run the official image

Ollama publishes an official container. The simplest deployment uses it directly:

FROM ollama/ollama:latest
# The image already exposes port 11434 and runs `ollama serve`

That's genuinely all you need to *run* the server. The interesting parts are storage, model pulling, and security.

Step 2: Persist model weights

Models are large (gigabytes each) and downloaded at runtime into /root/.ollama. Containers are ephemeral, so without persistence every restart re-downloads everything — wasting bandwidth and cold-start time.

Attach a persistent volume mounted at /root/.ollama. With a volume, pulled models survive restarts and redeploys.

To pre-pull a model so it's ready on first request, run an init/post-deploy command:

ollama pull llama3.2:3b

Step 3: Configure the server

Useful environment variables:

# Bind to all interfaces so the platform ingress can reach it
OLLAMA_HOST=0.0.0.0:11434
# How long an idle model stays loaded in memory
OLLAMA_KEEP_ALIVE=10m
# Limit concurrent model loads to control memory
OLLAMA_MAX_LOADED_MODELS=1

OLLAMA_KEEP_ALIVE matters for cost: keep models warm during business hours, let them unload off-hours.

Step 4: Deploy on PandaStack

  1. 1Push a repo containing the Dockerfile (or reference the ollama/ollama image).
  2. 2Create a container app on [PandaStack](https://dashboard.pandastack.io).
  3. 3Choose a compute tier sized for your model — PandaStack offers compute-optimized (c1/c2) and memory-optimized (m1/m2) tiers; pick memory-optimized for larger models that need RAM headroom on CPU.
  4. 4Attach a persistent volume at /root/.ollama.
  5. 5Expose port 11434.
  6. 6Set the pre-pull command to download your model after deploy.

Do not put an inference server on scale-to-zero unless you accept long cold starts — loading a multi-GB model from disk takes time, and from zero you'd also re-provision the node. For interactive use, keep at least one warm replica.

Step 5: Secure the API — this is not optional

Ollama's API has no built-in authentication. An open :11434 on the public internet is an open invitation to run up your compute bill or exfiltrate data. Options:

  • Don't expose it publicly. Keep Ollama internal and only let your own backend services reach it.
  • Put a reverse proxy with auth in front. A small Nginx/Caddy sidecar that checks a bearer token before proxying to Ollama.
  • Use platform firewall rules to restrict source IPs.

A minimal token-checking Caddy proxy:

:8080 {
  @authorized header Authorization "Bearer {$API_TOKEN}"
  handle @authorized {
    reverse_proxy localhost:11434
  }
  respond 401
}

Expose the proxy port publicly; keep 11434 internal.

Step 6: Call it

Ollama speaks both its native API and an OpenAI-compatible endpoint:

curl https://your-ollama.example.com/api/generate -d '{
  "model": "llama3.2:3b",
  "prompt": "Explain durable workflows in one sentence.",
  "stream": false
}'

OpenAI-compatible, so existing SDKs work:

from openai import OpenAI
client = OpenAI(base_url="https://your-ollama.example.com/v1", api_key="token")
resp = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[{"role": "user", "content": "Hi"}],
)

Cost-control tips

  • Use the smallest model that meets quality needs; quantized (Q4) variants cut memory dramatically.
  • Tune OLLAMA_KEEP_ALIVE so idle models free memory.
  • Run one model per instance for predictable resource use.
  • Monitor token throughput vs. compute spend before scaling up.

References

  • [Ollama documentation](https://github.com/ollama/ollama/tree/main/docs)
  • [Ollama API reference](https://github.com/ollama/ollama/blob/main/docs/api.md)
  • [Ollama OpenAI compatibility](https://github.com/ollama/ollama/blob/main/docs/openai.md)
  • [Ollama Docker image](https://hub.docker.com/r/ollama/ollama)

---

Self-hosting Ollama is mostly about right-sizing compute and persisting weights — and locking down the API. PandaStack's compute and memory-optimized tiers plus persistent volumes cover the infrastructure; just remember to put auth in front. Explore the tiers at [dashboard.pandastack.io](https://dashboard.pandastack.io).

Ready to deploy?

Start free on PandaStack.

Start free on PandaStack

More in Tutorial

Browse all Tutorial articles →

See also