Back to Blog
Tutorial12 min read2026-07-01

How to Deploy an LLM Inference API to the Cloud

Deploy your own LLM inference API to the cloud: pick a serving runtime, size CPU vs GPU honestly, handle model loading and memory, and front it with a clean API.

Ajay Kumar
Ajay Kumar
Founder & DevOps, PandaStack

Running your own LLM inference API gives you control over latency, cost, and data privacy — but it's also where teams most often underestimate hardware requirements. This guide walks through serving a model with a runtime like Ollama or vLLM, sizing resources honestly, and wrapping it in a production API.

Choose a serving runtime

RuntimeStrengthsBest for
OllamaDead-simple, great DX, CPU or GPUSmall models, prototypes, internal tools
vLLMHigh-throughput batching, GPUProduction serving of open models
llama.cpp / GGUFEfficient CPU/quantized inferenceResource-constrained or CPU-only

Ollama is the easiest entry point: it pulls models with one command and exposes an HTTP API. vLLM is the choice when you need throughput and have GPUs.

Be honest about hardware

This is the part to get right before you deploy anything. Model size drives memory:

  • A 7B-parameter model in 4-bit quantization needs roughly 4-6 GB; in 16-bit, ~14 GB.
  • Larger models (13B, 70B) scale up accordingly and realistically need GPUs for usable latency.
  • CPU inference *works* for small quantized models but is slow — fine for low-volume internal tools, not for latency-sensitive user-facing traffic.

Don't try to run a 70B model on a small CPU instance. Pick a small quantized model for CPU, or plan for GPU capacity for anything larger.

Serving with Ollama

Ollama exposes an OpenAI-compatible API and its own /api/generate endpoint:

FROM ollama/ollama:latest
# Optionally bake a model in at build time

At runtime, start the server and pull a model:

ollama serve &
ollama pull llama3.2:3b
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:3b",
  "prompt": "Hello"
}'

Bake small models into the image or load them from a persistent volume so you don't re-download on every cold start — a multi-gigabyte pull on startup will blow past health-check timeouts.

Model loading and memory

Models load into memory at startup, which can take tens of seconds. Configure your readiness probe to allow for this, and set the start-up grace period generously. Keep the model warm — unloading and reloading per request destroys latency. Size the container's memory limit above the model's footprint plus working memory, or the OOM killer will terminate it mid-load.

A thin API wrapper

Rather than exposing the raw runtime, front it with a small API that adds auth, rate limiting, and input validation:

from fastapi import FastAPI, Header, HTTPException
import httpx, os

app = FastAPI()
KEY = os.environ['API_KEY']

@app.post('/generate')
async def generate(body: dict, authorization: str = Header(None)):
    if authorization != f'Bearer {KEY}':
        raise HTTPException(401)
    async with httpx.AsyncClient(timeout=120) as c:
        r = await c.post('http://localhost:11434/api/generate', json=body)
    return r.json()

This keeps the inference runtime private and gives you a control point for quotas and logging.

Deploying on PandaStack

  1. 1Connect your repo (with the Ollama or vLLM Dockerfile) as a container app in the [dashboard](https://dashboard.pandastack.io).
  2. 2Choose a compute tier sized for your model. PandaStack compute tiers range from Free (0.25 CPU / 512 MB) up to C2-2XCompute (8 CPU / 16 GB); pick a memory-optimized m1/m2 tier for larger models. Be realistic — small quantized models only on modest CPU tiers.
  3. 3Set API_KEY and any model configuration as environment secrets.
  4. 4PandaStack builds with rootless BuildKit and deploys via Helm with automatic SSL and live logs.

Cold starts matter a lot here

LLM containers are large and slow to start because the model must load into memory. Free-tier scale-to-zero means an idle API cold-starts *and* loads the model on the next request — potentially a minute. For anything interactive, run on a tier that keeps the instance warm so the model stays resident.

DecisionGuidance
Model sizeMatch to memory: 4-bit 7B ≈ 4-6 GB
CPU vs GPUSmall quantized → CPU; larger → GPU
Cold startKeep warm; bake model into image
SecurityFront with an auth/rate-limit layer

References

  • [Ollama documentation](https://github.com/ollama/ollama/blob/main/docs/README.md)
  • [Ollama API reference](https://github.com/ollama/ollama/blob/main/docs/api.md)
  • [vLLM documentation](https://docs.vllm.ai/en/latest/)
  • [Hugging Face: model memory and quantization](https://huggingface.co/docs/transformers/main/en/quantization/overview)

Serve your own LLM API with the right compute tier on PandaStack — explore the options and start at [dashboard.pandastack.io](https://dashboard.pandastack.io).

Ready to deploy?

Start free on PandaStack.

Start free on PandaStack

More in Tutorial

Browse all Tutorial articles →

See also