Back to Blog
Tutorial12 min read2026-07-03

How to Deploy a LlamaIndex RAG Service

Deploy a LlamaIndex retrieval-augmented generation service to production: separate ingestion from querying, persist your index, and serve grounded answers over HTTP.

Ajay Kumar
Ajay Kumar
Founder & DevOps, PandaStack

# How to Deploy a LlamaIndex RAG Service

Retrieval-augmented generation grounds an LLM in your own documents. LlamaIndex makes building a RAG pipeline straightforward, but deploying it raises questions the quickstart skips: where does the index live, how do you re-index without downtime, and how do you serve queries to many users? Here's a production-shaped approach.

Separate ingestion from querying

The single most important architectural decision: don't rebuild your index on every request, and don't rebuild it inside your query service. Split the system into two paths:

  1. 1Ingestion — chunk documents, compute embeddings, write vectors to a store. Runs offline or on a schedule.
  2. 2Query — load the existing index, retrieve relevant chunks, synthesize an answer. Runs per request, fast.

If you build the index in-process at startup, every deploy re-embeds your whole corpus (slow and expensive) and your index isn't shared across replicas.

Persist the index in a vector store

LlamaIndex's default in-memory VectorStoreIndex is fine for a demo but evaporates on restart. Back it with a real vector store. A clean option that needs no extra service: pgvector on a managed Postgres, so your vectors live in the same managed database you already have.

# ingest.py — run as a job, not on every request
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.vector_stores.postgres import PGVectorStore
from sqlalchemy import make_url

url = make_url(os.environ["DATABASE_URL"])
vector_store = PGVectorStore.from_params(
    database=url.database, host=url.host, port=url.port,
    user=url.username, password=url.password,
    table_name="docs", embed_dim=1536,
)
storage = StorageContext.from_defaults(vector_store=vector_store)
docs = SimpleDirectoryReader("./data").load_data()
VectorStoreIndex.from_documents(docs, storage_context=storage)
print("Ingestion complete")

The query service

# app.py
import os
from fastapi import FastAPI
from pydantic import BaseModel
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
from sqlalchemy import make_url

url = make_url(os.environ["DATABASE_URL"])
vector_store = PGVectorStore.from_params(
    database=url.database, host=url.host, port=url.port,
    user=url.username, password=url.password,
    table_name="docs", embed_dim=1536,
)
index = VectorStoreIndex.from_vector_store(vector_store)
query_engine = index.as_query_engine(similarity_top_k=4)

app = FastAPI()

class Q(BaseModel):
    question: str

@app.post("/query")
def query(q: Q):
    resp = query_engine.query(q.question)
    return {
        "answer": str(resp),
        "sources": [n.node.metadata for n in resp.source_nodes],
    }

Returning sources is not optional in production — grounded answers without citations are unverifiable. Always surface where the answer came from.

Tuning retrieval quality

RAG quality lives and dies on retrieval, not the LLM:

LeverEffect
Chunk sizeToo big = noisy context; too small = lost meaning. Start ~512 tokens.
similarity_top_kMore chunks = more context but more cost/noise. 3-5 is a common sweet spot.
Embedding modelBetter embeddings = better retrieval; match embed_dim to the model.
RerankingA reranker over top-k results often beats raising k.

Containerize

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Deploy on PandaStack

  1. 1Provision a managed PostgreSQL and enable the pgvector extension. PandaStack supports PostgreSQL 14.x/16.x; DATABASE_URL is auto-wired into your apps.
  2. 2Push the repo and create a container app for the query service connected to the same project, so it inherits DATABASE_URL.
  3. 3Add OPENAI_API_KEY (or your embeddings/LLM provider key) as an env var.
  4. 4Run ingestion as a separate step — a cronjob for periodic re-indexing, or a one-off run when documents change. PandaStack supports cronjobs natively, so scheduled re-indexing is built in rather than bolted on.

Why pgvector here instead of a dedicated vector DB

For small-to-medium corpora, keeping vectors in your managed Postgres means one fewer service to operate, transactional consistency between metadata and vectors, and automatic backups covering your index. If you later outgrow it, you can move to a dedicated store — but most RAG services never need to.

Free tier considerations

Free-tier databases are small (dev/hobby sized), which is fine for a few thousand chunks. For a large knowledge base, move to a paid plan with more storage and connections. The query service itself is light; just remember free-tier apps cold-start after idle.

Re-indexing without downtime

Because the query service only *reads* the index, you can re-ingest in the background while the service keeps serving the old vectors. Write to a new table, then atomically point the service at it (env var or a small config flip) to swap.

References

  • [LlamaIndex documentation](https://docs.llamaindex.ai/)
  • [LlamaIndex: Postgres vector store](https://docs.llamaindex.ai/en/stable/examples/vector_stores/postgres/)
  • [pgvector](https://github.com/pgvector/pgvector)
  • [LlamaIndex: Storing and persisting](https://docs.llamaindex.ai/en/stable/module_guides/storing/)

A good RAG service is mostly good plumbing: persist the index, separate ingestion, and cite sources. PandaStack gives you managed pgvector-capable Postgres, auto-wired DATABASE_URL, and built-in cronjobs for re-indexing. Start on the free tier at [dashboard.pandastack.io](https://dashboard.pandastack.io).

Ready to deploy?

Start free on PandaStack.

Start free on PandaStack

More in Tutorial

Browse all Tutorial articles →

See also