# How to Deploy a LlamaIndex RAG Service
Retrieval-augmented generation grounds an LLM in your own documents. LlamaIndex makes building a RAG pipeline straightforward, but deploying it raises questions the quickstart skips: where does the index live, how do you re-index without downtime, and how do you serve queries to many users? Here's a production-shaped approach.
Separate ingestion from querying
The single most important architectural decision: don't rebuild your index on every request, and don't rebuild it inside your query service. Split the system into two paths:
- 1Ingestion — chunk documents, compute embeddings, write vectors to a store. Runs offline or on a schedule.
- 2Query — load the existing index, retrieve relevant chunks, synthesize an answer. Runs per request, fast.
If you build the index in-process at startup, every deploy re-embeds your whole corpus (slow and expensive) and your index isn't shared across replicas.
Persist the index in a vector store
LlamaIndex's default in-memory VectorStoreIndex is fine for a demo but evaporates on restart. Back it with a real vector store. A clean option that needs no extra service: pgvector on a managed Postgres, so your vectors live in the same managed database you already have.
# ingest.py — run as a job, not on every request
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.vector_stores.postgres import PGVectorStore
from sqlalchemy import make_url
url = make_url(os.environ["DATABASE_URL"])
vector_store = PGVectorStore.from_params(
database=url.database, host=url.host, port=url.port,
user=url.username, password=url.password,
table_name="docs", embed_dim=1536,
)
storage = StorageContext.from_defaults(vector_store=vector_store)
docs = SimpleDirectoryReader("./data").load_data()
VectorStoreIndex.from_documents(docs, storage_context=storage)
print("Ingestion complete")The query service
# app.py
import os
from fastapi import FastAPI
from pydantic import BaseModel
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
from sqlalchemy import make_url
url = make_url(os.environ["DATABASE_URL"])
vector_store = PGVectorStore.from_params(
database=url.database, host=url.host, port=url.port,
user=url.username, password=url.password,
table_name="docs", embed_dim=1536,
)
index = VectorStoreIndex.from_vector_store(vector_store)
query_engine = index.as_query_engine(similarity_top_k=4)
app = FastAPI()
class Q(BaseModel):
question: str
@app.post("/query")
def query(q: Q):
resp = query_engine.query(q.question)
return {
"answer": str(resp),
"sources": [n.node.metadata for n in resp.source_nodes],
}Returning sources is not optional in production — grounded answers without citations are unverifiable. Always surface where the answer came from.
Tuning retrieval quality
RAG quality lives and dies on retrieval, not the LLM:
| Lever | Effect |
|---|---|
| Chunk size | Too big = noisy context; too small = lost meaning. Start ~512 tokens. |
similarity_top_k | More chunks = more context but more cost/noise. 3-5 is a common sweet spot. |
| Embedding model | Better embeddings = better retrieval; match embed_dim to the model. |
| Reranking | A reranker over top-k results often beats raising k. |
Containerize
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]Deploy on PandaStack
- 1Provision a managed PostgreSQL and enable the
pgvectorextension. PandaStack supports PostgreSQL 14.x/16.x;DATABASE_URLis auto-wired into your apps. - 2Push the repo and create a container app for the query service connected to the same project, so it inherits
DATABASE_URL. - 3Add
OPENAI_API_KEY(or your embeddings/LLM provider key) as an env var. - 4Run ingestion as a separate step — a cronjob for periodic re-indexing, or a one-off run when documents change. PandaStack supports cronjobs natively, so scheduled re-indexing is built in rather than bolted on.
Why pgvector here instead of a dedicated vector DB
For small-to-medium corpora, keeping vectors in your managed Postgres means one fewer service to operate, transactional consistency between metadata and vectors, and automatic backups covering your index. If you later outgrow it, you can move to a dedicated store — but most RAG services never need to.
Free tier considerations
Free-tier databases are small (dev/hobby sized), which is fine for a few thousand chunks. For a large knowledge base, move to a paid plan with more storage and connections. The query service itself is light; just remember free-tier apps cold-start after idle.
Re-indexing without downtime
Because the query service only *reads* the index, you can re-ingest in the background while the service keeps serving the old vectors. Write to a new table, then atomically point the service at it (env var or a small config flip) to swap.
References
- [LlamaIndex documentation](https://docs.llamaindex.ai/)
- [LlamaIndex: Postgres vector store](https://docs.llamaindex.ai/en/stable/examples/vector_stores/postgres/)
- [pgvector](https://github.com/pgvector/pgvector)
- [LlamaIndex: Storing and persisting](https://docs.llamaindex.ai/en/stable/module_guides/storing/)
A good RAG service is mostly good plumbing: persist the index, separate ingestion, and cite sources. PandaStack gives you managed pgvector-capable Postgres, auto-wired DATABASE_URL, and built-in cronjobs for re-indexing. Start on the free tier at [dashboard.pandastack.io](https://dashboard.pandastack.io).