Most "AI apps" don't need GPUs
There's a useful distinction at the heart of AI app hosting in 2026. The large majority of AI applications are orchestration apps: they call a model provider's API (OpenAI, Anthropic, Google, or an inference host), manage a vector store, stream tokens to the browser, and run background jobs for embeddings and ingestion. These run perfectly on normal CPU infrastructure. A smaller set of apps run models themselves and need GPUs. Knowing which you're building decides your host.
What an AI orchestration app needs
- Streaming support (SSE/WebSockets) for token-by-token responses — and no aggressive idle timeouts.
- A vector database (pgvector, or a dedicated vector store).
- Background jobs / cronjobs for embedding pipelines and ingestion.
- Secure secret management for model API keys.
- Reasonable request timeouts — LLM calls can run tens of seconds.
What a model-serving app needs
- GPU instances (A100/H100/L4 class) and GPU autoscaling.
- Fast cold starts or warm pools for inference.
- This is a different category — covered more in our LLM inference hosting post.
The options
| Platform | CPU orchestration | GPU serving | Vector DB | Streaming |
|---|---|---|---|---|
| Vercel / Netlify | Yes (functions) | No | BYO | Yes |
| Modal / Replicate / Baseten | Yes | Yes | BYO | Yes |
| AWS/GCP/Azure | Yes | Yes | Managed/BYO | Yes |
| Fly.io / Railway | Yes | Limited GPU | BYO | Yes |
| PandaStack | Yes | No GPU | pgvector on Postgres | Yes |
GPU-first platforms
If you serve your own models, Modal, Replicate, and Baseten are purpose-built for GPU inference with autoscaling and cold-start optimization. Hyperscalers give you raw GPU instances with full control. Use these when you genuinely run models.
Where PandaStack fits
PandaStack is an excellent home for the orchestration majority of AI apps. Deploy your RAG API, chat backend, or agent server as a container (Node/Python/Go), and PandaStack wires in a managed Postgres — enable pgvector and you have a vector store next to your app with no extra service:
# FastAPI RAG service on PandaStack: pgvector in the auto-wired Postgres
import os, asyncpg
# DATABASE_URL injected; enable the extension once
# CREATE EXTENSION IF NOT EXISTS vector;
async def search(embedding):
conn = await asyncpg.connect(os.environ['DATABASE_URL'])
return await conn.fetch(
'SELECT id, content FROM docs ORDER BY embedding <-> $1 LIMIT 5',
embedding)You get streaming over WebSockets/SSE through Kong ingress, cronjobs for embedding/ingestion pipelines, secure environment variables for your model API keys, live logs, metrics, custom domains, and automatic SSL. The all-in-one model means your chat frontend (static site), API (container), vector store (managed Postgres + pgvector), and ingestion jobs (cronjobs) live on one platform.
The honest limit, stated plainly: PandaStack does not offer GPU instances. If your app self-hosts and serves models, use a GPU-first platform (Modal/Replicate/Baseten/hyperscaler GPUs) for the inference part — and you can still run your orchestration layer and vector store on PandaStack, calling out to the GPU service. For the very common pattern of "call a model API + manage vectors + stream results," PandaStack covers the whole app on CPU.
A pragmatic split architecture
Frontend (PandaStack static) ─▶ Orchestration API (PandaStack container)
├─▶ Model API (OpenAI/Anthropic/...)
├─▶ pgvector (PandaStack managed Postgres)
└─▶ GPU inference (Modal/Baseten) [only if self-serving]
Ingestion cronjob (PandaStack) ─▶ embeddings ─▶ pgvectorDecision guide
- Self-serve your own models on GPUs → Modal / Replicate / Baseten / hyperscaler GPUs.
- Call model APIs + vectors + streaming + jobs (CPU) → PandaStack.
- Hybrid → orchestration + vectors on PandaStack, GPU inference on a GPU platform.
References
- pgvector: https://github.com/pgvector/pgvector
- Modal docs: https://modal.com/docs
- Replicate: https://replicate.com/docs
- Anthropic API: https://docs.anthropic.com/
- Server-Sent Events (MDN): https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events
---
Building an AI app that calls model APIs and needs a vector store? PandaStack runs your orchestration API with pgvector in a managed Postgres and cronjobs for ingestion. Start free at https://dashboard.pandastack.io