Back to Blog
Tutorial11 min read2026-06-30

How to Deploy Airbyte Data Integration

A practical guide to self-hosting Airbyte for ELT data integration, covering its architecture, resource requirements, and how to run the platform alongside a managed Postgres destination.

Ajay Kumar
Ajay Kumar
Founder & DevOps, PandaStack

Airbyte has become the default open-source choice for ELT (extract, load, transform) pipelines. With 350+ connectors and a permissive license, it lets data teams sync from SaaS APIs and databases into a warehouse without paying per-row fees. The catch: self-hosting Airbyte is not a single-container affair. It's a distributed system, and treating it like a hobby app is how teams end up with stuck syncs and OOM-killed workers.

This guide walks through what Airbyte actually needs to run, how to deploy it sensibly, and how to pair it with a managed database so you're not also operating Postgres by hand.

What Airbyte actually is

Airbyte is not one process. A modern Airbyte deployment (Helm-based, post the abctl/Helm consolidation) includes:

  • airbyte-server — the API backend
  • airbyte-webapp — the React UI
  • airbyte-worker / workload-launcher — orchestrates sync jobs
  • temporal — workflow engine that drives every sync
  • airbyte-db — a Postgres instance for Airbyte's own metadata (configs, job history, state)
  • connector pods — ephemeral containers spun up per sync (one for source, one for destination)

That last point matters. Airbyte launches a new container for every connector run. On Kubernetes this means it needs permission to create pods; on a single VM with Docker it means the worker needs the Docker socket. This is why "just run the Docker image" advice tends to fall apart at scale.

Resource requirements (be honest with yourself)

The Airbyte docs recommend a minimum of 4 vCPU and 8 GB RAM for a low-volume deployment, and that's genuinely a floor, not a comfortable target. Connector containers each want 0.5–2 GB depending on the source. A few concurrent syncs of a chatty API (think large Salesforce or Postgres CDC streams) will push you past 8 GB quickly.

WorkloadvCPURAMNotes
Eval / single small sync24 GBWill be tight; fine for kicking tires
Low-volume production48 GBAirbyte's stated minimum
Multiple concurrent syncs8+16 GB+Connector pods add up fast

Separate your metadata database from the workloads. Running Airbyte's internal Postgres as a co-located container is fine for testing, but for anything you care about, point DATABASE_URL at a managed Postgres instance with backups. Losing Airbyte's metadata DB means losing all your connection configs and incremental sync state.

Deploying the core platform

The officially supported path is Helm. Add the repo and install:

helm repo add airbyte https://airbytehq.github.io/helm-charts
helm repo update

helm install airbyte airbyte/airbyte \
  --namespace airbyte --create-namespace \
  --set global.database.host=$PG_HOST \
  --set global.database.port=5432 \
  --set global.database.database=airbyte \
  --set global.database.user=$PG_USER \
  --set-string global.database.password=$PG_PASSWORD

For a Docker-Compose-style single-node run during evaluation, the abctl CLI wraps a local kind cluster:

curl -LsfS https://get.airbyte.com | bash -
abctl local install

The key configuration decisions are: externalize the metadata DB, set a real secrets backend (don't leave credentials in the internal DB unencrypted in production), and size the worker's job parallelism (MAX_SYNC_WORKERS) to match your node.

The auto-wired database angle

The single most error-prone part of self-hosting Airbyte is its metadata Postgres. On PandaStack you provision a managed PostgreSQL (14.x or 16.x via KubeBlocks) and the connection string is injected as DATABASE_URL, so you point Airbyte's global.database.* values at managed credentials instead of babysitting a Postgres container. Scheduled and manual backups cover the "I lost all my connectors" failure mode.

A pragmatic split:

  • Airbyte platform runs as container apps (server, worker, temporal) with enough compute headroom — Airbyte is memory-hungry, so use a memory-optimized tier (m1/m2) rather than the free tier.
  • Airbyte metadata DB runs as a managed Postgres with backups.
  • Your warehouse destination (the place Airbyte loads data into) is a separate, larger Postgres or an external warehouse like BigQuery/Snowflake.

Don't run Airbyte itself on free-tier, scale-to-zero, preemptible nodes — Temporal workflows and long-running syncs do not appreciate being cold-started mid-job.

Connecting a source and destination

Once the UI is reachable, the flow is the same as Airbyte Cloud:

  1. 1Set up a source — pick a connector (Postgres CDC, Stripe, GitHub, etc.), enter credentials, test the connection.
  2. 2Set up a destination — your warehouse Postgres or BigQuery.
  3. 3Create a connection — choose streams, sync mode (full refresh vs incremental + dedupe), and a schedule.

For incremental syncs, Airbyte stores a cursor/state in its metadata DB. This is exactly why that DB must be durable: a fresh full-refresh of a 200M-row table because you lost state is an expensive accident.

-- For Postgres CDC sources, enable logical replication on the source DB
ALTER SYSTEM SET wal_level = logical;
-- then create a publication and replication slot per Airbyte's source docs

Operational gotchas

  • Temporal is load-bearing. If Temporal is unhealthy, syncs silently queue and never run. Monitor it.
  • Connector pod scheduling. On Kubernetes, ensure the namespace has quota and the service account can create pods, or syncs fail with cryptic launcher errors.
  • Disk for logs and staging. Some destinations stage data on local disk before loading. Watch ephemeral storage.
  • Upgrades change schemas. Airbyte runs DB migrations on its metadata DB during upgrades. Back up before upgrading, every time.

When self-hosting is and isn't worth it

Self-hosting Airbyte makes sense when you have data-residency requirements, high sync volumes where Airbyte Cloud's credit pricing gets expensive, or you simply want full control. It does not make sense if you have one analyst and three small syncs — Airbyte Cloud or a lighter tool will cost you less in operational time.

If you do self-host, the win is treating the stateless platform pieces and the stateful database as separate concerns. A managed Postgres for metadata removes the scariest failure mode, and putting the platform on a memory-optimized container tier keeps syncs from getting OOM-killed.

References

  • Airbyte deployment docs: https://docs.airbyte.com/deploying-airbyte/
  • Airbyte Helm charts: https://github.com/airbytehq/helm-charts
  • abctl quickstart: https://docs.airbyte.com/using-airbyte/getting-started/oss-quickstart
  • Airbyte architecture overview: https://docs.airbyte.com/understanding-airbyte/high-level-view
  • Temporal documentation: https://docs.temporal.io/

---

Want to run Airbyte's metadata store without operating Postgres yourself? PandaStack's free tier includes a managed database with backups and an auto-injected DATABASE_URL, plus container apps for the platform itself. Spin it up at https://dashboard.pandastack.io.

Ready to deploy?

Start free on PandaStack.

Start free on PandaStack

More in Tutorial

Browse all Tutorial articles →

See also