Gradio turns a Python function into a web UI in a few lines, which makes it the fastest way to share an ML model. The gap people hit is moving from demo.launch() on a laptop to a stable, public deployment that handles real traffic. This guide covers the deployment-specific configuration that matters: binding for containers, loading models once, managing concurrency with Gradio's queue, and securing the demo.

A minimal Gradio app, deployment-ready

The two settings that matter for containers are the server name and port. Gradio defaults to 127.0.0.1, which is unreachable from outside a container. Bind to 0.0.0.0 and read the port from the environment:

import gradio as gr
import os

model = load_model()  # load ONCE at module level, not per request

def predict(image):
    return model(image)

demo = gr.Interface(fn=predict, inputs="image", outputs="label")

if __name__ == "__main__":
    demo.queue().launch(
        server_name="0.0.0.0",
        server_port=int(os.getenv("PORT", 7860)),
    )

Load the model at import time so it's resident in memory and reused across requests. Loading inside predict would reload weights on every call, which is catastrophically slow.

The queue is not optional

ML inference is expensive and often single-threaded (especially on one GPU). If ten users hit your demo at once, running all ten predictions concurrently will exhaust memory. Gradio's built-in queue solves this, call .queue() and Gradio serializes requests, processing a bounded number at a time and making the others wait with a position indicator:

demo.queue(max_size=20, default_concurrency_limit=2)

default_concurrency_limit caps how many predictions run simultaneously. Set it to match your hardware, often 1 or 2 for a single GPU. max_size bounds the waiting line so the app sheds load gracefully instead of queuing forever.

Authentication

A public ML demo can be abused (compute is expensive). Gradio has built-in basic auth:

demo.launch(server_name="0.0.0.0", auth=(os.getenv("DEMO_USER"), os.getenv("DEMO_PASS")))

Keep credentials in environment variables. For more than a couple of users, put the demo behind your platform's access controls or an SSO-protected route.

Dockerfile

Model dependencies are heavy, so be deliberate about the image:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV HF_HOME=/cache/huggingface
EXPOSE 7860
CMD ["python", "app.py"]

If you download model weights at runtime, point HF_HOME at a mounted volume so they cache across restarts instead of re-downloading on every boot.

Deploying on PandaStack

1Connect your GitHub repo as a container app. PandaStack auto-detects the Dockerfile (or you can use buildpacks for a pure-Python app).
2The build runs in an ephemeral Kubernetes Job pod with rootless BuildKit, pushes to Artifact Registry, and deploys via Helm.
3Add any auth credentials and model API keys as environment variables.
4Choose a tier sized for your model, a memory-optimized tier for large CPU models.
5Tail the live logs to confirm the model loaded once and the server bound to the port. You get an HTTPS URL with automatic SSL to share.

Setting	Value	Why
`server_name`	`0.0.0.0`	Reachable in container
`server_port`	`$PORT`	Platform-assigned
`.queue()`	enabled	Protect memory under load
concurrency limit	1-2 (GPU)	Match hardware
auth	env-based	Prevent abuse

Free tier and model size

Be realistic about size. A small classifier fits comfortably; large vision or diffusion models need a bigger tier than the free 0.25 CPU / 512MB. Use the free tier for lightweight demos and scale up for heavier models. Scale-to-zero is attractive for demos that are used occasionally, the app idles at zero cost and cold-starts on the next visitor, but the cold start includes reloading the model, so a multi-gigabyte model means a noticeable wait on first access. For demos you expect steady traffic on, keep a warm instance.

WebSocket and Server-Sent Events power Gradio's live updates and queue position; these pass cleanly through Kong ingress, so the interactive experience works without extra configuration.

Verifying

Open the URL, run a prediction, and open a second tab to see the queue position indicator when concurrency is reached. Watch memory in the metrics view during inference to confirm you're within the tier's limit.

Common pitfalls

Binding to 127.0.0.1 makes the app unreachable in a container.
Loading the model per request instead of once at startup.
No queue, concurrent requests exhaust memory.
Public demo with no auth, anyone can run up your compute.
Undersized tier for the model, the container OOM-crashes on load.

References

Gradio documentation: https://www.gradio.app/docs
Gradio queuing and concurrency: https://www.gradio.app/guides/queuing
Gradio sharing and deployment: https://www.gradio.app/guides/sharing-your-app
Gradio authentication: https://www.gradio.app/guides/sharing-your-app#authentication
Hugging Face cache configuration: https://huggingface.co/docs/huggingface_hub/guides/manage-cache

Gradio gets you from model to UI fast; the deployment work is binding correctly, loading once, and using the queue to protect memory. Deploy your demo with automatic SSL and a sized container on PandaStack's free tier: https://dashboard.pandastack.io

How to Deploy a Gradio ML Demo App

A minimal Gradio app, deployment-ready

The queue is not optional

Authentication

Dockerfile

Deploying on PandaStack

Free tier and model size

Verifying

Common pitfalls

References

Ready to deploy?

More in Tutorial

How to Deploy a Phoenix (Elixir) App to the Cloud

How to Deploy a Monorepo with Multiple Services

How to Deploy a Python RQ Background Worker

See also