Back to Blog
Tutorial9 min read2026-07-01

How to Deploy a Gradio ML Demo App

Deploy a Gradio machine-learning demo to production: configuring the server for containers, loading models efficiently, handling concurrency and the queue, authentication, and going from notebook to a shareable URL.

Ajay Kumar
Ajay Kumar
Founder & DevOps, PandaStack

Gradio turns a Python function into a web UI in a few lines, which makes it the fastest way to share an ML model. The gap people hit is moving from demo.launch() on a laptop to a stable, public deployment that handles real traffic. This guide covers the deployment-specific configuration that matters: binding for containers, loading models once, managing concurrency with Gradio's queue, and securing the demo.

A minimal Gradio app, deployment-ready

The two settings that matter for containers are the server name and port. Gradio defaults to 127.0.0.1, which is unreachable from outside a container. Bind to 0.0.0.0 and read the port from the environment:

import gradio as gr
import os

model = load_model()  # load ONCE at module level, not per request

def predict(image):
    return model(image)

demo = gr.Interface(fn=predict, inputs="image", outputs="label")

if __name__ == "__main__":
    demo.queue().launch(
        server_name="0.0.0.0",
        server_port=int(os.getenv("PORT", 7860)),
    )

Load the model at import time so it's resident in memory and reused across requests. Loading inside predict would reload weights on every call, which is catastrophically slow.

The queue is not optional

ML inference is expensive and often single-threaded (especially on one GPU). If ten users hit your demo at once, running all ten predictions concurrently will exhaust memory. Gradio's built-in queue solves this, call .queue() and Gradio serializes requests, processing a bounded number at a time and making the others wait with a position indicator:

demo.queue(max_size=20, default_concurrency_limit=2)

default_concurrency_limit caps how many predictions run simultaneously. Set it to match your hardware, often 1 or 2 for a single GPU. max_size bounds the waiting line so the app sheds load gracefully instead of queuing forever.

Authentication

A public ML demo can be abused (compute is expensive). Gradio has built-in basic auth:

demo.launch(server_name="0.0.0.0", auth=(os.getenv("DEMO_USER"), os.getenv("DEMO_PASS")))

Keep credentials in environment variables. For more than a couple of users, put the demo behind your platform's access controls or an SSO-protected route.

Dockerfile

Model dependencies are heavy, so be deliberate about the image:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV HF_HOME=/cache/huggingface
EXPOSE 7860
CMD ["python", "app.py"]

If you download model weights at runtime, point HF_HOME at a mounted volume so they cache across restarts instead of re-downloading on every boot.

Deploying on PandaStack

  1. 1Connect your GitHub repo as a container app. PandaStack auto-detects the Dockerfile (or you can use buildpacks for a pure-Python app).
  2. 2The build runs in an ephemeral Kubernetes Job pod with rootless BuildKit, pushes to Artifact Registry, and deploys via Helm.
  3. 3Add any auth credentials and model API keys as environment variables.
  4. 4Choose a tier sized for your model, a memory-optimized tier for large CPU models.
  5. 5Tail the live logs to confirm the model loaded once and the server bound to the port. You get an HTTPS URL with automatic SSL to share.
SettingValueWhy
server_name0.0.0.0Reachable in container
server_port$PORTPlatform-assigned
.queue()enabledProtect memory under load
concurrency limit1-2 (GPU)Match hardware
authenv-basedPrevent abuse

Free tier and model size

Be realistic about size. A small classifier fits comfortably; large vision or diffusion models need a bigger tier than the free 0.25 CPU / 512MB. Use the free tier for lightweight demos and scale up for heavier models. Scale-to-zero is attractive for demos that are used occasionally, the app idles at zero cost and cold-starts on the next visitor, but the cold start includes reloading the model, so a multi-gigabyte model means a noticeable wait on first access. For demos you expect steady traffic on, keep a warm instance.

WebSocket and Server-Sent Events power Gradio's live updates and queue position; these pass cleanly through Kong ingress, so the interactive experience works without extra configuration.

Verifying

Open the URL, run a prediction, and open a second tab to see the queue position indicator when concurrency is reached. Watch memory in the metrics view during inference to confirm you're within the tier's limit.

Common pitfalls

  • Binding to 127.0.0.1 makes the app unreachable in a container.
  • Loading the model per request instead of once at startup.
  • No queue, concurrent requests exhaust memory.
  • Public demo with no auth, anyone can run up your compute.
  • Undersized tier for the model, the container OOM-crashes on load.

References

  • Gradio documentation: https://www.gradio.app/docs
  • Gradio queuing and concurrency: https://www.gradio.app/guides/queuing
  • Gradio sharing and deployment: https://www.gradio.app/guides/sharing-your-app
  • Gradio authentication: https://www.gradio.app/guides/sharing-your-app#authentication
  • Hugging Face cache configuration: https://huggingface.co/docs/huggingface_hub/guides/manage-cache

Gradio gets you from model to UI fast; the deployment work is binding correctly, loading once, and using the queue to protect memory. Deploy your demo with automatic SSL and a sized container on PandaStack's free tier: https://dashboard.pandastack.io

Ready to deploy?

Start free on PandaStack.

Start free on PandaStack

More in Tutorial

Browse all Tutorial articles →

See also