Deploy a Hugging Face Model with FastAPI: Build a Production-Ready ML API in Python

6 minute read

A model in a notebook is useful for experiments. But if another application needs to use it, the model needs an API.

In this tutorial we will take a Hugging Face sentiment-analysis pipeline and serve it with FastAPI. The API will accept text, run the model, and return a clean JSON response. We’ll also add validation, batch prediction, a health endpoint, configurable settings, and a Dockerfile — so the service can run locally or on a cloud container platform.

The deployment pipeline at a glance

FastAPI serving a Hugging Face model: client posts text, FastAPI validates and runs the pipeline, returns JSON

Client → POST /predict → FastAPI (validates with Pydantic) → Hugging Face pipeline → JSON response

What you’ll build

A sentiment-analysis API that loads a Hugging Face model once, exposes /predict and /health endpoints, validates input with Pydantic, returns clean JSON, supports batch prediction, and can run locally or inside Docker.

Prerequisites

Python 3.10+
Basic familiarity with the command line (Docker is optional, used near the end)

1. Create the project

mkdir hf-fastapi-sentiment
cd hf-fastapi-sentiment

2. Install dependencies

With uv (a fast, modern Python package & environment manager — recommended):

# install uv once (skip if you already have it)
curl -LsSf https://astral.sh/uv/install.sh | sh        # macOS / Linux
# Windows (PowerShell):  powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

uv venv --python 3.12
source .venv/bin/activate                              # Windows: .venv\Scripts\activate

uv pip install fastapi "uvicorn[standard]" "transformers[torch]" torch pydantic

With plain venv + pip (macOS / Linux):

python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install fastapi "uvicorn[standard]" "transformers[torch]" torch pydantic

Windows (venv):

python -m venv .venv
.venv\Scripts\activate
python -m pip install --upgrade pip
python -m pip install fastapi "uvicorn[standard]" "transformers[torch]" torch pydantic

Pin them in a requirements.txt so the build is reproducible:

fastapi
uvicorn[standard]
transformers[torch]
torch
pydantic

3. Build the FastAPI app

Create app.py. This is deliberately more than a minimal demo: it validates input, returns a typed response, loads the model once, includes a health endpoint, supports batch prediction, and hides internal errors from the caller.

import logging
from typing import List

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from transformers import pipeline

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

MODEL_ID = "distilbert-base-uncased-finetuned-sst-2-english"

app = FastAPI(
    title="Hugging Face Sentiment API",
    description="A simple FastAPI service for Hugging Face text classification.",
    version="1.0.0",
)

# Loaded ONCE at import/startup — never inside an endpoint.
classifier = pipeline(task="sentiment-analysis", model=MODEL_ID)


class PredictionRequest(BaseModel):
    text: str = Field(..., min_length=1, max_length=2000,
                      description="Text to classify.",
                      examples=["I love this product. It works perfectly."])


class PredictionResponse(BaseModel):
    label: str
    score: float


class BatchPredictionRequest(BaseModel):
    texts: List[str] = Field(..., min_length=1, max_length=32,
                             description="A list of texts to classify.")


@app.get("/health")
def health():
    return {"status": "ok", "model": MODEL_ID}


@app.post("/predict", response_model=PredictionResponse)
def predict(request: PredictionRequest):
    try:
        result = classifier(request.text)[0]
        return PredictionResponse(label=result["label"], score=round(float(result["score"]), 4))
    except Exception as exc:
        logger.exception("Prediction failed")
        raise HTTPException(status_code=500, detail="Prediction failed.") from exc


@app.post("/predict-batch", response_model=List[PredictionResponse])
def predict_batch(request: BatchPredictionRequest):
    try:
        results = classifier(request.texts)
        return [PredictionResponse(label=item["label"], score=round(float(item["score"]), 4))
                for item in results]
    except Exception as exc:
        logger.exception("Batch prediction failed")
        raise HTTPException(status_code=500, detail="Batch prediction failed.") from exc

Why load the model at startup?

The model is loaded once when the application starts. This is important: do not create the pipeline inside the /predict endpoint, because that would reload the model on every request and make the API extremely slow. Loading it at import time means each request reuses the model already in memory.

Why Pydantic validation?

PredictionRequest and PredictionResponse are Pydantic models built on Python type hints. They reject empty or oversized input before it reaches the model, and they guarantee the response shape — which also powers the automatic docs you’ll see shortly.

4. Test the API locally

Run it with Uvicorn:

uvicorn app:app --reload

Uvicorn is the ASGI server that runs the application: FastAPI defines the API, Uvicorn serves it over HTTP. (--reload is for development only — remove it in production.)

Check health:

curl http://127.0.0.1:8000/health

{ "status": "ok", "model": "distilbert-base-uncased-finetuned-sst-2-english" }

Single prediction:

curl -X POST "http://127.0.0.1:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"text": "This tutorial is clear and useful."}'

{ "label": "POSITIVE", "score": 0.9998 }

Batch prediction:

curl -X POST "http://127.0.0.1:8000/predict-batch" \
  -H "Content-Type: application/json" \
  -d '{"texts": ["Great experience.", "This was terrible.", "It was okay."]}'

[
  { "label": "POSITIVE", "score": 0.9999 },
  { "label": "NEGATIVE", "score": 0.9997 },
  { "label": "POSITIVE", "score": 0.7150 }
]

Free interactive docs

Open http://127.0.0.1:8000/docs in your browser. FastAPI automatically generates interactive documentation from your endpoint definitions and Pydantic models — you can try the API right from the page. This is one of the biggest advantages of using FastAPI for ML services.

5. Make settings configurable

Don’t hardcode every setting. Create config.py:

import os

MODEL_ID = os.getenv("MODEL_ID", "distilbert-base-uncased-finetuned-sst-2-english")
MAX_BATCH_SIZE = int(os.getenv("MAX_BATCH_SIZE", "32"))

Then use it in app.py:

from config import MODEL_ID, MAX_BATCH_SIZE

Now you can point the API at a different model — including your own fine-tuned model — without changing code: export MODEL_ID="your-username/your-model".

6. Add a Dockerfile

FROM python:3.12-slim

ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app.py .
COPY config.py .

EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Add a .dockerignore so caches and secrets never enter the image:

__pycache__/
*.pyc
.venv/
.env
.git/
.pytest_cache/
.mypy_cache/

For a hardened, multi-stage build with a non-root user and a health check, see Deploy AI agents to production with Docker.

7. Run with Docker

docker build -t hf-sentiment-api:1.0 .
docker run --rm -p 8000:8000 hf-sentiment-api:1.0

curl -X POST "http://127.0.0.1:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"text": "Docker makes deployment easier."}'

Once the API is containerized, deployment becomes much easier: the same image can run on Google Cloud Run, AWS ECS/Fargate, Kubernetes, or Hugging Face Spaces.

8. Production checklist

For local development, uvicorn app:app --reload is convenient. For production, do not use --reload. Beyond that:

Load the model once (done) and batch requests where possible for throughput.
Add authentication and rate limiting so the API isn’t abused.
Add structured logs and monitoring; track latency and error rate.
Pre-download the model during the Docker build (or mount an HF_HOME cache) so cold starts are fast and don’t depend on network at startup.
Set resource limits (--memory, --cpus) and run multiple workers or replicas behind a load balancer.

Common errors

If the first request is slow, that’s normal — the model may be downloading and loading into memory.

If the container is very large, use a slim Python image and avoid copying virtual environments or cache folders into the image (that’s what .dockerignore is for).

If the API is slow under load, batch requests where possible and make sure the model is loaded once at startup, not per request.

If Docker can’t find uvicorn, check that uvicorn[standard] is in requirements.txt.

If the model fails to download in production, pre-download it during the Docker build or use a platform with network access at startup.

FAQ

FastAPI vs Flask? Flask is simple and mature. FastAPI is usually a better fit for ML services because it gives you typed request validation, automatic OpenAPI docs, and strong async support.

Can I use my own fine-tuned model? Yes. Set MODEL_ID to the local path or Hugging Face Hub ID of your model (see fine-tuning Llama 3 with Unsloth or DistilBERT text classification).

Should I load the model inside the endpoint? No. Load it once when the app starts — loading it inside the endpoint makes every request slow.

Where do I deploy this? Any container platform works. Good options include Google Cloud Run, AWS ECS/Fargate, Kubernetes, and Hugging Face Spaces with Docker.

Is this production-ready? It’s a strong starting point. For production add authentication, rate limiting, logging, monitoring, batching, model versioning, and resource limits.

Conclusion

You now have a complete path from a Hugging Face model to a working API. The model is loaded once, FastAPI validates the request, the pipeline runs inference, and the service returns clean JSON.

This is the pattern you can reuse for many ML services: start with a simple local API, test it with curl and /docs, then containerize it when you’re ready to deploy.

Share on

Twitter Facebook LinkedIn

Ruslan Magana Vsevolodovna

Deploy a Hugging Face Model with FastAPI: Build a Production-Ready ML API in Python

The deployment pipeline at a glance

What you’ll build

Prerequisites

1. Create the project

2. Install dependencies

3. Build the FastAPI app

Why load the model at startup?

Why Pydantic validation?

4. Test the API locally

Free interactive docs

5. Make settings configurable

6. Add a Dockerfile

7. Run with Docker

8. Production checklist

Common errors

FAQ

Conclusion

Share on

Leave a comment

You may also enjoy

The $0 AI Software House: Running a 4-Agent Engineering Team on a $35 Raspberry Pi

05 Jul 2026

The EU AI Act vs. Autonomous AI: How to Audit an Agentic Workflow in Under 60 Seconds

05 Jul 2026

I Let an Open-Source AI Team Refactor a 10,000-Line Legacy Codebase Overnight — Under a Strict Matrix Contract

05 Jul 2026

Why I Created Matrix Designer: Giving AI a Brain Before It Writes Code

20 Jun 2026

Ruslan Magana Vsevolodovna

The deployment pipeline at a glance

What you’ll build

Prerequisites

1. Create the project

2. Install dependencies

3. Build the FastAPI app

Why load the model at startup?

Why Pydantic validation?

4. Test the API locally

Free interactive docs

5. Make settings configurable

6. Add a Dockerfile

7. Run with Docker

8. Production checklist

Common errors

FAQ

Conclusion

Related tutorials

Share on

Leave a comment

You may also enjoy

The $0 AI Software House: Running a 4-Agent Engineering Team on a $35 Raspberry Pi

05 Jul 2026

The EU AI Act vs. Autonomous AI: How to Audit an Agentic Workflow in Under 60 Seconds

05 Jul 2026

I Let an Open-Source AI Team Refactor a 10,000-Line Legacy Codebase Overnight — Under a Strict Matrix Contract

05 Jul 2026

Why I Created Matrix Designer: Giving AI a Brain Before It Writes Code

20 Jun 2026