Work in progress. The API is functional but not yet stable. Breaking changes may occur before 1.0.
Run your GPU-hungry Metaflow steps on Modal without managing any infrastructure.
Running ML training or GPU inference inside a Metaflow pipeline means either paying for always-on GPU instances (expensive when idle) or wiring up a custom remote execution backend (time-consuming, fragile). The built-in @batch and @kubernetes decorators solve this for AWS and Kubernetes, but not for Modal's serverless GPU cloud — leaving a gap for teams who want on-demand GPU execution without the overhead.
pip install metaflow-modalfrom metaflow import FlowSpec, step, modal
class TrainFlow(FlowSpec):
@step
def start(self):
self.next(self.train)
@modal(gpu="A10G", memory=8192)
@step
def train(self):
import torch
self.device = str(torch.cuda.get_device_name(0))
self.next(self.end)
@step
def end(self):
print(f"Trained on: {self.device}")
if __name__ == "__main__":
TrainFlow()METAFLOW_DEFAULT_DATASTORE=s3 python train_flow.py runpip install metaflow-modalRequires Modal credentials — run modal token new once to authenticate, or set:
export MODAL_TOKEN_ID=...
export MODAL_TOKEN_SECRET=...A remote datastore (S3, GCS, or Azure Blob) is required — @modal sandboxes can't reach the local filesystem.
@modal(cpu=4, memory=16384, image="python:3.11-slim", timeout=1800)
@step
def preprocess(self):
...@modal(gpu="H100", secrets=["my-hf-token"])
@step
def finetune(self):
import os
token = os.environ["HF_TOKEN"] # injected from Modal secret
...@modal(gpu="A100:4") # 4x A100 40GB
@step
def distributed_train(self):
...| Value | Hardware |
|---|---|
"T4" |
NVIDIA T4 |
"A10G" |
NVIDIA A10G |
"A100" |
NVIDIA A100 40GB |
"A100-80GB" |
NVIDIA A100 80GB |
"H100" |
NVIDIA H100 |
"L4" |
NVIDIA L4 |
"TYPE:N" |
N of TYPE (e.g. "A10G:2") |
@modal intercepts the Metaflow step execution and redirects it through Modal's sandbox API. Your flow's code package is uploaded to your remote datastore (S3/GCS/Azure), then downloaded and executed inside an ephemeral Modal container. Artifacts and metadata are written back to the same datastore, so the rest of your flow continues normally on your local machine.
@modal step -> upload code package to S3
-> modal.Sandbox.create("bash", "-c", step_cmd)
-> stream stdout/stderr live
-> sandbox exits -> sync artifacts + metadata from S3
No Modal Functions to deploy, no apps to register. Each step spins up a fresh container and exits.
| Parameter | Default | Description |
|---|---|---|
cpu |
1 |
Number of CPUs |
memory |
1024 |
Memory in MB |
gpu |
None |
GPU type (see table above) |
image |
Modal debian_slim | Docker image URL |
timeout |
600 |
Sandbox timeout in seconds |
env |
{} |
Extra environment variables |
secrets |
[] |
Modal secret names to mount |
| Environment variable | Purpose |
|---|---|
MODAL_TOKEN_ID |
Modal API token ID |
MODAL_TOKEN_SECRET |
Modal API token secret |
MODAL_ENVIRONMENT |
Modal environment name (optional) |
METAFLOW_DEFAULT_DATASTORE |
Must be s3, gs, or azure |
git clone https://github.com/npow/metaflow-modal
cd metaflow-modal
pip install -e ".[dev]"
# Unit tests (no credentials needed)
pytest tests/unit/
# Architecture enforcement tests
pytest tests/structural/ -m structural
# Lint + type check
ruff check src/ tests/
mypy src/