Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

MLflow + DVC

What it is

Three tools, three roles, one anchor:

The same pattern works with W&B as the tracker; nothing here is MLflow-specific except the client commands for reading the run back.

Why use it

Neither tool on its own gives full reproducibility. MLflow records what happened in a run but not which exact bytes of data fed it. DVC versions the files but has no concept of a training run.

Together with the git commit as the linking key. Every experiment becomes reproducible: from a commit, you can restore the data, restore the model, and pull up the full run record.

If you run experiments often, swap datasets between runs, or need to come back months later and re-run exactly what produced a given number, you want both tools on.

When to use it

What each tool tracks

DVCMLflow
Input data.dvc pointer (MD5 hash)dvc_data_md5 tag
Hyperparameterslogged per run
Metricslogged per run, comparable across runs
Output model.dvc pointer (MD5 hash)dvc_model_md5 tag + Model Registry
Plots / predictionsstored as artifacts

The MD5 hashes logged to MLflow are what make reproduction possible — given any past run, you can look up the exact data and model that produced it.

One-time setup

# Configure a DVC remote
dvc remote add -d myremote /path/to/shared/storage   # or s3://bucket/path

# Auto-stage .dvc files after dvc add
dvc config core.autostage true

git commit -m "Configure DVC remote"

Workflow per experiment

# 1. Update your data if needed and version it
dvc add data/sst_sample.csv
git commit -m "Update SST training data v2"
dvc push                  # upload the new bytes to the remote

# 2. Train — MLflow records params, metrics, and the DVC hashes
python scripts/train_sst_mlflow.py --run-name "experiment_v2"

# 3. Version the produced model
dvc add runs/sst_enso/model.joblib
git commit -m "Train model on SST v2 data"
dvc push

The final git commit is the single snapshot that captures the code, the data pointer, and the model pointer for this experiment. The training script records that commit SHA as an MLflow tag (workshop-sst/scripts/train_sst_mlflow.py:167), so the MLflow UI shows exactly which commit produced each run.

Train SST MLflow script link

Reproducing a past experiment

1. Retrieve the DVC hashes from the MLflow run

from mlflow.tracking import MlflowClient

client = MlflowClient(tracking_uri="file:runs/sst_enso/mlruns")

runs = client.search_runs(
    experiment_ids=["955874916381101783"],
    filter_string="tags.mlflow.runName = 'experiment_v2'",
)
run = runs[0]

data_md5 = run.data.tags["dvc_data_md5"]
model_md5 = run.data.tags["dvc_model_md5"]
print(data_md5, model_md5)

2. Restore the files from those hashes

DVC’s cache lives at .dvc/cache/files/md5/XX/YYY…, where XX is the first two characters of the MD5 and YYY… is the rest. Point each .dvc pointer back at the historical hash and let dvc checkout pull the corresponding bytes out of cache:

import subprocess
import yaml
from pathlib import Path

def restore_dvc_file(md5: str, dvc_pointer_path: Path) -> None:
    """Update a .dvc pointer to a specific hash and restore the file."""
    with open(dvc_pointer_path) as f:
        dvc_info = yaml.safe_load(f)
    dvc_info["outs"][0]["md5"] = md5
    with open(dvc_pointer_path, "w") as f:
        yaml.dump(dvc_info, f)
    subprocess.run(["dvc", "checkout", str(dvc_pointer_path)], check=True)

restore_dvc_file(data_md5, Path("data/sst_sample.csv.dvc"))
restore_dvc_file(model_md5, Path("runs/sst_enso/model.joblib.dvc"))

3. Fetch from the remote if the cache is cold

If the bytes for a hash are not in the local cache, pull them from the DVC remote first:

dvc fetch data/sst_sample.csv.dvc
dvc fetch runs/sst_enso/model.joblib.dvc
# or
dvc pull data/sst_sample.csv.dvc
dvc pull runs/sst_enso/model.joblib.dvc

Swapping in W&B

Everything above applies with W&B as the tracker: log the DVC MD5s as wandb.config entries (the workshop-sst W&B script already does), and the reproduction flow is identical — fetch the hashes from the run via wandb.Api().run(...) instead of MlflowClient, then pass them through the same restore_dvc_file helper.

Link to script

Pros and cons

ProsCons
Full data and model lineage per runMost steps per workflow of any option in this section
Every run is reproducible from a git commitDVC remote must be accessible to all collaborators
MD5 hashes link tracker runs to exact file versionsOverkill for stable datasets or one-off experiments
Same pattern works with MLflow or W&B

Reference