Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

DVC

What it is

DVC (Data Version Control) versions large files — datasets and model artifacts — outside of Git. A small content-hash pointer file (.dvc) is committed to the repo, and the actual bytes live in a separate remote: a local path, a shared drive, or a cloud bucket (S3, GCS, Azure).

Why use it

Git cannot hold files larger than 100 MB. DVC solves this by putting only the pointer file in Git while storing the data elsewhere. Because the pointer contains an MD5 of the file’s contents, you can verify and restore any previous version exactly. You can log that MD5 into an experiment tracker to tie a run to the exact bytes it trained on.

DVC shines in operational workflows: you run experiments often, data is actively changing across runs, and you need to know precisely which version produced which model. For stable, one-shot datasets it is usually overkill as Git alone is fine.

When to use it

How to use it

Install

uv tool install dvc

Initialize inside a Git repository

dvc init
git commit -m "Initialize DVC"

Track a data file

dvc add data/sst_sample.csv
# Creates data/sst_sample.csv.dvc and updates .gitignore
git add data/sst_sample.csv.dvc data/.gitignore
git commit -m "Track SST data v1"

The .dvc pointer committed to Git looks like this:

outs:
- md5: d08ae445bfa70901879bfe45ae78de40
  size: 2160
  path: sst_sample.csv

Track a model artifact

dvc add runs/sst_enso/model.joblib
git add runs/sst_enso/model.joblib.dvc
git commit -m "Version trained model"
dvc push

Configure a remote and push

# Local path (e.g., a shared drive or scratch directory)
dvc remote add -d localremote /path/to/storage

# Or S3
dvc remote add -d s3remote s3://your-bucket/dvc-store

dvc push

Restore a specific version

git checkout <commit-hash>
dvc pull

git checkout moves the .dvc pointer files back to that commit’s state; dvc pull then fetches the bytes they point to from the remote.

Check status

dvc status                     # files changed since last dvc add
dvc data status                # detailed view of tracked file states
dvc list . --dvc-only          # list DVC-tracked files; works on a GH URL too

The dvc list form is handy when a collaborator wants to see what a repo has tracked without cloning:

dvc list https://github.com/chicago-aiscience/workshop-sst --dvc-only

Pros and cons

ProsCons
Exact data and model lineage via content hashesAdds dvc add / dvc push / dvc pull to every commit
Works with any remote storage backendRemote must be accessible to all collaborators
.dvc pointer files integrate naturally with GitNot needed for stable, small datasets
MD5 hashes can be logged to MLflow or W&B for cross-tool linking

Reference