What it is¶
DVC (Data Version Control) versions large files — datasets and model artifacts — outside of Git. A small content-hash pointer file (.dvc) is committed to the repo, and the actual bytes live in a separate remote: a local path, a shared drive, or a cloud bucket (S3, GCS, Azure).
Why use it¶
Git cannot hold files larger than 100 MB. DVC solves this by putting only the pointer file in Git while storing the data elsewhere. Because the pointer contains an MD5 of the file’s contents, you can verify and restore any previous version exactly. You can log that MD5 into an experiment tracker to tie a run to the exact bytes it trained on.
DVC shines in operational workflows: you run experiments often, data is actively changing across runs, and you need to know precisely which version produced which model. For stable, one-shot datasets it is usually overkill as Git alone is fine.
When to use it¶
Data or model files exceed GitHub’s 100 MB per-file limit.
You need to know exactly which data produced which model.
You are running experiments repeatedly against changing datasets.
You need to share large files with collaborators without requiring individual cloud storage accounts.
How to use it¶
Install¶
uv tool install dvcInitialize inside a Git repository¶
dvc init
git commit -m "Initialize DVC"Track a data file¶
dvc add data/sst_sample.csv
# Creates data/sst_sample.csv.dvc and updates .gitignore
git add data/sst_sample.csv.dvc data/.gitignore
git commit -m "Track SST data v1"The .dvc pointer committed to Git looks like this:
outs:
- md5: d08ae445bfa70901879bfe45ae78de40
size: 2160
path: sst_sample.csvTrack a model artifact¶
dvc add runs/sst_enso/model.joblib
git add runs/sst_enso/model.joblib.dvc
git commit -m "Version trained model"
dvc pushConfigure a remote and push¶
# Local path (e.g., a shared drive or scratch directory)
dvc remote add -d localremote /path/to/storage
# Or S3
dvc remote add -d s3remote s3://your-bucket/dvc-store
dvc pushRestore a specific version¶
git checkout <commit-hash>
dvc pullgit checkout moves the .dvc pointer files back to that commit’s state; dvc pull then fetches the bytes they point to from the remote.
Check status¶
dvc status # files changed since last dvc add
dvc data status # detailed view of tracked file states
dvc list . --dvc-only # list DVC-tracked files; works on a GH URL tooThe dvc list form is handy when a collaborator wants to see what a repo has tracked without cloning:
dvc list https://github.com/chicago-aiscience/workshop-sst --dvc-onlyPros and cons¶
| Pros | Cons |
|---|---|
| Exact data and model lineage via content hashes | Adds dvc add / dvc push / dvc pull to every commit |
| Works with any remote storage backend | Remote must be accessible to all collaborators |
.dvc pointer files integrate naturally with Git | Not needed for stable, small datasets |
| MD5 hashes can be logged to MLflow or W&B for cross-tool linking | — |
Reference¶
Next: combine DVC with an experiment tracker in MLflow + DVC.