Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Git & GitHub

What it is

Git tracks code changes as a series of commits. GitHub stores the repository remotely and adds collaboration features: pull requests, issues, Actions, and access control. For small data files and model artifacts (under 100 MB per file), Git can also serve as the versioning layer end-to-end with no extra tooling required.

Why use it

Every experiment is ultimately anchored to a git commit. That commit hash is the linking key the rest of this stack relies on:

Without a clean commit history, the other tools on these pages have nothing reliable to point at.

When to use it

How to use it

Tag a commit with an experiment label

git tag experiment-baseline
git push origin experiment-baseline

Capture the commit SHA in your training script

Log the commit into your run record so the tracker and the code are linked:

import subprocess

git_commit = subprocess.check_output(
    ["git", "rev-parse", "HEAD"], text=True
).strip()
git_branch = subprocess.check_output(
    ["git", "rev-parse", "--abbrev-ref", "HEAD"], text=True
).strip()

This is the pattern used in workshop-sst/scripts/train_sst_mlflow.py and train_sst_wandb.py where both log git_commit and git_branch as run tags so you can jump from the tracker UI back to the exact code.

Commit small models directly

If a trained model file is well under 100 MB, committing it straight to the repo is the simplest possible way to share it with collaborators: No registry, no remote, no extra tooling. Once the file grows past that limit, move it to DVC.

Keep large files and secrets out of the repo

Add these to .gitignore:

# Large binary files
*.joblib
*.pkl
*.pt
*.h5
data/*.csv

# Experiment outputs
runs/
wandb/
mlruns/

# Secrets
.env
*.key

Commit pointer files for anything that cannot live in Git

When a file is too large for Git but you do not want the overhead of DVC, a plain YAML pointer file is a good middle ground. The pointer is tiny, commits cleanly, and carries everything someone needs to fetch the real asset and verify they got the right bytes:

Because the pointer is committed, checking out any past commit gives you an exact, verifiable recipe for reconstructing that experiment’s inputs and outputs without a DVC remote or LFS account.

Example: data pointer

Naming convention: <asset>.pointer.yaml next to where the asset belongs.

# data/nino34_sample.csv.pointer.yaml
path: data/nino34_sample.csv
md5: af5118ccfd637b0b90349921e7146663
size: 2095
source: https://github.com/chicago-aiscience/workshop-sst/releases/download/v0.6.2/nino34_sample.csv
description: Monthly-mean Nino 3.4 index observations, ENSO indicator sample (2000-2023).
git_commit: 72bca4468f63daf10834181fca7599d134523857

Example: model pointer

# runs/sst_enso/model.joblib.pointer.yaml
path: runs/sst_enso/model.joblib
md5: 9f1c3d2e4a5b6c7d8e9f0a1b2c3d4e5f
size: 482113
source: https://github.com/chicago-aiscience/workshop-sst/releases/download/v0.6.2/model.joblib
description: Random forest SST-ENSO predictor trained on nino34_sample v0.6.2.
git_commit: 72bca4468f63daf10834181fca7599d134523857
trained_from:
  data: data/nino34_sample.csv.pointer.yaml
  config: configs/train_baseline.yaml
metrics:
  r2: 0.81
  rmse: 0.42

The trained_from block links the model back to the specific data and config pointers it was trained against, so one commit can anchor the whole chain.

Example: config pointer

Training configs are small enough to commit directly, but a pointer is useful when the config lives in another repo or an experiment-management system:

# configs/train_baseline.yaml.pointer.yaml
path: configs/train_baseline.yaml
md5: 1a2b3c4d5e6f7890abcdef1234567890
size: 612
source: https://github.com/chicago-aiscience/workshop-sst/releases/download/v0.6.2/train_baseline.yaml
description: Baseline training config — 12 lags, 20% test split, seed 42.
git_commit: 72bca4468f63daf10834181fca7599d134523857

Example: generic artifact pointer (figures, tables, reports)

The same pattern works for any output you want to version without checking in the binary:

# reports/figures/skill_vs_lead.png.pointer.yaml
path: reports/figures/skill_vs_lead.png
md5: abcdef0123456789abcdef0123456789
size: 184220
source: https://github.com/chicago-aiscience/workshop-sst/releases/download/v0.6.2/skill_vs_lead.png
description: Skill-vs-lead-time figure for the baseline model.
git_commit: 72bca4468f63daf10834181fca7599d134523857
generated_by: scripts/plot_skill.py

Fetching from a pointer

A short helper reads the pointer, downloads the asset, and verifies the MD5 before letting anything else use it. workshop-sst/scripts/fetch_from_pointer.py is a working reference:

python scripts/fetch_from_pointer.py data/nino34_sample.csv.pointer.yaml
python scripts/fetch_from_pointer.py runs/sst_enso/model.joblib.pointer.yaml --out-dir /tmp/models

The script skips the download when the file is already present and its hash matches, and aborts on mismatch, so it is safe to call every time a training script starts.

Use GitHub releases for model snapshots

When a model is stable enough to share, attach the artifact to a GitHub release:

gh release create v1.0.0 model.joblib \
  --title "Model v1.0.0" \
  --notes "Baseline SST-ENSO model"

Collaborators can then download a specific version without needing DVC configured locally.

Pros and cons of Git & GitHub

ProsCons
No setup100 MB per-file limit
Full code history for freeNo metric comparison UI
Universally understood workflowLarge binary history bloats the repo permanently
Commit SHA is the universal linking key for every other toolNo artifact storage beyond raw file blobs

Reference