Jupyter notebooks as trackers¶
A Jupyter notebook is a perfectly good experiment tracker for early-stage exploration which can include a handful of model variants, a single dataset, and results that you don’t need to track long term. You don’t need to start with more advanced experiment tracking like MLflow or Weights & Biases.
When a notebook is enough¶
Reach for a notebook (and skip the heavier tools) when:
You’re iterating solo and the audience is mostly future-you.
The dataset is stable across runs (.e.g., same files, same content, same ordering).
You’re comparing on the order of 5–10 variants, not hundreds.
You’re willing to re-run from scratch if you ever need to reproduce a number.
If two of those stop being true, graduate to MLflow or Weights & Biases.
What to record inline¶
Treat the notebook like a lab notebook. Alongside the code, capture the things you’d otherwise forget:
Parameters: every value you varied (learning rate, seed, train/val split, feature set). Put them in a single cell at the top so they’re easy to scan.
Metrics: print them, don’t just plot them. A printed number is searchable; a plot is not.
Plots: keep them in the notebook output, not saved off to a
figures/folder you’ll lose track of.What surprised you: a markdown cell after each experiment with one or two sentences on what worked, what didn’t, and what you’d try next. This is the part that’s hardest to reconstruct later.
Record the question. Record the question(s) you are trying to answer at the top of the notebook. What were you trying to find out?
Record a summary of conslusions. It may also be helpful to place a short conclusion summary at the top of the notebook so you don’t have to scroll through the entire notebook to understand the experiment results.
Practical hygiene¶
A few habits that make notebooks much more reliable as a record:
Commit them to Git. Use
nbstripoutif cell outputs makegitdiffs noisy, but keep outputs in the committed copy when the plots and printed metrics are the record.Restart and run all before you trust a result. Out-of-order cell execution is the single biggest source of “I can’t reproduce my own number from yesterday.” Make a habit of restarting the kernel and running top-to-bottom before you record a final number.
Name notebooks by date and topic.
2026-04-15-baseline-vs-augmented.ipynbages better thanexperiment3-final-v2.ipynb.Set random seeds explicitly.
numpy,torch, and any data-shuffling step. Without this, “stable data across runs” stops being true.One question per notebook. When a notebook starts answering three questions, split it.
Chronological append. New experiments go at the bottom (or in a new notebook). Don’t overwrite previous experiments cells so you can track the history of your work.
Where notebooks fall short¶
Worth knowing the limits, so you can spot when it’s time to switch:
Cross-run comparison is manual. There’s no built-in “show me all runs sorted by validation accuracy”; you will need to scroll through notebooks.
Kernel state can hide bugs. A cell that works because of a variable defined three notebooks ago will silently break for anyone else.
Sharing is awkward. A teammate needs the data, the environment, and the notebook but even then, kernel state may differ.
Diffs are noisy. Even with
nbstripout, reviewing notebook changes in a PR is harder than reviewing a.pyfile.
If you find yourself building scripts to parse metrics out of old notebooks, or copying parameters between notebooks by hand, that’s the signal to move to MLflow (solo / local) or Weights & Biases (team / cloud).
Starter template¶
The template below wires in every habit on this page: top-of-notebook question and conclusions block, a single PARAMS dict, explicit seeds, “what surprised me” cells, and an append-only structure for new experiments. Copy it into a new .ipynb file to get started, or download the notebook.
The example uses the workshop-sst pipeline (sst.io, sst.transform, sst.ml), so run it from a clone of that repo with the sst package installed and data/sst_sample.csv and data/nino34_sample.csv present.
# %% [markdown]
# # Experiment: <short descriptive title>
#
# **Date:** 2026-04-15
# **Author:** <your name>
# **Notebook file:** `2026-04-15-<topic>.ipynb`
# **Reference codebase:** [chicago-aiscience/workshop-sst](https://github.com/chicago-aiscience/workshop-sst)
#
# ---
#
# ## Question(s)
#
# > State the question this notebook is trying to answer, *before* you start.
# >
# > *Does increasing the number of lag features (3 → 6) improve the Random Forest's
# > ability to predict the Niño 3.4 index from SST?*
#
# ## Summary of conclusions
#
# > Fill this in **after** running the experiments.
# >
# > *Example: On the sample dataset, n_lags=3 and n_lags=6 produced nearly identical
# > test R² (~0.975) — the extra lags neither helped nor hurt meaningfully.*
#
# ## Parameters
#
# | Parameter | Value |
# |---|---|
# | Random seed | 42 |
# | Train/test split | 0.8 / 0.2 (chronological) |
# | Model | `RandomForestRegressor` (sklearn, via `sst.ml`) |
# | Feature column | `sst_c_roll_12` |
# | Target column | `nino34_roll_12` |
# | `n_lags` | 3 (baseline), 6 (variant) |
#
# ## Data
#
# - **Source:** `data/sst_sample.csv`, `data/nino34_sample.csv`
# - **Version / commit:** record the DVC pointer or git commit hash here
# %% [markdown]
# ## Setup
#
# Pin every source of randomness and collect parameters in one place.
# If you change a parameter, change it *here* — don't sprinkle literals
# through the notebook.
# %%
import random
from pathlib import Path
import numpy as np
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
PARAMS = {
"seed": SEED,
"test_size": 0.2,
"feature_col": "sst_c_roll_12",
"target_col": "nino34_roll_12",
"n_lags_baseline": 3,
"n_lags_variant": 6,
"sst_path": Path("data/sst_sample.csv"),
"enso_path": Path("data/nino34_sample.csv"),
}
PARAMS
# %% [markdown]
# ## Data
# %%
from sst.io import load_sst, load_enso
from sst.transform import tidy, join_on_month
sst_df = tidy(load_sst(PARAMS["sst_path"]), date_col="date", value_col="sst_c")
enso_df = tidy(load_enso(PARAMS["enso_path"]), date_col="date", value_col="nino34")
joined = join_on_month(sst_df, enso_df)
print(f"Joined shape: {joined.shape}")
print(f"Date range: {joined['date'].min().date()} → {joined['date'].max().date()}")
joined.head()
# %% [markdown]
# ## Experiment 1 — Baseline (n_lags = 3)
#
# **Hypothesis:** Three months of lag features should capture most of the
# short-term autocorrelation in the Niño 3.4 index. This is the workshop default.
# %%
from sst.ml import predict_enso_from_sst
baseline = predict_enso_from_sst(
joined,
target_col=PARAMS["target_col"],
feature_col=PARAMS["feature_col"],
test_size=PARAMS["test_size"],
n_lags=PARAMS["n_lags_baseline"],
random_state=PARAMS["seed"],
)
print(f"Baseline (n_lags={PARAMS['n_lags_baseline']})")
print(f" R²: {baseline['r2_score']:.4f}")
print(f" RMSE: {baseline['rmse']:.4f}")
print("\nTop features:")
baseline["feature_importance"].head()
# %% [markdown]
# ### What surprised me
#
# > One or two sentences: what worked, what didn't, what you'd try next.
# %% [markdown]
# ## Experiment 2 — More lags (n_lags = 6)
#
# **Hypothesis:** Six months of lags should capture seasonal structure that
# 3 months misses. On a small sample dataset this might overfit instead.
# %%
variant = predict_enso_from_sst(
joined,
target_col=PARAMS["target_col"],
feature_col=PARAMS["feature_col"],
test_size=PARAMS["test_size"],
n_lags=PARAMS["n_lags_variant"],
random_state=PARAMS["seed"],
)
print(f"Variant (n_lags={PARAMS['n_lags_variant']})")
print(f" R²: {variant['r2_score']:.4f}")
print(f" RMSE: {variant['rmse']:.4f}")
print("\nTop features:")
variant["feature_importance"].head()
# %% [markdown]
# ### What surprised me
#
# > One or two sentences: what worked, what didn't, what you'd try next.
# %% [markdown]
# ## Comparison
#
# Print the metrics (searchable) **and** plot predictions vs. actual (scannable).
# Keep both inline — don't write the figure out to `figures/`.
# %%
import matplotlib.pyplot as plt
results = {
f"Baseline (n_lags={PARAMS['n_lags_baseline']})": baseline,
f"Variant (n_lags={PARAMS['n_lags_variant']})": variant,
}
for name, r in results.items():
print(f"{name}: R²={r['r2_score']:.4f} RMSE={r['rmse']:.4f}")
fig, axes = plt.subplots(1, 2, figsize=(11, 4), sharey=True)
for ax, (name, r) in zip(axes, results.items()):
preds = r["predictions"]
ax.plot(preds["date"], preds["actual"], label="Actual", linewidth=2)
ax.plot(preds["date"], preds["predicted"], label="Predicted", linestyle="--")
ax.set_title(f"{name}\nR²={r['r2_score']:.3f}")
ax.set_xlabel("Date")
ax.legend()
axes[0].set_ylabel("Niño 3.4 (12-mo rolling)")
fig.autofmt_xdate()
plt.tight_layout()
plt.show()
# %% [markdown]
# ## Conclusions
#
# > Mirror the **Summary of conclusions** at the top, but with more detail.
# > What did you learn? What would you do next? What would you do differently?
# %% [markdown]
# ---
# ## Appendix: new experiments go below this line
#
# Append new experiments as `## Experiment 3 — ...` with the same
# hypothesis → code → "what surprised me" pattern. Don't overwrite cells above.