Experiment Tracking and Tooling

Sneiderman, Robby

Methodology

Experiment Tracking and Tooling

MLflow, Weights and Biases, TensorBoard, Hydra, and the discipline of recording everything. What to log, how to compare experiments, and why you cannot reproduce what you did not record.

CoreTier 3CurrentSupporting~35 min

Prerequisites

Reproducibility and Experimental Rigor Hardware for ML Practitioners ML Project Lifecycle Weights and Biases Experiment Tracking

Prereq Map

Why This Matters

You cannot reproduce what you did not record. Every ML practitioner has faced this situation: a model from three months ago performed well, but nobody remembers the exact hyperparameters, data version, or code commit that produced it. Experiment tracking tools solve this by automatically recording every detail of every run.

The difference between a research group that can reproduce its own results and one that cannot is almost always tooling discipline, not talent. This page covers the three industry-standard trackers (MLflow, W&B, TensorBoard), config management with Hydra / OmegaConf, and how reproducibility discipline maps onto what you actually log per run. The companion page on ablation study design covers what to do with the resulting record once you have one.

Mental Model

An experiment tracker is a structured lab notebook. For each run, it records: the inputs (code, data, config), the process (training curves, resource usage), and the outputs (metrics, artifacts, model checkpoints). Given any past result, you should be able to look up the exact conditions that produced it and re-run them.

What to Log

Definition

Experiment Record

A complete experiment record contains the following for each training run:

Hyperparameters: learning rate, batch size, optimizer settings, architecture choices, regularization strength, number of epochs
Metrics over time: training loss, validation loss, evaluation metrics at each logging step
Data version: hash or identifier for the exact dataset used, including preprocessing
Code version: git commit hash of the training code
Environment: Python version, library versions, GPU type, CUDA version
Artifacts: saved model checkpoints, generated outputs, evaluation predictions
Random seeds: all seeds used for reproducibility

Major Tracking Platforms

MLflow

MLflow provides four components: Tracking (logging parameters, metrics, artifacts), Projects (packaging code for reproducibility), Models (model registry), and Model Serving. The tracking server stores runs in a backend database with an artifact store (local filesystem or S3).

Key design choice: MLflow is open-source and self-hosted. You own your data. The tradeoff is that you manage the infrastructure.

Weights and Biases (W&B)

W&B is a hosted platform that logs metrics, hyperparameters, system metrics (GPU utilization, memory), and artifacts. It provides interactive dashboards for comparing runs, a sweep agent for hyperparameter search, and a report system for sharing results.

Key design choice: W&B is hosted (with a self-hosted option). The hosted version requires sending data to external servers, which matters for proprietary work.

TensorBoard

TensorBoard is a visualization tool that reads event files written during training. It supports scalar metrics, histograms, images, text, and computation graphs. It is tightly integrated with TensorFlow and has PyTorch support via torch.utils.tensorboard.

Key design choice: TensorBoard is a local visualization tool, not a full tracking platform. It lacks built-in experiment comparison, hyperparameter logging, and artifact management. It is useful for monitoring a single run but insufficient for managing a research program.

Side-by-side comparison

Capability	MLflow	Weights & Biases	TensorBoard
Hosted / self-hosted	self-hosted (open source)	hosted (paid) or self-hosted	local viewer
Hyperparameter logging	yes (Tracking API)	yes	manual via `add_hparams`
Metric dashboards	basic (web UI on the tracking server)	rich (custom panels, run groups)	scalar / histogram / image only
Artifact registry	yes (Models component)	yes	no
Hyperparameter sweeps	external (optuna / ax)	built-in (Sweeps agent)	none
System metrics (GPU, memory)	manual	automatic	manual
Team collaboration	basic permissions	reports, shared workspaces	none
Best fit	proprietary / regulated data, infra team available	external / academic / quick onboarding	sanity-check a single run

The right choice depends on data residency, available DevOps capacity, and how much the team relies on shared dashboards. For reproducibility discipline, any of the three is better than ad hoc logging; the differences are in collaboration and convenience, not whether you can reconstruct a run.

Configuration Management

Definition

Configuration Management

Configuration management is the practice of specifying all experiment parameters in structured config files rather than command-line arguments or hardcoded values. Tools like Hydra and OmegaConf provide:

Hierarchical configs: nested YAML files for model, data, training, and evaluation settings
Config composition: combine partial configs (e.g., model=resnet + optimizer=adam)
Override from command line: change any parameter without editing files
Automatic logging: the resolved config is saved with each run

Hydra (by Meta) is the standard. It creates a timestamped output directory for each run, saves the full resolved config, and integrates with logging frameworks. The key discipline: never specify a hyperparameter in code that is not also in the config file.

Experiment Comparison

Tracking Completeness as a Working Desideratum

Treat tracking completeness as a practitioner guideline, not a theorem. A tracking system is "complete enough" if, for any two runs $A$ and $B$ whose final metrics differ by more than the run-to-run noise you have measured, you can point to at least one recorded input difference (hyperparameters, data version, code commit, environment, random seed) that plausibly explains the gap. Symbolically: if $|\text{metric}(A) - \text{metric}(B)|$ exceeds your noise floor, you expect some logged $\theta$ with $\theta_A \neq \theta_B$ .

If two runs produce different results and you cannot find any recorded difference, your tracking is incomplete. Common culprits are unrecorded library version changes, GPU floating-point nondeterminism, and silent data preprocessing changes.

Perfect completeness is unachievable. GPU FP operations are nondeterministic, library internals change between minor versions, and some randomness is irreducible. The realistic standard is: log enough to reproduce results within the run-to-run variance you have measured and documented. Logging learning rate and batch size is not enough if the preprocessing pipeline silently changed between runs.

Hyperparameter Sweep Management

Tracking tools typically integrate sweep (hyperparameter search) functionality:

Grid search: enumerate all combinations. Logged as a group of runs with a shared sweep ID
Random search: sample configurations from distributions. More efficient than grid for high-dimensional spaces (Bergstra and Bengio, 2012)
Bayesian optimization: use past results to guide future configurations. W&B Sweeps and Optuna support this
Early stopping: kill underperforming runs early. Requires real-time metric access, which tracking tools provide

The sweep metadata (search space, sampling strategy, stopping criteria) should be logged alongside the individual runs.

Common Confusions

Watch Out

Logging is not the same as tracking

Writing print statements to stdout is logging. Tracking means structured storage that supports querying, comparison, and retrieval. If you cannot programmatically find the run with the best validation loss from last month, you have logging but not tracking.

Watch Out

Version control is not experiment tracking

Git tracks code versions. Experiment tracking records the mapping from (code version, data version, config) to (metrics, artifacts). You need both. A git commit tells you what code was available; an experiment record tells you which configuration of that code produced which result.

Watch Out

Dashboards are not a substitute for raw logs

Interactive dashboards are useful for exploration but unreliable for archival. Always ensure the underlying data (metrics, configs, artifacts) is stored in a durable format that survives platform migrations. Export to JSON or CSV periodically.

Summary

Log everything: hyperparameters, metrics over time, data version, code commit, environment, seeds
Use structured config management (Hydra/OmegaConf), not command-line arguments
MLflow is open-source and self-hosted; W&B is hosted with richer visualization
TensorBoard is a visualization tool, not a complete tracking platform
Sweep metadata (search space, strategy) should be tracked alongside runs
You cannot reproduce what you did not record

Exercises

ExerciseCore

Problem

You trained a model three months ago that achieved 94% accuracy. You want to reproduce it. You have the code in git and the final accuracy logged. What additional information do you need, and which of it would a proper tracking system have recorded?

ExerciseAdvanced

Problem

You are choosing between MLflow (self-hosted) and W&B (hosted) for a team of 10 ML engineers working on proprietary medical data. List three specific technical factors that should influence this decision, beyond general preference.

References

Canonical:

Zaharia et al., "Accelerating the Machine Learning Lifecycle with MLflow" (IEEE Data Engineering Bulletin, 2018), §3-4 on the Tracking and Models components.
Bergstra and Bengio, "Random Search for Hyper-Parameter Optimization" (JMLR 2012), §2 for the variance argument vs grid search.
Sculley et al., "Hidden Technical Debt in Machine Learning Systems" (NeurIPS 2015), §3-4 on configuration debt and entanglement; the canonical statement of why config and tracking discipline matter.

Current:

Biewald, "Experiment Tracking with Weights and Biases" (2020), the original W&B technical write-up.
Yadan, "Hydra: A Framework for Elegantly Configuring Complex Applications" (Meta, 2019), the canonical Hydra reference.
Akiba et al., "Optuna: A Next-generation Hyperparameter Optimization Framework" (KDD 2019), for the Bayesian-optimization sweep alternative referenced above.
Pineau et al., "Improving Reproducibility in Machine Learning Research" (JMLR 2021), for the empirical reproducibility checklist used by NeurIPS / ICML.

Next Topics

Ablation study design: systematic experiments to understand component contributions

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Hardware for ML Practitionerslayer 1 · tier 2
ML Project Lifecyclelayer 1 · tier 2
Reproducibility and Experimental Rigorlayer 2 · tier 2
Weights and Biases for Experiment Trackinglayer 4 · tier 3

Derived topics

1

Ablation Study Designlayer 3 · tier 2

Graph-backed continuations

Ablation Study Design