Methodology
Experiment Tracking and Tooling
MLflow, Weights and Biases, TensorBoard, Hydra, and the discipline of recording everything. What to log, how to compare experiments, and why you cannot reproduce what you did not record.
Prerequisites
Why This Matters
You cannot reproduce what you did not record. Every ML practitioner has faced this situation: a model from three months ago performed well, but nobody remembers the exact hyperparameters, data version, or code commit that produced it. Experiment tracking tools solve this by automatically recording every detail of every run.
The difference between a research group that can reproduce its own results and one that cannot is almost always tooling discipline, not talent. This page covers the three industry-standard trackers (MLflow, W&B, TensorBoard), config management with Hydra / OmegaConf, and how reproducibility discipline maps onto what you actually log per run. The companion page on ablation study design covers what to do with the resulting record once you have one.
Mental Model
An experiment tracker is a structured lab notebook. For each run, it records: the inputs (code, data, config), the process (training curves, resource usage), and the outputs (metrics, artifacts, model checkpoints). Given any past result, you should be able to look up the exact conditions that produced it and re-run them.
What to Log
Experiment Record
A complete experiment record contains the following for each training run:
- Hyperparameters: learning rate, batch size, optimizer settings, architecture choices, regularization strength, number of epochs
- Metrics over time: training loss, validation loss, evaluation metrics at each logging step
- Data version: hash or identifier for the exact dataset used, including preprocessing
- Code version: git commit hash of the training code
- Environment: Python version, library versions, GPU type, CUDA version
- Artifacts: saved model checkpoints, generated outputs, evaluation predictions
- Random seeds: all seeds used for reproducibility
Major Tracking Platforms
MLflow
MLflow provides four components: Tracking (logging parameters, metrics, artifacts), Projects (packaging code for reproducibility), Models (model registry), and Model Serving. The tracking server stores runs in a backend database with an artifact store (local filesystem or S3).
Key design choice: MLflow is open-source and self-hosted. You own your data. The tradeoff is that you manage the infrastructure.
Weights and Biases (W&B)
W&B is a hosted platform that logs metrics, hyperparameters, system metrics (GPU utilization, memory), and artifacts. It provides interactive dashboards for comparing runs, a sweep agent for hyperparameter search, and a report system for sharing results.
Key design choice: W&B is hosted (with a self-hosted option). The hosted version requires sending data to external servers, which matters for proprietary work.
TensorBoard
TensorBoard is a visualization tool that reads event files written during
training. It supports scalar metrics, histograms, images, text, and computation
graphs. It is tightly integrated with TensorFlow and has PyTorch support via
torch.utils.tensorboard.
Key design choice: TensorBoard is a local visualization tool, not a full tracking platform. It lacks built-in experiment comparison, hyperparameter logging, and artifact management. It is useful for monitoring a single run but insufficient for managing a research program.
Side-by-side comparison
| Capability | MLflow | Weights & Biases | TensorBoard |
|---|---|---|---|
| Hosted / self-hosted | self-hosted (open source) | hosted (paid) or self-hosted | local viewer |
| Hyperparameter logging | yes (Tracking API) | yes | manual via add_hparams |
| Metric dashboards | basic (web UI on the tracking server) | rich (custom panels, run groups) | scalar / histogram / image only |
| Artifact registry | yes (Models component) | yes | no |
| Hyperparameter sweeps | external (optuna / ax) | built-in (Sweeps agent) | none |
| System metrics (GPU, memory) | manual | automatic | manual |
| Team collaboration | basic permissions | reports, shared workspaces | none |
| Best fit | proprietary / regulated data, infra team available | external / academic / quick onboarding | sanity-check a single run |
The right choice depends on data residency, available DevOps capacity, and how much the team relies on shared dashboards. For reproducibility discipline, any of the three is better than ad hoc logging; the differences are in collaboration and convenience, not whether you can reconstruct a run.
Configuration Management
Configuration Management
Configuration management is the practice of specifying all experiment parameters in structured config files rather than command-line arguments or hardcoded values. Tools like Hydra and OmegaConf provide:
- Hierarchical configs: nested YAML files for model, data, training, and evaluation settings
- Config composition: combine partial configs (e.g., model=resnet + optimizer=adam)
- Override from command line: change any parameter without editing files
- Automatic logging: the resolved config is saved with each run
Hydra (by Meta) is the standard. It creates a timestamped output directory for each run, saves the full resolved config, and integrates with logging frameworks. The key discipline: never specify a hyperparameter in code that is not also in the config file.
Experiment Comparison
Tracking Completeness as a Working Desideratum
Treat tracking completeness as a practitioner guideline, not a theorem. A tracking system is "complete enough" if, for any two runs and whose final metrics differ by more than the run-to-run noise you have measured, you can point to at least one recorded input difference (hyperparameters, data version, code commit, environment, random seed) that plausibly explains the gap. Symbolically: if exceeds your noise floor, you expect some logged with .
If two runs produce different results and you cannot find any recorded difference, your tracking is incomplete. Common culprits are unrecorded library version changes, GPU floating-point nondeterminism, and silent data preprocessing changes.
Perfect completeness is unachievable. GPU FP operations are nondeterministic, library internals change between minor versions, and some randomness is irreducible. The realistic standard is: log enough to reproduce results within the run-to-run variance you have measured and documented. Logging learning rate and batch size is not enough if the preprocessing pipeline silently changed between runs.
Hyperparameter Sweep Management
Tracking tools typically integrate sweep (hyperparameter search) functionality:
- Grid search: enumerate all combinations. Logged as a group of runs with a shared sweep ID
- Random search: sample configurations from distributions. More efficient than grid for high-dimensional spaces (Bergstra and Bengio, 2012)
- Bayesian optimization: use past results to guide future configurations. W&B Sweeps and Optuna support this
- Early stopping: kill underperforming runs early. Requires real-time metric access, which tracking tools provide
The sweep metadata (search space, sampling strategy, stopping criteria) should be logged alongside the individual runs.
Common Confusions
Logging is not the same as tracking
Writing print statements to stdout is logging. Tracking means structured storage that supports querying, comparison, and retrieval. If you cannot programmatically find the run with the best validation loss from last month, you have logging but not tracking.
Version control is not experiment tracking
Git tracks code versions. Experiment tracking records the mapping from (code version, data version, config) to (metrics, artifacts). You need both. A git commit tells you what code was available; an experiment record tells you which configuration of that code produced which result.
Dashboards are not a substitute for raw logs
Interactive dashboards are useful for exploration but unreliable for archival. Always ensure the underlying data (metrics, configs, artifacts) is stored in a durable format that survives platform migrations. Export to JSON or CSV periodically.
Summary
- Log everything: hyperparameters, metrics over time, data version, code commit, environment, seeds
- Use structured config management (Hydra/OmegaConf), not command-line arguments
- MLflow is open-source and self-hosted; W&B is hosted with richer visualization
- TensorBoard is a visualization tool, not a complete tracking platform
- Sweep metadata (search space, strategy) should be tracked alongside runs
- You cannot reproduce what you did not record
Exercises
Problem
You trained a model three months ago that achieved 94% accuracy. You want to reproduce it. You have the code in git and the final accuracy logged. What additional information do you need, and which of it would a proper tracking system have recorded?
Problem
You are choosing between MLflow (self-hosted) and W&B (hosted) for a team of 10 ML engineers working on proprietary medical data. List three specific technical factors that should influence this decision, beyond general preference.
References
Canonical:
- Zaharia et al., "Accelerating the Machine Learning Lifecycle with MLflow" (IEEE Data Engineering Bulletin, 2018), §3-4 on the Tracking and Models components.
- Bergstra and Bengio, "Random Search for Hyper-Parameter Optimization" (JMLR 2012), §2 for the variance argument vs grid search.
- Sculley et al., "Hidden Technical Debt in Machine Learning Systems" (NeurIPS 2015), §3-4 on configuration debt and entanglement; the canonical statement of why config and tracking discipline matter.
Current:
- Biewald, "Experiment Tracking with Weights and Biases" (2020), the original W&B technical write-up.
- Yadan, "Hydra: A Framework for Elegantly Configuring Complex Applications" (Meta, 2019), the canonical Hydra reference.
- Akiba et al., "Optuna: A Next-generation Hyperparameter Optimization Framework" (KDD 2019), for the Bayesian-optimization sweep alternative referenced above.
- Pineau et al., "Improving Reproducibility in Machine Learning Research" (JMLR 2021), for the empirical reproducibility checklist used by NeurIPS / ICML.
Next Topics
- Ablation study design: systematic experiments to understand component contributions
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Hardware for ML Practitionerslayer 1 · tier 2
- ML Project Lifecyclelayer 1 · tier 2
- Reproducibility and Experimental Rigorlayer 2 · tier 2
- Weights and Biases for Experiment Trackinglayer 4 · tier 3
Derived topics
1- Ablation Study Designlayer 3 · tier 2
Graph-backed continuations