Skip to main content

Infrastructure

Git and GitLab for ML Research

Reference for version control in ML projects: branching, rebase, monorepo vs multirepo, GitLab CI for training, Git LFS vs DVC for data, and why most ML teams use GitHub plus a model registry.

CoreTier 3CurrentReference~12 min

What It Is

Git is a distributed version control system, written by Linus Torvalds in 2005. Every clone is a full repository: the entire history, every branch, every commit lives on disk. Operations are local and fast; pushing to a remote (origin, typically) is how clones synchronize. The unit of change is a commit, identified by a SHA-1 hash of its contents plus its parents.

The branching model: a branch is a movable pointer to a commit, merge combines two branches and produces a merge commit with two parents, rebase replays a sequence of commits onto a different base and rewrites their hashes. Conflict resolution happens when two branches modify the same file region; Git marks the conflict with <<<<<<<, =======, >>>>>>> and waits for a manual edit.

GitLab and GitHub are competing hosted Git platforms with built-in CI/CD, code review, and issue tracking. GitLab self-hosts more cleanly (the Omnibus install) and bundles Container Registry and CI runners; GitHub is dominant in open source and has the larger third-party action ecosystem. Both run pipelines from a YAML file in the repo (.gitlab-ci.yml or .github/workflows/*.yml).

When You'd Use It

For ML research, the recurring decisions are repository layout, large-file handling, and CI scope.

Monorepo vs multirepo: a monorepo (one repo for data pipelines, training code, evaluation harness, deployment scripts) wins for cross-cutting refactors and atomic commits across components. Multirepo wins when teams have separate release cadences. ML research projects under 5 contributors almost always benefit from a monorepo.

Large files: Git is bad at storing binary files because it stores full snapshots, not deltas, and the pack format degrades with binary data. Git LFS replaces large files with text pointers and stores blobs on a separate server. LFS works for occasional binaries (logos, small fixtures) but is the wrong tool for model checkpoints or training data: bandwidth costs add up, history rewriting is painful, and there is no concept of dataset versioning.

DVC (Data Version Control, by Iterative.ai) is purpose-built for ML artifacts. It stores a small .dvc file in Git that points to a content-addressed blob in object storage (S3, GCS, Azure Blob, SSH). dvc repro defines a DAG of pipeline stages with input/output hashes, so re-running only re-executes stages whose inputs changed. For any project with model checkpoints over ~100 MB or datasets that change, DVC is the right answer.

GitLab CI for ML training: GitLab CI can launch GPU runners and run training as a pipeline stage, useful for nightly evaluation and integration tests. As of 2025-2026 most production ML teams have moved to a split: GitHub for code review and unit-test CI, and a model registry / orchestrator (W&B Model Registry, MLflow Model Registry, Hugging Face Hub, Modal, Argo Workflows) for actual training and deployment. GitLab CI runners do not handle multi-day jobs gracefully and model lineage tracking is weak.

Pre-commit hooks: the pre-commit framework (a Python tool) runs linters, formatters, and secret scanners before each commit. Standard ML setup: ruff (lint and format), mypy (type check), nbstripout (strip notebook outputs to keep diffs reviewable), and detect-secrets or trufflehog for credential scanning.

Definition

Version-Controlled ML Artifact

A version-controlled ML artifact is any object needed to reproduce a result with a stable identity: source code, config, dataset pointer, checkpoint pointer, environment lockfile, evaluation report, or run metadata. Git should store text and pointers; object storage should store large binary payloads.

Proposition

Large Binary History Principle

Statement

Putting large changing binaries directly in Git makes every clone and fetch pay for historical artifacts that most users do not need.

Intuition

Git is excellent at tracking source history. It is not a model registry or dataset store. ML projects need content-addressed pointers so code review remains small while artifacts remain reproducible.

Failure Mode

Small fixtures, icons, and stable test assets are fine in Git. The problem is repeated checkpoint, dataset, or notebook-output churn.

ExerciseCore

Problem

Classify each item as Git, Git LFS, DVC, or a model registry: a 2 KB YAML config, a 400 MB checkpoint updated weekly, a 50 GB training dataset, and a released model card.

Notable Gotchas

Watch Out

git rebase rewrites hashes

Rebasing a branch that someone else has pulled creates two parallel histories and a recurring "your branch and origin have diverged" message. Rule of thumb: rebase only branches that exist on your machine. Once a branch is shared, use merge commits. Force-push (--force-with-lease) on a personal feature branch is fine; force-push on main is destructive.

Watch Out

Notebook diffs make code review painful

A .ipynb file is JSON containing both code and outputs (including base64-encoded images). A one-line code change can produce a 10000-line diff. Always run nbstripout --install in any repo that hosts notebooks, or use Jupytext to pair notebooks with .py files that are the version-controlled artifact.

References

Related Topics

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

1

Graph-backed continuations