CNNs for Medical Imaging

Sneiderman, Robby

Applied ML

CNNs for Medical Imaging

U-Net and its 3D successors dominate medical image segmentation; ResNet/DenseNet backbones drive radiology classification. Distribution shift across hospitals is the recurring failure mode.

AdvancedTier 3CurrentReference~15 min

Prerequisites

Convolutional Neural Networks Object Detection and Segmentation

Prereq Map

Why This Matters

Medical images differ from natural images in three ways: pixel intensities carry physical meaning (Hounsfield units, T1/T2 contrast), spatial resolution is anisotropic and modality-dependent, and labeled data is scarce. Architectures that succeed here are the ones that respect these constraints. The encoder-decoder U-Net with skip connections (Ronneberger, Fischer, Brox 2015) became the workhorse for segmentation precisely because it preserves spatial detail through the bottleneck while still aggregating context.

Two clinical use classes dominate. Segmentation: outline an organ, lesion, or cell, then compute volume or shape statistics. Classification: from a chest X-ray or histopathology slide, predict the presence of a finding. Both ride on convolutional neural networks; the failure modes are usually about data, not architecture.

The recurring evaluation trap is confounding by acquisition site. CheXNet-era models often learned scanner artifacts, view markers, or hospital-specific framing rather than the radiological feature itself. Zech et al. (2018, PLOS Medicine) showed pneumonia classifiers transfer poorly across hospitals because they had picked up site-specific calibration patches as features.

Core Ideas

U-Net. Symmetric encoder and decoder with skip connections at each resolution. The encoder downsamples through pooling, the decoder upsamples through transposed convolutions or interpolation, and skips concatenate matched-resolution features so the decoder sees both context (deep layers) and locality (shallow layers). For an input volume $X$ and prediction $\hat{Y}$ the loss is typically a sum of cross entropy and Dice: $\mathcal{L} = \mathcal{L}_{\mathrm{CE}} + \lambda (1 - \mathrm{Dice}(\hat{Y}, Y))$ .

3D U-Net and volumetric variants. Replace 2D convolutions with 3D for CT and MRI volumes. Memory blows up cubically, so most 3D pipelines work on patches and stitch predictions. V-Net (Milletari et al. 2016) added residual blocks; nnU-Net (Isensee et al. 2021, Nature Methods 18) automated patch size, spacing, and normalization choices and won most public segmentation benchmarks without architectural innovation. The lesson: with medical data, configuration discipline beats novel architectures.

Classification backbones. ResNet-50 and DenseNet-121 remain the default for chest X-ray and histology classification. CheXNet (Rajpurkar et al. 2017, arXiv:1711.05225) is a DenseNet-121 trained on ChestX-ray14 to predict pneumonia and 13 other findings. ImageNet pretraining still helps despite the domain gap; the early conv filters transfer.

Vision transformers in pathology. Whole-slide histopathology images are gigapixel; attention over patch tokens fits naturally. Models like CTransPath, HIPT, and UNI use self-supervised vision pretraining on millions of slide patches before fine-tuning. ViT-based foundation models now match or exceed CNNs on many pathology tasks, though convolutional baselines remain competitive when labeled data is small.

Distribution shift and external validation. The single most important methodology rule: report performance on at least one external hospital. Internal AUC of 0.95 and external AUC of 0.75 is the typical gap. Causes include scanner manufacturer differences, demographic shift, and label noise from differing radiologist conventions. Stain normalization (Macenko, Reinhard) helps for histopathology; intensity standardization and bias-field correction help for MRI. Domain randomization at training time and test-time adaptation are the active research directions.

Common Confusions

Watch Out

Dice loss is not symmetric to cross entropy

Dice optimizes the F1-like overlap and is robust to class imbalance, which is severe in lesion segmentation (a tumor may be $0.1\%$ of voxels). Cross entropy is well-calibrated but gets dominated by background gradients. Most pipelines use a weighted sum so the network learns sharp boundaries (cross entropy) and high overlap (Dice).

Watch Out

ImageNet pretraining helps even though the images look nothing alike

Early conv filters learn edges, textures, and gradients that are universal. The benefit shrinks for late layers but the cost of training from scratch on small medical datasets is usually worse. Self-supervised pretraining on in-domain images (RadImageNet, MICLe) outperforms ImageNet when available.

References

Canonical:

Ronneberger, Fischer, Brox. "U-Net: Convolutional Networks for Biomedical Image Segmentation." MICCAI 2015. https://arxiv.org/abs/1505.04597
Isensee et al. "nnU-Net: a self-configuring method for deep-learning-based biomedical image segmentation." Nature Methods 18, 2021. https://www.nature.com/articles/s41592-020-01008-z

Current:

Rajpurkar et al. "CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning." arXiv:1711.05225, 2017. https://arxiv.org/abs/1711.05225
Zech et al. "Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study." PLOS Medicine, 2018. https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002683
Lu et al. "A visual-language foundation model for pathology image analysis (CONCH)." Nature Medicine, 2024. https://www.nature.com/articles/s41591-024-02856-4
Cicek et al. "3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation." MICCAI 2016. https://arxiv.org/abs/1606.06650

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics