Applied ML
CNNs for Medical Imaging
U-Net and its 3D successors dominate medical image segmentation; ResNet/DenseNet backbones drive radiology classification. Distribution shift across hospitals is the recurring failure mode.
Why This Matters
Medical images differ from natural images in three ways: pixel intensities carry physical meaning (Hounsfield units, T1/T2 contrast), spatial resolution is anisotropic and modality-dependent, and labeled data is scarce. Architectures that succeed here are the ones that respect these constraints. The encoder-decoder U-Net with skip connections (Ronneberger, Fischer, Brox 2015) became the workhorse for segmentation precisely because it preserves spatial detail through the bottleneck while still aggregating context.
Two clinical use classes dominate. Segmentation: outline an organ, lesion, or cell, then compute volume or shape statistics. Classification: from a chest X-ray or histopathology slide, predict the presence of a finding. Both ride on convolutional neural networks; the failure modes are usually about data, not architecture.
The recurring evaluation trap is confounding by acquisition site. CheXNet-era models often learned scanner artifacts, view markers, or hospital-specific framing rather than the radiological feature itself. Zech et al. (2018, PLOS Medicine) showed pneumonia classifiers transfer poorly across hospitals because they had picked up site-specific calibration patches as features.
Core Ideas
U-Net. Symmetric encoder and decoder with skip connections at each resolution. The encoder downsamples through pooling, the decoder upsamples through transposed convolutions or interpolation, and skips concatenate matched-resolution features so the decoder sees both context (deep layers) and locality (shallow layers). For an input volume and prediction the loss is typically a sum of cross entropy and Dice: .
3D U-Net and volumetric variants. Replace 2D convolutions with 3D for CT and MRI volumes. Memory blows up cubically, so most 3D pipelines work on patches and stitch predictions. V-Net (Milletari et al. 2016) added residual blocks; nnU-Net (Isensee et al. 2021, Nature Methods 18) automated patch size, spacing, and normalization choices and won most public segmentation benchmarks without architectural innovation. The lesson: with medical data, configuration discipline beats novel architectures.
Classification backbones. ResNet-50 and DenseNet-121 remain the default for chest X-ray and histology classification. CheXNet (Rajpurkar et al. 2017, arXiv:1711.05225) is a DenseNet-121 trained on ChestX-ray14 to predict pneumonia and 13 other findings. ImageNet pretraining still helps despite the domain gap; the early conv filters transfer.
Vision transformers in pathology. Whole-slide histopathology images are gigapixel; attention over patch tokens fits naturally. Models like CTransPath, HIPT, and UNI use self-supervised vision pretraining on millions of slide patches before fine-tuning. ViT-based foundation models now match or exceed CNNs on many pathology tasks, though convolutional baselines remain competitive when labeled data is small.
Distribution shift and external validation. The single most important methodology rule: report performance on at least one external hospital. Internal AUC of 0.95 and external AUC of 0.75 is the typical gap. Causes include scanner manufacturer differences, demographic shift, and label noise from differing radiologist conventions. Stain normalization (Macenko, Reinhard) helps for histopathology; intensity standardization and bias-field correction help for MRI. Domain randomization at training time and test-time adaptation are the active research directions.
Common Confusions
Dice loss is not symmetric to cross entropy
Dice optimizes the F1-like overlap and is robust to class imbalance, which is severe in lesion segmentation (a tumor may be of voxels). Cross entropy is well-calibrated but gets dominated by background gradients. Most pipelines use a weighted sum so the network learns sharp boundaries (cross entropy) and high overlap (Dice).
ImageNet pretraining helps even though the images look nothing alike
Early conv filters learn edges, textures, and gradients that are universal. The benefit shrinks for late layers but the cost of training from scratch on small medical datasets is usually worse. Self-supervised pretraining on in-domain images (RadImageNet, MICLe) outperforms ImageNet when available.
References
Canonical:
- Ronneberger, Fischer, Brox. "U-Net: Convolutional Networks for Biomedical Image Segmentation." MICCAI 2015. https://arxiv.org/abs/1505.04597
- Isensee et al. "nnU-Net: a self-configuring method for deep-learning-based biomedical image segmentation." Nature Methods 18, 2021. https://www.nature.com/articles/s41592-020-01008-z
Current:
- Rajpurkar et al. "CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning." arXiv:1711.05225, 2017. https://arxiv.org/abs/1711.05225
- Zech et al. "Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study." PLOS Medicine, 2018. https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002683
- Lu et al. "A visual-language foundation model for pathology image analysis (CONCH)." Nature Medicine, 2024. https://www.nature.com/articles/s41591-024-02856-4
- Cicek et al. "3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation." MICCAI 2016. https://arxiv.org/abs/1606.06650
Related Topics
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Convolutional Neural Networkslayer 3 · tier 2
- Object Detection and Segmentationlayer 3 · tier 2
Derived topics
2- Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMlayer 4 · tier 1
- Out-of-Distribution Detectionlayer 3 · tier 2
Graph-backed continuations