What Variables Affect Out-of-Distribution Generalization in Pretrained Models?

Md Yousuf Harun^1*, Kyungbok Lee^2*, Jhair Gallardo¹, Giri Krishnan³ Christopher Kanan²

¹Rochester Institute of Technology, ²University of Rochester, ³Georgia Tech

[* Equal contribution]

Published at NeurIPS 2024

Motivation

Embeddings from pre-trained deep neural networks (DNNs) are widely used across computer vision; however, the efficacy of these embeddings when used for down-stream tasks can vary widely. We seek to understand what variables affect out-of-distribution (OOD) generalization. We do this through the lens of the tunnel effect hypothesis, which states that after training an over-parameterized DNN, its layers form two distinct groups. The first consists of the initial DNN layers that produce progressively more linearly separable representations, and the second consists of the deeper layers that compress these representations and hinder OOD generalization. Earlier work convincingly demonstrated the tunnel effect exists for DNNs trained on low-resolution images (e.g., CIFAR-10) and suggested that it was universally applicable. Here, we study the magnitude of the tunnel effect when the DNN architecture, training dataset, image resolution, augmentations, and OOD dataset are varied. We show that in some cases the tunnel effect is completely mitigated, therefore refuting that the hypothesis is universally applicable. Through extensive experiments with 10,584 trained linear probes, we find that each variable plays a role, but some have more impact than others. Our results caution against the practice of extrapolating findings from models trained on toy datasets to be universally applicable.

The Tunnel Effect Hypothesis

An overparameterized N-layer DNN forms two distinct groups:

The extractor consists of the first K layers, creating linearly separable representations.

The tunnel comprises the remaining N - K layers, compressing representations and hindering OOD performance.

Linear probe ID accuracy monotonically increases as a function of layers, but OOD accuracy only increases until the tunnel is reached and then decreases.
Earlier work used datasets with 32✕32 images (CIFAR-10, etc.) for ID training data and did not measure tunnel effect strength (Masarczyk et al., NeurIPS 2023).
Their findings are contrary to widely used transfer learning approaches with ImageNet-1K backbones.

In-distribution (CIFAR-10)

Out-of-distribution (CIFAR-100)

RQ: How does image resolution impact the tunnel strength?

The tunnel impedes OOD generalization, which we study using linear probes trained on ID and OOD datasets for each layer. In this example, identical VGGm-17 models are trained on identical ID datasets, where only the resolution is changed. Probe accuracy on OOD datasets decreases once the tunnel is reached (denoted by ⭐), where the model trained on low-resolution (32x32) images creates a longer tunnel (layer 9-16) than the one (layer 13-16) trained on higher-resolution (224x224) images. The Y-axis shows the normalized accuracy. The OOD curve is the average of 8 OOD datasets, with the standard deviation denoted with shading.

Quantitative Results: High-Resolution Images Reduce The Tunnel Strength

Increasing image resolution improves OOD performance across all criteria. p-values are denoted by stars. *P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001.

RQ: How does augmentation impact the tunnel strength?

Augmentation reduces the tunnel strength. In (a), augmentation shifts the tunnel from layer 14 to 22, and in (b) from block 11 to 15. The OOD curve is the average of 8 OOD datasets with a shaded area indicating a 95% confidence interval. ⭐ denotes the start of the tunnel.

(a) ResNet

(b) ViT

Quantitative Results: Augmentation Reduces The Tunnel Strength

Augmentations increase training data diversity and decrease the tunnel strength. p-values are denoted by stars. *P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001.

Training on more classes greatly reduces the tunnel effect, whereas increasing dataset size has less impact

(1) and (2) Results with a fixed number of samples but a varied number of classes.

(3) and (4) Results with a fixed number of classes but a varied number of samples per class.

(1) Class (No Aug)

(2) Class (Aug)

(3) Data (No Aug)

(4) Data (Aug)

Impact of DNN Variables on The Tunnel Effect

DNN variables negatively impact the tunnel effect.

(a) Increasing overparameterization level (𝛄) intensifies the tunnel effect and impairs OOD generalization.

(b) Increasing DNN depth hurts OOD generalization.

(d) Large stem size (k×k) impairs OOD transfer.

(a) OverParam.

(b) DNN Depth

(c) Spatial Red.

(d) Stem

Is The Tunnel Effect Universal?

The tunnel effect is not universal and its strength varies. Among 64 ID backbones, 4 did not exhibit any tunnel effect.

In (a), VGGm-11 consisting of max-pool in all 5 stages (φ = 0.5), creates tunnels (layer 7-10, gray-shaded area).

In (b), the same VGGm-11 without max-pool in the first 2 stages (φ = 1, called VGGm†-11), eliminates the tunnel for all OOD datasets.

(a) Strong Tunnel Effect

(b) No Tunnel Effect

SHAP Analysis - % OOD Performance Retained

In terms of % OOD Performance Retained, ID class count shows the greatest impact.

SHAP Analysis - ID/OOD Alignment

In terms of ID/OOD alignment, image resolution shows the greatest impact.

SHAP Analysis - Pearson Correlation

In terms of Pearson correlation, ID class count shows the greatest impact.

ID Dataset

The tunnel effect is observed for various ID datasets but its strength varies with ID class counts. The tunnel effect is not a characteristic of a particular dataset e.g., CIFAR-10.

Representation Compression

The t-SNE comparison between VGGm-11 models trained on low- (1st row) and high-resolution (2nd row) images of the same ID dataset (ImageNet-100) in an augmentation-free setting. Layer 8 marks the start of the tunnel in VGGm-11 trained on 32x32 images whereas 224x224 resolution does not create any tunnel. Layer 10 is the penultimate layer. The tunnel layers (layer 8-10) progressively compress representations for 32x32 resolution whereas corresponding layers for 224x224 resolution do not exhibit similar compression. For clarity, we show 5 classes from ImageNet-100 and indicate each class by a distinct color. The formation of distinct clusters in the 32x32 model is indicative of representation compression and intermediate neural collapse, which impairs OOD generalization.

Summary

It is evident that dataset variables e.g., image resolution, ID class counts, and augmentations show dominance in altering the tunnel effect.
Increasing ID class counts (between-class diversity), using more augmentations (within-class diversity), and using higher image resolution (hierarchical features) reduce the tunnel effect and improve OOD transfer.
DNN variables e.g., over-parameterization, depth etc. increase the tunnel effect but their impact is less compared to the dataset variables.
Concretely, we observe that increasing dataset diversity plays a major role in mitigating the tunnel effect. This leads us to revise the tunnel effect hypothesis.

Revised Tunnel Effect Hypothesis

Our study indicates that the best way to mitigate the tunnel effect, and thereby increase OOD generalization, is to increase diversity in the ID training dataset, especially by increasing the number of semantic classes, using augmentations, and higher-resolution images; hence, we revise the tunnel effect hypothesis as follows:

An overparameterized N-layer DNN forms two distinct groups:

The extractor consists of the first K layers, creating linearly separable representations.

The tunnel comprises the remaining N - K layers, compressing representations and hindering OOD performance.

K is proportional to the diversity of training inputs, where if diversity is sufficiently high, N = K.

Acknowledgements

This work was partly supported by NSF awards #2326491, #2125362, and #2317706.

BibTeX

@article{harun2024variables,
  title     = {What Variables Affect Out-of-Distribution Generalization in Pretrained Models?},
  author    = {Harun, Md Yousuf and Lee, Kyungbok and Gallardo, Jhair and Krishnan, Giri and Kanan, Christopher},
  journal   = {Neural Information Processing Systems},
  year      = {2024}
  }