What Variables Affect Out-Of-Distribution Generalization in Pretrained Models?

Md Yousuf Harun1*, Kyungbok Lee2*, Jhair Gallardo1, Giri Krishnan3, Christopher Kanan2
1Rochester Institute of Technology, 2University of Rochester, 3Georgia Tech

[* Equal contribution]

Motivation

Embeddings from pre-trained deep neural networks (DNNs) are widely used across computer vision; however, the efficacy of these embeddings when used for down-stream tasks can vary widely. We seek to understand what variables affect out-of-distribution (OOD) generalization. We do this through the lens of the tunnel effect hypothesis, which states that after training an over-parameterized DNN, its layers form two distinct groups. The first consists of the initial DNN layers that produce progressively more linearly separable representations, and the second consists of the deeper layers that compress these representations and hinder OOD generalization. Earlier work convincingly demonstrated the tunnel effect exists for DNNs trained on low-resolution images (e.g., CIFAR-10) and suggested that it was universally applicable. Here, we study the magnitude of the tunnel effect when the DNN architecture, training dataset, image resolution, augmentations, and OOD dataset are varied. We show that in some cases the tunnel effect is completely mitigated, therefore refuting that the hypothesis is universally applicable. Through extensive experiments with 10,584 trained linear probes, we find that each variable plays a role, but some have more impact than others. Our results caution against the practice of extrapolating findings from models trained on toy datasets to be universally applicable.

The Tunnel Effect

Interpolation end reference image.

The tunnel impedes OOD generalization, which we study using linear probes trained on ID and OOD datasets for each layer. In this example, identical VGGm-17 models are trained on identical ID datasets, where only the resolution is changed. Probe accuracy on OOD datasets decreases once the tunnel is reached (denoted by ⭐), where the model trained on low-resolution (32x32) images creates a longer tunnel (layer 9-16) than the one (layer 13-16) trained on higher-resolution (224x224) images. The Y-axis shows the normalized accuracy. The OOD curve is the average of 8 OOD datasets, with the standard deviation denoted with shading.

Is The Tunnel Effect Universal?

The tunnel effect is not universal. In (a), VGGm-11 consisting of max-pool in all 5 stages (φ = 0.5), creates tunnels (layer 7-10, gray-shaded area). In (b), the same VGGm-11 without max-pool in the first 2 stages (φ = 1, called VGGm†-11), eliminates the tunnel for all OOD datasets.

(a) Strong Tunnel Effect

Interpolation end reference image.

(b) No Tunnel Effect

Interpolation end reference image.

Augmentation Reduces The Tunnel Effect

In (a), augmentation shifts the tunnel from layer 14 to 22, and in (b) from block 11 to 15. The OOD curve is the average of 8 OOD datasets with a shaded area indicating a 95% confidence interval. ⭐ denotes the start of the tunnel.

(a) ResNet

Interpolation end reference image.

(b) ViT

Interpolation end reference image.

Training on more classes greatly reduces the tunnel effect, whereas increasing dataset size has less impact

(1) and (2) Results with a fixed number of samples but a varied number of classes.

(3) and (4) Results with a fixed number of classes but a varied number of samples per class.

SHAP Analysis - % OOD Performance Retained

In terms of % OOD Performance Retained, ID class count shows the greatest impact.

Interpolation end reference image.

SHAP Analysis - ID/OOD Alignment

In terms of ID/OOD alignment, image resolution shows the greatest impact.

Interpolation end reference image.

SHAP Analysis - Pearson Correlation

In terms of Pearson correlation, ID class count shows the greatest impact.

Interpolation end reference image.

ID Dataset

The tunnel effect is observed for various ID datasets but its strength varies with ID class counts. The tunnel effect is not a characteristic of a particular dataset e.g., CIFAR-10.

Interpolation end reference image.

Representation Compression

Interpolation end reference image.

The t-SNE comparison between VGGm-11 models trained on low- (1st row) and high-resolution (2nd row) images of the same ID dataset (ImageNet-100) in an augmentation-free setting. Layer 8 marks the start of the tunnel in VGGm-11 trained on 32x32 images whereas 224x224 resolution does not create any tunnel. Layer 10 is the penultimate layer. The tunnel layers (layer 8-10) progressively compress representations for 32x32 resolution whereas corresponding layers for 224x224 resolution do not exhibit similar compression. For clarity, we show 5 classes from ImageNet-100 and indicate each class by a distinct color. The formation of distinct clusters in the 32x32 model is indicative of representation compression and intermediate neural collapse, which impairs OOD generalization.

The Tunnel Effect Hypothesis

Our study indicates that the best way to mitigate the tunnel effect, and thereby increase OOD generalization, is to increase diversity in the ID training dataset, especially by increasing the number of semantic classes, using augmentations, and higher-resolution images; hence, we revise the tunnel effect hypothesis as follows:

An overparameterized N-layer DNN forms two distinct groups:

1. The extractor consists of the first K layers, creating linearly separable representations.

2. The tunnel comprises the remaining N - K layers, compressing representations and hindering OOD performance.

K is proportional to the diversity of training inputs, where if diversity is sufficiently high, N = K.

Acknowledgements

This work was partly supported by NSF awards #2326491, #2125362, and #2317706.

BibTeX

@article{harun2024variables,
  title     = {What Variables Affect Out-Of-Distribution Generalization in Pretrained Models?},
  author    = {Harun, Md Yousuf and Lee, Kyungbok and Gallardo, Jhair and Krishnan, Giri and Kanan, Christopher},
  journal   = {arXiv preprint arXiv:2405.15018},
  year      = {2024}
  }