SIESTA: Efficient Online Continual Learning with Sleep

Md Yousuf Harun1*, Jhair Gallardo1*, Tyler L. Hayes1†, Ronald Kemker2, Christopher Kanan3
1Rochester Institute of Technology, 2United States Space Force, 3University of Rochester

[* denotes Equal contribution and † conveys Now at NAVER LABS Europe]

Motivation

Most deep neural networks are trained once and then evaluated. In contrast, continual learning mimics how humans continually learn new knowledge throughout their lifespan. Most continual learning research has focused on mitigating a phenomenon called catastrophic forgetting, in which neural networks forget past information. Despite making remarkable progress toward alleviating catastrophic forgetting, existing algorithms remain compute-intensive and ill-suited for many resource-constrained real-world applications such as edge devices, mobile phones, robots, AR/VR and virtual assistants. For continual learning to make a real-world impact, continual learning systems need to provide computational efficiency and rival traditional offline learning systems retrained from scratch when dataset grows in size.

Towards that goal, we propose a novel online continual learning algorithm named SIESTA (Sleep Integration for Episodic STreAming). SIESTA uses a wake/sleep framework for training, which is well aligned to the needs of on-device learning. The major goal of SIESTA is to advance compute efficient continual learning so that DNNs can be updated efficiently using far less time and energy. The principal innovations of SIESTA are: [1] rapid online updates using a rehearsal-free, backpropagation-free, and data-driven network update rule during its wake phase, and [2] expedited memory consolidation using a compute-restricted rehearsal policy during its sleep phase. SIESTA is far more computationally efficient than existing methods, enabling continual learning on ImageNet-1K in under 2 hours on a single GPU; moreover, in the augmentation-free setting it matches the performance of the offline learner, a milestone critical to driving adoption of continual learning in real-world applications.

SIESTA Outperforms Prior Arts on ImageNet-1K Dataset

Interpolation end reference image.

SIESTA requires 7x-60x fewer updates, 10x less memory, 2x-20x fewer parameters than other methods. SIESTA requires only 1.9 hours to learn full ImageNet-1K whereas other methods require many hours or even days on the same hardware!

How Efficient is SIESTA?

Our method, SIESTA, outperforms existing continual learning methods for class-incremental learning on ImageNet-1K while requiring fewer network updates and using fewer parameters, as denoted by circle size.

Interpolation end reference image.

SIESTA Achieves "Zero Forgetting" - A Milestone

SIESTA matches the performance of the offline model while outperforming existing state-of-the-art methods e.g., ER, DER, and REMIND by large margins in continual learning on ImageNet-1K dataset. In the augmentation-free setting, Chochran’s Q test reveals that there is no significant difference among SIESTA’s final accuracy for the continual iid and class incremental settings compared to the offline learner (P = 0.08). Therefore, SIESTA achieves “zero forgetting” by matching the performance of the offline model.

Interpolation end reference image.

SIESTA is Capable of Working with Arbitrary Orderings

In general, iid (shuffled) orderings do not cause catastrophic forgetting; and at the other extreme, an ordering sorted by category causes severe catastrophic forgetting in conventional algorithms. When switching from iid to the class incremental setting, existing methods e.g., ER and REMIND fail to maintain similar performance and intensify forgetting. In contrast, SIESTA maintains similar performance as offline learner and achieve "zero forgetting" in both settings, demonstrating its robustness to data ordering.

Interpolation end reference image.

SIESTA is Performant on Four Benchmark Datasets

SIESTA outperforms state-of-the-art online continual leanring method, REMIND on four benchmark datasets. SIESTA learns the large-scale ImageNet-1K dataset (1.2M training samples) 3.4x faster than REMIND on the same hardware. Moreover, SIESTA provides a 4.4x speedup compared to REMIND to learn another large-scale dataset, Places365-Standard (1.8M training samples) using the same hardware.

Interpolation end reference image.

Efficiency in Large-Scale Dataset Regime

As size of dataset grows, the gap in GFLOPS between SIESTA and existing methods grows significantly. It is evident that SIESTA becomes far more efficient than others in large-scale dataset regime.

Interpolation end reference image.

Online Updates with Offline Consolidation

Interpolation end reference image.

An illustration of online updates with offline consolidation paradigm. While awake, agent performs online learning and while asleep, it performs computationally constrained offline learning. This wake/sleep cycles oscillate. Thus, proposed paradigm combines two existing paradigms: class incremental batch learning and online learning. SIESTA operates in this framework.

How Does SIESTA Algorithm Work?

Interpolation end reference image.

A high-level overview of SIESTA. During the Wake Phase, it transforms raw inputs into intermediate feature representations using network H. The inputs are then compressed with tensor quantization and cached. Then, weights belonging to recently seen classes in network F are updated with a running class mean using the output vectors from G. Finally, inference is performed on the current sample. During the Sleep Phase, a sampler uses a rehearsal policy to choose which examples should be reconstructed from the cached data for each mini-batch. Then, networks G and F are updated with backpropagation in a supervised manner. The wake/sleep cycles alternate.

Sleep Enhances Learning

We ask the question “What is the impact of sleep on SIESTA’s ability to learn and remember?”. After examining the pre-sleep and post-sleep performance of SIESTA on ImageNet-1K, we see that the performance of SIESTA after sleep is consistently higher than before sleep for all increments. Therefore, sleep greatly benefits online continual learning in DNN.

Interpolation end reference image.

Impact of Sleep Length

We study the impact of sleep length by varying the number of updates (m) during each sleep period, where SIESTA slept every 100 classes. We observe that as sleep length increases, SIESTA’s performance also increases; however, as the sleep length increases, SIESTA requires more updates, so we must strike a balance between accuracy and efficiency.

Interpolation end reference image.

Criteria for An Efficient Continual Learner

We argue that an ideal continual learner should have the following characteristics:

1. It should be capable of online learning and inference in a compute and memory constrained environment.

2. It should rival (or exceed) an offline learner, regardless of the structure of the training data stream.

3. It should be significantly more computationally efficient than training from scratch.

4. It should make no additional assumptions that constrain the supervised learning task, e.g., using task labels during inference.

Our method, SIESTA, meets all these criteria and thus aligns with real-world applications.

BibTeX

@article{harun2023siesta,
  title     = {{SIESTA}: Efficient Online Continual Learning with Sleep},
  author    = {Md Yousuf Harun and Jhair Gallardo and Tyler L. Hayes and Ronald Kemker and Christopher Kanan},
  journal   = {Transactions on Machine Learning Research},
  issn      = {2835-8856},
  year      = {2023},
  url       = {https://openreview.net/forum?id=MqDVlBWRRV},
  note      = {}
  }