An Impartial Take to the CNN vs Transformer Robustness Contest

by Francesco Pinto, Philip H.S. Torr and Puneet K. Dokania Published at the European Conference on Computer Vision (ECCV), 2022

Abstract

Following the surge of popularity of Transformers in Computer Vision, several studies have attempted to determine whether theycould be more robust to distribution shifts and provide better uncertainty estimates than Convolutional Neural Networks (CNNs). The almost unanimous conclusion is that they are, and it is often conjecturedmore or less explicitly that the reason of this supposed superiority isto be attributed to the self-attention mechanism. In this paper we perform extensive empirical analyses showing that recent state-of-the-art CNNs (particularly, ConvNeXt) can be as robust and reliable oreven sometimes more than the current state-of-the-art Transformers. However, there is no clear winner. Therefore, although it is tempting to state the definitive superiority of one family of architectures over another, they seem to enjoy similar extraordinary performances on a variety of tasks while also suffering from similar vulnerabilities such as texture, background, and simplicity biases.

Presenter

Francesco Pinto is currently a PhD student in the Torr Vision group in Oxford. Furthermore, he is currently a visiting research at ETH Zurich in the group of Fanny Yang.

Links

Paper

Slides

Recording