Skip to content
Longterm Wiki
Back

Detection accuracy drops with newer generators

paper

Authors

Nam Hyeon-Woo·Kim Yu-Ji·Byeongho Heo·Dongyoon Han·Seong Joon Oh·Tae-Hyun Oh

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Technical study examining Vision Transformers' attention mechanisms and their robustness properties, relevant to understanding model reliability and potential failure modes in deep learning systems.

Paper Details

Citations
38
3 influential
Year
2022

Metadata

arxiv preprintprimary source

Abstract

The favorable performance of Vision Transformers (ViTs) is often attributed to the multi-head self-attention (MSA). The MSA enables global interactions at each layer of a ViT model, which is a contrasting feature against Convolutional Neural Networks (CNNs) that gradually increase the range of interaction across multiple layers. We study the role of the density of the attention. Our preliminary analyses suggest that the spatial interactions of attention maps are close to dense interactions rather than sparse ones. This is a curious phenomenon, as dense attention maps are harder for the model to learn due to steeper softmax gradients around them. We interpret this as a strong preference for ViT models to include dense interaction. We thus manually insert the uniform attention to each layer of ViT models to supply the much needed dense interactions. We call this method Context Broadcasting, CB. We observe that the inclusion of CB reduces the degree of density in the original attention maps and increases both the capacity and generalizability of the ViT models. CB incurs negligible costs: 1 line in your model code, no additional parameters, and minimal extra operations.

Summary

This paper investigates why Vision Transformers (ViTs) perform well, focusing on the role of attention density in multi-head self-attention (MSA). The authors find that ViTs naturally develop dense attention maps despite the learning difficulty this entails, suggesting a strong preference for dense interactions. They propose Context Broadcasting (CB), a simple method that explicitly injects uniform attention into each ViT layer to provide dense interactions. The approach reduces attention density in original maps while improving model capacity and generalizability, with minimal computational overhead and no additional parameters.

Cited by 1 page

PageTypeQuality
AI-Driven Legal Evidence CrisisRisk43.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202690 KB
# Scratching Visual Transformer’s Back    with Uniform Attention

& Nam Hyeon-Woo2 Kim Yu-Ji2

Byeongho Heo1 Dongyoon Han1 Seong Joon Oh3 Tae-Hyun Oh2

1NAVER AI Lab 2POSTECH 3University of Tübingen
This work was done during an intern at NAVER AI Lab.

###### Abstract

The favorable performance of Vision Transformers (ViTs) is often attributed to the multi-head self-attention (𝙼𝚂𝙰𝙼𝚂𝙰\\mathtt{MSA}).
The 𝙼𝚂𝙰𝙼𝚂𝙰\\mathtt{MSA} enables global interactions at each layer of a ViT model, which is a contrasting feature against Convolutional Neural Networks (CNNs) that gradually increase the range of interaction across multiple layers.
We study the role of the density of the attention.
Our preliminary analyses suggest that the spatial interactions of attention maps are close to dense interactions rather than sparse ones.
This is a curious phenomenon, as dense attention maps are harder for the model to learn due to steeper softmax gradients around them.
We interpret this as a strong preference for ViT models to include dense interaction.
We thus manually insert the uniform attention to each layer of ViT models to supply the much needed dense interactions.
We call this method Context Broadcasting, 𝙲𝙱𝙲𝙱\\mathtt{CB}.
We observe that the inclusion of 𝙲𝙱𝙲𝙱\\mathtt{CB} reduces the degree of density in the original attention maps and increases both the capacity and generalizability of the ViT models.
𝙲𝙱𝙲𝙱\\mathtt{CB} incurs negligible costs: 1 line in your model code, no additional parameters, and minimal extra operations.

## 1 Introduction

After the success of Transformers (Vaswani et al., [2017](https://ar5iv.labs.arxiv.org/html/2210.08457#bib.bib40 "")) in language domains, Dosovitskiy et al. ( [2021](https://ar5iv.labs.arxiv.org/html/2210.08457#bib.bib9 "")) have extended to Vision Transformers (ViTs) that operate almost identically to the Transformers but for computer vision tasks.
Recent studies (Dosovitskiy et al., [2021](https://ar5iv.labs.arxiv.org/html/2210.08457#bib.bib9 ""); Touvron et al., [2021b](https://ar5iv.labs.arxiv.org/html/2210.08457#bib.bib38 "")) have shown that ViTs achieve superior performance on image classification tasks.

The favorable performance is often attributed to the multi-head self-attention (𝙼𝚂𝙰𝙼𝚂𝙰\\mathtt{MSA}) in ViTs (Dosovitskiy et al., [2021](https://ar5iv.labs.arxiv.org/html/2210.08457#bib.bib9 ""); Touvron et al., [2021b](https://ar5iv.labs.arxiv.org/html/2210.08457#bib.bib38 ""); Wang et al., [2018](https://ar5iv.labs.arxiv.org/html/2210.08457#bib.bib41 ""); Carion et al., [2020](https://ar5iv.labs.arxiv.org/html/2210.08457#bib.bib5 ""); Strudel et al., [2021](https://ar5iv.labs.arxiv.org/html/2210.08457#bib.bib32 ""); Raghu et al., [2021](https://ar5iv.labs.arxiv.org/html/2210.08457#bib.bib27 "")), which facilitates long-range dependency.111Long-range dependency is described in the literature with various terminologies: non-local, global, large receptive fields, etc.
Specifically, 𝙼𝚂𝙰�

... (truncated, 90 KB total)
Resource ID: 48213457fb9308c2 | Stable ID: NjVhY2RjMz