학술논문

An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

Document Type

Working Paper

Author

Janiak, Jett; Rager, Can; Dao, James; Lau, Yeu-Tong

Source

BlackboxNLP Workshop 2024, pages 232-237

Subject

Computer Science - Machine Learning
Computer Science - Artificial Intelligence

Language

Abstract

Prior work suggests that language models manage the limited bandwidth of the residual stream through a "memory management" mechanism, where certain attention heads and MLP layers clear residual stream directions set by earlier layers. Our study provides concrete evidence for this erasure phenomenon in a 4-layer transformer, identifying heads that consistently remove the output of earlier heads. We further demonstrate that direct logit attribution (DLA), a common technique for interpreting the output of intermediate transformer layers, can show misleading results by not accounting for erasure.

Online Access

Open Access (Arxiv) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송