Unlock: Forgetting Transformer (FoX)

FoX adds a data-dependent forget gate to softmax attention. The gate down-weights unnormalized attention scores between past and present positions, giving the transformer a learned, recency-biased decay. FoX is FlashAttention-compatible, works without positional embeddings, and improves long-context language modeling and length extrapolation.

173 Prerequisites0 Mastered0 Working145 Gaps

Prerequisite mastery16%

Recommended probe

Chernoff Bounds is your weakest prerequisite with available questions. You haven't been assessed on this topic yet.

Forgetting Transformer (FoX)TARGET

Chernoff BoundsFoundationsWEAKEST

Not assessed3 questions

Chi-Squared ConcentrationCore

No quiz

Attention Mechanism TheoryResearch

Not assessed11 questions

Attention Variants and EfficiencyResearch

Not assessed3 questions

Recurrent Neural NetworksAdvanced

Not assessed3 questions

Sparse Attention and Long ContextResearch

No quiz

Transformer ArchitectureResearch

Not assessed11 questions