Table of contents
- 🐋 Part 1: Motivation, Background, and Prior Work
- 🐋 Part 2: Concept and Architecture
- 🐋 Part 3: Modules and Branches
- 🐋 Part 4: Hardware and Training
- 🐋 Part 5: Evaluations and Discussion
Native Sparse Attention (NSA) introduces design principles for scalable Transformer-based architectures, focusing on hardware-friendly sparse patterns to streamline long-context modeling and efficient training at scale.
In this series, we begin with the motivation, background, and prior work in sparse attention before diving deeper into the NSA architecture, the specific modules and branches, hardware considerations, and a final look at evaluations and broader discussions.
A Deep Research series