Self-Attention
Although neural networks have been around for a long time, it is mainly the development of the "self-attention" mechanism that understands relationships between massive amounts of data by mimicking human focus - instead of reading a document word-by-word sequentially, it evaluates the relevance of every part of the input to every other part simultaneously.
Self-Attention enables modern AI to understand context by creating dynamic, weighted connections (relationships) between data points rather than treating them all equally.
1. Mechanism
Attention breaks down every input token (word) into three distinct vector representations, acting like a database lookup system:
-
Query (Q): Your Search Term (e.g. climate change)
This represents the current word or data point that is "asking" for context.
A specific piece of data is "looking for" something (e.g., a verb looking for its subject). What am I looking for? What is this word trying to understand?
-
Key (K): Book Names (e.g. the names of all books available)
This is the "index" or "label" for all other words in the sequence.
A label representing what information each data point "contains". What do I offer? What information does this word have?
-
Dimension (d): Limit Operation (e.g. scaling down by reduce the scope of operation)
This is the "dimensionality" of the Keys and Queries.
It is a single number (a scalar) that tells the model how "long" the vectors are.
For example, Query and Key vectors can each contain 64 numbers.
-
Value (V): Open books to extract the actual text (the "Value") based on percentages.
This is the actual information content of those words.
The actual content or meaning that gets passed along if a match is found. What is my actual content? What information gets passed on?
1.1. Scoring
The mechanism performs similarity matching by calculating a score - multiplying the QUERY of one word with the KEYS of all other words. If a query matches a key, it gets a high score, telling the model that these two words have a strong relationship.
1.2. Weighting
After calculating the relevance scores, the scores are converted into WEIGHTS (probabilities that sum to 1) using a softmax function. This tells the model exactly what percentage of attention to pay to every other word, that is the model assigns "weights" to words.
If the model is processing the word "it," the attention mechanism might give a high weight (0.9) to "cat" and a low weight (0.01) to "the," allowing it to know that "it" = "cat".
1.3. Aggregating
The model multiplies these weights by the VALUE of each word. This creates a weighted sum that updates the original data with its new, context-aware meaning.
After we multiply our WEIGHT by the VALUES:
- If a word has a high weight (e.g., 0.9), most of its information is passed through.
- If a word has a low weight (e.g., 0.01), its information is mostly filtered out.
The output is not just the original word, but a weighted sum of the values of all other relevant words. This creates a context-aware embedding—a rich, numerical representation of the word that includes all its relationships to the surrounding text.
Note this output is a one dimension context VECTOR for each word, but we process all words at once, comparing each word with every other word simultaneously, so we always ended up with a two dimension attention MATRIX for the whole sentence.
So in everyday operation involving sentences, we are almost always dealing with matrices (e.g. weight matrix, value matrix etc.).
2. Advantages
2.1. Parallel Processing
Unlike older AI models (RNNs/LSTMs) that processed data sequential (step-by-step) and "forgot" early information, attention computes relationships between all words in a sentence at once. This allows it to:
- Capture Long-Range Dependencies: A word at the beginning of a paragraph can be directly linked to a word at the end, regardless of distance.
- Scale Efficiently: Because it can process everything in parallel, it can handle massive datasets, which is why it is used in models like GPT.
That is, it creates a direct "shortcut" between any two points in the data, meaning distance no longer hinders its understanding of complex relationships.
2.2. Multi-Head Attention
The model doesn't just do this once. It uses Multi-Head Attention (Multiple Perspectives), which means it runs the attention process multiple times in parallel (each a "head").
- Each head learns to focus on different types of relationships.
- For example, in the sentence "The cat sat on the mat," one head might focus on syntax (what is the verb?), while another focuses on semantics (what is the noun?), and a third focuses on coreference (what does "it" refer to?).