Multi-head Attention is the same as a Linear transformation with less computation

Multi-head Attention is the same as a Linear transformation with less computation.


1) Multi-head Attention is the same as Linear transformation because it has the same property of "Every output depends on every input".

2) Multi-head Attention do less computation because Multi-head Attention calculates the weights on the fly depending on the input length unlike Linear transformation which store the weights and expect input with a fixed length. The additional impact is a smaller model because it has less weights.


Multi-head Attention use this equation Query * Transposed Key * Value, where Query = Key = Value = Input. The Query * Transposed Key is the core concept of Multi-head Attention. With this equation, Multi-head Attention calculates the weights (as shown in the video).


Disclaimer: The equation is incomplete.


For context: In Large Language Model, a token such as "hello" is represented in more than 1 number e.g. [1, 2, ..., N] to represent the "hello" token. Now, our goal is to calculate every output that depends on every input. With Multi-head Attention, the input size is N length. Meanwhile, if we use Linear transformation, the input size is "the number of tokens" (e.g. ["hello", "_", "world"], that's 3 tokens) multiplied by N length.


Note: The video is made by the "Animated AI" in YouTube. The video title is "Multihead Attention's Impossible Efficiency Explained".


#transformer #largelanguagemodel


Is it true that you need at least 46 hours and $500 to pre-train a large Language Model?