Multilayer Perceptron (MLP) in the Transformer model is where the model store the facts (well, sort of)

Multilayer Perceptron (MLP) in the Transformer model is where the model stores the facts (well, sort of). In other words, if the input is "Michael Jordan plays the sport of ____". Then, with some magic called linear projections, we can recover the token "basketball".


"Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neural Level (Post 1). by Google DeepMind.

PyTorch support Flash Attention natively