I want to give you headache, Empirical Rule

I want to give you headache.


In ChatGPT model, tying embedding weights with the final linear weights is theoritically and empirically bad for the model performance. But people still do it for smaller model as the performance hit is not that much.


It's as if, the Deep Learning community wants to tell you "whoever has the most compute to experiment wins the AI race because of empirical factor and there is a limit to thinking logically".


Why is tying embedding weights theoritically bad? Tying embedding weights with final linear weights will impose symmety i.e. the probability of "Obama" following "Barack" is high and vice versa, while if it asymmetric it should not be the case.


By symmetric, for any two vectors a and b, a \cdot b = b \cdot a, where a is the embedding vector and b is the unembedding vector.


Note: Tying embedding weights is beneficial for smaller models (i.e. model with less than 8 billion parameters) because it reduces the number of parameters significantly (e.g. in GPT-2 124 million parameters, this reduce the number of parameters by 30%). However, the benefit diminishes as the number of layers increases. So, bigger model (i.e. larger than 8 billion parameters) should not tie the embedding weights.


Note 2: Qwen2 7B does not tie the embedding weights.


Note 3: Qwen2 paper do not tell you the reason. Another headache. As if, the Deep Learning community wants to tell you "figure it out yourself".


#largelanguagemodel

The core tool of Mechanistic Interpretability for Large Language Model does not help probe generalize OOD