The core tool of mechanistic Interpretability for large language model does not help probe generalize OOD.
The findings are publicized by the author himself. So, hopefully the other tool by Lee Sharkey from Apollo Research, Attribution-based Parameter (APD), helps probe generalize OOD.
The mechanistic interpretability is important. For example, one of the recent research shows that we can debias professions from the gender feature.
It’s just not scalable and useful yet—does not help probe generalize OOD [Hurt].
#largelanguagemodel #mechanisticinterpretability