Interpretation of machine learning models can be separated into model architectures… can it? The intuition of why that may be true, is that mechanistic interpretation of transformers is done on transformers. To know whether the technique can be applied to the general neural network architectures, I will have to first read the mech interp from anthropic, and then read other interpretability works.