The loss is 0,but why the gradients are not 0?

Hi, my goal is to share a trick that I found from the following paper. The use case of knowing this trick is the ability to calculate the impact of steering vectors to a specific layer by doing matrix multiplication between the gradients and the steering vectors


You can pull this Pull Request to run it. This Pull Request use NDIF, so you can run it in your own laptop instead of a workstation node with a minimum of 40 GB of VRAM


Pull Request https://github.com/FlyingPumba/steering-thinking-models/pull/1

NDIF https://nnsight.net/notebooks/tutorials/start_remote_access/


The paper: Understanding Reasoning in Thinking Models via Steering Vectors


Though sharing this would expose my weakness in math 🤣


I was curious why the paper calculate the loss between `logits.detach()` and `logits`


These are 2 questions that I am curious to answer:

1. The loss is 0.0, but why the gradients are not 0.0?

2. How does it work?


So, the paper use log_softmax and KL Divergence, that made my head dizzy trying to manually calculate it. So, I made the loss simple `loss = (logits.detach() - logits).sum()` for the sake of my mental health


The explanation are in the code https://gist.github.com/jasonrichdarmawan/90b8d992232f40e721d5adf64caad609


TLDR; when we do `logits.detach()`, the `logits.detach()` is treated as a constant during backpropagation. That's why the gradients are not 0

in Blog
Do you know why ChatGPT understand context within a sentence?