Jason Rich Darmawan The loss is 0,but why the gradients are not 0? Hi, my goal is to share a trick that I found from the following paper. The use case of knowing this trick is the ability to calculate the impact of steering vectors to a specific layer by doing matrix... May 1, 2025 Blog
Jason Rich Darmawan Do you know why ChatGPT understand context within a sentence? Do you know why ChatGPT understand context within a sentence? For example, when you say "Michael Jordan is a basketball player. Who is Michael?" and the model will respond "Michael Jordan is a basketb... Apr 22, 2025 Blog
Jason Rich Darmawan Is it true that you need at least 46 hours and $500 to pre-train a large Language Model? Is it true that you need at least 46 hours and $500 to pre-train a Large Language Model*? A lab published a ICLR 2025 paper training a large language model with 850 million parameters, 2 million token... Apr 10, 2025 Large Language Model
Jason Rich Darmawan I want to give you headache, human language does not follow a distribution of high probability next words I want to give you a headache. Ari Holtzman et al. (2019) argued that high quality human language does not follow a distribution of high probability next words. In other words, as humans, we want gene... Apr 10, 2025 Large Language Model
Jason Rich Darmawan How does ChatGPT able to predict the next sub-word? How does ChatGPT able to predict the next sub-word of the following sentence << The pizza was just baked from the oven. The pi >> is "zza" instead of "e"? So, in ChatGPT, there is something called lay... Apr 10, 2025 Large Language Model
Jason Rich Darmawan I want to give you headache, Empirical Rule I want to give you headache. In ChatGPT model, tying embedding weights with the final linear weights is theoritically and empirically bad for the model performance. But people still do it for smaller ... Apr 10, 2025 Large Language Model
Jason Rich Darmawan The core tool of Mechanistic Interpretability for Large Language Model does not help probe generalize OOD The core tool of mechanistic Interpretability for large language model does not help probe generalize OOD. The findings are publicized by the author himself. So, hopefully the other tool by Lee Sharke... Apr 10, 2025 Large Language Model
Jason Rich Darmawan The prerequisite of Mechanistic Interpretability of Large Language Model The prerequisite of Mechanistic Interpretability of Large Language Model 1. The Transformer Architecture A Transformer consists of multiple Block. Each block consists an attention mechanism and a mult... Apr 10, 2025 Large Language Model
Jason Rich Darmawan I want to give you headache. Why a small ChatGPT 2 with only 768 numbers to represent 1 meaning can communicate with you? I want to give you [Broken]. Why a small ChatGPT 2 with only 768 numbers to represent 1 meaning can communicate with you? In general, a consequence of Johnson-Lindenstrauss Lemma, is that the number o... Apr 10, 2025 Large Language Model
Jason Rich Darmawan Multilayer Perceptron (MLP) in the Transformer model is where the model store the facts (well, sort of) Multilayer Perceptron (MLP) in the Transformer model is where the model stores the facts (well, sort of). In other words, if the input is "Michael Jordan plays the sport of ____". Then, with some magi... Apr 10, 2025 Large Language Model
Jason Rich Darmawan PyTorch support Flash Attention natively PyTorch support Flash Attention natively with F.scaled_dot_product_attention(q, k, v, is_causal=True) . Flash Attention fuses multiple CUDA kernels (Matrix Multiplication, Dropout, Softmax, Mask, Mat... Apr 10, 2025 Large Language Model
Jason Rich Darmawan Multi-head Attention is the same as a Linear transformation with less computation Multi-head Attention is the same as a Linear transformation with less computation. 1) Multi-head Attention is the same as Linear transformation because it has the same property of "Every output depend... Apr 10, 2025 Large Language Model