Blog Posts | jasonrichdarmawan

Latest

How to reduce VRAM peak usage, and increase Volatile GPU-Util?

TLDR; split the intermediate matrix to fit into the GPU L2 Cache

Jason Rich Darmawan

How to reduce VRAM peak usage, and increase Volatile GPU-Util?

TLDR; split the intermediate matrix to fit into the GPU L2 Cache. Source Code The difference in performance between computing the full matrix multiplication `W_E @ W_V @ W_O @ W_U` at once versus chun...

Jul 19, 2025 Large Language Model

Jason Rich Darmawan

The loss is 0, but why the gradients are not 0?

Hi, my goal is to share a trick that I found from the following paper. The use case of knowing this trick is the ability to calculate the impact of steering vectors to a specific layer by doing matrix...

May 1, 2025 Blog

Jason Rich Darmawan

Do you know why ChatGPT understand context within a sentence?

Do you know why ChatGPT understand context within a sentence? For example, when you say "Michael Jordan is a basketball player. Who is Michael?" and the model will respond "Michael Jordan is a basketb...

Apr 22, 2025 Blog

Jason Rich Darmawan

Is it true that you need at least 46 hours and $500 to pre-train a large Language Model?

Is it true that you need at least 46 hours and $500 to pre-train a Large Language Model*? A lab published a ICLR 2025 paper training a large language model with 850 million parameters, 2 million token...

Apr 10, 2025 Large Language Model

Jason Rich Darmawan

I want to give you headache, human language does not follow a distribution of high probability next words

I want to give you a headache. Ari Holtzman et al. (2019) argued that high quality human language does not follow a distribution of high probability next words. In other words, as humans, we want gene...

Apr 10, 2025 Large Language Model

Jason Rich Darmawan

How does ChatGPT able to predict the next sub-word?

How does ChatGPT able to predict the next sub-word of the following sentence << The pizza was just baked from the oven. The pi >> is "zza" instead of "e"? So, in ChatGPT, there is something called lay...

Apr 10, 2025 Large Language Model

Jason Rich Darmawan

I want to give you headache, Empirical Rule

I want to give you headache. In ChatGPT model, tying embedding weights with the final linear weights is theoritically and empirically bad for the model performance. But people still do it for smaller ...

Apr 10, 2025 Large Language Model

Jason Rich Darmawan

The core tool of Mechanistic Interpretability for Large Language Model does not help probe generalize OOD

The core tool of mechanistic Interpretability for large language model does not help probe generalize OOD. The findings are publicized by the author himself. So, hopefully the other tool by Lee Sharke...

Apr 10, 2025 Large Language Model

Jason Rich Darmawan

The prerequisite of Mechanistic Interpretability of Large Language Model

The prerequisite of Mechanistic Interpretability of Large Language Model 1. The Transformer Architecture A Transformer consists of multiple Block. Each block consists an attention mechanism and a mult...

Apr 10, 2025 Large Language Model

Jason Rich Darmawan

I want to give you headache. Why a small ChatGPT 2 with only 768 numbers to represent 1 meaning can communicate with you?

I want to give you [Broken]. Why a small ChatGPT 2 with only 768 numbers to represent 1 meaning can communicate with you? In general, a consequence of Johnson-Lindenstrauss Lemma, is that the number o...

Apr 10, 2025 Large Language Model

Jason Rich Darmawan

Multilayer Perceptron (MLP) in the Transformer model is where the model store the facts (well, sort of)

Multilayer Perceptron (MLP) in the Transformer model is where the model stores the facts (well, sort of). In other words, if the input is "Michael Jordan plays the sport of ____". Then, with some magi...

Apr 10, 2025 Large Language Model

Jason Rich Darmawan

PyTorch support Flash Attention natively

PyTorch support Flash Attention natively with F.scaled_dot_product_attention(q, k, v, is_causal=True) . Flash Attention fuses multiple CUDA kernels (Matrix Multiplication, Dropout, Softmax, Mask, Mat...

Apr 10, 2025 Large Language Model

1
2