(cache)LLM Visualization

Chapter: Overview

Table of Contents

Intro

Introduction

Preliminaries

Components

Embedding

Layer Norm

Self Attention

Projection

MLP

Transformer

Softmax

Output

Welcome to the walkthrough of the GPT large language model! Here we'll explore the model nano-gpt, with a mere 85,000 parameters.

Its goal is a simple one: take a sequence of six letters:

C B A B B C

and sort them in alphabetical order, i.e. to "ABBBCC".

We call each of these letters a token, and the set of the model's different tokens make up its vocabulary:

token	A	B	C
index	0	1	2

From this table, each token is assigned a number, its token index. And now we can enter this sequence of numbers into the model:

2 1 0 1 1 2

In the 3d view, each green cell represents a number being processed, and each blue cell is a weight.

-0.7

0.4

0.8

being processed

-0.7

0.7

-0.1

weights

Each number in the sequence first gets turned into a 48 element vector (a size chosen for this particular model). This is called an embedding.

The embedding is then passed through the model, going through a series of layers, called transformers, before reaching the bottom.

So what's the output? A prediction of the next token in the sequence. So at the 6th entry, we get probabilities that the next token is going to be 'A', 'B', or 'C'.

In this case, the model is pretty sure it's going to be 'A'. Now, we can feed this prediction back into the top of the model, and repeat the entire process.

Press Space to continue

GPT-2 (small)

nano-gpt

GPT-2 (XL)

GPT-3