Inside GPT – Large Language Models Demystified • Alan Smith • GOTO 2024

GPT models predict the next word in a sequence based on probabilities, turning complex math into seemingly human-like text.

Welcome, my name's Alan Smith, and I'm going to talk about large language models and demystify a bit about the maths behind the magic of the GPT language models. A bit about myself: I've resisted the temptation to go into architecture, project management, or sales because I like writing code. Since 1995, when I started my career, I have focused solely on writing code. I'm at Microsoft AI and have been a Microsoft MVP for artificial intelligence since 2005 in various disciplines. I do a lot with the community, including organizing conferences and helping with Koda dojos. During COVID, I was quite active on YouTube, discussing topics like GPT versus Starcraft. Unfortunately, the random nature of these language models meant that my demos didn't work as well in front of an audience as they did at home. However, I have all of that content on my YouTube channel, where I've edited out the parts where things went wrong.

First of all, I'd like to thank other YouTubers and community members who have done a fantastic job helping me learn how these language models work. Some of you might have seen their channels. I especially recommend Three Blue One Brown, who has done a couple of series looking at the maths behind Transformer models, which is fantastic to watch.

When we interact with GPT, we might ask it a question like how to make cheese, and it provides a detailed response. Sometimes, we might think it's human, feel sorry for it, or get angry when it makes mistakes, but really, it's just maths happening behind the scenes. Hopefully, I'll be able to explain a bit about how that maths works. The only thing these models do, including GPT and many other large language models, is predict the probabilities of each token being the next token in a sequence given a sequence of tokens. That's it. We put a lot of code and other elements around it to make it work.

For example, if we say "the cat sat in the," GPT will predict the most probable next token. It might come out with "4" as the most probable token with 7.64%, so we add that to the sequence. We send that sequence through GPT again, and it might come out with "comma" at 26.14%, so we add a comma. We repeat this process, and that's why we see this stream of words or tokens appearing as GPT makes its predictions. These sequence predictions can be of variable length, and the order is also important. For instance, "the film was not bad, in fact, it was very good" has a different sentiment compared to "the film was not good, in fact, it was very bad," just by swapping the position of two words.

Long-term dependencies are also crucial. For example, "the cat sat on the floor; it had just been cleaned" means "it" refers to the floor, whereas "the cat sat on the floor; it had just been fed" means "it" refers to the cat. The language models need to understand these actual words, and this takes place in the attention mechanism inside these Transformer models.

=> 00:04:47

Predicting the next token in a sequence with GPT-2 shows how AI can generate text, but it’s like chatting with a four-year-old—impressive for data scientists, not so much for the general public.

Predicting what tokens are going to come next in a sequence is a fundamental task in language models. This process involves generating tokens one at a time based on the preceding tokens. Additionally, we can employ retrieval augmented generation, where we incorporate information, typically from a search engine, to enhance the predictions. Despite this, the core task remains predicting the next token in the sequence.

Moreover, these models can utilize chat history to maintain context. For instance, when asked, "What is the capital of the Netherlands?", the model, which lacks state, requires the entire chat history to understand the context. These models are capable of handling extensive histories, sometimes up to 4,000 or 16,000 tokens, which refers to the amount of data they can process using retrieval augmented generation.

I am currently working with GPT-2. Jody's session this morning was fantastic, providing insights into GPT-2. One reason for my focus on GPT-2 is its portability; I can run it on a computer that fits in my backpack, unlike other language models. Additionally, GPT-2 is open source, allowing me to download the source code, insert breakpoints, and explore its workings.

For example, if I set GPT-2 as the startup file and run it, it initializes as a friendly chatbot. When asked, "What is the capital of the Netherlands?", it responds, "The Dutch capital is Amsterdam." This interaction demonstrates the model's ability to generate syntactically correct text. However, the general public might not find such a chatbot impressive, as it often produces simplistic or incorrect answers, akin to conversing with a four-year-old. Data scientists, however, see potential in improving performance by feeding the model more data, parameters, and GPUs.

In another demonstration of sequence prediction, I set up a notebook to show how the model predicts the next token. By importing libraries such as PyTorch and Transformers, and tools like GPT tokenizer and GPT-2 LM head model from Hugging Face, I can create a tokenizer and load the model. Despite the limitations of my laptop's GPU, I define text strings to feed through the model and print them out. The model then predicts the next token in the sequence, such as "The cat sat on the..." and provides probabilities for various continuations.

=> 00:09:20

AI doesn't understand text; it predicts the next token based on probability.

For instance, when we say "the king of rock" and switch to a demo that highlights and uses pie charts, we can press control F5 to start without debugging. This takes a few seconds to load the model. The demo takes sequences of text, converts them into tokens, sends them into the GPT2 model, and then shows the actual probability distribution.

The model displays various distributions, showing the probability of all the 50,000 or so tokens. Imagine this as a fairground wheel where the grand prize is a small wedge, and when you spin the wheel, it lands on one of those segments. This is similar to how the probability distribution works. For example, in Helsinki, the model predicted "Helsinki is the capital of Finland" with a probability of 42.760890. When predicting "the king of rock and roll is Elvis," the model shows a very high probability that the next token is "Lee."

The process of tokenization is crucial. Tokens are common sequences of characters found in text. This process is separate from the language training model. The training data set is analyzed mathematically to determine the most efficient way of representing text in about 50,000 combinations of characters. Since most of the training data is in English, tokenization is more efficient with English. The models understand the statistical relationships between these tokens, not the text itself. They predict the next tokens based on these relationships.

In practice, we convert the model into tokens and send integers into the models, not text. The models are adept at producing the next token in a sequence of tokens. However, they don't actually produce the next token; they predict the probability distribution of all the 50,000 tokens. Depending on how creative we want the text to be, we may or may not choose the most probable one.

=> 00:13:31

Writing prompts in English for AI models is more efficient, accurate, and cost-effective.

A good example of the model's limitations is when it gets confused with similar tokens. For example, when I asked for a well-known song lyric with the word "wind" as in "Winder watch," it initially came up with "Blowing in the Wind" by Bob Dylan. I clarified that I wanted "wind" as in "Winder watch," not "blowing in the wind," and it then suggested "Against the Wind" by Bob Seger. Another interesting response was "Listen to the winds blow, watch the sun rise," which contains both "wind" and "watch," but uses the wrong meanings for both words. This confusion arises because the model treats these as tokens rather than words. The token for "wind" or the token ID for "wind" and "watch" are the same, even though they are different words. This demonstrates how the language model can be confused by similar sequences of letters.

Tokenization is more efficient in English. For instance, if we take English text, it comes out to 45 tokens. The word "the" is represented differently, and all tokens are prefixed by a space. If we used space as a token, there would be about twice as many tokens, so it makes sense to include the space when it's a token at the start of the word. Most of the actual English words are tokens.

For programmers in the audience, the significance of 11 and 13, comma, and full stop as ASCII characters is that the first 256 tokens are ASCII characters. This allows us to represent any text, and you'll see these ASCII characters popping up. When we take Swedish, it ends up getting more fragmented because there isn't much Swedish data in the training data, resulting in 83 tokens. For Dutch, there are 71 tokens, and Finnish, which is from the European languages, is the most inefficient one found, with 622 tokens even though there are only 247 characters.

This illustrates the efficiency of tokenization using different languages and the language understanding of GPT-4 compared with GPT-3 using the MMLU, one of the stats for analyzing language accuracy. GPT-4 is much better in English than in other languages, but it is usually much better than GPT-3.5 in other languages. There is a rough correlation between tokenization efficiency and language understanding, but it is not a straight line. For example, Tuuga is the most inefficient language listed for processing tokens.

English is the new programming language. Writing prompts in English is recommended for several reasons: GPT models are most accurate with English due to the extensive training data, billing is based on the number of tokens, and using languages like Swedish or Finnish will be more expensive. Throughput is based on tokens per second, making English faster to process. Additionally, models have a maximum number of input tokens. As developers, if we can make something run 20-30% faster, more accurately, and reduce costs by 20-30%, we will do that. Even if your model deals with text in your native language, generating outputs in your native language, writing the prompt in English is still beneficial.

=> 00:17:35

AI can get really weird with language translations, sometimes giving bizarre results like a recipe for cheese turning into a story about Frozen characters.

I recommend doing that just for a test. For example, I asked it how to make cheese in Tuuga, and it came out with an answer recommending using cheese powder as the main ingredient. However, it then switched back to English, offered to generate a poem or a story, and got really weird by generating an image of the Frozen characters in an Autumn Bound in charted land. This shows some confusion in understanding, which is an extreme example. Despite this, I've done loads of testing, and that was just one crazy outcome.

There is a correlation between language understanding and the quality of the answers you receive. Neural networks don't work with integers; they use floating-point numbers. Therefore, another stage of conversion is necessary to get these tokens into floating-point numbers to perform mathematical operations. This process is known as embedding or vectorization. Jod mentioned how we can vectorize text or do text embedding to get the sentiment of text. For example, in GPT-2, the word "cheese" is represented by 768 floating-point numbers.

Humans can't think in 768 dimensions; we can handle about three dimensions. As developers, we understand red, green, and blue colors, which can be represented in a vector format. Data scientists work with floating-point numbers between 0 and 1, and we can visualize similar colors in a 3D space. Measuring the distance between these points can be done using Euclidean distance or cosine similarity. Cosine similarity measures the angle between vectors, indicating that colors along the same line are similar.

=> 00:21:38

This concept can also be applied to words. Word2Vec has a database with 10,000 words, each represented in 200 dimensions. We can squash these dimensions into three so the human brain can understand them, showing that similar words are clustered together. For example, words related to classical music, diseases, Greek letters, and numbers are grouped in similar locations. This vectorization helps in finding sentiment and understanding that words like "good," "great," and "fantastic" have similar meanings.

To achieve this, we will start by examining the actual Transformer architecture and see what goes on in these language models, specifically focusing on how the attention model works and how these calculations yield answers. This approach is derived from the white paper "Attention is All You Need", which was initially written about translations. However, as Jody mentioned this morning, instead of just doing translations, we are using GPT for sequence prediction.

We can essentially ignore the translation bit and focus on the decoder to see how we can take our data, send it in, get the embedding and positional encoding, and follow through how that goes through the network. By focusing on this, we see how tokenization works. For instance, the phrase "the cat sat on the" will involve lookups in the tokenization vocabulary to get the integer numbers that represent those token IDs. During the model training process, we create something known as an embeddings table. This table, present within the model, has dimensions of 768 because each token will have 768 floating-point numbers describing it in a 768-dimensional space. The size of the token vocabulary is 50,256. We perform a lookup for tokens like 464, 3797, 3332, 319, and 262, which gives us the actual embedding vectors for those tokens. These embedding vectors, which are the number of tokens by 768, are then fed into the model.

Next, we get the positional vectors from a positional lookup table, where we feed in the actual position of the token, such as 0, 1, 2, 3, 4. The positional vectors, still 768 dimensions, are created using sine and cosine functions and natural values, giving an indication of the token's position. We then add these positional vectors to the embedding vectors, resulting in the encoder input values that are fed into the model.

The advantage of having the source code available is that we can set breakpoints. For example, in the next token prediction, I have this set up as a startup file. By changing the input text and pressing F5, the model loads into memory. Instead of typing the text, it will say "the cat sat on the," but we will hit breakpoints. In the watch window, using Python, we can see the class we are in, such as the GPT-2 model.

=> 00:25:53

Gaming GPUs power AI models with the same matrix calculations used in 3D games.

The ac Embeddings are obtained by feeding these IDs into the embeddings table, referred to as wte in the model. When we step over the relevant lines of code, we get the input embeddings, which are of size 1x5x768. This indicates that we are sending in one sequence during inference, typically just one stream of text, although the model supports more. The dimensions (1, 5, 768) represent one sequence, five tokens, and 768 dimensions for the embeddings.

Similarly, we obtain the positional embeddings, which are also of size 1x5x768. The encoder input values, or hidden states, are calculated by adding the input embeddings to the positional embeddings. This calculation is reflected in the slide deck.

Next, we look at the attention calculation. The formula involves the query, key, and value, which are extracted using floating points in a network layer trained for this purpose. A softmax operation is performed on this matrix algebra. The gaming industry has contributed significantly to the rise of these models, as the calculations in 3D games are similar to those used in training these models. Nvidia, for instance, has become highly valuable due to their GPUs, which are essential for these calculations.

The encoder input values (1x5x768) are processed through a convolution layer to generate different representations for the attention mechanism, resulting in a size of 1x5x2304. This is then split into three sections: the query, key, and value. In GPT-2, there are 12 attention heads in each layer and 12 layers in total. The dimensions (1x5x768) are split based on the number of heads, resulting in a size of 1x12x5x64. This means there is one batch, 12 attention heads, five tokens, and 64 floating point numbers in each attention for the query, key, and value.

The query, key, and value are used to determine how the words relate to each other. For example, in a conference setting, the speaker can be the query, the audience members the keys, and the value represents the connection each key has to the query. The model examines all the tokens in the sequence to determine how each word relates to another, performing calculations to figure out the sentiment. The vectors for the words describe their sentiments, with similar words being closer together in the vector space.

=> 00:29:58

Attention mechanisms in AI models decode how words relate to each other, enabling accurate predictions by understanding context.

For example, if the sentence is "the cat sat on the floor; it had just been cleaned," the word "it" will have a connection to "the cat." Within this matrix operation, we're taking the query, transposing the key, multiplying it by that, and dividing it by the root of 64. This number is something that has been found to work better for the underlying math. Then, we perform a softmax operation, which I'll explain more about later. If you've played around with PyTorch or TensorFlow, you probably know what softmax does; it gives us a probability distribution from a sequence of floating-point numbers. This is important when calculating the temperature. What this process gives us is the actual attention when all the math is done.

The vectorization or embedding gives us the meaning, and the positional vectors and positional lookup give us the position. The attention mechanism calculates how words relate to other words and the long-term dependencies between them. This mechanism is what allows these models to make predictions. If we look at the code, we should see that math taking place. We have the hidden states, which resulted from adding the position vectors and the embedding vectors for the tokens. The hidden states had a shape of 1x5x768.

Next, we perform the math where we take the attention, split it by the actual dimensions, and this gives us the keys, queries, and values for all 12 heads in the model. For example, the key shape is 1x5x768, and when split based on the number of heads, it becomes 1x12x768. This process is repeated 12 times through the network layers before we get the final results. The main difference between GPT-2, GPT-3, and GPT-4 is the number of layers, parameters, and attention heads—they are just bigger and larger.

Eventually, we get the actual output that we can start processing and making sense of. For instance, the word "it" in the sentence "it had just been cleaned" is ambiguous. The attention mechanism helps determine whether "it" refers to "the cat" or "the floor" by shifting the vectors in high-dimensional space. The model understands how these words relate to each other, so "it" refers to "the floor" because "floor" has more of a connection with "cleaned."

The model has 12 heads and several other layers for normalization and feed-forward processes, repeating 12 times in the GPT-2 model to give us the predictions. Thus, we can predict the next token, which is fantastic because it means we can actually start making accurate predictions.

=> 00:33:56

Adding randomness to text generation makes it more interesting and less predictable.

We can predict the next token, which means we can start generating text. For instance, if we input "the king of rock and roll is Elvis," and press Control F5, we get a prediction with a probability of 64.98%. If we continue, we might see repetitive outputs like "he's a rockstar." This demonstrates that merely predicting the next bit of text can be boring and predictable. To make text generation more interesting, we introduce stochasticity or randomness into the prompts. This prevents repetitive outputs and makes the generated text more engaging.

There are a couple of important parameters when working with these models. One of them is temperature, which affects the randomness or creativity of the network. Temperature values range from zero to one, where zero is very predictable, and one is very unpredictable. Although the slider in Azure OpenAI Studio only goes up to one, GPT-3 can handle values up to two, but higher values can result in nonsensical outputs.

In a demo, we used the phrase "the cat sat on the floor" with a probability distribution. The temperature affects the next token logits, which are floating-point numbers representing the probability of tokens. By dividing these logits by the temperature and performing a softmax operation, we convert them into a probability distribution. If the temperature is one, the values remain regular. However, if the temperature is zero, it results in a divide-by-zero error.

For example, with a low temperature, "floor" might have a 100% probability. If the temperature is set to 0.5, "floor" might have a 24% probability, showing how the probability distribution shifts.

=> 00:38:17

Understanding temperature and top-p in language models can make or break your AI's performance.

If I increase the temperature to two, you can see that all of these top 10 have quite a low probability, making it more probable that other tokens will be chosen. To make this more visual, I'll first cover top P and then show the actual animation. Top P basically puts a hard cut-off on the tokens being chosen. For instance, if I choose a top P of 0.1, it can only choose "floor" or "bed," and it’s most likely to choose "floor." There is no way it can choose "couch" or "ground" or anything else. Top P varies how much of that actual slice we can make the actual choice from.

When developing these applications, it's crucial to understand what temperature and top P do and how you can combine these options. I've seen people discussing large language models without a clear understanding of what top P does and how it works. I created these animations because I was curious about how the probability distribution is affected. We're sending in the catat on the and have the temperature starting at one and the top P. Using the same values for temperature and top P, we can see how the actual probability distributions change. As we get to lower values, things start disappearing from the probability distribution. All the earlier ones are gone and can never be chosen. They can be chosen, but with a low probability.

If we go back to a temperature of one and top P of one, there is no restriction, so there is a tiny chance it will land on something inappropriate like "the cat sat on the trees." In the guidelines, they specify a top P of 0.95, which chops out the 5% of tokens we probably don’t want to choose. They suggest varying the temperature or the top P, but not both. This ensures that even with a temperature of 1.0, top P will filter out unwanted tokens. As we move towards the end of the probability distribution, you can see that only "floor" and "bed" remain with a top P of 0.135. With the temperature, we have "floor," "bed," "couch," "ground," and others, but most have very small probabilities.

=> 00:42:16

Varying the temperature in GPT models can make outputs more creative and less predictable.

To see how temperature affects things, I have a demo called Temperature Theory. Initially, I set a file as a startup file and executed a control F5 command. This action cycles through different temperatures and generates different outputs. I realized I had removed the input, so I reset the startup file to the temperature theory. I often perform this demo at the end of sessions, as it tends to amuse the audience, despite its random or stochastic nature, which can lead to unpredictable and sometimes inappropriate results. I apologize in advance if that occurs.

With a temperature of zero, using GPT-2, the output is repetitive, stating, "it's a very good dog," due to the high probability of the token "dog." As the temperature increases, the predictions vary, occasionally selecting "black bear." With a temperature of 0.04, it starts picking "rat," "rabbit," "cat," and "rabbit," showing more creativity. I recommend varying the temperature when building GPT solutions.

In the next conference session, I will focus on automating tests and analyzing these models. It's crucial to consider statistical analysis rather than just marking answers as good or bad. By modifying the temperature, the language generated becomes more creative. Remember, we are using GPT-2, so the answers are less predictable.

Returning to the slides, besides temperature, other parameters influence the output. Temperature and top P are the main ones. N determines how many completion choices to generate for each input message. The prediction process involves beam search, which generates multiple completions for each message. A stop sequence can be specified to halt output generation at certain points, like after a full stop. The maximum number of tokens can also be set to control the length of the generated answers, considering cost and time.

Presence penalty and frequency penalty help manage repetition. For instance, in creative texts like poems or songs, repetition might be desirable, but in scientific texts or code, it is not. Code generation often involves repetition, so managing this is crucial to avoid chaotic results. Additionally, the likelihood of specific tokens appearing can be modified, and a block list can prevent certain tokens from appearing.

I want to thank the community members who contribute to understanding and explaining these models. I highly recommend checking out Three Blue One Brown's videos on Transformers, specifically chapters five and six in their Deep Learning series. These videos use animations to explain how Transformer models work, building on what I've discussed.

We have about two or three minutes left for questions. Feel free to ask any questions from the audience.