With the AI bubble growing daily, I decided to slow down for a minute and provide a basic overview of how I visualize the intelligence behind most LLMs. Well for that matter, this covers anything that uses a variable weights and balances model. Converting a neural network down into basic parts is not an easy task, as there are billions of variables that traverse a weighted probability tree to bring you your answer. So while this post gives a basic overview of how this kind of system works, it does not come close to explaining the complexity of LLMs. Loosely, all LLMs can be thought of as extraordinarily complex Markov chains.
Markov chains use probabilities to predict a future event based on what has happened in the past. They are frequently used to predict weather. The chain will contain information about past weather patterns and assign probabilities to them. Then it will check past values to calculate the weather for today, or tomorrow based on what happened in the past. It works pretty well.
But, how does this pertain to LLMs, and how they are trained and then prompts form sentences? Well it is not quite the same, but nearly the same. But in the case of LLMs the probabilities are words, context, and sentence fragments instead of weather it is windy, sunny, or raining.
Below is a long example of a Markov chain based on five similar but different sentences.
I love cats
You love dogs
I love baseball
You love football
Sometimes we need to love all things
Cats and dogs can get along
Here is a JSON representation of a 3 order, forward and backwards looking Markov chain for the sentences above.
{ "order_1_forward": { "I": { "love": { "probability": 0.333, "wordnet_synsets": ["pronoun.n.01"] } }, "You": { "love": { "probability": 0.333, "wordnet_synsets": ["pronoun.n.01"] } }, "Sometimes": { "we": { "probability": 0.333, "wordnet_synsets": ["adverb.r.01", "pronoun.n.01"] } }, "Cats": { "and": { "probability": 0.5, "wordnet_synsets": ["noun.n.01"] } }, "and": { "dogs": { "probability": 0.5, "wordnet_synsets": ["noun.n.01"] } } }, "order_1_backward": { "park": { "the": { "probability": 1.0, "wordnet_synsets": ["noun.n.01"] } }, "the": { "in": { "probability": 1.0, "wordnet_synsets": ["adverb.r.01", "preposition.n.01", "adjective.s.01", "noun.n.02", "noun.n.03"] } }, "in": { "soccer": { "probability": 1.0, "wordnet_synsets": ["noun.n.01"] } }, "soccer": { "play": { "probability": 0.666, "wordnet_synsets": ["verb.n.01", "verb.n.02", "verb.n.03", "noun.n.01"] } }, "play": { "to": { "probability": 1.0, "wordnet_synsets": ["verb.n.01", "verb.n.02", "noun.n.02"] } }, "to": { "love": { "probability": 1.0, "wordnet_synsets": ["preposition.n.01", "adverb.r.01", "adjective.s.01", "noun.n.03"] } }, "love": { "I": { "probability": 0.25, "wordnet_synsets": ["verb.n.01", "verb.n.02", "noun.n.02"] }, "cats": { "probability": 0.25, "wordnet_synsets": ["noun.n.01"] }, "dogs": { "probability": 0.25, "wordnet_synsets": ["noun.n.01"] }, "football": { "probability": 0.25, "wordnet_synsets": ["noun.n.01"] }, "baseball": { "probability": 0.25, "wordnet_synsets": ["noun.n.01"] } } }, "order_2_forward": { "I love": { "cats": { "probability": 1.0, "wordnet_synsets": ["noun.n.01"] } }, "You love": { "dogs": { "probability": 1.0, "wordnet_synsets": ["noun.n.01"] } }, "Sometimes we": { "need": { "probability": 1.0, "wordnet_synsets": ["adverb.r.01", "adjective.s.01", "verb.n.01", "verb.n.02", "verb.n.03", "noun.n.02"] } }, "Cats and dogs": { "can": { "probability": 1.0, "wordnet_synsets": ["verb.n.01", "verb.n.02", "verb.n.03", "noun.n.01"] } } }, "order_2_backward": { "cats and": { "dogs": { "probability": 1.0, "wordnet_synsets": ["noun.n.01"] } }, "love cats": { "I": { "probability": 1.0, "wordnet_synsets": ["verb.n.01", "verb.n.02", "noun.n.02"] } }, "love dogs": { "You": { "probability": 1.0, "wordnet_synsets": ["verb.n.01", "verb.n.02", "noun.n.02"] } }, "we need": { "to": { "probability": 1.0, "wordnet_synsets": ["preposition.n.01", "adverb.r.01", "adjective.s.01", "noun.n.03"] } } }, "order_3_forward": { "I love cats": { "": { "probability": 1.0 } }, "You love dogs": { "": { "probability": 1.0 } }, "I love baseball": { "": { "probability": 1.0 } }, "You love football": { "": { "probability": 1.0 } }, "Sometimes we need": { "to love": { "probability": 1.0, "wordnet_synsets": ["verb.n.01", "verb.n.02", "noun.n.02"] } }, "Cats and dogs can": { "get": { "probability": 1.0, "wordnet_synsets": ["verb.n.01", "verb.n.02", "verb.n.03", "noun.n.01"] } } }, "order_3_backward": { "need to love": { "all": { "probability": 1.0, "wordnet_synsets": ["adverb.r.01", "adjective.s.01"] } }, "can get along": { "": { "probability": 1.0, "wordnet_synsets": ["adverb.r.01", "verb.n.01", "verb.n.02", "verb.n.03", "noun.n.01"] } } } }
To provide a clearer understanding of how the probabilities tie together in the given data, let’s examine the JSON representation above. This data demonstrates the probabilities and connections within a Markov chain for a set of sentences.
The JSON representation is organized into different sections based on the order of the Markov chain (order_1_forward, order_1_backward, order_2_forward, order_2_backward, order_3_forward, order_3_backward). Each section contains a series of key-value pairs representing the word combinations and their associated probabilities.
For example, in the “order_1_forward” section, we can see the following connections:
- “I” followed by “love” with a probability of 0.333
- “You” followed by “love” with a probability of 0.333
- “Sometimes” followed by “we” with a probability of 0.333
- “Cats” followed by “and” with a probability of 0.5
- “and” followed by “dogs” with a probability of 0.5
These probabilities reflect the likelihood of encountering these word combinations based on the training data. The associated “wordnet_synsets” provide additional information about the meaning or category of the words.
If we tell a system to generate a random sentence out of this chain you may get something like this “I love cats and dogs can get along.”
To understand how the sentence “I love cats and dogs can get along” was randomly generated from the given Markov chain, let’s break down the process step by step:
1. Starting Point: The generation begins with a starting word or fragment. In this case, we start with the word “I” from the “order_1_forward” section.
2. Probability and Next Word Selection: From the “order_1_forward” section, we look at the probabilities associated with the word “I.” In this case, there are multiple possibilities: “love” and “sometimes we.” Each possibility has an associated probability. The probabilities help determine which option is more likely to be chosen. In this case, “love” has a probability of 0.333, and “sometimes we” also has a probability of 0.333.
3. Word Association and Progression: Based on the probability, we randomly select the next word. In this case, the word “love” is chosen.
4. Repeat the Process: Now, we look at the “order_2_forward” section and search for the association between “love” and the next word. From this section, we find that “love cats” has a probability of 1.0.
5. Word Association and Progression: Based on the probability, we connect “love” with “cats.”
6. Repeat the Process: Continuing in the same manner, we look at the “order_3_forward” section for the next association. We find that “cats and dogs can” has a probability of 1.0.
7. Word Association and Progression: Based on the probability, we connect “cats” with “and dogs can.”
8. Completing the Sentence: Finally, we check the “order_3_backward” section for the next association. We find that “and dogs can get along” has a probability of 1.0.
9. Word Association and Progression: Based on the probability, we connect “and dogs can” with “get along,” completing the sentence.
By following the probabilities and associations within the Markov chain, the random sentence “I love cats and dogs can get along” was generated. The probabilities determine the likelihood of certain word combinations based on the observed patterns in the training data. The process continues until a suitable endpoint or completion is reached.
Similar connections and probabilities can be observed in the other sections of the JSON, corresponding to different orders of the Markov chain. These connections capture the patterns and relationships between words, enabling the generation of coherent text based on the learned probabilities.
By utilizing these probabilities, an LLM can generate text by selecting the most likely word or phrase to follow a given context, considering the observed patterns from the training data.
But lets make sure we are on the same page, LLMs, at their core, are based on probabilities. They analyze vast amounts of data to understand the likelihood of different words and sentence fragments appearing together. This understanding allows them to generate text that follows coherent patterns.
Imagine you’re playing with a set of word blocks. Each block represents a word or a fragment of a sentence. LLMs learn the probability of one block being placed next to another, considering the previous blocks. It’s like a puzzle, where the pieces fit together based on the patterns they’ve observed in the training data.
In the case of a 3rd order Markov chain, the LLMs looks at groups of up to three blocks at a time. It examines the relationships between these blocks and calculates the probabilities of certain combinations occurring. By doing so, it can predict the most likely next block or sequence of blocks based on what it has seen before.
Of course, keep in mind that modern LLMs have gone far beyond this simplified representation. They incorporate advanced techniques like transformers, far forward lookahead and have billions of parameters, allowing them to capture complex language structures and dependencies across long ranges of text.
Building an LLM this way would be horrendously slow, I know, I did this experiment with 1MB worth of text documents in a MariaDB database. Of course, the larger the order tables grow, the slower the engine gets. If you look back to the JSON data above, you will see that many words and sentence parts are repeated many times to make the proper connections. So, 1MB of text turns into 10MB, 100MB, or more depending on how deep your Markov chains go. And then, of course, querying was quite slow. And the sentences it did make, while coherent, could not follow the context to the next sentence.
Neural networks are tuned and trained to be able to look far ahead and hold context for many paragraphs. To attempt this with what I’ll refer to as the ‘SQLChain method’ would take, believe it or not, many times more processing power than the GPU-hungry LLMs of today. When you get to the billions of parameters that modern LLMs have, then it would be a futile task to do this with any other tech stack, but it is the thought that counts.
I hope this helps others have a basic understanding of how LLMs work. What goes on inside the neural network is much more complex than the simple Markov chain shown in this post. But, in the end, it is all based on probabilities, how likely is this object to connect to that object based on how many times it has been seen before.
–Bryan