Demystifying WORD EMBEDDINGS To Power Language Models⚡ Quick Start LLMs

Peter Chung
7 min readNov 27, 2023

--

This is Part 2 of the series Quick Start Large Language Models: An Accessible Code-First Approach To Working with LLMs.

Reading time: ~7 minutes

Language modeling has changed our world.

ChatGPT is not even a year old at the time of this writing. And despite some recent disruptions at OpenAI, innovation in the space has not stopped.

In fact, it’s only accelerating.

To aid in this acceleration, I’ve started this Quick Start series to encourage LLM development broadly.

Building on Part 1, tokenization was only an initial step in working with text data.

To understand how language models can interpret, reason, and generate we need to understand word embeddings.

If we think of token IDs as serial numbers we assign to our data, i.e. unique identifiers we create, then word embeddings are descriptors to help us characterize them.

Embeddings are expressed numerically, so it is more accurate to think of them as coordinates we map out to show how words exist in our data relative to each other.

We’ll go through a simplified build-up to develop intuition.

Let’s take this iconic line from one of the great classics of English literature:

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair.

Charles Dickens, A Tale of Two Cities

First, to create an example with a bit more complexity, we’ll start with two very specific steps for our experiment.

  1. We’ll segment our sample text into ‘documents’ so we have a mini-corpus to work through. We’ll use the commas as separation points.
  2. We’ll inject two special tokens, [start] and [end], that will be used to mark the beginning and end of each of our documents.
Link to Colab notebook with all code below.

Next, we’ll process our mini-corpus through a very basic word-based tokenization process.

We’ll find all the unique tokens, creating a vocabulary specific to our training corpus. We’ll also assign a unique ID to each token to make processing them more efficient.

Link to Colab notebook with all code below.

With our data now prepped, we can begin to contextualize them.

Word embeddings are just models, i.e. representations, of our data. We are examining each token, looking for context clues to how it is used, and then iterating on a set of scores to map out the token’s position relative to other words in training data.

We need a starting place for this modeling process, so we’ll create a randomly generated array of data for each of our tokens.

Link to Colab notebook with all code below.
Link to Colab notebook with all code below.

We’ve created a 2-dimensional tensor, where each row corresponds to each token of our vocabulary, and then a 3-dimensional vector of random numbers that will act as seeds to initialize our embedding training.

For our purposes, we are intentionally using 3 dimensions for the embedding values for simplicity’s sake. This will help us easily visualize the position of our sample words spacially as we train them. In practice, word embedding vectors are much larger, often several hundreds of unique dimensions.

If we take a few keywords from our vocabulary, ‘age’, ‘epoch’, ‘wisdom’, and ‘foolishness’ for example, and plot them to the random seed numbers we’ve assigned, we can visualize their starting embeddings.

Unsurprisingly, the words are positioned very randomly to each other, with no sense of meaning, similarity, or context we could use.

Now if we begin to train our seed embeddings, so the values shift bit by bit, and point us to the pattern of words as they appear in our sample text, we can begin injecting some context.

We’ll try a ‘primitive’ model, using a simple training loop on our data to guess what sequence of words should appear given a specific set of context words.

Link to Colab notebook with all code below.

The block of code above takes the following steps:

  1. Within each training ‘epoch’ look at each document and split it into tokens.
  2. For each token, look back at most 3 words (context window) relative to the target token we’re trying to guess.
  3. Given the tokens in our context window, take the average of the current embedding values for those tokens (np.mean across axis 0).
  4. Take the averaged embedding values and multiply them across our entire embedding tensor. This should surface a ‘high score’.
  5. The np.argmax function looks for this ‘high score’ and returns the index ID associated with it.
  6. Use that index ID to predict the next word in our target sequence.
  7. If the predicted next word does not match the actual next word from our training data, it raises an error and takes the difference between the embeddings vector for the predicted ID versus the embeddings vector for the target ID.
  8. Take this error value and multiply it by our learning rate. We’ll use this value to adjust the embedding vectors of tokens in the context window.
  9. Repeat this process 1,000 times. As the training loop adjusts the embedding vectors, the values should learn the pattern of words in our training dataset.

This is an intentionally contrived example for illustration purposes only. This is designed only to help conceptualize the process of word embeddings, so we can develop an intuitive understanding of how we can model numbers to represent words and meanings. In reality, a word embedding process is trained on a very large dataset, often using several hundred dimensions to represent various tokens, using much more sophisticated training techniques, which we’ll see shortly.

After our 1,000 epochs are finished, the resulting trained embeddings produce a much different result from our starting seed values.

Looking again at ‘age’, ‘epoch’, ‘wisdom’, and ‘foolishness’, we see the coordinates have now shifted.

From the plot, the words ‘age’ and ‘epoch’ are positioned very close together. Given how they are used in our sample text (after the sequence ‘it was the…’), this makes a lot of sense for our training loop to capture.

The words ‘wisdom’ and ‘foolishness’ also appear to distinguish from each other across some axes, which makes sense given that they are used in a contrasting way in the passage.

This numerical calibration is what empowers the language models of today to capture deep meaning and context, allowing us to programmatically develop applications that use language.

With this fundamental concept in place, let’s see a more practical example pulled from the Hugging Face platform.

If you are unfamiliar with Hugging Face, I highly encourage you to visit their platform. Their documentation is tremendous and the insights from the community have always proven to be helpful and encouraging. (Not sponsored, just a fan!)

In the snippet below, we leverage the AutoTokenizer and AutoModel classes to call the ‘bert-base-uncased’ model.

Link to Colab notebook with all code below.

This is a pre-trained model, with 768 dimensions defined for each of the 30,522 tokens in its vocabulary. Unlike our ‘basic’ model, it was trained using masking, where a token is randomly hidden from the model, and the context of the word in both directions (i.e. the words that came before and after the masked one) is used to model it.

As you’ll encounter out in the wild, word embeddings take very complex approaches, mapping tokens to high-dimensional space, using a very large and diverse training corpus, and more robust training techniques like masking. But this complexity is what powers language models to develop insightful predictions and intuitions from language context.

And that’s it.

The notebook with all code examples can be found here: QuickStart_pt2_Embeddings.ipynb

Armed with this foundation, we will look at the process of modeling and inference with the transformer next. This is the architecture that powers BERT, GPT, LLaMA, and other commonly used large language models today.

Thanks for reading!

References & Resources

--

--

Peter Chung
Peter Chung

Written by Peter Chung

Founder and Principal at Aberrest.com. Applied AI research and development. Building intelligent automation, reasoning systems, and agents.