Words As Numbers: TOKENS are The Essential Building Blocks of Language Models⚡Quick Start LLMs
This is Part 1 of the series Quick Start Large Language Models: An Accessible Code-First Approach To Working with LLMs.
Reading time: ~5 minutes
Language modeling has finally brought the massive potential of artificial intelligence to the mainstream as ChatGPT continues to report the fastest growing user base in history.
As the power of language models begins permeating through the technology ecosystem, developers are becoming empowered to usher in a new wave of innovation.
I am a strong proponent of open-source software, especially with the dawn of Large Language Models (LLMs). I firmly believe the control, specificity, privacy, and cost of open-source models will be the way forward for the most creative AI applications.
For developers new to language modeling, I’ve started this Quick Start series to accelerate greater comprehension and adoption of LLM development.
To begin modeling language, we need to turn data into something we can work with programmatically.
We start with the basic atomic element of language modeling, the token.
When we think of the simplest unit of communication, words are likely the first answer that comes to mind. They are the individual components that combine to make sentences and paragraphs, which then convey ideas, questions, instructions, sentiments etc.
For example, let’s take this iconic line by Vito Corleone from The Godfather:
“ I’m gonna make him an offer he can’t refuse.”
If we separate this sentence into individual words, we have word-tokenized this expression, meaning we’ve created a list of parsed units based on each word.
If we separate this sentence into individual words, we have word-tokenized this expression, meaning we’ve created a list of parsed units o each word.
The code above uses the split method to separate our sentence based on the spaces between each word. It creates a list of 9 unique word-only strings.
Not a bad start. But with this separation, there are some essential nuances that we are potentially missing. We have conjunctions that are captured as single words (e.g. “I’m”) and punctuation that is bundled together with letters (e.g. “refuse.”).
Perhaps some greater granularity would be more informative.
text = "I'm gonna make him an offer he can't refuse."
characters = list(text.replace(" ", ""))
print(characters)
print(f"token count: {len(characters)}")
We remove spacing with the replace function and then parse out each character in our sample text. The parsing produces a list of 36 individual string characters.
There is indeed more granularity captured but likely way too fine-grained to be helpful. Letters are repeated and no real context to what is being said is captured.
Practically speaking, most modern tokenization techniques take a subword approach, separating complex words into individual parts as well as parsing out punctuation. This provides more specificity than a purely word tokenization approach and more context than character tokenization.
Using the Hugging Face transformer library, we can apply a more rigorous framework.
If you are unfamiliar with Hugging Face, I highly encourage you to visit their platform. Their documentation is tremendous and the insights from the community have always proven to be helpful and encouraging. (Not sponsored, just a fan!)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
text = "I'm gonna make him an offer he can't refuse."
tokens = tokenizer.tokenize(text)
print(tokens)
print(f"token count: {len(tokens)}")
- First, we import the AutoTokenizer class from the transformers library. This will automatically select the correct tokenizer method based on the model we reference in the next step, hence the name.
- Next, we instantiate a tokenizer object using the ‘bert-base-cased’ model.
- Then, we pass our sample text into the tokenizer.
Using this approach our sample text has now been broken down into 14 unique tokens. Unlike pure word tokenization, we see the separation of contracted words like “I’m” and the separation of the various punctuation as separate units.
With our raw input now more informatively parsed, we can process our data into a format our model can work with.
We do this by applying unique identifiers to each word. With our tokenizer object, we can use the convert_tokens_to_ids method.
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)
print(f"token count: {len(token_ids)}")
Language models like BERT were trained on a large and diverse set of documents called a training corpus, or just corpus for short. Unique IDs are created for each token in the vocabulary from that training corpus.
The vocabulary is explicitly derived using methods like the Byte Pair Encoding algorithm to identify words and subwords, along with punctuation, starts of sentences, ends of sentences, and other special characters.
This process has turned our textual data into a numerical format that a model can process.
We can visualize this token-to-ID relationship more intuitively by thinking of them as pairs. If we zip the list of tokens and list of token_ids together we can see how each token is matched to its unique ID pairing:
To illustrate this further, if we call the tokenizer again and use the decode method on our list of IDs, we can see our numerical representation is transformed back to our original string.
decoded_token_ids = tokenizer.decode(token_ids)
print(decoded_token_ids)
“ I’m gonna make him an offer he can’t refuse.”
And that’s it.
The notebook with all code examples can be found here: QuickStartLLMs_pt1_Tokenization.ipynb
Armed with this fundamental understanding of how words can be transformed into numbers, we can advance to our next step, Word Embeddings, where we’ll contextualize our numerical representations of text.
Thanks for reading!
References & Resources
- Tokenizers by Hugging Face. https://huggingface.co/learn/nlp-course/chapter2/4?fw=pt
- NLP Basics: What is Tokenization and How Do You Do It? by Weights & Biases. https://wandb.ai/sauravm/Tokenizers/reports/NLP-Basics-What-is-Tokenization-and-How-Do-You-Do-It---VmlldzoxOTAxNDU2
- Byte pair encoding. https://en.wikipedia.org/wiki/Byte_pair_encoding