Tokenisation: Breaking Documents and Files into Words or Sub-Words
- Pulkit Sahu
- Apr 14
- 4 min read
Updated: 6 days ago
Tokenisation is a crucial step in language modelling.
Breaking down large text into smaller pieces called tokens is known as tokenisation. Tokens can be sentences, words, sub-words, or even characters. In language modelling, tokenisation is a crucial and effort-worthy step.
Before we move ahead, let us briefly explain language modelling:
Predicting the probability of what word (or token) comes next, given previous context words, is called language modelling. For example, as I write this sentence, my brain predicts what I should write next. Similarly, an AI model predicts the probability of an upcoming word given previous words. This is often termed conditional probability and can be mathematically expressed as:
P(upcoming word | previous words)
Quick Links
#1: Tokenisation
AI models play with words, thus it's good to break down the given data (large text file) into smaller, atomic units called tokens.
For instance, in Python,
text = """I am writing an educational post on tokenisation."""
can be broken into tokens as:
I
am
writing
an
educational
post
on
tokenisation.
How you tokenise text depends largely on the dataset you're working with and the output you desire. Often, you would consider whether to keep capitalised words or punctuation in tokens.
#2: The Tokenisation Process
If you plan to build a language model, you will start by collecting and pre-processing the data. After pre-processing (cleaning, formatting) the data, the very first step in language modelling is preparing a tokeniser, which breaks down text into tokens.
There are three common ways to tokenise:
Character-based
Word-based
Subword-based
In this post, we will discuss word-based tokenisation—the tokenisation of given text data into words.
Word-Tokenisation in Python
For example,
text = "The sun rises in the east."
tokens would be
The
sun
rises
in
the
east.
The handy function available to us in Python is the split() function or method. We simply call this on the text variable to get a list of words. Try and run the code here.
words = text.split()
# You can check by running this:
words
You will get like this: ['The', 'sun', 'rises', 'in', 'the', 'east.']
Since computers work with numbers, we will assign each unique word or token from the above list an integer starting from 0.
We can simply use Python's built-in function set() to get unique tokens like this:
set(words)
Since Python's list is the best data structure, let us convert the above set into a sorted list.
sorted(list(set(words)))
# Assign it to tokens variable
tokens = sorted(list(set(words)))
You can now check the tokens:
tokens
You will get something like this: ['The', 'east.', 'in', 'rises', 'sun', 'the']. See how we preserved capitalised 'The' and lowercase 'the' as our tokens.
Now we will make a function that maps each unique token to an integer.
# Create a word-to-number mapping
word_to_number = {word: number for number, word in enumerate(tokens)}
We are using a dictionary comprehension to make things faster.
Reversing the process:
# Create a number-to-word mapping
number_to_word = {number: word for word, number in word_to_number.items()}
#3: Encode and Decode
Let us encode and decode.
We will make two functions—encode and decode—that take a list of strings and numbers, respectively. We will loop through them using Python's for loop and simply append the value of dictionary key to a new list. For instance, we will append the corresponding number for a token by using word_to_number[word].
🚦Splitting the string into words using split()
def encode(string):
encoded = []
for word in string.split():
encoded.append(word_to_number[word])
return encoded
def decode(numbers):
decoded = []
for number in numbers:
decoded.append(number_to_word[number])
return decoded
Let us check
encode("The")
# Run it and you will get: [0]
save it
encoded = encode("The")
decode(encoded)
# Run it and you will get: "The"
We can now make a vocabulary.
vocabulary = {index: word for index, word in enumerate(tokens)}
# You can check the prepared vocabulary:
# If your vocabulary is too large, please do not print it
vocabulary
Prepared Vocabulary:
{0: 'The', 1: 'east.', 2: 'in', 3: 'rises', 4: 'sun', 5: 'the'}
You can find a word if you provide its index number like this:
vocabulary[3]
# Run it and you will get: 'rises'
⚠️ This is an elementary example of how we prepare a vocabulary. If you use other words like apple, it would give you a KeyError.
This way, we prepared a unique vocabulary of tokens for our language model. The size of the vocabulary can be checked using len(vocabulary). The size increases as we take more and more data. If you're just beginning, a vocabulary size of 1,000 is a good start for building a toy language model. For small models, it can range from 10,000 to 50,000, and for larger models, it can go as high as 100,000 to 200,000. A well-defined vocabulary will help us in the next stages of language modelling.
Let us end this post by preparing a vocabulary of tokens for the above paragraph. We simply split and sort it and save the result in the tokens variable. This variable is then passed into the vocabulary expression.
# The paragraph
text = """This way, we prepared a unique vocabulary of tokens for our language model. The size of the vocabulary can be checked using len(vocabulary). The size increases as we take more and more data. If you're just beginning, a vocabulary size of 1,000 is a good start for building a toy language model. For small models, it can range from 10,000 to 50,000, and for larger models, it can go as high as 100,000 to 200,000. A well-defined vocabulary will help us in the next stages of language modelling.
"""
# Splitting
words = text.split()
# Sorting
tokens = sorted(list(set(words)))
# Assigning numbers to our tokens using word_to_number
word_to_number = {word: number for number, word in enumerate(tokens)}
# Let us use our vocabulary
vocabulary = {index: word for index, word in enumerate(tokens)}
# Check your vocabulary by running this:
vocabulary
References
Credits and Sources
Comments