Tokenisation: Breaking Documents and Files into Words or Sub-Words

Pulkit Sahu
Apr 14
4 min read

Updated: Apr 14

Tokenisation is a crucial step in language modelling.

Breaking down large text into smaller pieces called tokens is known as tokenisation. Tokens can be sentences, words, sub-words, or even characters. In language modelling, tokenisation is a crucial and effort-worthy step.

Before we move ahead, let us briefly explain language modelling:

Predicting the probability of what word (or token) comes next, given previous context words, is called language modelling. For example, as I write this sentence, my brain predicts what I should write next. Similarly, an AI model predicts the probability of an upcoming word given previous words. This is often termed conditional probability and can be mathematically expressed as:

P(upcoming word | previous words)

Quick Links

Tokenisation

The Tokenisation Process

Encode and Decode

#1: Tokenisation

AI models play with words, thus it's good to break down the given data (large text file) into smaller, atomic units called tokens.

For instance, in Python,

text = """I am writing an educational post on tokenisation."""

can be broken into tokens as:

writing

educational

post

tokenisation.

How you tokenise text depends largely on the dataset you're working with and the output you desire. Often, you would consider whether to keep capitalised words or punctuation in tokens.

#2: The Tokenisation Process

If you plan to build a language model, you will start by collecting and pre-processing the data. After pre-processing (cleaning, formatting) the data, the very first step in language modelling is preparing a tokeniser, which breaks down text into tokens.

There are three common ways to tokenise:

Character-based
Word-based
Subword-based

In this post, we will discuss word-based tokenisation—the tokenisation of given text data into words.

Word-Tokenisation in Python

For example,

text = "The sun rises in the east."

tokens would be

The

sun

rises

the

east.

The handy function available to us in Python is the split() function or method. We simply call this on the text variable to get a list of words. Try and run the code here.

words = text.split()

# You can check by running this:
words

You will get like this: ['The', 'sun', 'rises', 'in', 'the', 'east.']

Since computers work with numbers, we will assign each unique word or token from the above list an integer starting from 0.

We can simply use Python's built-in function set() to get unique tokens like this:

set(words)

Since Python's list is the best data structure, let us convert the above set into a sorted list.

sorted(list(set(words)))

# Assign it to tokens variable
tokens = sorted(list(set(words)))

You can now check the tokens:

tokens

You will get something like this: ['The', 'east.', 'in', 'rises', 'sun', 'the']. See how we preserved capitalised 'The' and lowercase 'the' as our tokens.

Now we will make a function that maps each unique token to an integer.

# Create a word-to-number mapping
word_to_number = {word: number for number, word in enumerate(tokens)}

We are using a dictionary comprehension to make things faster.

Reversing the process:

# Create a number-to-word mapping
number_to_word = {number: word for word, number in word_to_number.items()}

#3: Encode and Decode

Let us encode and decode.

We will make two functions—encode and decode—that take a list of strings and numbers, respectively. We will loop through them using Python's for loop and simply append the value of dictionary key to a new list. For instance, we will append the corresponding number for a token by using word_to_number[word].

🚦Splitting the string into words using split()

def encode(string):
	encoded = []
	for word in string.split():
		encoded.append(word_to_number[word])
	return encoded

def decode(numbers):
	decoded = []
	for number in numbers:
		decoded.append(number_to_word[number])
	return decoded

Let us check

encode("The")

# Run it and you will get: [0]

save it

encoded = encode("The")
decode(encoded)

# Run it and you will get: "The"

We can now make a vocabulary.

vocabulary = {index: word for index, word in enumerate(tokens)}

# You can check the prepared vocabulary:
# If your vocabulary is too large, please do not print it
vocabulary

Prepared Vocabulary:

{0: 'The', 1: 'east.', 2: 'in', 3: 'rises', 4: 'sun', 5: 'the'}

You can find a word if you provide its index number like this:

vocabulary[3]

# Run it and you will get: 'rises'

⚠️ This is an elementary example of how we prepare a vocabulary. If you use other words like apple, it would give you a KeyError.

This way, we prepared a unique vocabulary of tokens for our language model. The size of the vocabulary can be checked using len(vocabulary). The size increases as we take more and more data. If you're just beginning, a vocabulary size of 1,000 is a good start for building a toy language model. For small models, it can range from 10,000 to 50,000, and for larger models, it can go as high as 100,000 to 200,000. A well-defined vocabulary will help us in the next stages of language modelling.

Let us end this post by preparing a vocabulary of tokens for the above paragraph. We simply split and sort it and save the result in the tokens variable. This variable is then passed into the vocabulary expression.

# The paragraph
text = """This way, we prepared a unique vocabulary of tokens for our language model. The size of the vocabulary can be checked using len(vocabulary). The size increases as we take more and more data. If you're just beginning, a vocabulary size of 1,000 is a good start for building a toy language model. For small models, it can range from 10,000 to 50,000, and for larger models, it can go as high as 100,000 to 200,000. A well-defined vocabulary will help us in the next stages of language modelling.
"""

# Splitting
words = text.split()

# Sorting
tokens = sorted(list(set(words)))

# Assigning numbers to our tokens using word_to_number
word_to_number = {word: number for number, word in enumerate(tokens)}

# Let us use our vocabulary
vocabulary = {index: word for index, word in enumerate(tokens)}

# Check your vocabulary by running this:
vocabulary

Access the notebook on GitHub | Kaggle

References

Credits and Sources

Andrej Karpathy. (2023, January 17). Let’s build GPT: from scratch, in code, spelled out. [Video]. YouTube. https://www.youtube.com/watch?v=kCc8FmEb1nY

Stanford Online. (2025, April 8). Stanford CS336 Language Modeling from Scratch | Spring 2025 | Overview and Tokenization [Video]. YouTube. https://www.youtube.com/watch?v=Rvppog1HZJY

Tunstall, L., von Werra, L., & Wolf, T. (2022). Natural language processing with Transformers: Building language applications with Hugging Face (Foreword by A. Géron). O’Reilly Media.