top of page

Tokenisation: Breaking Documents and Files into Words or Sub-Words

Updated: 6 days ago

Tokenisation is a crucial step in language modelling.


Breaking down large text into smaller pieces called tokens is known as tokenisation. Tokens can be sentences, words, sub-words, or even characters. In language modelling, tokenisation is a crucial and effort-worthy step.


Before we move ahead, let us briefly explain language modelling:


Predicting the probability of what word (or token) comes next, given previous context words, is called language modelling. For example, as I write this sentence, my brain predicts what I should write next. Similarly, an AI model predicts the probability of an upcoming word given previous words. This is often termed conditional probability and can be mathematically expressed as:


P(upcoming word | previous words)

Quick Links


#1: Tokenisation


AI models play with words, thus it's good to break down the given data (large text file) into smaller, atomic units called tokens.


For instance, in Python,

text = """I am writing an educational post on tokenisation."""

can be broken into tokens as:

I

am

writing

an

educational

post

on

tokenisation.


How you tokenise text depends largely on the dataset you're working with and the output you desire. Often, you would consider whether to keep capitalised words or punctuation in tokens.


#2: The Tokenisation Process


If you plan to build a language model, you will start by collecting and pre-processing the data. After pre-processing (cleaning, formatting) the data, the very first step in language modelling is preparing a tokeniser, which breaks down text into tokens.


There are three common ways to tokenise:

  • Character-based

  • Word-based

  • Subword-based


In this post, we will discuss word-based tokenisation—the tokenisation of given text data into words.


Word-Tokenisation in Python


For example,

text = "The sun rises in the east."

tokens would be

The

sun

rises

in

the

east.


The handy function available to us in Python is the split() function or method. We simply call this on the text variable to get a list of words. Try and run the code here.

words = text.split()

# You can check by running this:
words

You will get like this: ['The', 'sun', 'rises', 'in', 'the', 'east.']


Since computers work with numbers, we will assign each unique word or token from the above list an integer starting from 0.


We can simply use Python's built-in function set() to get unique tokens like this:

set(words)

Since Python's list is the best data structure, let us convert the above set into a sorted list.

sorted(list(set(words)))

# Assign it to tokens variable
tokens = sorted(list(set(words)))

You can now check the tokens:

tokens

You will get something like this: ['The', 'east.', 'in', 'rises', 'sun', 'the']. See how we preserved capitalised 'The' and lowercase 'the' as our tokens.


Now we will make a function that maps each unique token to an integer.


# Create a word-to-number mapping
word_to_number = {word: number for number, word in enumerate(tokens)}

We are using a dictionary comprehension to make things faster.


Reversing the process:

# Create a number-to-word mapping
number_to_word = {number: word for word, number in word_to_number.items()}

#3: Encode and Decode


Let us encode and decode.


We will make two functions—encode and decode—that take a list of strings and numbers, respectively. We will loop through them using Python's for loop and simply append the value of dictionary key to a new list. For instance, we will append the corresponding number for a token by using word_to_number[word].


🚦Splitting the string into words using split()


def encode(string):
	encoded = []
	for word in string.split():
		encoded.append(word_to_number[word])
	return encoded

def decode(numbers):
	decoded = []
	for number in numbers:
		decoded.append(number_to_word[number])
	return decoded		

Let us check

encode("The")

# Run it and you will get: [0]

save it

encoded = encode("The")
decode(encoded)

# Run it and you will get: "The"

We can now make a vocabulary.


vocabulary = {index: word for index, word in enumerate(tokens)}

# You can check the prepared vocabulary:
# If your vocabulary is too large, please do not print it
vocabulary

Prepared Vocabulary:

{0: 'The', 1: 'east.', 2: 'in', 3: 'rises', 4: 'sun', 5: 'the'}

You can find a word if you provide its index number like this:

vocabulary[3]

# Run it and you will get: 'rises'

⚠️ This is an elementary example of how we prepare a vocabulary. If you use other words like apple, it would give you a KeyError.


This way, we prepared a unique vocabulary of tokens for our language model. The size of the vocabulary can be checked using len(vocabulary). The size increases as we take more and more data. If you're just beginning, a vocabulary size of 1,000 is a good start for building a toy language model. For small models, it can range from 10,000 to 50,000, and for larger models, it can go as high as 100,000 to 200,000. A well-defined vocabulary will help us in the next stages of language modelling.


Let us end this post by preparing a vocabulary of tokens for the above paragraph. We simply split and sort it and save the result in the tokens variable. This variable is then passed into the vocabulary expression.

# The paragraph
text = """This way, we prepared a unique vocabulary of tokens for our language model. The size of the vocabulary can be checked using len(vocabulary). The size increases as we take more and more data. If you're just beginning, a vocabulary size of 1,000 is a good start for building a toy language model. For small models, it can range from 10,000 to 50,000, and for larger models, it can go as high as 100,000 to 200,000. A well-defined vocabulary will help us in the next stages of language modelling.
"""

# Splitting
words = text.split()

# Sorting
tokens = sorted(list(set(words)))

# Assigning numbers to our tokens using word_to_number
word_to_number = {word: number for number, word in enumerate(tokens)}

# Let us use our vocabulary
vocabulary = {index: word for index, word in enumerate(tokens)}

# Check your vocabulary by running this:
vocabulary

Access the notebook on GitHub | Kaggle


References


Credits and Sources


Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating

Most Read

AI Nine O Subscription

Subscribe to AI Nine O for 90-second learning.

Venus will reply here...

Your question...

VenusMoon Logo

Promoting equitable education for all through the fruitful use of AI.

© 2024-25 by VenusMoon Education | Udyam Registration Number: UDYAM-MP-10-0030480

bottom of page