Use Bert to tokenize text

Estimated read time 4 min read

What exactly is Token ? In this article, the concept of token is introduced.

We can think of it this way, token is the smallest unit processed by the AI ​​model in natural language processing scenarios (such as text generation, AI chat).

In computer vision, the model processes the relationship between image pixels in units of pixels, while the language model uses tokens as the unit to understand the relationship between tokens.

Below I will use a small example to show how the language model converts a piece of text into a token, and what it looks like after the model converts the text into a token.

Through this example, I hope you can have a perceptual understanding of the process of model processing tokens.

1. How BERT converts text

We use the BERT model to process text.

BERT, which stands for Bidirectional Encoder Representations from Transformers, is a natural language processing model proposed by Google in 2018.

At that time, this model achieved very good performance in many NLP tasks, including text question answering, natural language inference and sentiment analysis. It can be said that it was SOTA for a long time.

We can use the following code to call the bert model to tokenize a piece of text.

It should be noted that when running the above code for the first time, if the relevant libraries are missing, they need to be downloaded and installed.

For example, use pip3 install transformers to download and install the transformers library.

When running the following code, bert’s tokenizer and other files will be downloaded from Hugging Face. It is recommended to quickly configure the content of Hugging Face and configure the relevant environment by following three commands.

The first download may take a long time, just wait.

After the download is completed, the relevant files will be placed in the system cache directory. They will not be downloaded when running again and can be used directly.

# 从transformers库中导入BertTokenizer类
from transformers import BertTokenizer
# 初始化BertTokenizer,加载'bert-base-uncased'预训练模型。
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# 定义一个示例文本字符串。
text = "I debug my code all day, and I am like a debugger."

# 使用tokenizer对文本进行编码,将文本分割成tokens,同时映射到相应的ID。
encoded_input = tokenizer(text)

# 使用tokenizer的convert_ids_to_tokens方法将tokens的ID转换回可读的tokens。
tokens = tokenizer.convert_ids_to_tokens(encoded_input['input_ids'])

# 打印出tokens及其对应的ID。
print("Tokens:", tokens)
print("Token IDs:", encoded_input['input_ids'])

In the above code, the text I defined is “I debug my code all day, and I am like a debugger.”

2. Token explanation

After running the above code and using the BERT model to tokenize the text, the output token is:

Tokens: ['[CLS]', 'i', 'de', '##bu', '##g', 'my', 'code', 'all', 'day', ',', 'and', 'i', 'am', 'like', 'a', 'de', '##bu', '##gger', '.', '[SEP]']

As you can see, the model splits the text, and the word debug is split into three sub-words: de, bu, and g.

At the same time, debugger is split into three sub-words: de, bu, and gger.

Among the tokens converted by the bert model above, CLS (Classifier token) can be considered as a special marker symbol for the input start position.

SEP (Separator token) is a special symbol that separates different sentences or paragraphs.

For example, when processing two sentences, SEP can be used to distinguish two independent sentences. This helps the model correctly distinguish the boundaries of the sentences, so that it can correctly understand the semantic relationship between multiple sentences.

The ## symbol can be thought of as the sub-word splitting symbol of a word.

The BERT model uses the WordPiece algorithm to segment words, which breaks some words into smaller units than the words themselves.

For example, debug is split into three parts: de, ##bu, ##g. ## means that this subword and the previous word are one word.

We don’t need to pay too much attention to these special symbols. The special symbols output in some other algorithms or models may be different, but they are basically used for segmentation and labeling text information.

The above code, in addition to outputting tokens, also outputs the ID corresponding to each token.

Token IDs: [101, 1045, 2139, 8569, 2290, 2026, 3642, 2035, 2154, 1010, 1998, 1045, 2572, 2066, 1037, 2139, 8569, 13327, 1012, 102]

This ID can be thought of as the unique identifier corresponding to each token in the model vocabulary, or it can be thought of as a way to convert text into a digital form that the computer can recognize and process.

You can modify the text in the above code to see what it looks like after the model tokenizes it.


Different models use different tokenization algorithms.

You May Also Like

More From Author

+ There are no comments

Add yours