Chinese text preprocessing with jieba

Author

Erik Skare

Published

March 21, 2023

A basic text preprocessing pipeline for Chinese that:

removes symbols, punctuation, URLs, and non-Chinese characters
removes stopwords
tokenizes the text
normalizes character variants
carries out word segmentation

Notes

This code does not convert Traditional Chinese signs into Simplified Chinese. This is mainly because conversion may be challenging and there are few (if any) reliable tools.
This code does, however, normalize character variants to a common standard form. This is necessary because Chinese has many different character variants that can be used to represent the same word or concept, due to historical changes in the writing system and regional variations in pronunciation.
This code carries out word segmentation as Chinese does not use spaces or punctuation to separate words, which makes it difficult to analyze or process Chinese text. Word segmentation is thus necessary for many downstream tasks such as information retrieval, sentiment analysis, and machine translation.
The code may struggle with segmenting non-Chinese proper names that do not appear in Chinese dictionaries (which the code relies on). 沿着佩诺布斯科特河(Penobscot River), for example, is turned into 佩诺布斯科特.
The code relies on the t2s.json file in order to normalize character variants. You can find it HERE. Copy the text and paste it into a plaintext editor (like Notepad). Save the file as t2s.json in your working directory.

Load libraries

import pandas as pd
import re
import requests
import jieba
import opencc

Load the Chinese .csv file

df = pd.read_csv('/path/to/your/chinese_language_file.csv')

Download Chinese stopword list

stopwords_url = 'https://raw.githubusercontent.com/stopwords-iso/stopwords-zh/master/stopwords-zh.txt'
response = requests.get(stopwords_url)
response.encoding = 'utf-8'
chinese_stopwords = set(response.text.split())

Define regular expression

The regular expression (regex) is used to remove punctuation, symbols, numbers, and non-Chinese characters.

pattern = re.compile(r'[^\u4e00-\u9fa5]')

Converter object to normalize character variants

converter = opencc.OpenCC('/path/to/your/t2s.json')

Create function to process text

def preprocess(text):
    text = re.sub(r'http\S+', '', text)
    text = pattern.sub('', text)
    text = converter.convert(text)
    seg_list = jieba.cut(text)
    tokens = [token for token in seg_list if token not in chinese_stopwords]
    return ' '.join(tokens)

Apply function to text column in .csv file

Option 1: The regular way

Uses a single core. Can be slow on large datasets.

df['text'] = df['text'].apply(preprocess)

Option 2: With parallalization

The faster the more cores you have.

from pandarallel import pandarallel
pandarallel.initialize(progress_bar = True)

df.text = df.text.parallel_apply(preprocess_text)

Save the processed .csv to working directory

df.to_csv('processed.csv')