Chinese text preprocessing with jieba
A basic text preprocessing pipeline for Chinese that:
- removes symbols, punctuation, URLs, and non-Chinese characters
- removes stopwords
- tokenizes the text
- normalizes character variants
- carries out word segmentation
Notes
- This code does not convert Traditional Chinese signs into Simplified Chinese. This is mainly because conversion may be challenging and there are few (if any) reliable tools.
- This code does, however, normalize character variants to a common standard form. This is necessary because Chinese has many different character variants that can be used to represent the same word or concept, due to historical changes in the writing system and regional variations in pronunciation.
- This code carries out word segmentation as Chinese does not use spaces or punctuation to separate words, which makes it difficult to analyze or process Chinese text. Word segmentation is thus necessary for many downstream tasks such as information retrieval, sentiment analysis, and machine translation.
- The code may struggle with segmenting non-Chinese proper names that do not appear in Chinese dictionaries (which the code relies on).
沿着佩诺布斯科特河(Penobscot River)
, for example, is turned into佩诺布 斯科特
. - The code relies on the
t2s.json
file in order to normalize character variants. You can find it HERE. Copy the text and paste it into a plaintext editor (likeNotepad
). Save the file ast2s.json
in your working directory.
Load libraries
import pandas as pd
import re
import requests
import jieba
import opencc
Load the Chinese .csv file
= pd.read_csv('/path/to/your/chinese_language_file.csv') df
Download Chinese stopword list
= 'https://raw.githubusercontent.com/stopwords-iso/stopwords-zh/master/stopwords-zh.txt'
stopwords_url = requests.get(stopwords_url)
response = 'utf-8'
response.encoding = set(response.text.split()) chinese_stopwords
Define regular expression
The regular expression (regex) is used to remove punctuation, symbols, numbers, and non-Chinese characters.
= re.compile(r'[^\u4e00-\u9fa5]') pattern
Converter object to normalize character variants
= opencc.OpenCC('/path/to/your/t2s.json') converter
Create function to process text
def preprocess(text):
= re.sub(r'http\S+', '', text)
text = pattern.sub('', text)
text = converter.convert(text)
text = jieba.cut(text)
seg_list = [token for token in seg_list if token not in chinese_stopwords]
tokens return ' '.join(tokens)
Apply function to text column in .csv file
Option 1: The regular way
Uses a single core. Can be slow on large datasets.
'text'] = df['text'].apply(preprocess) df[
Option 2: With parallalization
The faster the more cores you have.
from pandarallel import pandarallel
= True) pandarallel.initialize(progress_bar
= df.text.parallel_apply(preprocess_text) df.text
Save the processed .csv to working directory
'processed.csv') df.to_csv(