Chinese text preprocessing with jieba
A basic text preprocessing pipeline for Chinese that:
- removes symbols, punctuation, URLs, and non-Chinese characters
- removes stopwords
- tokenizes the text
- normalizes character variants
- carries out word segmentation
Notes
- This code does not convert Traditional Chinese signs into Simplified Chinese. This is mainly because conversion may be challenging and there are few (if any) reliable tools.
- This code does, however, normalize character variants to a common standard form. This is necessary because Chinese has many different character variants that can be used to represent the same word or concept, due to historical changes in the writing system and regional variations in pronunciation.
- This code carries out word segmentation as Chinese does not use spaces or punctuation to separate words, which makes it difficult to analyze or process Chinese text. Word segmentation is thus necessary for many downstream tasks such as information retrieval, sentiment analysis, and machine translation.
- The code may struggle with segmenting non-Chinese proper names that do not appear in Chinese dictionaries (which the code relies on).
沿着佩诺布斯科特河(Penobscot River), for example, is turned into佩诺布 斯科特. - The code relies on the
t2s.jsonfile in order to normalize character variants. You can find it HERE. Copy the text and paste it into a plaintext editor (likeNotepad). Save the file ast2s.jsonin your working directory.
Load libraries
import pandas as pd
import re
import requests
import jieba
import openccLoad the Chinese .csv file
df = pd.read_csv('/path/to/your/chinese_language_file.csv')Download Chinese stopword list
stopwords_url = 'https://raw.githubusercontent.com/stopwords-iso/stopwords-zh/master/stopwords-zh.txt'
response = requests.get(stopwords_url)
response.encoding = 'utf-8'
chinese_stopwords = set(response.text.split())Define regular expression
The regular expression (regex) is used to remove punctuation, symbols, numbers, and non-Chinese characters.
pattern = re.compile(r'[^\u4e00-\u9fa5]')Converter object to normalize character variants
converter = opencc.OpenCC('/path/to/your/t2s.json')Create function to process text
def preprocess(text):
text = re.sub(r'http\S+', '', text)
text = pattern.sub('', text)
text = converter.convert(text)
seg_list = jieba.cut(text)
tokens = [token for token in seg_list if token not in chinese_stopwords]
return ' '.join(tokens)Apply function to text column in .csv file
Option 1: The regular way
Uses a single core. Can be slow on large datasets.
df['text'] = df['text'].apply(preprocess)Option 2: With parallalization
The faster the more cores you have.
from pandarallel import pandarallel
pandarallel.initialize(progress_bar = True)df.text = df.text.parallel_apply(preprocess_text)Save the processed .csv to working directory
df.to_csv('processed.csv')