Japanese preprocessing with Janome
This is a basic preprocessing pipeline using Janome for digesting and cleaning Japanese text that:
- tokenizes the text
- lemmatizes tokens
- removes stopwords
- removes punctuation
- removes symbols
- removes numbers
- removes URLs
Notes
- Janome is a Japanese morphological analyzer (or tokenizer, POS-tagger) written in pure Python including the built-in dictionary and the language model. Though there are inherent advantages and disadvantages to each analyzer (many prefer the sophistication of
MeCab
), Janome was preferred because it is fairly easy to use. - Janome uses
mecab-ipadic-2.7.0-20070801
as the built-in dictionary. Also Japanese new era“令和” (Reiwa)
has been added to the dictionary sincev0.3.8
. - This pipeline tokenizes Japanese text into morphemes (word units) and lemmatizes the tokens by providing the base form (lemma) of each morpheme, along with its
Part of Speech (POS)
tag. In this code, thepreprocess()
function tokenizes the input text using the janome.tokenizer.Tokenizer() class from Janome. It then filters out certain parts of speech (助詞, 助動詞, 記号, 接頭辞
) using thejanome.tokenfilter.POSStopFilter()
class, which removes unnecessary particles, symbols, and prefixes. The remaining words are lemmatized using thejanome.analyzer.Analyzer()
class, which applies a sequence of token filters to the input tokens. - A custom list of Japanese stopwords are provided in the code itself with
bash stop_words = [...]
. If there are stopwords you want to add/remove, please modify the list provided below.
Load libraries
import pandas as pd
import os
import re
from janome.tokenizer import Tokenizer
from janome.analyzer import Analyzer
from janome.tokenfilter import POSStopFilter, LowerCaseFilter
Load the Japanese .csv file
= pd.read_csv('/full/file/path/to/your/japanese_language_file.csv') df
Define list of Japanese stopwords
= [
stop_words 'あそこ', 'あっ', 'あの', 'あのかた', 'あの人', 'あり', 'あります', 'ある', 'あれ', 'い', 'いう',
'います', 'いる', 'う', 'うち', 'え', 'お', 'および', 'おり', 'おります', 'か', 'かつて', 'から',
'が', 'き', 'ここ', 'こちら', 'こと', 'この', 'これ', 'これら', 'さ', 'さらに', 'し', 'しかし',
'する', 'ず', 'せ', 'せる', 'そこ', 'そして', 'その', 'その他', 'その後', 'それ', 'それぞれ',
'それで', 'た', 'ただし', 'たち', 'ため', 'たり', 'だ', 'だっ', 'つ', 'て', 'で', 'でき',
'できる', 'です', 'では', 'でも', 'と', 'という', 'といった', 'とき', 'ところ', 'として', 'とともに',
'とも', 'と共に', 'どこ', 'どの', 'な', 'ない', 'なお', 'なかっ', 'ながら', 'なく', 'なっ', 'など',
'なに', 'なら', 'なり', 'なる', 'なん', 'に', 'において', 'における', 'について', 'にて', 'によって',
'により', 'による', 'に対して', 'に対する', 'に関する', 'の', 'ので', 'のみ', 'は', 'ば', 'へ', 'ほか',
'ほとんど', 'ほど', 'ます', 'また', 'または', 'まで', 'も', 'もの', 'ものの', 'や', 'よう', 'より',
'ら', 'られ', 'られる', 'り', 'る', 'れ', 'れる', 'を', 'ん'
]
Define the preprocessing function
def preprocess(text):
# Remove URLs
= re.sub(r'http\S+', '', text)
text # Remove punctuation, symbols, and numbers
= re.sub(r'[^\w\s]|\d+', '', text)
text # Tokenize text into words and POS tag
= Tokenizer()
tokenizer = [(token.surface, token.part_of_speech.split(',')[0]) for token in tokenizer.tokenize(text)]
pos_list # Remove stopwords
= [(word, pos) for word, pos in pos_list if word not in stop_words]
pos_list # Lemmatize tokens
= [(word, '動詞') if pos == '動詞' else (word, '名詞') for word, pos in pos_list]
pos_list = Analyzer(tokenizer=tokenizer, token_filters=[POSStopFilter(['助詞', '助動詞', '記号', '接頭辞']), LowerCaseFilter()])
lemmatizer = [lemmatizer.analyze(word + '\t' + pos) for word, pos in pos_list]
lemmatized_tokens = [token.base_form for tokens in lemmatized_tokens for token in tokens if token.part_of_speech.split(',')[0] in ['名詞', '動詞']]
lemmatized_words # Return the list of lemmatized words
return lemmatized_words
Apply the function to the text column in the csv file
Option 1: The regular way
Uses a single core. Can be slow on large datasets.
"text"] = df["text"].apply(preprocess_text) df[
Option 2: With parallalization
The faster the more cores you have.
from pandarallel import pandarallel
= True) pandarallel.initialize(progress_bar
= df.text.parallel_apply(preprocess_text) df.text
Save the processed .csv file to working directory
'processed.csv') df.to_csv(