Japanese preprocessing with Janome

Author

Erik Skare

Published

March 31, 2023

This is a basic preprocessing pipeline using Janome for digesting and cleaning Japanese text that:

Notes

  • Janome is a Japanese morphological analyzer (or tokenizer, POS-tagger) written in pure Python including the built-in dictionary and the language model. Though there are inherent advantages and disadvantages to each analyzer (many prefer the sophistication of MeCab), Janome was preferred because it is fairly easy to use.
  • Janome uses mecab-ipadic-2.7.0-20070801 as the built-in dictionary. Also Japanese new era “令和” (Reiwa) has been added to the dictionary since v0.3.8.
  • This pipeline tokenizes Japanese text into morphemes (word units) and lemmatizes the tokens by providing the base form (lemma) of each morpheme, along with its Part of Speech (POS) tag. In this code, the preprocess() function tokenizes the input text using the janome.tokenizer.Tokenizer() class from Janome. It then filters out certain parts of speech (助詞, 助動詞, 記号, 接頭辞) using the janome.tokenfilter.POSStopFilter() class, which removes unnecessary particles, symbols, and prefixes. The remaining words are lemmatized using the janome.analyzer.Analyzer() class, which applies a sequence of token filters to the input tokens.
  • A custom list of Japanese stopwords are provided in the code itself with bash stop_words = [...]. If there are stopwords you want to add/remove, please modify the list provided below.

Load libraries

import pandas as pd
import os
import re
from janome.tokenizer import Tokenizer
from janome.analyzer import Analyzer
from janome.tokenfilter import POSStopFilter, LowerCaseFilter

Load the Japanese .csv file

df = pd.read_csv('/full/file/path/to/your/japanese_language_file.csv')

Define list of Japanese stopwords

stop_words = [
    'あそこ', 'あっ', 'あの', 'あのかた', 'あの人', 'あり', 'あります', 'ある', 'あれ', 'い', 'いう',
    'います', 'いる', 'う', 'うち', 'え', 'お', 'および', 'おり', 'おります', 'か', 'かつて', 'から',
    'が', 'き', 'ここ', 'こちら', 'こと', 'この', 'これ', 'これら', 'さ', 'さらに', 'し', 'しかし',
    'する', 'ず', 'せ', 'せる', 'そこ', 'そして', 'その', 'その他', 'その後', 'それ', 'それぞれ',
    'それで', 'た', 'ただし', 'たち', 'ため', 'たり', 'だ', 'だっ', 'つ', 'て', 'で', 'でき',
    'できる', 'です', 'では', 'でも', 'と', 'という', 'といった', 'とき', 'ところ', 'として', 'とともに',
    'とも', 'と共に', 'どこ', 'どの', 'な', 'ない', 'なお', 'なかっ', 'ながら', 'なく', 'なっ', 'など',
    'なに', 'なら', 'なり', 'なる', 'なん', 'に', 'において', 'における', 'について', 'にて', 'によって',
    'により', 'による', 'に対して', 'に対する', 'に関する', 'の', 'ので', 'のみ', 'は', 'ば', 'へ', 'ほか',
    'ほとんど', 'ほど', 'ます', 'また', 'または', 'まで', 'も', 'もの', 'ものの', 'や', 'よう', 'より',
    'ら', 'られ', 'られる', 'り', 'る', 'れ', 'れる', 'を', 'ん'
]

Define the preprocessing function

def preprocess(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove punctuation, symbols, and numbers
    text = re.sub(r'[^\w\s]|\d+', '', text)
    # Tokenize text into words and POS tag
    tokenizer = Tokenizer()
    pos_list = [(token.surface, token.part_of_speech.split(',')[0]) for token in tokenizer.tokenize(text)]
    # Remove stopwords
    pos_list = [(word, pos) for word, pos in pos_list if word not in stop_words]
    # Lemmatize tokens
    pos_list = [(word, '動詞') if pos == '動詞' else (word, '名詞') for word, pos in pos_list]
    lemmatizer = Analyzer(tokenizer=tokenizer, token_filters=[POSStopFilter(['助詞', '助動詞', '記号', '接頭辞']), LowerCaseFilter()])
    lemmatized_tokens = [lemmatizer.analyze(word + '\t' + pos) for word, pos in pos_list]
    lemmatized_words = [token.base_form for tokens in lemmatized_tokens for token in tokens if token.part_of_speech.split(',')[0] in ['名詞', '動詞']]
    # Return the list of lemmatized words
    return lemmatized_words

Apply the function to the text column in the csv file

Option 1: The regular way

Uses a single core. Can be slow on large datasets.

df["text"] = df["text"].apply(preprocess_text)

Option 2: With parallalization

The faster the more cores you have.

from pandarallel import pandarallel
pandarallel.initialize(progress_bar = True)
df.text = df.text.parallel_apply(preprocess_text)

Save the processed .csv file to working directory

df.to_csv('processed.csv')