Persian text preprocessing with Hazm

Author

Erik Skare

Published

March 18, 2023

This is a basic preprocessing pipeline for digesting and cleaning Persian text that:

Notes on pipeline

  • This pipeline carries out a Part-of-Speech (POS) tagging of the Persian language .csv file. That is, the pipeline assigns each word in a text a specific part of speech (such as noun, verb, adjective, etc.). The output of POS tagging is a sequence of word-tag pairs, where each pair represents a word in the text and its corresponding POS tag. These tags can be used for various NLP tasks, such as text classification, named entity recognition, and sentiment analysis, among others.

  • The pipeline was developed on a Linux OS and runs without difficulties. You may encounter the error message from wapiti import Model ModuleNotFoundError: No module named 'wapiti' on Windows OS. To solve this issue, install WSJ (Windows Subsystem for Linux) by running the following command in the command prompt wsl --install. Then install python, pip, hazm and other necessary packages in WSL and run the code there. This is only necessary if you need to carry out POS tagging.

  • This Hazm code relies on trained taggerand parser models, which you can download from the Hazm documentation page. Download and extract the files to a preferred folder and provide the file path to the resources in the relevant parts of the code below.

Load libraries

import requests
import re
from hazm import *
import pandas as pd
import os

Load the Persian .csv file

A dataframe with text items as rows and a column called ‘text’ where the values are strings.

df = pd.read_csv('persian_language_file.csv')

Create function to process text

def preprocess(text):
    normalizer = Normalizer()
    tokenizer = WordTokenizer(join_verb_parts=False)
    stemmer = Stemmer()
    lemmatizer = Lemmatizer()
    tagger = POSTagger(model='file_path_to_tagger_model/postagger.model')
    chunker = Chunker(model='file_path_to_chunker_model/chunker.model')
    response = requests.get("https://raw.githubusercontent.com/roshan-research/hazm/master/hazm/data/stopwords.dat")
    stopwords = response.content.decode('utf-8').split('\n')
    stopwords = set(stopwords)
    text = re.sub(r'[^\w\s]|\d+', '', text)
    text = normalizer.normalize(text)
    tokens = tokenizer.tokenize(text)
    tagged_tokens = tagger.tag(tokens)
    chunked_sentence = tree2brackets(chunker.parse(tagged_tokens))[1:-1]
    tokens = chunked_sentence.split(' ')
    tokens = [stemmer.stem(token) for token in tokens]
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    tokens = [token for token in tokens if token not in stopwords]
    return " ".join(tokens)

Specify location of Java class files and libraries

os.environ['CLASSPATH'] = 'file_path_to_tagger_and_parser_models'

Apply the function to the text column in the csv file

Option 1: The regular way

Uses a single core. Can be slow on large datasets.

df["text"] = df["text"].apply(preprocess_text)

Option 2: With parallalization

The faster the more cores you have.

from pandarallel import pandarallel
pandarallel.initialize(progress_bar = True)
df.text = df.text.parallel_apply(preprocess_text)

Save the processed .csv file to working directory

df.to_csv('processed.csv')