Persian text preprocessing with Hazm

Author

Erik Skare

Published

March 18, 2023

This is a basic preprocessing pipeline for digesting and cleaning Persian text that:

normalizes the text
tokenizes the text
removes punctuation and symbols
removes numbers
stems the tokens
lemmatizes the text
removes stopwords
conducts Part-of-Speech (POS) tagging

Notes on pipeline

This pipeline carries out a Part-of-Speech (POS) tagging of the Persian language .csv file. That is, the pipeline assigns each word in a text a specific part of speech (such as noun, verb, adjective, etc.). The output of POS tagging is a sequence of word-tag pairs, where each pair represents a word in the text and its corresponding POS tag. These tags can be used for various NLP tasks, such as text classification, named entity recognition, and sentiment analysis, among others.
The pipeline was developed on a Linux OS and runs without difficulties. You may encounter the error message from wapiti import Model ModuleNotFoundError: No module named 'wapiti' on Windows OS. To solve this issue, install WSJ (Windows Subsystem for Linux) by running the following command in the command prompt wsl --install. Then install python, pip, hazm and other necessary packages in WSL and run the code there. This is only necessary if you need to carry out POS tagging.
This Hazm code relies on trained taggerand parser models, which you can download from the Hazm documentation page. Download and extract the files to a preferred folder and provide the file path to the resources in the relevant parts of the code below.

Load libraries

import requests
import re
from hazm import *
import pandas as pd
import os

Load the Persian .csv file

A dataframe with text items as rows and a column called ‘text’ where the values are strings.

df = pd.read_csv('persian_language_file.csv')

Create function to process text

def preprocess(text):
    normalizer = Normalizer()
    tokenizer = WordTokenizer(join_verb_parts=False)
    stemmer = Stemmer()
    lemmatizer = Lemmatizer()
    tagger = POSTagger(model='file_path_to_tagger_model/postagger.model')
    chunker = Chunker(model='file_path_to_chunker_model/chunker.model')
    response = requests.get("https://raw.githubusercontent.com/roshan-research/hazm/master/hazm/data/stopwords.dat")
    stopwords = response.content.decode('utf-8').split('\n')
    stopwords = set(stopwords)
    text = re.sub(r'[^\w\s]|\d+', '', text)
    text = normalizer.normalize(text)
    tokens = tokenizer.tokenize(text)
    tagged_tokens = tagger.tag(tokens)
    chunked_sentence = tree2brackets(chunker.parse(tagged_tokens))[1:-1]
    tokens = chunked_sentence.split(' ')
    tokens = [stemmer.stem(token) for token in tokens]
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    tokens = [token for token in tokens if token not in stopwords]
    return " ".join(tokens)

Specify location of Java class files and libraries

os.environ['CLASSPATH'] = 'file_path_to_tagger_and_parser_models'

Apply the function to the text column in the csv file

Option 1: The regular way

Uses a single core. Can be slow on large datasets.

df["text"] = df["text"].apply(preprocess_text)

Option 2: With parallalization

The faster the more cores you have.

from pandarallel import pandarallel
pandarallel.initialize(progress_bar = True)

df.text = df.text.parallel_apply(preprocess_text)

Save the processed .csv file to working directory

df.to_csv('processed.csv')