Persian text preprocessing with Hazm
This is a basic preprocessing pipeline for digesting and cleaning Persian text that:
- normalizes the text
- tokenizes the text
- removes punctuation and symbols
- removes numbers
- stems the tokens
- lemmatizes the text
- removes stopwords
- conducts Part-of-Speech (POS) tagging
Notes on pipeline
This pipeline carries out a Part-of-Speech (POS) tagging of the Persian language .csv file. That is, the pipeline assigns each word in a text a specific part of speech (such as noun, verb, adjective, etc.). The output of POS tagging is a sequence of word-tag pairs, where each pair represents a word in the text and its corresponding POS tag. These tags can be used for various NLP tasks, such as text classification, named entity recognition, and sentiment analysis, among others.
The pipeline was developed on a Linux OS and runs without difficulties. You may encounter the error message
from wapiti import Model ModuleNotFoundError: No module named 'wapiti'
on Windows OS. To solve this issue, installWSJ
(Windows Subsystem for Linux) by running the following command in the command promptwsl --install
. Then installpython
,pip
,hazm
and other necessary packages in WSL and run the code there. This is only necessary if you need to carry out POS tagging.This Hazm code relies on trained
tagger
andparser
models, which you can download from the Hazm documentation page. Download and extract the files to a preferred folder and provide the file path to the resources in the relevant parts of the code below.
Load libraries
import requests
import re
from hazm import *
import pandas as pd
import os
Load the Persian .csv file
A dataframe with text items as rows and a column called ‘text’ where the values are strings.
= pd.read_csv('persian_language_file.csv') df
Create function to process text
def preprocess(text):
= Normalizer()
normalizer = WordTokenizer(join_verb_parts=False)
tokenizer = Stemmer()
stemmer = Lemmatizer()
lemmatizer = POSTagger(model='file_path_to_tagger_model/postagger.model')
tagger = Chunker(model='file_path_to_chunker_model/chunker.model')
chunker = requests.get("https://raw.githubusercontent.com/roshan-research/hazm/master/hazm/data/stopwords.dat")
response = response.content.decode('utf-8').split('\n')
stopwords = set(stopwords)
stopwords = re.sub(r'[^\w\s]|\d+', '', text)
text = normalizer.normalize(text)
text = tokenizer.tokenize(text)
tokens = tagger.tag(tokens)
tagged_tokens = tree2brackets(chunker.parse(tagged_tokens))[1:-1]
chunked_sentence = chunked_sentence.split(' ')
tokens = [stemmer.stem(token) for token in tokens]
tokens = [lemmatizer.lemmatize(token) for token in tokens]
tokens = [token for token in tokens if token not in stopwords]
tokens return " ".join(tokens)
Specify location of Java class files and libraries
'CLASSPATH'] = 'file_path_to_tagger_and_parser_models' os.environ[
Apply the function to the text column in the csv file
Option 1: The regular way
Uses a single core. Can be slow on large datasets.
"text"] = df["text"].apply(preprocess_text) df[
Option 2: With parallalization
The faster the more cores you have.
from pandarallel import pandarallel
= True) pandarallel.initialize(progress_bar
= df.text.parallel_apply(preprocess_text) df.text
Save the processed .csv file to working directory
'processed.csv') df.to_csv(