Persian text preprocessing with Hazm
This is a basic preprocessing pipeline for digesting and cleaning Persian text that:
- normalizes the text
- tokenizes the text
- removes punctuation and symbols
- removes numbers
- stems the tokens
- lemmatizes the text
- removes stopwords
- conducts Part-of-Speech (POS) tagging
Notes on pipeline
This pipeline carries out a Part-of-Speech (POS) tagging of the Persian language .csv file. That is, the pipeline assigns each word in a text a specific part of speech (such as noun, verb, adjective, etc.). The output of POS tagging is a sequence of word-tag pairs, where each pair represents a word in the text and its corresponding POS tag. These tags can be used for various NLP tasks, such as text classification, named entity recognition, and sentiment analysis, among others.
The pipeline was developed on a Linux OS and runs without difficulties. You may encounter the error message
from wapiti import Model ModuleNotFoundError: No module named 'wapiti'on Windows OS. To solve this issue, installWSJ(Windows Subsystem for Linux) by running the following command in the command promptwsl --install. Then installpython,pip,hazmand other necessary packages in WSL and run the code there. This is only necessary if you need to carry out POS tagging.This Hazm code relies on trained
taggerandparsermodels, which you can download from the Hazm documentation page. Download and extract the files to a preferred folder and provide the file path to the resources in the relevant parts of the code below.
Load libraries
import requests
import re
from hazm import *
import pandas as pd
import osLoad the Persian .csv file
A dataframe with text items as rows and a column called ‘text’ where the values are strings.
df = pd.read_csv('persian_language_file.csv')Create function to process text
def preprocess(text):
normalizer = Normalizer()
tokenizer = WordTokenizer(join_verb_parts=False)
stemmer = Stemmer()
lemmatizer = Lemmatizer()
tagger = POSTagger(model='file_path_to_tagger_model/postagger.model')
chunker = Chunker(model='file_path_to_chunker_model/chunker.model')
response = requests.get("https://raw.githubusercontent.com/roshan-research/hazm/master/hazm/data/stopwords.dat")
stopwords = response.content.decode('utf-8').split('\n')
stopwords = set(stopwords)
text = re.sub(r'[^\w\s]|\d+', '', text)
text = normalizer.normalize(text)
tokens = tokenizer.tokenize(text)
tagged_tokens = tagger.tag(tokens)
chunked_sentence = tree2brackets(chunker.parse(tagged_tokens))[1:-1]
tokens = chunked_sentence.split(' ')
tokens = [stemmer.stem(token) for token in tokens]
tokens = [lemmatizer.lemmatize(token) for token in tokens]
tokens = [token for token in tokens if token not in stopwords]
return " ".join(tokens)Specify location of Java class files and libraries
os.environ['CLASSPATH'] = 'file_path_to_tagger_and_parser_models'Apply the function to the text column in the csv file
Option 1: The regular way
Uses a single core. Can be slow on large datasets.
df["text"] = df["text"].apply(preprocess_text)Option 2: With parallalization
The faster the more cores you have.
from pandarallel import pandarallel
pandarallel.initialize(progress_bar = True)df.text = df.text.parallel_apply(preprocess_text)Save the processed .csv file to working directory
df.to_csv('processed.csv')