Turkish text preprocessing with Zemberek and Zeyrek
This is a basic preprocessing pipeline using Zemberek and Zeyrek for digesting and cleaning Turkish text that:
- normalizes the text
- tokenizes the text
- lemmatizes tokens
- removes stopwords
- removes punctuation
- removes symbols
- removes numbers
- removes URLs
Notes
This pipeline uses Zemberek to tokenize and normalize the Turkish text .csv file, and is one of the most reliable Natural Language Processing (NLP) tools for Turkish text with features for Turkish morphological analysis, tokenization and sentence boundary detection, basic spell checking, word suggestion, noisy text normalization, Turkish Named Entity Recognition, and Hash functions to mention just some. The pipeline does also use Zeyrek to lemmatize the text, which is a partial port of Zemberek library to Python for lemmatizing and analyzing Turkish language words.
As the installation notes below imply, there was quite some struggle to make this code work given that Zemberek is written with Java programming language. I am thus grateful to this very helpful guide for Turkish Text Preprocessing with Zemberek in Python which Ayşe Kübra Kuyucu wrote and developed.
P.S. Zeyrek is in alpha stage, and the API will probably change according to the developers.
Installation notes
Compared to other NLP packages and libraries, the installation process requires slightly greater attention to details and some patience for the code below to work. Please follow the steps below as several of Zemberek’s features require associated libraries and files to function properly:
- The following process assumes that you have a working directory where you will store all required data, python script, and Zemberek libraries. That is, your working directory will have a i)
python script
, ii) a folder calledzemberek-full.jar
(see the step below), iii) and one folder calleddata
. - Download the
zemberek-full.jar
folder from Zemberek’s Google Drive (you will need to log in with your Google account to get access). You can find the folder by navigatingdistributions
->0.17.1
->zemberek-full.jar
. Download the folder, unzip it, and extract it to your working directory with the folder namezemberek-full.jar
. - You need to download the
normalization
folder from the same Zemberek archive. You can find it by navigatingdata
->normalization
. Again, download the entirenormalization
folder and save it in thedata
folder in your working directory. - You need to download a file called
lm.2gram.slm
. You will find it by navigating todata
->lm
->lm.2gram.slm
. Create a folder in yourdata
folder calledlm
and save the file there (so the file path is/working_directory/data/lm/lm.2gram.slm
). - Mark all
stopwords
on this list my pressingCTRL+A
. Copy the list withCTRL+C
. Paste the list into a plaintext editor (likeNotepad
) and save the list asstopwords.txt
in yourdata
folder. - Once all the associated data is downloaded and saved in its rightful place, all we have to do is to provide the file/directory path in the Python code below.
Load libraries
import zeyrek
import pandas as pd
import re
import os
from pathlib import Path
import jpype
from jpype import JClass, JString, getDefaultJVMPath, shutdownJVM
Load the Turkish .csv file
A dataframe with text items as rows and a column called ‘text’ where the values are strings.
= pd.read_csv('turkish_language_file.csv') df
Set up the classpath to include zemberek-full.jar
= str(Path('full/path/to/your/zemberek-full.jar').resolve())
jarpath "-Djava.class.path=%s" % jarpath)
jpype.startJVM(getDefaultJVMPath(), = "/full/path/to/your/data/folder/with/associated/zemberek/files" DATA_PATH
Initialize Zemberek tools
= JClass("zemberek.tokenization.TurkishTokenizer")
TurkishTokenizer: JClass = JClass("zemberek.morphology.TurkishMorphology")
TurkishMorphology: JClass = JClass("zemberek.normalization.TurkishSentenceNormalizer")
TurkishSentenceNormalizer: JClass = JClass("zemberek.tokenization.TurkishSentenceExtractor").DEFAULT
extractor = JClass("java.nio.file.Paths")
Paths: JClass = TurkishMorphology.createWithDefaults()
morphology = TurkishTokenizer.DEFAULT
tokenizer = TurkishSentenceNormalizer(
normalizer
TurkishMorphology.createWithDefaults(),str(os.path.join(DATA_PATH, "normalization"))),
Paths.get(str(os.path.join(DATA_PATH, "lm", "lm.2gram.slm"))),
Paths.get( )
Initialize Zeyrek lemmatizer
= zeyrek.MorphAnalyzer() analyzer
Define stopwords
with open('/full/path/to/your/saved/stopwords.txt', 'r', encoding='utf-8') as f:
= set([line.strip() for line in f]) stopwords
Define the preprocessing function
def preprocess(text):
# Tokenize the text into words
= tokenizer.tokenizeToStrings(JString(text))
tokenized_text # Convert Java strings to Python strings
= [str(token) for token in tokenized_text]
token_list # Remove stopwords
= [token for token in token_list if token not in stopwords]
token_list # Join the tokens back into a normalized sentence
= normalizer.normalize(' '.join(token_list))
normalized_sentence # Lemmatize the text
= []
lemmas for word in str(normalized_sentence).split():
= analyzer.analyze(word)[0]
analyses for parse in analyses:
= parse.lemma
lemma if lemma not in lemmas:
lemmas.append(lemma)= ' '.join(lemmas)
lemmatized_text # Remove punctuation, numbers, symbols, and URLs
= re.sub(r'[^a-zA-Z0-9\s.,?!]|[\u200b-\u200d\uFEFF]', '', lemmatized_text)
preprocessed_text return preprocessed_text
Apply the function to the text column in the csv file
Option 1: The regular way
Uses a single core. Can be slow on large datasets.
"text"] = df["text"].apply(preprocess_text) df[
Option 2: With parallalization
The faster the more cores you have.
from pandarallel import pandarallel
= True) pandarallel.initialize(progress_bar
= df.text.parallel_apply(preprocess_text) df.text
Save the processed .csv file to working directory
'processed.csv') df.to_csv(