Turkish text preprocessing with Zemberek and Zeyrek

Author

Erik Skare

Published

March 30, 2023

This is a basic preprocessing pipeline using Zemberek and Zeyrek for digesting and cleaning Turkish text that:

Notes

  • This pipeline uses Zemberek to tokenize and normalize the Turkish text .csv file, and is one of the most reliable Natural Language Processing (NLP) tools for Turkish text with features for Turkish morphological analysis, tokenization and sentence boundary detection, basic spell checking, word suggestion, noisy text normalization, Turkish Named Entity Recognition, and Hash functions to mention just some. The pipeline does also use Zeyrek to lemmatize the text, which is a partial port of Zemberek library to Python for lemmatizing and analyzing Turkish language words.

  • As the installation notes below imply, there was quite some struggle to make this code work given that Zemberek is written with Java programming language. I am thus grateful to this very helpful guide for Turkish Text Preprocessing with Zemberek in Python which Ayşe Kübra Kuyucu wrote and developed.

P.S. Zeyrek is in alpha stage, and the API will probably change according to the developers.

Installation notes

Compared to other NLP packages and libraries, the installation process requires slightly greater attention to details and some patience for the code below to work. Please follow the steps below as several of Zemberek’s features require associated libraries and files to function properly:

  • The following process assumes that you have a working directory where you will store all required data, python script, and Zemberek libraries. That is, your working directory will have a i) python script, ii) a folder called zemberek-full.jar (see the step below), iii) and one folder called data.
  • Download the zemberek-full.jar folder from Zemberek’s Google Drive (you will need to log in with your Google account to get access). You can find the folder by navigating distributions -> 0.17.1 -> zemberek-full.jar. Download the folder, unzip it, and extract it to your working directory with the folder name zemberek-full.jar.
  • You need to download the normalization folder from the same Zemberek archive. You can find it by navigating data -> normalization. Again, download the entire normalization folder and save it in the data folder in your working directory.
  • You need to download a file called lm.2gram.slm. You will find it by navigating to data -> lm -> lm.2gram.slm. Create a folder in your data folder called lm and save the file there (so the file path is /working_directory/data/lm/lm.2gram.slm).
  • Mark all stopwords on this list my pressing CTRL+A. Copy the list with CTRL+C. Paste the list into a plaintext editor (like Notepad) and save the list as stopwords.txt in your data folder.
  • Once all the associated data is downloaded and saved in its rightful place, all we have to do is to provide the file/directory path in the Python code below.

Load libraries

import zeyrek
import pandas as pd
import re
import os
from pathlib import Path
import jpype
from jpype import JClass, JString, getDefaultJVMPath, shutdownJVM

Load the Turkish .csv file

A dataframe with text items as rows and a column called ‘text’ where the values are strings.

df = pd.read_csv('turkish_language_file.csv')

Set up the classpath to include zemberek-full.jar

jarpath = str(Path('full/path/to/your/zemberek-full.jar').resolve())
jpype.startJVM(getDefaultJVMPath(), "-Djava.class.path=%s" % jarpath)
DATA_PATH = "/full/path/to/your/data/folder/with/associated/zemberek/files"

Initialize Zemberek tools

TurkishTokenizer: JClass = JClass("zemberek.tokenization.TurkishTokenizer")
TurkishMorphology: JClass = JClass("zemberek.morphology.TurkishMorphology")
TurkishSentenceNormalizer: JClass = JClass("zemberek.normalization.TurkishSentenceNormalizer")
extractor = JClass("zemberek.tokenization.TurkishSentenceExtractor").DEFAULT
Paths: JClass = JClass("java.nio.file.Paths")
morphology = TurkishMorphology.createWithDefaults()
tokenizer = TurkishTokenizer.DEFAULT
normalizer = TurkishSentenceNormalizer(
    TurkishMorphology.createWithDefaults(),
    Paths.get(str(os.path.join(DATA_PATH, "normalization"))),
    Paths.get(str(os.path.join(DATA_PATH, "lm", "lm.2gram.slm"))),
)

Initialize Zeyrek lemmatizer

analyzer = zeyrek.MorphAnalyzer()

Define stopwords

with open('/full/path/to/your/saved/stopwords.txt', 'r', encoding='utf-8') as f:
    stopwords = set([line.strip() for line in f])

Define the preprocessing function

def preprocess(text):
    # Tokenize the text into words
    tokenized_text = tokenizer.tokenizeToStrings(JString(text))
    # Convert Java strings to Python strings
    token_list = [str(token) for token in tokenized_text]
    # Remove stopwords
    token_list = [token for token in token_list if token not in stopwords]
    # Join the tokens back into a normalized sentence
    normalized_sentence = normalizer.normalize(' '.join(token_list))
    # Lemmatize the text
    lemmas = []
    for word in str(normalized_sentence).split():
        analyses = analyzer.analyze(word)[0]
        for parse in analyses:
            lemma = parse.lemma
            if lemma not in lemmas:
                lemmas.append(lemma)
    lemmatized_text = ' '.join(lemmas)
    # Remove punctuation, numbers, symbols, and URLs
    preprocessed_text = re.sub(r'[^a-zA-Z0-9\s.,?!]|[\u200b-\u200d\uFEFF]', '', lemmatized_text)
    return preprocessed_text

Apply the function to the text column in the csv file

Option 1: The regular way

Uses a single core. Can be slow on large datasets.

df["text"] = df["text"].apply(preprocess_text)

Option 2: With parallalization

The faster the more cores you have.

from pandarallel import pandarallel
pandarallel.initialize(progress_bar = True)
df.text = df.text.parallel_apply(preprocess_text)

Save the processed .csv file to working directory

df.to_csv('processed.csv')