Norwegian text preprocessing using SpaCy

Author

Erik Skare

Published

February 1, 2023

A basic text preprocessing pipeline for Norwegian text that lemmatizes the text, removes punctuation and symbols, numbers, etc. The following code can be used to process csv files in any language given that SpaCy has a suitable language module. For example, if you want to preprocess English text, just run nlp = spacy.load('en_core_web_trf') instead. Remember to change “Define Stopwords”, as well, so it fits with your language requirements.

Load libraries

import spacy
import pandas as pd
import csv
import regex
from spacy.lang.nb.examples import sentences

Load the Norwegian language module

# spacy.cli.download('nb_core_news_lg') // Run this code first if you struggle loading the language module
nlp = spacy.load('nb_core_news_lg')

Define Stopwords

stop_words = set(spacy.lang.nb.STOP_WORDS)

Increase text size processing possibility

This may not be necessary in all cases, but you may get some error messages if you try to preprocess a very large batch of text files. If you do, try to change the maximum text length that SpaCy can process.

# Increase maximum text length that can be processed by SpaCy model
nlp.max_length = 20000000
csv.field_size_limit(5000000)

Testing that language module has been loaded correctly

This is just to make sure that the language module (in this case nb_core_news_lg) has been loaded correctly. If it has, then the returned text should be lemmatized (gikk = gå | sang = synge).

doc = nlp("Jeg gikk og går for å gå etter å ha gått, og jeg synger, liker å synge da jeg sang.")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Load Norwegian dataframe

A dataframe with text items as rows, various metadata columns, and a column called text where the values are strings.

df = pd.read_csv('unprocessed.csv')

Create function to process the text

def preprocess_text(text):

# Remove URLs
    text = regex.sub(r'https?://\S+', '', text)
    # Removes symbols and any characters that are not letters, numbers, or whitespaces
    text = regex.sub(r'[^\p{L}\p{N}\s]+', '', text)
 # Remove numbers
    text = regex.sub(r'\d+', '', text)
# Remove stopwords
    stop_words = set(spacy.lang.nb.STOP_WORDS)
# Lemmatize text
    doc = nlp(text)
    lemmatized_text = [token.lemma_ for token in doc if not token.is_stop and token.text not in stop_words]
    return " ".join(lemmatized_text)

Apply the function to the text column in the csv file

Option 1: The regular way

Uses a single core. Can be slow on large datasets.

df["text"] = df["text"].apply(preprocess_text)

Option 2: With parallelization

The faster the more cores you have.

from pandarallel import pandarallel
pandarallel.initialize(progress_bar = True)

df.text = df.text.parallel_apply(preprocess_text)

Save the processed csv file to working directory

df.to_csv('processed.csv')