Norwegian text preprocessing using SpaCy
A basic text preprocessing pipeline for Norwegian text that lemmatizes the text, removes punctuation and symbols, numbers, etc. The following code can be used to process csv files in any language given that SpaCy has a suitable language module. For example, if you want to preprocess English text, just run nlp = spacy.load('en_core_web_trf') instead. Remember to change “Define Stopwords”, as well, so it fits with your language requirements.
Load libraries
import spacy
import pandas as pd
import csv
import regex
from spacy.lang.nb.examples import sentences Load the Norwegian language module
# spacy.cli.download('nb_core_news_lg') // Run this code first if you struggle loading the language module
nlp = spacy.load('nb_core_news_lg')Define Stopwords
stop_words = set(spacy.lang.nb.STOP_WORDS)Increase text size processing possibility
This may not be necessary in all cases, but you may get some error messages if you try to preprocess a very large batch of text files. If you do, try to change the maximum text length that SpaCy can process.
# Increase maximum text length that can be processed by SpaCy model
nlp.max_length = 20000000
csv.field_size_limit(5000000)Testing that language module has been loaded correctly
This is just to make sure that the language module (in this case nb_core_news_lg) has been loaded correctly. If it has, then the returned text should be lemmatized (gikk = gå | sang = synge).
doc = nlp("Jeg gikk og går for å gå etter å ha gått, og jeg synger, liker å synge da jeg sang.")
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)Load Norwegian dataframe
A dataframe with text items as rows, various metadata columns, and a column called text where the values are strings.
df = pd.read_csv('unprocessed.csv')Create function to process the text
def preprocess_text(text):
# Remove URLs
text = regex.sub(r'https?://\S+', '', text)
# Removes symbols and any characters that are not letters, numbers, or whitespaces
text = regex.sub(r'[^\p{L}\p{N}\s]+', '', text)
# Remove numbers
text = regex.sub(r'\d+', '', text)
# Remove stopwords
stop_words = set(spacy.lang.nb.STOP_WORDS)
# Lemmatize text
doc = nlp(text)
lemmatized_text = [token.lemma_ for token in doc if not token.is_stop and token.text not in stop_words]
return " ".join(lemmatized_text)Apply the function to the text column in the csv file
Option 1: The regular way
Uses a single core. Can be slow on large datasets.
df["text"] = df["text"].apply(preprocess_text)Option 2: With parallelization
The faster the more cores you have.
from pandarallel import pandarallel
pandarallel.initialize(progress_bar = True)df.text = df.text.parallel_apply(preprocess_text)Save the processed csv file to working directory
df.to_csv('processed.csv')