Norwegian text preprocessing using SpaCy
A basic text preprocessing pipeline for Norwegian text that lemmatizes the text, removes punctuation and symbols, numbers, etc. The following code can be used to process csv files in any language given that SpaCy has a suitable language module. For example, if you want to preprocess English text, just run nlp = spacy.load('en_core_web_trf')
instead. Remember to change “Define Stopwords”, as well, so it fits with your language requirements.
Load libraries
import spacy
import pandas as pd
import csv
import regex
from spacy.lang.nb.examples import sentences
Load the Norwegian language module
# spacy.cli.download('nb_core_news_lg') // Run this code first if you struggle loading the language module
= spacy.load('nb_core_news_lg') nlp
Define Stopwords
= set(spacy.lang.nb.STOP_WORDS) stop_words
Increase text size processing possibility
This may not be necessary in all cases, but you may get some error messages if you try to preprocess a very large batch of text files. If you do, try to change the maximum text length that SpaCy can process.
# Increase maximum text length that can be processed by SpaCy model
= 20000000
nlp.max_length 5000000) csv.field_size_limit(
Testing that language module has been loaded correctly
This is just to make sure that the language module (in this case nb_core_news_lg
) has been loaded correctly. If it has, then the returned text should be lemmatized (gikk = gå | sang = synge).
= nlp("Jeg gikk og går for å gå etter å ha gått, og jeg synger, liker å synge da jeg sang.")
doc for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)
Load Norwegian dataframe
A dataframe with text items as rows, various metadata columns, and a column called text
where the values are strings.
= pd.read_csv('unprocessed.csv') df
Create function to process the text
def preprocess_text(text):
# Remove URLs
= regex.sub(r'https?://\S+', '', text)
text # Removes symbols and any characters that are not letters, numbers, or whitespaces
= regex.sub(r'[^\p{L}\p{N}\s]+', '', text)
text # Remove numbers
= regex.sub(r'\d+', '', text)
text # Remove stopwords
= set(spacy.lang.nb.STOP_WORDS)
stop_words # Lemmatize text
= nlp(text)
doc = [token.lemma_ for token in doc if not token.is_stop and token.text not in stop_words]
lemmatized_text return " ".join(lemmatized_text)
Apply the function to the text column in the csv file
Option 1: The regular way
Uses a single core. Can be slow on large datasets.
"text"] = df["text"].apply(preprocess_text) df[
Option 2: With parallelization
The faster the more cores you have.
from pandarallel import pandarallel
= True) pandarallel.initialize(progress_bar
= df.text.parallel_apply(preprocess_text) df.text
Save the processed csv file to working directory
'processed.csv') df.to_csv(