R Code for Processing Big File Batches with daiR

Author

Erik Skare

This walk-through is intended to help you process big batches of files (whether .txt, .tiff, .jpg etc.) using using Google Document AI’s OCR service with Thomas Hegghammer’s daiR package (see: Thomas Hegghammer. “daiR: an R package for OCR with Google Document AI.” The Journal of Open Software 6, no. 68 (2021): 3528).

Note that you need to modify some of the code depending on the files you want to process, what you have named these files, and so on.

Packages

require(daiR)
require(googleCloudStorageR)
require(usethis)
require(purrr)
require(fs)
require(glue)
require(stringr)
require(stringr)

Editing .Renviron file and preparing big batch processing

This step assumes that you have already:

  • Activated the Google Cloud Console
  • Linked your project to your billing account
  • Set up a service account
  • Downloaded a .json file with the service account key

If you have not done so already, please confer the daiR website.

Once we have completed all of the steps above, the first thing we must do is get our project ID and open our .Renviron file and provide all necessary data so that you can interact with Google Cloud Storage in R:

    project_id <- daiR::get_project_id()  # Saves your Google Cloud project ID in the R environment
    usethis::edit_r_environ() # Opens the .Renviron file
    # We will then save our project ID in the .Renviron file so that it looks something like this:
    DAI_PROCESSOR_ID= "[your project ID]"
    GCS_AUTH_FILE= "[file path to your .json file with service account key]"

We will then check our bucket in Google Cloud:

gcs_list_buckets(project_id)

The first time we check our bucket in Google Cloud, the answer will be “NULL” in the R console (because there is none). We must, then, create one. We first provide the name of our bucket (in quotation marks), we then provide project ID vector (the one we saved as “project_id” in the R environment), and the choice of location for the server.

gcs_create_bucket("[your preferred bucket name]", [your project ID], location = "EU")

We can then save our bucket name in the .Renviron file:

usethis::edit_r_environ() # Open the .Renviron file in case you closed it
# The information you stored in the .Renviron file should now look like this:

DAI_PROCESSOR_ID= "[your project ID]"
GCS_DEFAULT_BUCKET= "[your bucket name]"
GCS_AUTH_FILE= "[complete file path to the your service account key (.json file)]"

If our bucket name has not been saved in our .Renviron file, then we have to provide bucket name with the following function:

gcs_global_bucket("[bucket name]")

We will then check the content of the bucket we have created (which obviously will be “NULL” as we have not uploaded any files yet)

gcs_list_objects()

Processing a big batch of .pdf files

Here, I am assuming that you have already uploaded all the files you want to process to the Google Cloud bucket and that everything is ready to be processed.

As Hegghammer writes:

Although dai_async() takes batches of files, it is constrained by Google’s rate limits. Currently, a dai_async() call can contain maximum 50 files (a multi-page pdf counts as one file), and you can not have more than 5 batch requests and 10 000 pages undergoing processing at any one time. Therefore, if you’re looking to process a large batch, you need to spread the dai_async() calls out over time. The simplest solution is to make a function that sends files off individually with a small wait in between. Say we have a vector called big_batch containing thousands of filenames.First we would make a function like this:

process_slowly <- function(file) {
  dai_async(file)
  Sys.sleep(15)
  }

We will then create an object with all the content in our bucket:

content <- gcs_list_objects()
big_batch <- content$name

We should at this point have a “process_slowly” (our function) and a “big_batch” (our files uploaded to the bucket) in the R environment. What remains is processing these files with the map() function:

map(big_batch, process_slowly)

Managing unprocessed files

Every now and then, you will receive the following message: “HTTP status: 429 - unsuccesful if processing hundreds of files” for some of the files you have tried to process. So we need to identify those files that we were unable to process. First, we will create an object will all .json files in our bucket (the processed .pdf files are turned into .json files) and mine their stems:

contents <- gcs_list_objects()
jsons <- grep("*.json", contents$name, value = TRUE)

We will then use the head() and tail() functions to make sure we have all .json files:

head(jsons)
tail(jsons)

We will then use regex to identify the stems of the .json files in our bucket as they have a prefix with “100143430000013434530000” or similar. P.S. The name of your .json file and the regex code required depend on the name of the .pdf files you uploaded. The regex code below is based on file names based on dates - for example “2018_08_31”, “2011_10_01” etc. The “d” stands for “digit” and {2} and {4} is the number of digits. For example, “d{4}”” are the four digits in year (“2018”, for example) while “d{2}”” are the two digits in months and days (“08”, for example). So “\d{4}\d{2}\d{2}” refers to “\[YEAR]\[MONTH]\[DAY]”. A regex cheat sheet helps if you find this confusing: https://www.rexegg.com/regex-quickstart.html

json_stems <- unlist(str_extract_all(jsons, "\\d{4}_\\d{2}_\\d{2}")) 
head(json_stems)

We will then try to identify the unique .json stems:

json_stems_unique <- unique(json_stems)
head(json_stems_unique)

Once we have identified the unique .json stems, we will need to do the same with the files we uploaded. In this case, we uploaded .pdfs, but you can change this to .jpg or .tiff if that are the files you uploaded. We will use the same regex code to find the stems:

pdfs <- grep("*.pdf", contents$name, value = TRUE)
pdfs <- list.files(dir)
pdf_stems <- unlist(str_extract_all(pdfs, "\\d{4}_\\d{2}_\\d{2}"))
pdf_stems_unique <- unique(pdf_stems)

We will then use the setdiff() function to compare the unique .json files and the unique .pdf files (the object “remaining” will contain the file name of all the unprocessed .pdf files in the bucket):

remaining <- setdiff(pdf_stems_unique, json_stems_unique)

Processing unprocessed files

Still, if we try to run the map() function on the vector “remaining” (the same function we used for the object “big_batch” above) it won’t execute as there is no information in the object containing the file names that we are trying to reprocess .pdfs. The names in the object are simply “2018_02_31” (but we need “2018_02_31.pdf”). We thus need the paste0() function to add a “.pdf” to the end of the file names:

remaining <- paste0(remaining, ".pdf")

We can then, once this is done, process the unprocessed .pdf files:

map(remaining, process_slowly)

Downloading the .json files

Once we have processed all the .pdf files (or .tiff/.jpg), we have all the .json files we need (the output from the preceding processing). The following function assumes that you want to download all the .json files but not the .pdf files you processed (in my case, I processed 30,000 .pdf files which created an output of more than 100,000 .json files. It would necessarily be a laborious task to download all those files manually). The grep() function makes sure we only download the .json files and not the .pdfs we uploaded. The saveToDisk makes sure to download the files to your working directory.

If you log into Google Storage and check the content of your bucket, you will see that all .json files have been saved in different folders (one folder per .pdf files with 1-5 .json file depending on the length of your .pdf file). So you need the str_replace_all() function to replace the / in the directory paths to a “_” so that R can find the .json files in your bucket:

bucket_contents <- gcs_list_objects()
only_jsons <- grep("*.json", bucket_contents$name, value = TRUE)
map(only_jsons, ~ gcs_get_object(.x, saveToDisk = str_replace_all(.x, "/", "_")))

Once you have the .json files, you can delete all content in the bucket:

contents <- gcs_list_objects()
map(contents$name, gcs_delete_object)

Extracting text from the .json files

Once we have downloaded all our .json files, we need to extract the text data. We will first provide a path to the directory containing all our .json files, and then provide a destination directory where we want to save the extracted shards:

jsons_on_file <- dir_ls(here("[directory with .json files]"))
jsons <- dir_ls(here("[directory with .json files]"))
destdir <- here("[directory where we want to save the .txt files output]"

We will then extract the text from the .json files by writing the following function:

x <- jsons_on_file

get_text_and_name <- function(x) {
  print(glue("Parsing {basename(x)} .."))
  text <- text_from_dai_file(x)
  stem <- str_sub(basename(x), end = -5)
  filename <- paste0(stem, "txt")
  filepath <- file.path(destdir, filename)
  Sys.setlocale("LC_CTYPE", "arabic")# Sets the locale to Arabic should you process Arabic text files (can be changed to Germany, Korean etc.). Only relevant for Windows users.
  write.csv(text, filepath, fileEncoding = "utf8", row.names = FALSE)
  Sys.setlocale("LC_CTYPE", "English")# Resets the locale back to English.
}

map(jsons, get_text_and_name)

Merging the .txt shards

Once you have extracted all .txt shards, you have to merge them (so that the .pdf file “2018_08_15.pdf” is turned into “2018_08_15.txt” and not “2018_08_15_01.txt”, “2018_08_15_02.txt”, “2018_08_15_03.txt” etc.). To do so, we will need to install the GitHub version of daiR:

devtools::install_github("hegghammer/daiR", force = TRUE)

We then provide the directory where we saved our .txt file shards and the directory in which we want to place the merged .txt files:

shard_dir <- here("[directory with all .txt shards]")
dest_dir <- here("[directory where we want to save the merged .txt files]")
merge_shards(shard_dir, dest_dir) # The function to merge the shards.