#RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (2024)

#RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (1)

I have clearly been out of the loop because I have only recently learned about the tesseract library in R. If I knew about it earlier I would have wrote about it much sooner!

The tesseract library is a package which has bindings to the Tesseract-OCR engine: a powerful optical character recognition (OCR) engine that supports over 100 languages and enables users to scan and extract text from pictures, which has direct applications in any field that deals with with high amounts of manual data processing- like accounting, mortgages/real estate, insurance and archival work. In this blog I am going to explore how its possible to parse a JPMorgan Chase bank statement using the tesseract,pdftools, stringr, and tidyverse libraries.

Disclaimer: The bank statement I am using is from a sample template and is not personal information. The statement sample can be accessed here; the document being parsed is shown below.

From what I’ve seen in the CRAN documentation we need to first convert the .pdf file into .png files. This can be done with the pdf_convert() function available in the pdftools package. To ensure that text will be read accurately, setting the dpi argument to a large number is recommended.

bank_statement <- pdftools::pdf_convert("sample-bank-statement.pdf", dpi = 1000)

## Converting page 1 to sample-bank-statement_1.png... done!## Converting page 2 to sample-bank-statement_2.png... done!## Converting page 3 to sample-bank-statement_3.png... done!## Converting page 4 to sample-bank-statement_4.png... done!

Now that the bank statement has been converted to .png files it can now be read with tesseract. Right now the data is very unstructured and it needs to be parsed. I’ll save the output for you, but you can see it for yourself if you output the raw_text vector on your machine.

library(tesseract)raw_text <- ocr(bank_statement, engine = tesseract("eng") )

From a bank statement, businesses are interested in data from the fields listed in document. Namely the:

Deposits and additions,
Checks paid,
Other withdrawals, fees and charges; and
Daily ending balances.

While this bank statement is relatively small and only consists of 4 pages, to create a general method for extracting data from JPMorgan Chase bank statements which could be larger, the scanned text will need to be combined into a single text file and then parsed accordingly.

# Bind raw text togetherraw_text_combined<- paste(raw_text,collapse="")

To get the data for the desired fields, we can use the field titles as anchors for parsing the data. It was with this and with the help of regex101.com this cleaning script was constructed. Beyond the particular regular expressions involved, the cleaning script relies heavily on the natural anchors in the text relating encompassing the values where they begin (around the title of the table they belong) and where they end (just before the total amount).

library(tidyverse)library(stringr)deposits_and_additions<- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract("(?<=DEPOSITS AND ADDITIONS).*(?=Total Deposits and Additions)") %>% str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% as_tibble(.name_repair = "unique") %>% transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'), transaction = ...1 %>% str_extract('(?<=\\d{2} ).*(?= (\\$|\\d))'), amount = ...1 %>% str_extract("(?<=[A-z] ).*")%>% str_extract('\\d{1,}.*'))checks_paid <- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract('(?<=CHECKS PAID).*(?=Total Checks Paid)') %>% # Would have to change regex to get check numbers and description but its not relevant here str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% as_tibble(.name_repair = "unique") %>% transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'), amount = ...1 %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))others <- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract('(?=OTHER WITHDRAWALS, FEES & CHARGES).*(?=Total Other Withdrawals, Fees & Charges)') %>% str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% as_tibble(.name_repair = "unique") %>% transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'), description = ...1 %>% str_extract('(?<=\\d{2} ).*(?= (\\$|\\d))'), amount = ...1 %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))daily_ending_balances<- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract('(?<=DAILY ENDING BALANCE).*(?=SERVICE CHARGE SUMMARY)') %>% str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% lapply(function(x) { x %>% str_split('(?= \\d{2}\\/\\d{2} )')}) %>% unlist() %>% as_tibble(.name_repair = "unique") %>% transmute(date= value %>% str_extract('\\d{2}\\/\\d{2}'), amount = value %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))

The cleaned data extracted from the bank statement is:

statement_data <- list("DEPOSITS AND ADDITIONS"=deposits_and_additions, "CHECKS PAID"=checks_paid, "OTHER WITHDRAWALS, FEES & CHARGES"=others, "DAILY ENDING BALANCE"=daily_ending_balances)statement_data

## $`DEPOSITS AND ADDITIONS`## # A tibble: 10 x 3## date transaction amount ## <chr> <chr> <chr> ## 1 07/02 Deposit 17,120.00## 2 07/09 Deposit 24,610.00## 3 07/14 Deposit 11,424.00## 4 07/15 Deposit 1,349.00 ## 5 07/21 Deposit 5,000.00 ## 6 07/21 Deposit 3,120.00 ## 7 07/23 Deposit 33,138.00## 8 07/28 Deposit 18,114.00## 9 07/30 Deposit 6,908.63 ## 10 07/30 Deposit 5,100.00 ## ## $`CHECKS PAID`## # A tibble: 2 x 2## date amount ## <chr> <chr> ## 1 07/14 1,471.99## 2 07/08 1,697.05## ## $`OTHER WITHDRAWALS, FEES & CHARGES`## # A tibble: 4 x 3## date description amount ## <chr> <chr> <chr> ## 1 07/11 Online Payment XXXXX To Vendor 8,928.00## 2 07/11 Online Payment XXXXX To Vendor 2,960.00## 3 07/25 Online Payment XXXXX To Vendor 250.00 ## 4 07/30 ADP TX/Fincl Svc ADP 2,887.68## ## $`DAILY ENDING BALANCE`## # A tibble: 11 x 2## date amount ## <chr> <chr> ## 1 07/02 98,727.40 ## 2 07/21 129,173.36## 3 07/08 97,030.35 ## 4 07/23 162,311.36## 5 07/09 121,640.35## 6 07/25 162,061.36## 7 07/11 109,752.35## 8 07/28 180,175.36## 9 07/14 108,280.36## 10 07/30 189,296.31## 11 07/16 121,053.36

There you have it! The tesseract package has opened up a world of data processing tools that I now have at my disposal and I hope I was able to show it in this blog. While this blog only focused on JPMorgan Chase bank statements, its possible to apply the same techniques to other bank statements by having the cleaning script tweaked accordingly.

Thank you for reading!

Want to see more of my content?

Be sure to subscribe and never miss an update!

YouTube

Facebook

Patreon