Data Cleaning with R using 'PDFtools' and 'stringr'

SBI Submits all the data to ECI and ECI publish it on Thursday 21 Mar 2024, 6:32 PM Links of 2 files are provided here:

Disclosure of Electoral Bonds

Details of Electoral Bonds submitted by SBI on 21st March 2024 (EB_Redemption_Details)
Details of Electoral Bonds submitted by SBI on 21st March 2024 (EB_Purchase_Details)

To know the background of the case read this: SBI submits all electoral bond details, including unique numbers, to ECI

A short note on regex`regex`

regex, short for Regular Expressions, are essentially sequences of characters that form a search pattern, which is then used to match strings or parts of strings in a text. Think of them as sophisticated search queries that allow you to find specific patterns within a larger body of text. regex can be used for simple patterns like a specific word or phrase, or you can create complex patterns to match more intricate structures such as email addresses, URLs, dates, and more.

For example, if you wanted to find all email addresses in a text document, you could use a regular expression pattern like [a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\\\.[a-zA-Z0-9-.]+. This pattern breaks down as follows:

# Sample emails
em1 <- "example-1_Random@gmail.com" # this is a valid email
em2 <- "example*1_Random$gmail.com" # this is an invalid email.

# Detect email addresses
stringr::str_detect(c(em1, em2), "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z0-9-.]+")
## [1]  TRUE FALSE

a-zA-Z0-9_.+-]+ matches one or more alphanumeric characters, dots, underscores, percent signs, plus signs, or hyphens, representing the username portion of an email address.

$\textbf{@}$ matches the “@” symbol, separating the username from the domain.

[A-Za-z0-9.-]+ matches one or more alphanumeric characters, dots, or hyphens, representing the domain name.

\. matches a literal dot, separating the domain from the top-level domain.

[a-zA-Z0-9-.]+ matches two or more alphabetic characters, representing the top-level domain (e.g., com, org, net).

Regular expressions can be used in various programming languages, here we are using it with R programming with stringr package.

Required R libraries for this analysis

# To process the PDF document
if(!require(lubridate)){install.packages("lubridate");library(lubridate)}
if(!require(pdftools)){install.packages("pdftools");library(pdftools)}
if(!require(stringr)){install.packages("stringr");library(stringr)}
if(!require(readxl)){install.packages("readxl");library(readxl)}
if(!require(DT)){install.packages("DT");library(DT)}

Figure 1: Structure of the files in working directory

Cleaning the Party data

We starting with reading the file Parties.pdf from ./Data/Raw Data/ directory. The regular expression [\r\n]+ is used to match one or more occurrences of either a carriage return ( or a newline character (). After that str_trip remove space either from starting or end of the each string. str_split_fixed split string with "\\s{2,}" means 2 consiqutive space till it reaches 9 element. We choose 9 because there are total 9 column in Parties.pdf.

Party <- pdf_text("./Data/Raw Data/Parties.pdf")
Party <- unlist(str_split(Party, "[\\r\\n]+"))
Party <- as.data.frame(str_split_fixed(str_trim(Party), "\\s{2,}", n = 9))

head(Party,5)
##           V1                                                   V2          V3
## 1    Date of                                       Account no. of        Bond
## 2     Sr No.                          Name of the Political Party      Prefix
## 3 Encashment                                      Political Party      Number
## 4          1 12/Apr/2019 ALL INDIA ANNA DRAVIDA MUNNETRA KAZHAGAM *******5199
## 5          2 12/Apr/2019 ALL INDIA ANNA DRAVIDA MUNNETRA KAZHAGAM *******5199
##              V4         V5          V6    V7      V8 V9
## 1    Pay Branch                                        
## 2 Denominations Pay Teller                             
## 3          Code                                        
## 4            OC        775 1,00,00,000 00800 2770121   
## 5            OC       3975 1,00,00,000 00800 2770121

Upon inspection we find all the pages of pdf as the table header and now we are removing the headers, I remove row containing “Date of”, “Sr No.”, “Encashment” using V1 column.

After that we have Page.and it’s numbers for each pages read by pdftools. We remove using str_extract, here Page{dash} represent anything after Page must be selected. We also remove rows that have all the columns emptly.

Index = Party$V1 %in% c("Date of", "Sr No.", "Encashment")
Party <- Party[!Index,]

Index2 = is.na(str_extract(Party$V1, "Page."))
Party <- Party[Index2,];
Party <- Party[!apply(Party == "", 1, all),]

rm(Index, Index2)

Now, all the data is stored in the Party variable the structure of it.

class(Party)
## [1] "data.frame"
dim(Party)
## [1] 20421     9

We found at some row total number of columns are greater then 9. So before constructing the final data frame we choose checking each row with for loop. We pre-poulated several columns such as SN, Date of Enchashment

n = rep(NA,dim(Party)[1])

Index_m <- !nchar(Party$V9)>0
Col1 = as.data.frame(str_split_fixed(Party$V2, pattern = " ", n = 2))

Party_Data <- data.frame("SN" = as.numeric(Party$V1),
                         "Date of Encashment" = lubridate::dmy(Col1$V1),
                         "Party Name" = n,
                         "Account No" = n,
                         "Prefix" = n,
                         "Bond No" = n,
                         "Denominations" = n,
                         "Pay Branch Code" = n,
                         "Pay Teller" = n)
col7_t <- as.numeric(gsub(",", "", Party$V6))
col7_f <- as.numeric(gsub(",", "", Party$V7))


for(ii in seq_along(n)){
    if (Index_m[ii]) {
        Party_Data[ii,3] <- Col1$V2[ii]
        Party_Data[ii,4] <- Party$V3[ii]
        Party_Data[ii,5] <- Party$V4[ii]
        Party_Data[ii,6] <- as.numeric(Party$V5)[ii]
        Party_Data[ii,7] <- col7_t[ii]
        Party_Data[ii,8] <- Party$V7[ii]
        Party_Data[ii,9] <- as.numeric(Party$V8)[ii]
    } else {
        Party_Data[ii,3] <- Party$V3[ii]
        Party_Data[ii,4] <- Party$V4[ii]
        Party_Data[ii,5] <- Party$V5[ii]
        Party_Data[ii,6] <- as.numeric(Party$V6)[ii]
        Party_Data[ii,7] <- col7_f[ii]
        Party_Data[ii,8] <- Party$V8[ii]
        Party_Data[ii,9] <- as.numeric(Party$V9)[ii]
    }
    # print(ii) # running this might take some time.
    # IMP: don't run white building the site.
}

Now, we have table similar to pdf, but I want to create unique id un_id by merging Perfiex with Bond No., so that I can match Parties to Donner, after that I save the data into Party_Cleaned.xlsx file.

un_id <- paste0(Party_Data$Prefix,
              str_pad(Party_Data$Bond.No, width = 5,pad = "0"))
Party_Data$unique_ID <- un_id
write.xlsx(Party_Data, file="./Data/Processed xlsx/Party_Cleaned.xlsx", row.names = FALSE)

Now to time to generate some table data, I am aggregation Party and Denominations data for table. To aggregate the data with Party name aggregate function from base R is used. After that DT::datatable is used for interactive table.

Party_Data <- read_xlsx(path = "./Data/Processed xlsx/Party_Cleaned.xlsx")

fx_name <- function(x) {c(length(x), median(x), round(mean(x),0), sum(x))}
agg_data = aggregate(Party_Data, Denominations ~ Party.Name, FUN = fx_name)

agg_data_df <- data.frame("Purchaser" = agg_data$Party.Name,
                          "Count" = agg_data$Denominations[,1],
                          "Median" = agg_data$Denominations[,2],
                          "Mean" = agg_data$Denominations[,3],
                          "Sum" = agg_data$Denominations[,4])

datatable(agg_data_df, options = list(
    initComplete = JS(
        "function(settings, json) {",
        "$(this.api().table().header()).css({'background-color': '#fff', 'color': '#111'});",
        "$(this.api().table().container()).css({'background-color': '#fff', 'color': '#111'});",
        "$(this.api().table().body()).css({'background-color': '#fff', 'color': '#111'});",
        "}")))%>% formatCurrency(c('Median', 'Mean', 'Sum'),
                                 currency = "₹",
                                 interval = 3,
                                 mark = ",",
                                 digits = 0)

Show entries

Search:

	Purchaser	Count	Median	Mean	Sum
1	AAM AADMI PARTY	245	₹1,000,000	₹2,671,429	₹654,500,000
2	ADYAKSHA SAMAJVADI PARTY	46	₹100,000	₹3,054,348	₹140,500,000
3	ALL INDIA ANNA DRAVIDA MUNNETRA KAZHAGAM	38	₹1,000,000	₹1,592,105	₹60,500,000
4	ALL INDIA TRINAMOOL CONGRESS	3305	₹1,000,000	₹4,869,989	₹16,095,314,000
5	BHARAT RASHTRA SAMITHI	1806	₹10,000,000	₹6,725,968	₹12,147,099,000
6	BHARATIYA JANATA PARTY	8633	₹10,000,000	₹7,020,168	₹60,605,111,000
7	BIHAR PRADESH JANTA DAL(UNITED)	14	₹10,000,000	₹10,000,000	₹140,000,000
8	BIJU JANATA DAL	861	₹10,000,000	₹9,006,969	₹7,755,000,000
9	DRAVIDA MUNNETRA KAZHAGAM (DMK)	648	₹10,000,000	₹9,861,111	₹6,390,000,000
10	GOA FORWARD PARTY	17	₹100,000	₹205,882	₹3,500,000

Showing 1 to 10 of 24 entries

Previous1 2 3Next

Cleaning the Donner data

Similar to the Party data, explore the Donner data from the pdf “Company.pdf” in the folder Data/Raw Data. Similar processing methodology is use as Party.pdf with one change this time total number of columns are 12.

Comp <- pdf_text("./Data/Raw Data/Company.pdf")
Comp <- unlist(str_split(Comp, "[\\r\\n]+"))
Comp <- str_trim(gsub(",","",Comp))

Based on the data format, I found breaking it from as digit followed by space (?<=\\d ) is more useful.

Comp = as.data.frame(str_split_fixed(Comp, "(?<=\\d )", n = 12))

This will break the string the string in 2 columns, “SN”, and “Reference No URN”

a <- gsub("\\s+", " ", Comp$V1, perl = TRUE)

Index = a %in% c(a[1:3])
Comp <- Comp[!Index,]

Index2 = is.na(str_extract(Comp$V1, "Page."))
Comp <- Comp[Index2,]
Comp <- Comp[!apply(Comp == "", 1, all),]

rm(Index, Index2, a)

Now we will break down column V6 that has structure of

“A B C INDIA LIMITED TL 11448”

First we will crop Bond numbers using str_sub with 1 to -6 (6th character from the last) save in c. After that Names are extracted using c removing last to charcter that are Prefix of the Unitq bond number. At last Prefix are extraced using str_sub(c, -2, -1).

BondN <- as.numeric(str_sub(str_trim(Comp$V6), -6,-1))

# you cans simply choose last 4 digit of the V6
c = str_trim(str_sub(str_trim(Comp$V6),1,-6))
Name <- str_trim(str_sub(c,1,-3))
Prefix <- str_sub(c, -2, -1)

Now we will assemble the data in the total 12 column.

Comp_Data <- data.frame("SN" = as.numeric(Comp$V1),
                        "Reference No URN" = str_trim(Comp$V2),
                        "Journal Date" = dmy(Comp$V3),
                        "Date of Purchase" = dmy(Comp$V4),
                        "Date of Expiry" = dmy(Comp$V5),
                        "Purchaser" = Name,
                        "Prefix" = Prefix,
                        "Bond No" = BondN,
                        "Denominations" = as.numeric(str_trim(Comp$V7)),
                        "Issue Branch Code" = str_trim(Comp$V8),
                        "Issue Teller" = as.numeric(str_trim(Comp$V9)),
                        "Status" = str_trim(Comp$V10))

# write.xlsx(Comp_Data, file="Comp_Cleaned.xlsx", row.names = FALSE)

We temporarily saved the files with “Comp_Cleaned.xlsx” and found there are some error in the columns, modify them here for one Donner.

Index = Comp_Data$Reference.No.URN %in% c("00300202310100000003344",
                                          "00300202310120000003422",
                                          "00300202310130000003470")
Comp_Data$Purchaser[Index] <- "L7 HITECH PRIVATE LIMITED"
Comp_Data$Denominations[Index] <- 10000000
Comp_Data$Issue.Branch.Code[Index] <- "00300"
Comp_Data$Prefix[Index] <- "OC"
Comp_Data$Status[Index] <- "Paid"
Comp_Data$Issue.Teller[Index] <- 1022034 # same value is used as it's not much useful information

## Adding Bond no. for all 3 cases
Comp_Data$Bond.No[Index] <- c(16524, 16531, 16521, 16535, 16529, 16527, 16523, 16519, 16533, 16569, 16577, 16565, 16567,16573, 16563, 16571, 16575, 16636,16638, 16644, 16642, 16640)
write.xlsx(Comp_Data, file="Comp_Cleaned.xlsx", row.names = FALSE)

print(paste0("No of NAs: ", sum(is.na(Comp_Data))))

Upon inspection around 16 row have incorrect information, it’s because the parser function you used and they cleaning manually.

NOTE: NA values are cleaned manually and renamed it to “Comp_Cleaned_Man.xlsx”

Now we will load the cleaned data for tabular representation

Comp_Data <- read_xlsx(path = "./Data/Processed xlsx/Comp_Cleaned_man.xlsx", sheet = "Sheet1")
Comp_Data$unique_ID <-  paste0(Comp_Data$Prefix,
                               str_pad(Comp_Data$Bond.No, width = 5,pad = "0"))

fx_name <- function(x) {c(length(x), median(x), round(mean(x),0), sum(x))}
agg_data = aggregate(Comp_Data, Denominations ~ Purchaser, FUN = fx_name)

agg_data_df <- data.frame("Purchaser" = agg_data$Purchaser,
                          "Count" = agg_data$Denominations[,1],
                          "Median" = agg_data$Denominations[,2],
                          "Mean" = agg_data$Denominations[,3],
                          "Sum" = agg_data$Denominations[,4])

datatable(agg_data_df, options = list(
    initComplete = JS(
        "function(settings, json) {",
        "$(this.api().table().header()).css({'background-color': '#fff', 'color': '#111'});",
        "$(this.api().table().container()).css({'background-color': '#fff', 'color': '#111'});",
        "$(this.api().table().body()).css({'background-color': '#fff', 'color': '#111'});",
        "}")))%>% formatCurrency(c('Median', 'Mean', 'Sum'),
                                 currency = "₹",
                                 interval = 3,
                                 mark = ",",
                                 digits = 0)

Show entries

Search:

	Purchaser	Count	Median	Mean	Sum
1	14 REELS PLUS LLP	1	₹10,000,000	₹10,000,000	₹10,000,000
2	A B C INDIA LIMITED	13	₹100,000	₹307,692	₹4,000,000
3	AAKANKSHA BAHETY	1	₹1,000,000	₹1,000,000	₹1,000,000
4	AALAYA CONSTRUCTIONS	1	₹10,000,000	₹10,000,000	₹10,000,000
5	AARISH SOLAR POWER PRIVATE LIMITED	2	₹10,000,000	₹10,000,000	₹20,000,000
6	AASHMAN ENERGY PRIVATE LIMITED	1	₹10,000,000	₹10,000,000	₹10,000,000
7	AASHMAN ENERGY PVT LTD	2	₹10,000,000	₹10,000,000	₹20,000,000
8	AAYTEE LOGISTICS PRIVATE LIMITED	3	₹1,000,000	₹1,000,000	₹3,000,000
9	ABHAY SHUKLA	2	₹1,000,000	₹1,000,000	₹2,000,000
10	ABHIJIT INTERNATIONAL	12	₹1,000,000	₹2,500,000	₹30,000,000

Showing 1 to 10 of 1,318 entries

Previous1 2 3 4 5…132Next

Processing the pdf in an easy way:

Using tabulizer package that can extract pdf tables, I find it’s slow but works well for table data. Similar to PDFtools it also required rJava.

# remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"))
library(tabulizer)

out_tables <- extract_tables("./2024-03-25-Electoral-Bond-Analysis/Data/Raw Data/Company.pdf")

Out <- out_tables[[1]]

for (ii in 2:length(out_tables)) {
    temp <- out_tables[[ii]]
    temp1 <- temp[2:dim(temp)[1],]
    Out <- rbind(Out,temp1)
}

Combining Party and Donner data using `UN_ID`

p_uid <- Party_Data$unique_ID # Unique ID for Party
c_uid <- Comp_Data$unique_ID  # Unique ID for Donner

## Finding how many UID don't match in both the dataset

missing_record_p_name = Party_Data$Party.Name[!p_uid %in% c_uid]
table(missing_record_p_name)
## missing_record_p_name
##                                            AAM AADMI PARTY 
##                                                          2 
##                                   ADYAKSHA SAMAJVADI PARTY 
##                                                         30 
##                   ALL INDIA ANNA DRAVIDA MUNNETRA KAZHAGAM 
##                                                         38 
##                               ALL INDIA TRINAMOOL CONGRESS 
##                                                         36 
##                                     BHARAT RASHTRA SAMITHI 
##                                                         33 
##                                     BHARATIYA JANATA PARTY 
##                                                       1145 
##                            BIHAR PRADESH JANTA DAL(UNITED) 
##                                                          2 
##                            DRAVIDA MUNNETRA KAZHAGAM (DMK) 
##                                                          7 
##                                     JANATA DAL ( SECULAR ) 
##                                                         25 
##                                     JHARKHAND MUKTI MORCHA 
##                                                         10 
##             NATIONALIST CONGRESS PARTY MAHARASHTRA PRADESH 
##                                                         25 
##                    PRESIDENT, ALL INDIA CONGRESS COMMITTEE 
##                                                        238 
##                                        RASHTRIYA JANTA DAL 
##                                                         10 
##                                                   SHIVSENA 
##                                                         36 
##                                         TELUGU DESAM PARTY 
##                                                         19 
## YSR CONGRESS PARTY (YUVAJANA SRAMIKA RYTHU CONGRESS PARTY) 
##                                                         24

Now, the total mission record by comparing Unique ID from Party and Company data/Donner data: 1680

Mismatch in our analysis 130. Remaining 1550 are not provided by SBI list.

Now computing number of bonds are given to each party.

Based on our analysis there are total 24 parties and 1318 Donner in the list.

p_name <- unique(Party_Data$Party.Name)
c_name <- unique(Comp_Data$Purchaser)

Comp_table <- matrix(NA, nrow = length(c_name), ncol = length(p_name))

rownames(Comp_table) <- c_name
colnames(Comp_table) <- p_name


for(ii in seq_along(p_name)){
    p_Ind <- Party_Data$unique_ID[Party_Data$Party.Name %in% p_name[ii]]
    c_Ind <- which(Comp_Data$unique_ID %in% p_Ind)
    donner <- table(Comp_Data$Purchaser[c_Ind])
    for (jj in seq_along(donner)) {
        inx <- which(c_name %in% names(donner)[jj])
        Comp_table[inx,ii] <- as.numeric(donner[jj])
    }
}

Now the table:

datatable(as.data.frame(Comp_table),  class = 'cell-border stripe', options = list(pageLength=5, scrollX='400px',
    initComplete = JS(
        "function(settings, json) {",
        "$(this.api().table().header()).css({'background-color': '#fff', 'color': '#111'});",
        "$(this.api().table().container()).css({'background-color': '#fff', 'color': '#111'});",
        "$(this.api().table().body()).css({'background-color': '#fff', 'color': '#111'});",
        "}")))

Show entries

Search:

	ALL INDIA ANNA DRAVIDA MUNNETRA KAZHAGAM	BHARAT RASHTRA SAMITHI	BHARATIYA JANATA PARTY	PRESIDENT, ALL INDIA CONGRESS COMMITTEE	SHIVSENA	TELUGU DESAM PARTY	YSR CONGRESS PARTY (YUVAJANA SRAMIKA RYTHU CONGRESS PARTY)	DRAVIDA MUNNETRA KAZHAGAM (DMK)	JANATA DAL ( SECULAR )	NATIONALIST CONGRESS PARTY MAHARASHTRA PRADESH	ALL INDIA TRINAMOOL CONGRESS	BIHAR PRADESH JANTA DAL(UNITED)	RASHTRIYA JANTA DAL	AAM AADMI PARTY	ADYAKSHA SAMAJVADI PARTY	SHIROMANI AKALI DAL	JHARKHAND MUKTI MORCHA	JAMMU AND KASHMIR NATIONAL CONFERENCE	BIJU JANATA DAL	GOA FORWARD PARTY	MAHARASHTRAWADI GOMNTAK PARTY	SIKKIM KRANTIKARI MORCHA	JANASENA PARTY	SIKKIM DEMOCRATIC FRONT

	BHARAT RASHTRA SAMITHI	BHARATIYA JANATA PARTY	AAM AADMI PARTY	SHIROMANI AKALI DAL
14 REELS PLUS LLP	1
GREENKO MP01 IREP PRIVATE LIMITED		5
FASTWAY TRANSMISSIONS PVT LTD				5
LOTUS TEXPARK LIMITED FORMERLY LOTUS INT EGRATED TEXPARK LTD		9
VARDHMAN TEXTILES LTD		35	5	12

Showing 1 to 5 of 1,318 entries

Previous1 2 3 4 5…264Next

Also to see all the party data use this link🔗

Data <- read.csv(file = "Data/Agg_Data_Sorted.csv")
datatable(Data,  class = 'cell-border stripe', options = list(pageLength=5, scrollX='400px',
    initComplete = JS(
        "function(settings, json) {",
        "$(this.api().table().header()).css({'background-color': '#fff', 'color': '#111'});",
        "$(this.api().table().container()).css({'background-color': '#fff', 'color': '#111'});",
        "$(this.api().table().body()).css({'background-color': '#fff', 'color': '#111'});",
        "}")))

Show entries

Search:

	labels	Comp	BRS	BJP	CONGRESS	SHIVSENA	TRINAMOOL	BJD
1	ABH	ABHINANDANSTOCKBROKINGPVTLTD	10	0	22	0	4	12
2	APL	AUROBINDO PHARMA LIMITED	15	36	0	0	0	0
3	ATFPL	AVEES TRADING FINANCE PVT LTD	0	1	53	0	26	0
4	BAL	BHARTI AIRTEL LIMITED	0	183	0	0	0	0
5	BGSCTPLT	B G SHIRKE CONSTRUCTION TECHNOLOGY PVT L TD	0	30	2	85	0	0

Showing 1 to 5 of 50 entries

Previous1 2 3 4 5…10Next

A short note on regexregex

Required R libraries for this analysis

Cleaning the Party data

Cleaning the Donner data

Processing the pdf in an easy way:

Combining Party and Donner data using UN_ID

Now computing number of bonds are given to each party.

A short note on regex`regex`

Combining Party and Donner data using `UN_ID`