Normalize Geographic Variables

library(campfin)
library(dplyr)
library(stringr)
library(readr)
library(scales)

Background

This vignette contains an example of the workflow used by the wranglers on The Accountability Project, a tool created by The Investigative Reporting Workshop in Washington, DC. The Accountability Project curates, cleans, and indexes public data to give journalists, researchers and others a simple way to search across otherwise siloed records. The data focuses on people, organizations and locations.

Since state-level campaign finance data, which is typically reported by the campaigns, which is often provided to them by the contributions themselves, there is often great disparity in data quality. The campfin package can be used to reduce this disparity in a consistent, confident, and programmatic manner.

Messy Data

In this vignette, we will clean some extremely messy fake contribution data from a fictitious campaign in Vermont. This data is found in the vt_contribs data frame found in the inst/extdata or data-raw directories.

read_csv(
  file = system.file("extdata", "vt_contribs.csv", package = "campfin")
)
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   id = col_character(),
#>   cand = col_character(),
#>   date = col_character(),
#>   amount = col_double(),
#>   first = col_character(),
#>   last = col_character(),
#>   address = col_character(),
#>   city = col_character(),
#>   state = col_character(),
#>   zip = col_character()
#> )
#> # A tibble: 10 x 10
#>    id    cand     date    amount first last   address       city    state zip   
#>    <chr> <chr>    <chr>    <dbl> <chr> <chr>  <chr>         <chr>   <chr> <chr> 
#>  1 01    Bob Mil… 02/09/…     10 Lisa  Miller 4 Sheffield … Sheffi… VT    05866 
#>  2 02    Bob Mil… 03/09/…     20 Deb   Brown  Requested     Reques… RE    00000 
#>  3 03    Chuck W… 04/09/…     25 <NA>  <NA>   p.o. box 567  Midleb… Verm… 05753…
#>  4 04    Chuck W… 05/09/…    100 Josh  Jones  sugarhouse S… e Cori… V.T.  5076  
#>  5 05    Bob Mil… 02/09/…     10 Lisa  Miller 4 Sheffield … Sheffi… VT    05866 
#>  6 06    Chuck W… 06/09/…   1000 Bob   Taylor 55 thisplace… young … mn    55555 
#>  7 07    Chuck W… 07/09/…   -600 Alex  Johns… 11 Liberty A… Bristo… VT    05443 
#>  8 08    Alice W… 08/09/…      0 Ruth  Smith  2 Burlington… Brulin… vt    05401 
#>  9 09    Alice W… 09/09/…     69 Joe   Garcia 770   5th-st… Washin… DC    20001…
#> 10 10    Alice W… 11/09/…    222 Dave  Wilson XXXXXXXXXXXX… SA      Texas 78202

What are some of the potential problems we can see in this data?

While this data is obviously much smaller and more full of errors than real campaign finance data, these errors are not uncommon and need to be addressed. The campfin package contains many of the tools we need to fix these problems.

Read

In most cases, the first step is to download and read the file from a state agency. When reading the data with the popular readr::read_delim() function, the col_date_usa function can be used as a quick shortcut for readr::col_date(format = "%m/%d/%Y"), the format most commonly found in U.S. campaign finance data.

vt <- read_csv(
  file = system.file("extdata", "vt_contribs.csv", package = "campfin"),
  trim_ws = FALSE,
  col_types = cols(
    amount = col_number(),
    date = col_date_usa()
  )
)

We can see how the new date column is an actual date object, allowing for mathematical manipulation.

transmute(vt, date, next_week = date + 7)
#> # A tibble: 10 x 2
#>    date       next_week 
#>    <date>     <date>    
#>  1 2019-02-09 2019-02-16
#>  2 2009-03-09 2009-03-16
#>  3 2019-04-09 2019-04-16
#>  4 2019-05-09 2019-05-16
#>  5 2019-02-09 2019-02-16
#>  6 2019-06-09 2019-06-16
#>  7 2019-07-09 2019-07-16
#>  8 2019-08-09 2019-08-16
#>  9 2019-09-09 2019-09-16
#> 10 2019-11-09 2019-11-16

Next, we should try to normalize our data as much as possible. We can use some simple counting functions and built in vectors to check the cleanliness of our raw data.

percent(prop_in(vt$city, str_to_lower(valid_city)))
#> [1] "10%"
percent(prop_in(vt$state, valid_state))
#> [1] "40%"
percent(prop_in(vt$zip, valid_zip))
#> [1] "60%"
col_stats(vt, n_distinct)
#> # A tibble: 10 x 4
#>    col     class      n     p
#>    <chr>   <chr>  <int> <dbl>
#>  1 id      <chr>     10   1  
#>  2 cand    <chr>      3   0.3
#>  3 date    <date>     9   0.9
#>  4 amount  <dbl>      9   0.9
#>  5 first   <chr>      9   0.9
#>  6 last    <chr>      9   0.9
#>  7 address <chr>     10   1  
#>  8 city    <chr>      9   0.9
#>  9 state   <chr>      8   0.8
#> 10 zip     <chr>      9   0.9
col_stats(vt, count_na)
#> # A tibble: 10 x 4
#>    col     class      n     p
#>    <chr>   <chr>  <int> <dbl>
#>  1 id      <chr>      0   0  
#>  2 cand    <chr>      0   0  
#>  3 date    <date>     0   0  
#>  4 amount  <dbl>      0   0  
#>  5 first   <chr>      1   0.1
#>  6 last    <chr>      1   0.1
#>  7 address <chr>      0   0  
#>  8 city    <chr>      0   0  
#>  9 state   <chr>      0   0  
#> 10 zip     <chr>      0   0

A typical benchmark is to reach greater than 95% “valid.” That is, we want to normalize our data enough that less than 5% of our data can not be easily confirmed as valid using a fairly comprehensive list of cities, states, and ZIP codes.

We will first try to reach this threshold by normalizing our data. This process involves reducing inconsistencies through string manipulation. There are separate normal_*() functions for each of the 4 types of geographic variables. Here, we will use dplyr::mutate() to create new, normalized versions of our messy columns. This preserves the old data for transparency.

vt <- vt %>% 
  mutate(
    address = normal_address(
      address = address,
      abbs = usps_street,
      na = invalid_city,
      na_rep = TRUE
    ),
    city = normal_city(
      city = city,
      abbs = usps_city,
      states = "VT",
      na = invalid_city,
      na_rep = TRUE
    ),
    state = normal_state(
      state = state,
      abbreviate = TRUE,
      na_rep = TRUE,
      valid = valid_state
    ),
    zip = normal_zip(
      zip = zip,
      na_rep = TRUE
    )
  )

We can see how these functions and our built in data was used to normalize the geographic contributor data and remove anything that didn’t present real information. This format is much more explorable and searchable.

#> # A tibble: 10 x 4
#>    address            city          state zip  
#>    <chr>              <chr>         <chr> <chr>
#>  1 4 SHEFFIELD SQ RD  SHEFFIELD     VT    05866
#>  2 <NA>               <NA>          <NA>  <NA> 
#>  3 PO BOX 567         MIDLEBURY     VT    05753
#>  4 SUGARHOUSE SQ      EAST CORINTH  VT    05076
#>  5 4 SHEFFIELD SQ AVE SHEFFIELD     VT    05866
#>  6 55 THISPLACE AVE   YOUNG AMERICA MN    <NA> 
#>  7 11 LIBERTY AVE     BRISTOL       VT    05443
#>  8 2 BURLINGTON SQ    BRULINGTON    VT    05401
#>  9 770 5 TH ST NW     WASHINGTON    DC    20001
#> 10 <NA>               SA            TX    78202

Cities

However, the problem has not been solved. Most troublesome are the city names. There are so many city names and such great variety (compared to states and ZIP codes), that it can be difficult to normalize and difficult to assess.

Our valid_city vector contains many city names, but far less than exist in the country, especially when you account for neighborhoods that aren’t really cities, but shouldn’t be changed (some of these are contained in our curated extra_city vector).

length(valid_city)
#> [1] 19083
sample(valid_city, 6)
#> [1] "KIRON"            "NORTH FORT MYERS" "BRIAR"            "NEW MIDWAY"      
#> [5] "NORTH NEWTON"     "SILVER GATE"
sample(extra_city, 6)
#> [1] "WILDWOOD CREST"       "MOCANAQUA"            "DORMONT"             
#> [4] "HELLAM"               "SHARONVILLE"          "INDIAN HARBOUR BEACH"
# combine both vectors
many_city <- c(valid_city, extra_city)

Still, checking against this list is a good way to check for values that need additional attention.

#> # A tibble: 3 x 5
#>   id    city       state zip   valid
#>   <chr> <chr>      <chr> <chr> <lgl>
#> 1 03    MIDLEBURY  VT    05753 FALSE
#> 2 08    BRULINGTON VT    05401 FALSE
#> 3 10    SA         TX    78202 FALSE

It might not be clear what’s actually wrong with these values. A common way to check is by comparing them against expected city for a given ZIP code.

#> # A tibble: 3 x 6
#>   id    city_raw   state zip   valid city_match 
#>   <chr> <chr>      <chr> <chr> <lgl> <chr>      
#> 1 03    MIDLEBURY  VT    05753 FALSE MIDDLEBURY 
#> 2 08    BRULINGTON VT    05401 FALSE BURLINGTON 
#> 3 10    SA         TX    78202 FALSE SAN ANTONIO

Now the problems become clear. Two city names are misspelled and the third is an obvious abbreviation. When dealing with millions of city names, we need a way to check each raw value against it’s expected ZIP code match.

The str_dist() and is_abbrev() functions can be used to compared the value we have with the value we expect. By only checking against the corresponding city to that record’s ZIP code, we are making extremely confident changes (compared to the incredibly useful clustering algorithms like those provided by the refinr package).

First, we can use str_dist(), which calls the stringdist::stringdist() function to check the distance between the two strings; that is, the number of changes we’d need to make to our normalized value to get our expected matched value. If that distance is small (usually 1 or 2), we can confidently use the matched value!

But the string distance does not catch colloquial city abbreviations. NYC, BOS, LA, CHI, ABQ, BRNX, DFW, OKC. Many residents of a city get so used to writing their city’s name they use abbreviations and assume them to be universally understood. The is_abbrev() function can be used to check to one string might be an abbreviation for another using 3 common abbreviation characteristics: (1) is the abbreviation at least 2 characters shorter than the full name, (2) is the first letter of both strings the same, and (3) do the letters of the abbreviation appear in the same order as the full name. Every abbreviation generated by the base::abbreviate() function would satisfy these characteristics.

(bad <- bad %>% 
  mutate(
    match_dist = str_dist(city_raw, city_match),
    match_abb = is_abbrev(city_raw, city_match)
  ))
#> # A tibble: 3 x 8
#>   id    city_raw   state zip   valid city_match  match_dist match_abb
#>   <chr> <chr>      <chr> <chr> <lgl> <chr>            <dbl> <lgl>    
#> 1 03    MIDLEBURY  VT    05753 FALSE MIDDLEBURY           1 FALSE    
#> 2 08    BRULINGTON VT    05401 FALSE BURLINGTON           1 FALSE    
#> 3 10    SA         TX    78202 FALSE SAN ANTONIO          9 TRUE

Here’s what this process would look like when employed on an entire data frame. It’s important to ensure that the number of rows in our campaign finance data is kept consistent throughout the wrangling process.

vt <- vt %>%
  rename(city_raw = city) %>% 
  left_join(zipcodes) %>% 
  rename(city_match = city) %>% 
  mutate(
    match_dist = str_dist(city_raw, city_match),
    match_abb = is_abbrev(city_raw, city_match),
    city = if_else(match_abb | match_dist == 1, city_match, city_raw)
  ) %>% 
  select(-city_raw, -city_match, -match_dist, -match_abb)
#> Joining, by = c("state", "zip")

Now every city, state, and zip value is contained in our list of valid values.

#> # A tibble: 8 x 5
#>   id    state zip   city         all_valid
#>   <chr> <chr> <chr> <chr>        <lgl>    
#> 1 01    VT    05866 SHEFFIELD    TRUE     
#> 2 03    VT    05753 MIDDLEBURY   TRUE     
#> 3 04    VT    05076 EAST CORINTH TRUE     
#> 4 05    VT    05866 SHEFFIELD    TRUE     
#> 5 07    VT    05443 BRISTOL      TRUE     
#> 6 08    VT    05401 BURLINGTON   TRUE     
#> 7 09    DC    20001 WASHINGTON   TRUE     
#> 8 10    TX    78202 SAN ANTONIO  TRUE

Once our data is as normal as we can confidently make it, we can begin to explore. First, we’ll explore the data for missing values with flag_na, which takes a tidyselct number of key columns to check (or something like dplyr::everything()).

(vt <- flag_na(vt, last))
#> # A tibble: 10 x 11
#>    id    cand   date       amount first last   address state zip   city  na_flag
#>    <chr> <chr>  <date>      <dbl> <chr> <chr>  <chr>   <chr> <chr> <chr> <lgl>  
#>  1 01    Bob M… 2019-02-09     10 Lisa  Miller 4 SHEF… VT    05866 SHEF… FALSE  
#>  2 02    Bob M… 2009-03-09     20 Deb   Brown  <NA>    <NA>  <NA>  <NA>  FALSE  
#>  3 03    Chuck… 2019-04-09     25 <NA>  <NA>   PO BOX… VT    05753 MIDD… TRUE   
#>  4 04    Chuck… 2019-05-09    100 Josh  Jones  SUGARH… VT    05076 EAST… FALSE  
#>  5 05    Bob M… 2019-02-09     10 Lisa  Miller 4 SHEF… VT    05866 SHEF… FALSE  
#>  6 06    Chuck… 2019-06-09   1000 Bob   Taylor 55 THI… MN    <NA>  <NA>  FALSE  
#>  7 07    Chuck… 2019-07-09   -600 Alex  Johns… 11 LIB… VT    05443 BRIS… FALSE  
#>  8 08    Alice… 2019-08-09      0 Ruth  Smith  2 BURL… VT    05401 BURL… FALSE  
#>  9 09    Alice… 2019-09-09     69 Joe   Garcia 770 5 … DC    20001 WASH… FALSE  
#> 10 10    Alice… 2019-11-09    222 Dave  Wilson <NA>    TX    78202 SAN … FALSE

Next, we’ll want to check for duplicate rows using flag_dupes, which takes the same kind of arguments. Here, we can ignore the supposedly unique id variable. It’s possible for a person to make the same contribution on the same date, but we should flag them nonetheless.

(vt <- flag_dupes(vt, -id))
#> Warning in flag_dupes(vt, -id): no duplicate rows, column not created
#> # A tibble: 10 x 11
#>    id    cand   date       amount first last   address state zip   city  na_flag
#>    <chr> <chr>  <date>      <dbl> <chr> <chr>  <chr>   <chr> <chr> <chr> <lgl>  
#>  1 01    Bob M… 2019-02-09     10 Lisa  Miller 4 SHEF… VT    05866 SHEF… FALSE  
#>  2 02    Bob M… 2009-03-09     20 Deb   Brown  <NA>    <NA>  <NA>  <NA>  FALSE  
#>  3 03    Chuck… 2019-04-09     25 <NA>  <NA>   PO BOX… VT    05753 MIDD… TRUE   
#>  4 04    Chuck… 2019-05-09    100 Josh  Jones  SUGARH… VT    05076 EAST… FALSE  
#>  5 05    Bob M… 2019-02-09     10 Lisa  Miller 4 SHEF… VT    05866 SHEF… FALSE  
#>  6 06    Chuck… 2019-06-09   1000 Bob   Taylor 55 THI… MN    <NA>  <NA>  FALSE  
#>  7 07    Chuck… 2019-07-09   -600 Alex  Johns… 11 LIB… VT    05443 BRIS… FALSE  
#>  8 08    Alice… 2019-08-09      0 Ruth  Smith  2 BURL… VT    05401 BURL… FALSE  
#>  9 09    Alice… 2019-09-09     69 Joe   Garcia 770 5 … DC    20001 WASH… FALSE  
#> 10 10    Alice… 2019-11-09    222 Dave  Wilson <NA>    TX    78202 SAN … FALSE