Calling DeGAUSS from R • dht

Background

DeGAUSS is designed to be used at the command line by issuing DeGAUSS commands one at a time to read and write CSV files. Our Sample Workflow and other specific step-by-step guides specify each command, including the input and output filenames, but this approach can be inflexible and working with longer and longer CSV file names can be cumbersome.

degauss_run() is a method for using DeGAUSS on a data frame in R instead of on a CSV file on disk. Data is passed to DeGAUSS via system Docker calls and the resulting data is read back into R. Since DeGAUSS always works by adding additional columns to the original input data, degauss_run() can be used to flexibly create DeGAUSS data pipelines from within R. This also allows for cleaner integration of Make-like workflows for geomarker assessment at scale.

An example DeGAUSS pipeline in R might look like this:

addresses |>
  degauss_run("postal", "0.1.4") |>
  dplyr::select(id = id, address = parsed_address) |>
  degauss_run("geocoder", "3.2.1") |>
  select(id, lat, lon) |>
  degauss_run("census_block_group", "0.6.0") |>
  degauss_run("greenspace", "0.3.0")

The next section is a detailed example with step-by-step code and output.

Example

library(dplyr, warn.conflicts = FALSE)
library(dht)

Create addresses

Addresses could be imported from a CSV file, database, or other source. For the sake of this example, we’ll use ten addresses randomly sampled from all dwellings in Hamilton County, Ohio:

addresses <-
  tibble::tibble(
    id = paste0("g_", 1:10),
    address = c(
      "518 Fortune Ave Cincinnati OH 45219",
      "3201 Stanhope Av Apt. 2 Cincinnati, OH 45211",
      "3917 Catherine Av Norwood OH 45212",
      "9960 Carolina Trace Road Harrison Township OH 45030",
      "332 East Sharon Rd Unit #15 Glendale OH 45246",
      "10101 Hamilton Cleves Road Crosby Township OH 45030",
      "6076 Lagrange Ln Green Township, OH 45239",
      "1325 Fuhrman Rd Reading, OH 45215",
      "8831 Wellerstation Drive Montgomery OH 45249",
      "2916 Willow Ridge Dr Colerain Township OH 45251"
    ))

Parse Addresses

Before geocoding, we will use the DeGAUSS postal image to create a “normalized and parsed” address column. We keep only our id column and the parsed addresses as address so that we can use it in the next step for geocoding.

d <-
  addresses |>
  degauss_run("postal", "0.1.4", quiet = TRUE) |>
  select(id = id, address = parsed_address)

Geocode Parsed Addresses

Geocode the addresses and keep only the id, latitude (lat), and longitude (lon) columns, discarding the addresses.

d <- d |>
  degauss_run("geocoder", "3.2.1", quiet = TRUE) |>
  select(id, lat, lon)

Add 2020 Census Tract Geographies

Attach 2020 census tract and block group identifiers. Here, we specify argument = "2020" to ensure the DeGAUSS image uses the correct vintage of census geographies.

d <- d |>
  degauss_run("census_block_group", "0.6.0", argument = "2020", quiet = TRUE)

Merge Census Data

Census tract identifiers can be used to link lots and lots of different types of data. Download the 2018 material deprivation index and join it to our data.

dep_index <- "https://github.com/geomarker-io/dep_index/raw/master/2018_dep_index/ACS_deprivation_index_by_census_tracts.rds" |>
  url() |>
  gzcon() |>
  readRDS() |>
  as_tibble()

d <- d |>
  left_join(dep_index, by = c("census_tract_id_2020" = "census_tract_fips"))

Add Greenness

As an example of geomarker assessment that does not rely on census geography, we add the mean EVI (enhanced vegetation index) within three differently sized buffers (radii of 500, 1,500, and 2,500 m) around each location.

d <- d |>
  degauss_run("greenspace", "0.3.0", quiet = TRUE)

Our geomarker assessment process is complete and everything is linked in one data frame, ready for storing, analyzing, and sharing.