If my_address_file.csv
is a file in the current working directory with an address column named address
, then the DeGAUSS command:
docker run --rm -v $PWD:/tmp ghcr.io/degauss-org/postal:0.1.4 my_address_file.csv
will produce my_address_file_postal_0.1.4.csv
with added columns:
cleaned_address
: address
with non-alphanumeric characterics and excess whitespace removed (with dht::clean_address()
)parsed.{address_component}
: multiple columns, one for each parsed address component (e.g., parsed.road
, parsed.state
, parsed.house_number
)parsed_address
: a “parsed” address created by pasting together available parsed.house_number
, parsed.road
, parsed.city
, parsed.state
, and the first five digits of the parsed.postcode
address componentsAfter parsing, the parsed addresses can be expanded into several possible normalized addresses using libpostal
. This can be useful for matching of these addresses with other messy, real world addresses.
If any value is provided as an argument (e.g., “expand”), then the DeGAUSS command:
docker run --rm -v $PWD:/tmp ghcr.io/degauss-org/postal:0.1.4 my_address_file.csv expand
will produce my_address_file_postal_0.1.4_expand.csv
with the above columns plus:
expanded_addresses
: the expanded addresses for parsed_address
Because each parsed_address
will likely result in more than one expanded_addresses
, each input row is duplicated to accomodate several expanded_addresses
. This means that when expanding addresses, the input CSV file is “expanded” too by duplicating the input rows.
Input addresses are parsed/normalized using libpostal
by:
-
) and excess whitespace (with dht::clean_address()
)libpostal/scr/address_parser
(a machine learning model trained on OpenStreetMap and OpenAddresses)For detailed documentation on DeGAUSS, including general usage and installation, please see the DeGAUSS homepage.