Getting started with Docker requires installing “Docker Desktop” on a Windows or macOS machine or a “Docker Engine” on a Linux operating system. Follow the official installation instructions specific to your operating system. Docker also has detailed installation guides, user manuals, and troubleshooting for both macOS and Windows.
To test your installation, open a shell and run
docker run hello-world
. (See the next section for details
on using a shell for command line instructions if you are unfamiliar.)
You should see some output describing what Docker did and that it is
working correctly:
$ docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
ca4f61b1923c: Pull complete
Digest: sha256:ca0eeb6fb05351dfc8759c20733c91def84cb8007aa89a5bf606bc8b315b9fc7
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.
...
Notice that after asking Docker to run a container, if it does not find the image locally, it downloads it from an online repository. This is only necessary the first time you run a container from each image. Once downloaded, Docker will continue to use the same local image to create containers.
If you are comfortable using the command line, please skip to DeGAUSS Commands.
DeGAUSS is operated through a command line interface by using a shell to issue Docker commands. If using macOS, access the command line by opening the “Terminal” application. For Windows, use the “Command Prompt” (sometimes abbreviated as “CMD”) or “Windows Powershell”. An alternative on Windows is to use a Linux shell through the Windows Subsystem for Linux.
Commands typed into a shell operate relative to a “working directory”. This allows us to access the input file by specifying its name without the full path including all parent folders/directories, but this requires us to first navigate to the directory on our computer where the input file is located.
After opening a new shell, navigate to the folder/directory where the
input file is stored by using the cd
command (for
change directory). For example, use
cd Users/Alice/Documents/my_project
to change to the
my_project
directory in Alice
’s
Documents
folder. (File/folder browsers in macOS and
Windows often have a secondary click contextual menu option to copy the
path to the current folder, which is usefuly for constructing
cd
commands in a shell.)
For more information on the command line, see these useful tuturials for macOS and Windows.
After succesfully navigating to the folder where the input file is stored, you are ready to use a DeGAUSS command.
DeGAUSS commands are essentially Docker commands with some specified arguments. Below, an example command is broken into color-coded annotated sections.
To use this generally for any DeGAUSS application,
my_address_file.csv
would be replaced by the name of your
csv file located in the current working directory of your shell.
geocoder
and 3.0.2
would be replaced by the
name and version, respectively, of the degauss container you would like to
run.
One caveat for using DeGAUSS commands on Windows is the use of the
$PWD
variable, which relies on the convention that the
shell will evaluate this as the current working directory. If using
Windows Command Prompt (but not Windows Powershell or Windows Subsystem
for Linux), this variable is not present and instead must be changed to
%cd%
. Windows Powershell users may have to use
${PWD}
, please see here for more details on
$PWD
for different Windows operating systems.
Addresses must be stored as a CSV
file and follow these
formatting requirements:
address
and an optional identifier column (e.g.,
id
). Fewer columns will increase geocoding speed.address
.32709
) and not
“plus four” (i.e. 32709-0000
)St.
instead of
Street
or OH
instead of
Ohio
)13
instead of thirteen
)3333 Burnet Ave Cincinnati 45229 OH
)Make sure Docker is running and open a new shell.
Navigate to the directory where your address file is located.
Enter the DeGAUSS command. See the example below:
If my_address_file.csv
is a file in the current working
directory with an address column named address
, then
docker run --rm -v $PWD:/tmp degauss/geocoder:3.0.1 my_address_file.csv
will produce my_address_file_geocoded_v3.0.csv
with
added columns including lat
, lon
, and
geocoding diagnostic information.
The geocoder’s output file includes the following columns:
matched_street
, matched_city
,
matched_state
, matched_zip
: matched address
componets (e.g., matched_street
is the street the geocoder
matched with the input address); can be used to investigate input
address misspellings, typos, etc.
precision
: The qualitative precision of the geocode.
The value will be one of:
range
: interpolated based on address ranges from
street segments
street
: center of the matched street
intersection
: intersection of two streets
zip
: centroid of the matched zip code
city
: centroid of the matched city
score
: The percentage of text match between the
given address and the geocoded result, expressed as a number between 0
and 1. A higher score indicates a closer match. Note that each score is
relative within a precision method (i.e. a score
of
0.8
with a precision
of range
is
not the same as a score
of 0.8
with a
precision
of street
).
lat
and lon
: geocoded coordinates for
matched address
geocode_result
: A qualitative summary of the
geocoding result. The value will be one of
po_box
: the address was not geocoded because it is a
PO Box
cincy_inst_foster_addr
: the address was not geocoded
because it is a known institutional address, not a residential
address
non_address_text
: the address was not geocoded
because it was blank or listed as “foreign”, “verify”, or
“unknown”
imprecise_geocode
: the address was geocoded, but
results were suppressed because the precision
was
intersection
, zip
, or city
and/or
the score
was less than 0.5
.
geocoded
: the address was geocoded with a
precision
of either range
or
street
and a score
of 0.5
or
greater.
intersection
,
zip
, or city
are returned with a missing
lat
and lon
because they are likely too
inaccurate and/or too imprecise to be used for further analysis.lat
and lon
are also returned
as missing if the score
is less than 0.5
(regardless of the precision). This threshold can be changed by
including an optional argument in the docker call
(docker run --rm -v $PWD:/tmp degauss/geocoder:3.0 my_address_file.csv 0.4
).The geomarker assessment images will only work with the output of the
geocoding docker image (or a CSV file with columns named
lat
and lon
). Similar to before, navigate to
the directory where the geocoded CSV file is located. If you are running
geomarker assessment right after geocoding and using the same shell, the
files will be in the same location, so no further navigation is
necessary.
Run:
docker run --rm -v "$PWD":/tmp degauss/<name-of-image> <name-of-geocoded-file>
Continuing with our usage example, if we wanted to calculate the distance to the nearest road and length of roads within a 400 m buffer for each subject, we could use the degauss/roads image:
docker run --rm -v "$PWD":/tmp degauss/roads my_address_file_geocoded.csv
Docker will emit some messages as it progresses through the
calculations and will again write the file to the working directory with
a descriptive name appended, in this case the distance to nearest
primary (dist_to_1100
) and secondary
(dist_to_1200
) roads and the length of primary
(length_1100
) and secondary (length_1200
)
roads within a 400 m buffer.
Again, our output file will be written into the same directory as our
input file. In our example above, this will be called
my_address_file_geocoded_roads.csv
:
id | address | lat | lon | dist_to_1100 | dist_to_1200 | length_1100 | length_1200 |
---|---|---|---|---|---|---|---|
131 | 1922 CATALINA AV CINCINNATI OH 45237 | 39.17112 | -84.46176 | 502.7 | 534.8 | 0 | 0 |
540 | 5358 LILIBET CT DELHI TOWNSHIP OH 45238 | 39.11552 | -84.61902 | 5793.1 | 1654.7 | 0 | 0 |
112 | 630 GREENWOOD AV CINCINNATI OH 45229 | 39.15321 | -84.49236 | 1453.0 | 548.5 | 0 | 0 |
Please note that the geomarker assesment programs will return
NA
for geomarkers when coordinate values are missing.
Missing coordinate values are possible if the geocoding container failed
to assign them, for example, when using a malformed address string. A
user should verify that the address strings have been recorded
correctly; however, geocoding sometimes fails even with a correctly
supplied address due to inconsistencies and inaccuracies in the street
range files provided by the census.
Now that we have our desired geomarkers, we can remove the addresses and coordinates from our output file, leaving only the geomarker information that will be associated with health outcomes in a downstream analysis:
id | dist_to_1100 | dist_to_1200 | length_1100 | length_1200 |
---|---|---|---|---|
131 | 502.7 | 534.8 | 0 | 0 |
540 | 5793.1 | 1654.7 | 0 | 0 |
112 | 1453.0 | 548.5 | 0 | 0 |
In theory, since this file no longer contains any PHI, it is no longer subject to HIPAA and can be shared with others or used with third party online services. In reality, we are applying the “Safe Harbor” method defined by HIPAA for deidentification, but re-identification is certainly possible when enough geomarkers and non-identifying information are combined together. Do not take the use of DeGAUSS as a guarantee of deidentification and please consult your institution for more information relating to their specific policies around sharing data.