This chapter aims to introduce basic map making concepts and techniques. It applies the analysis workflow that we have learnt in Chapter 3 to spatial visualization. After completing this tutorial, you should be able to:
Similar to what we have done in previous labs, under the course folder, please create a folder called “lab4”. Next, in the lab4 folder, please create two sub-folders that one is called “data” and another one is “plot”. We’ll use the “data” folder to store data collected from the internet. The “plot” will be used to export plots generated from R.
In this lab, we continue to explore US Census Bureau as a potential data source for administrative boundaries, and demographic, economic and population data. Specifically, we can collect:
Cartographic Boundaries by geography (e.g., cities, counties, census tracts, block groups) or by scale. And
Census Tables summarizing the socio-economic and demographic characteristics of particular areas.
This practical will use El Paso county (where UCCS is located) as a case study to explore its poverty level through map making. Please follow the steps below to download data, unzip it and move the data to the required folder.
If there you have any questions about the above-mentioned steps, please refer to Chapter 3.2.2 for detailed instructions.
Again, we would like to start a new project from scratch with a clean R Script. Please do the following steps. If you have any questions about these steps, please refer to the previous chapters for help.
Heads-Up!
From this lab, all scripts will be non-copyable. Please try to:
This section starts with creating a simple map of counties in the state of Colorado.
We are going to use the following packages: * dplyr: data manipulation * tmap: mapping * sf: represent simple features as records in a data.frame or tibble with a geometry list-column. * data.table: data manipulation such as renaming.
You should have the libraries tmap
and
dplyr
installed from previous labs. But in case you don’t,
please uncomment the bottom two lines and run.
#### install library
install.packages("sf")
install.packages("data.table")
#### If you don't have "tmap" and "dplyr" installed, please uncomment these two lines and run.
# install.packages("tmap")
# install.packages("dplyr")
Next, let’s load some packages.
# load library
library(dplyr)
library(tmap)
library(sf)
library(data.table)
Check the current working directory. You should see a path ending
with something similar to COURSEFOLDER/lab4
. If not, please
close the RStudio, go back to Section
4.2.1 to organize your folder, and Section
4.2.3 to re-launch RStudio.
# working directory
getwd()
We will read the data from the “data” folder. Please make sure you can see four folders in your data folder as shown in Figure 4.1. If not, please go back to Section 4.2.2.
We will import four data files and save them under different variable
names. In previous labs, we have used the function
read.csv()
to import .csv
files. This time, we
will use st_read()
to read shapefile (.shp).
Hint: The easy way to write the
"FILE_PATH"
is to go to your “data” folder and
copy folder & file names, and organize them using forward slashes
/
.
## US counties (simple features)
sf_us_county <- st_read("data/cb_2023_us_county_500k/cb_2023_us_county_500k.shp")
## Colorado census tracts (simple features)
sf_co_tract <- st_read("data/cb_2023_08_CO_tract_500k/cb_2023_08_tract_500k.shp")
## El Paso family in poverty - data (data frame)
df_ep_poverty <- read.csv("data/ACSST5Y2022.S1702_poverty_family_El Paso/ACSST5Y2022.S1702-Data.csv")
## El Paso family in poverty - metadata (data frame)
df_ep_poverty_meta <- read.csv("data/ACSST5Y2022.S1702_poverty_family_El Paso/ACSST5Y2022.S1702-Column-Metadata.csv")
Please open these data in RStudio (by clicking objects in the Environment panel) and take some time to understand what are the columns.
We will start with the sf_us_county
. This data includes
the information for all counties in the U.S. It also has a column
geometry storing the spatial boundaries of counties as
polygons.
Let’s check the class of the sf_us_county
data by using
the function class()
and have a look at it using
View()
.
# check class
class(sf_us_county)
# open
View(sf_us_county)
After examining the data, it seems sf_us_county
includes
all counties in the U.S. But we are only interested in those in the
state of Colorado. Let’s use the function filter()
to
extract only Colorado counties from
sf_us_county
based on their values in the
STATE_NAME
column, and save it as a new object
sf_co_county
.
# extract CO counties
sf_co_county <- sf_us_county %>% filter(STATE_NAME == "Colorado")
The
%>%
symbol is used to build pipes that chains multiple operations together. It takes the output of the expression on its left and passes it as the first argument to the function on its right. The shortcut of%>%
is [ctrl
+shift
+M
] (Windows) or [Cmd
+Shift
+M
] (Mac).
Have a look at the data sf_co_county
. Please write to
your lab4 report:
[Q1] How many counties in Colorado based on the data
sf_co_county
[Q2] In sf_co_county
, please locate the
record of El Paso county, what is its GEOID
?
View(sf_co_county)
Next,we will extract the record of El Paso county from
sf_co_county
based on its value in the column
NAME
.
# El Paso County
sf_ep_county <- sf_co_county %>% filter(NAME == "El Paso")
In addition to use the NAME
column to extract El Paso
county from sf_co_county
. We can also use other columns
such as GEOID
and NAMELSAD
.
[Q3] Please fill and run the commands below, and type
your answers to lab4 report.
Hint: Have a look at the code above and also have a
look at the data sf_co_county
.
sf_ep_county <- sf_co_county %>% filter(GEOID == "TYPE_ANSWER")
sf_ep_county <- sf_co_county %>% filter(NAMELSAD == "TYPE_ANSWER")
Okay, after data extraction, we have two new datasets representing
counties in Colorado and also the county of El Paso. Let’s do some
mapping using the tmap
package.
First, we set the tmap_mode()
to plot
for
static maps.
# static map
tmap_mode("plot")
Then, produce a simple map of Colorado Counties (MAP1).
tm_shape()
specifies what a spatial data object (in our
case, the simple feature class).tm_polygons()
draws polygons.# MAP1
tm_shape(sf_co_county) + tm_polygons()
Produce another map of El Paso county only (MAP2). This time, we specify the fill color and also the border color. Feel free to try other colors.
# MAP2
tm_shape(sf_ep_county) + tm_polygons(col = "orange", border.col = "blue")
Overlap MAP1 and MAP2 to produce another map that highlights the El Paso county (MAP3).
# MAP3
tm_shape(sf_co_county) + tm_polygons() +
tm_shape(sf_ep_county) + tm_polygons(col = "orange", border.col = "blue")
Improve MAP3 by using the following functions and
save the plot as a new variable map_ep_county
.
tm_text()
adds labels showing county names based on the
sf_co_county
data.tm_compass()
: adds a north arrow.tm_scale_bar()
: adds a scale bar and put it in the
bottom-right.# improve MAP3
map_ep_county <- tm_shape(sf_co_county) + tm_polygons(alpha = 0) +
tm_shape(sf_ep_county) + tm_polygons(col = "orange", border.col = "black", lwd = 2) +
tm_shape(sf_co_county) + tm_text(text = "NAME", size = 0.3) +
tm_compass(north = 0, size = 1) + tm_scale_bar(position = c("right", "bottom"), width = 0.15)
Export the improved MAP3 to the “plot” folder we have created in Step 4.2.1.
# export
pdf("plot/map3_elpaso_county.pdf")
map_ep_county
dev.off()
[Q4] Please go to your “plot” folder, open the pdf “map3_elpaso_county.pdf”, and paste the screenshot to your lab4 report.
This section will zoom into the census tracts in the state of Colorado.
According to U.S. Census Bureau, census tracts are small, relatively. permanent statistical subdivisions of a county that are used to collect and present census data. On average, each census tract typically has 4,000 residents (min=2,500, and max=8,000).
We have imported the census tract boundaries in Colorado in Step
4.3.1 and saved it into the variable sf_co_tract
. Let’s
have a look at the data.
[Q5] Please report how many census tracts in
Colorado according to sf_co_tract
.
[Q6] What column to use to extract census tracts that are located in El Paso county?
## Open
View(sf_co_tract)
Next, we check the class and data types of
sf_co_tract
.
## Check class
class(sf_co_tract)
## data type of each column
str(sf_co_tract)
In terms of data type, it seems of most columns are characters (or
string). But it also has two columns ALAND
(land area) and
AWATER
(water area) as numeric.
[Q7] Please report the data type of column
geometry
to lab4 report.
Let’s extract all census tracts that are located in
El Paso County using the function filter()
. (I know, the
chunk of code below reveals the answer of Q6 :)
# El Paso census tracts
sf_ep_tract <- sf_co_tract %>% filter(NAMELSADCO == "El Paso County")
# open data
View(sf_ep_tract)
Make a map showing census tracts in El Paso County (MAP4).
## MAP4
tm_shape(sf_ep_tract) + tm_polygons()
Highlight the border of El Paso County in MAP4 (through overlapping the census tracts map with El Paso county map).
alpha=
represents the transparency of used colors.
## highlight county boundary in MAP4
tm_shape(sf_ep_tract) + tm_polygons() +
tm_shape(sf_ep_county) + tm_polygons(alpha = 0, lwd = 3)
Continue to improve MAP4 by adding a north arrow, a
scale bar, and removing frame using tm_layout()
.
## improve MAP4
map_ep_tract <- tm_shape(sf_ep_tract) + tm_polygons(col="white") +
tm_shape(sf_ep_county) + tm_polygons(alpha = 0, lwd = 3) +
tm_compass(north = 0, size = 1.5) +
tm_scale_bar(position = c("right", "bottom"), width = 0.15) +
tm_layout(frame = F)
## Export
pdf("plot/map4_elpaso_tracts.pdf")
map_ep_tract
dev.off()
[Q8] Please go to your “plot” folder, open the pdf “map4_elpaso_tracts.pdf”, and paste the screenshot to your lab4 report.
Finally, we are approaching the most “exciting” part of this tutorial which is to combine tabular and spatial data and produce a choropleth map exploring the poverty issues in El Paso county.
Remember the two poverty data frames df_ep_poverty
and
df_ep_poverty_meta
that we have loaded in Step
4.2.2?
Both data frames are collected from US Census Bureau using the Table S1702: Poverty Status in the Past 12 Months of Families.
Unlike the census data we haves seen in Lab
3 that have been cleaned and pre-processed, the two data frame here
are the raw data that are directly collected from the
Census.
Let’s have a look at them.
## open
View(df_ep_poverty)
View(df_ep_poverty_meta)
The
df_ep_poverty
table shows the poverty data itself. Thedf_ep_poverty_meta
is the metadata that tells information about the poverty data. Particularly, it explains the meaning of column names indf_ep_poverty
.
For example, df_ep_poverty_meta
suggests that
S1702_C01_001E
is the estimate total
number of families in each census tractS1702_C01_041E
measures the number of families
occupying a housing unit as owners.S1702_C01_042E
measures the number of families
occupying a housing unit as renter.The df_ep_poverty_meta
also explains what are the
columns of income-to-poverty ratio.
The income-to-poverty ratio measures an individual or family’s income to the poverty threshold. A ratio less than 100% suggests that a family’s income is less than the poverty level.
S1702_C01_043E
tells about the estimate number of
families that has income less than 50% of the poverty
threshold. These are the families that suffer the most.[Q9] What is the meaning of column name
S1702_C01_044E
. Write your answer to lab4
report Hint: Please (1) open the data
`df_ep_poverty_meta
, (2) search S1702_C01_044E on
the top right (as shown in Figure 4.2), and (3) try to figure out the
“Label” associated with “S1702_C01_044E”.
Going back to our data, let’s check how many columns and rows are in
df_ep_poverty
.
# number of rows
nrow(df_ep_poverty)
# number of columns
ncol(df_ep_poverty)
While the df_ep_poverty
has 603 columns. But we only
need some of them. Let’s create a list storing the column names that we
are interested in including (the list below is copyable):
"GEO_ID", "NAME", "S1702_C01_001E", "S1702_C01_041E", "S1702_C01_042E", "S1702_C01_043E", "S1702_C01_044E"
.
# list of column names
lt_poverty_col <- c("GEO_ID", "NAME", "S1702_C01_001E", "S1702_C01_041E", "S1702_C01_042E", "S1702_C01_043E", "S1702_C01_044E")
Then, we create a dataframe that explains the above mentioned column names.
df_poverty_cols <- data.frame(
old_col_name = lt_poverty_col,
new_col_name = c("GEO_ID", "NAME", "ttl_famly", "famly_owner", "famly_renter", "famly_50pct_poverty", "famly_125pct_poverty")
)
View(df_poverty_cols)
[Q10] Open df_poverty_cols
and paste a
screenshot of it to lab4 report
Based on the column names in the list lt_poverty_col
, we
data corpus from df_ep_poverty
and save it as a new &
clean data frame df_ep_poverty_cln
.
# extract useful columns
df_ep_poverty_cln <- df_ep_poverty[lt_poverty_col]
Replace column names in df_ep_poverty_cln
based on table
df_poverty_cols
.
# update column names
setnames(df_ep_poverty_cln, old=df_poverty_cols$old_col_name, new=df_poverty_cols$new_col_name)
# view data
View(df_ep_poverty_cln)
Delete the first row df_ep_poverty_cln
and reset the row
index. Do you know why?
# delete the first row
df_ep_poverty_cln <- df_ep_poverty_cln[-1, ]
# reset row index
rownames(df_ep_poverty_cln) <- NULL
Check data types in df_ep_poverty_cln
.
# check the data
str(df_ep_poverty_cln)
While the columns
"ttl_famly", "famly_owner", "famly_renter", "famly_50pct_poverty", "famly_125pct_poverty"
contain numerical values, they are identified as characters. Let’s make
them numeric.
# convert one column from string to numeric
df_ep_poverty_cln$ttl_famly <- as.numeric(df_ep_poverty_cln$ttl_famly)
# convert multiple columns from string to numeric
df_ep_poverty_cln[,3:7] <- sapply(df_ep_poverty_cln[,3:7], as.numeric) %>% as.data.frame()
# check the data type again
str(df_ep_poverty_cln)
Calculate percentage by using the column famly_renter
(renter occupied family) as an example.
\[ \% \ of \ renter \ occupied \ family = 100 * \frac {count \ of \ renter \ occupied \ family} {total \ count \ of \ family} \]
df_ep_poverty_cln["pct_fam_renter"] <- 100 * df_ep_poverty_cln$famly_renter / df_ep_poverty_cln$ttl_famly
df_ep_poverty_cln["pct_fam_pov_50pct"] <- 100 * df_ep_poverty_cln$famly_50pct_poverty / df_ep_poverty_cln$ttl_famly
df_ep_poverty_cln["pct_fam_pov_125pct"] <- 100 * df_ep_poverty_cln$famly_125pct_poverty / df_ep_poverty_cln$ttl_famly
Have a look at the df_ep_poverty_cln
and check values
for Census Tract 38.02. It seems this row as some not valid
values (e.g., NA
). Replace NA
with
0
.
# replace NA
df_ep_poverty_cln[is.na(df_ep_poverty_cln)] <- 0
Quick look at the data.
# open data
View(df_ep_poverty_cln)
Let’s check the distribution of ttl_famly
of all
census tracts in El Paso by producing a histogram
hist()
.
# Exploratory analysis
hist(df_ep_poverty_cln$ttl_famly)
Check the distribution of three variables together including:
pct_fam_renter
percentage of renter-occupied
family;pct_fam_pov_50pct
percentage of family below 50% of
poverty level;pct_fam_pov_125pct
percentage of family below 125% of
poverty level;# run the three lines below together
hist(df_ep_poverty_cln$pct_fam_renter, col='orange')
hist(df_ep_poverty_cln$pct_fam_pov_125pct, col='blue', add= T)
hist(df_ep_poverty_cln$pct_fam_pov_50pct, col='green', add= T)
Let’s join the table df_ep_poverty_cln
with the spatial
boundaries stored in the simple data sf_ep_tract
based on
GEOIDs (i.e., a unique identifier for each census tract).
# left join
sf_ep_tract_pov <- left_join(sf_ep_tract, df_ep_poverty_cln, by=c("GEOIDFQ"="GEO_ID"))
Set the plot mode to static
# static
tmap_mode("plot")
Quick mapping to explore four variables including
"ttl_famly", "pct_fam_renter", "pct_fam_pov_50pct", "pct_fam_pov_125pct"
.
# quick mapping
lt_map_col <- c("ttl_famly", "pct_fam_renter", "pct_fam_pov_50pct", "pct_fam_pov_125pct")
for (column in lt_map_col){
print(column)
print(tm_shape(sf_ep_tract_pov) + tm_polygons(col = as.character(column)))
}
Please use the arrows in the Plot panels to see all maps (as shown in Figure 4.3).
Okay, after quickly mapping the four variables, let’s produce maps one by one and:
# MAP5 - # of families in each census tract
map_ep_tract_family <- tm_shape(sf_ep_tract_pov) + tm_polygons(col = "ttl_famly", title = "# of family", palette = "YlGnBu") +
tm_layout(legend.outside = T)
# export
pdf("plot/map5_elpaso_tracts_family_counts.pdf")
map_ep_tract_family
dev.off()
# MAP6 - % of renter-occupied families in each census tract
map_ep_tract_renter <- tm_shape(sf_ep_tract_pov) +
tm_polygons(col = "pct_fam_renter", title = "renter-occupied family (%)", palette = "Purples") +
tm_layout(legend.outside = T)
# export
pdf("plot/map6_elpaso_tracts_renter_occupied_family_pct.pdf")
map_ep_tract_renter
dev.off()
# MAP7 - % of families below 50% of poverty in each census tract
map_ep_tract_50pct_pov <- tm_shape(sf_ep_tract_pov) +
tm_polygons(col = "pct_fam_pov_50pct", title = "below 50% poverty (%)", palette = "Greens") +
tm_layout(legend.outside = T)
# export
pdf("plot/map7_elpaso_tracts_family_below_50pct_poverty.pdf")
map_ep_tract_50pct_pov
dev.off()
# MAP8 - % of families below 125% of poverty in each census tract
map_ep_tract_125pct_pov <- tm_shape(sf_ep_tract_pov) +
tm_polygons(col = "pct_fam_pov_125pct", title = "below 125% poverty (%)", palette = "Blues") +
tm_layout(legend.outside = T)
# export
pdf("plot/map8_elpaso_tracts_family_below_125pct_poverty.pdf")
map_ep_tract_125pct_pov
dev.off()
At the end, we combine all maps together into a single plot.
# organize all maps together
t=tmap_arrange(map_ep_tract_family,
map_ep_tract_renter,
map_ep_tract_50pct_pov,
map_ep_tract_125pct_pov, ncol=2)
pdf("plot/map9_elpaso_tracts_analysis.pdf", width = 12, height = 8)
t
dev.off()
[Q11] Please go to your “plot” folder, open the pdf “map9_elpaso_tracts_analysis.pdf”, and paste the screenshot to your lab4 report.
Congratulations!! You have completed the entire tutorial by typing the script by yourself!! Excellent work.
Please go “File”–> “Close Project” – a pop window asking “Do you want to save these changes” –> “Yes”.
Don’t forget to submit the lab4 report and your script to Canvas.