This chapter continues our conversation about Network Visualization by exploring:
Under the course folder, please create a folder called “lab8”. Next, in the lab8 folder, please create two sub-folders that one is called “data” and another one is “plot”.
This chapter continues to explore Job-to-Job Flows Explorer (J2J) by using Colorado as a case study. In addition to the job-to-job (j2j) flows between states, we also explore the j2j flows to 20 industries to 20 industries in Colorado. The filters used to download such data is shown in Figure 8.1.
Furthermore, this lab also collects the DP03 Economic Characteristics within all US States using the American Community Survey (ACS) 1-year estimate from US Census.
Please follow the steps below to download data, unzip it and move the data to the required folder.
If there you have any questions about the above-mentioned steps, please refer to Chapter 3.2.3 for detailed instructions.
Again, we would like to start a new project from scratch with a clean R Script. Please do the following steps. If you have any questions about these steps, please refer to the relevant chapters for help.
Heads-Up!
All scripts are non-copyable. Please read the tutorial carefully and try to understand the script as you would need to re-write some part of the script to complete the report.
Please install necessary packages by yourself.
# Library ----
library(dplyr)
library(data.table)
library(sf)
library(tidyverse)
library(igraph)
library(GGally) # correlation diagram
library(ggplot2)
library(ggspatial)
library(networkD3) # sankey network
Work smarter, not harder.
Instead of writing the path by yourself, please copy file name from file directories as shown in Figure 8.2.
# US Census CB (cartographic boundaries) by states (simple features)
# project it to "CRS NAD83 / Colorado North (ftUS)"
sf_us_state <- st_read("data/cb_2023_us_state/cb_2023_us_state_5m.shp") %>%
st_transform(., 2231)
## Reading layer `cb_2023_us_state_5m' from data source
## `D:\OneDrive\UCCS\Teaching\GES4070\Labs\01_git_wk\uccs_geoviz\data\cb_2023_us_state\cb_2023_us_state_5m.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 56 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -179.1473 ymin: -14.55255 xmax: 179.7785 ymax: 71.35256
## Geodetic CRS: NAD83
# lab7 (Section 7.6) - j2j outflow and inflow for all NAICS sections
j2j_co_xy <- read.csv("data/lab7_j2j_co_xy.csv")
# j2j inflow to CO by industry
j2j_inflow_industry <- read.csv("data/j2j_TO_Colorado_industries_2023Q1_Q2_Q3/table.csv")
# ACS 2023 - all states within US - 1yr estimate
# Data (Table TP03): Selected Economic Characteristics
df_economic <- read.csv("data/ACSDP1Y2023.DP03_2024-10-14T182606/ACSDP1Y2023.DP03-Data.csv")
# Metadata (Table TP03)
df_economic_meta <- read.csv("data/ACSDP1Y2023.DP03_2024-10-14T182606/ACSDP1Y2023.DP03-Column-Metadata.csv")
# lists extracted from j2j data
# list of states in "j2j_co_xy" networks
lt_state <- unique(c(j2j_co_xy$from_state, j2j_co_xy$to_state))
# list of naics sectors in "j2j_co_xy" data
lt_sector <- unique(c(j2j_co_xy$naics_sector))
## extract JOB OUTFLOW network
# edges
dt_edge <- j2j_co_xy %>% filter(direction == "outflow_from_co",
naics_sector == "all_sectors")
# node - states polygons
dt_node <- sf_us_state %>%
filter(NAME %in% unique(c(dt_edge$from_state, dt_edge$to_state)))
# node - convert to centroid
dt_node <- st_centroid(dt_node)
# spatial data: state boundaries
# only states that is located in the continuous continental United States
sf_j2j_state <- sf_us_state %>%
filter(GEOID < "60") %>%
filter(! NAME %in% c("Hawaii", "Alaska"))
At the end of previous lab Section 7.6.2, we have produced a map of job outflow network from Colorado to other states that is similar to the map below.
## Mapping network - from Lab 7
c_edge <- "blue"
lbl_title <- "Job Outflow from Colorado, 2023 Q1-Q3"
ggplot(sf_j2j_state) +
geom_sf() +
geom_sf(data = dt_node) +
geom_segment(data = dt_edge,
aes(x = long_start, y = lat_start,
xend = long_end, yend = lat_end, size = avg_n_job, alpha = 0.5),
colour = c_edge) +
scale_size_continuous(range = c(0.01, 4)) +
ggtitle(lbl_title) +
theme_bw() +
ylab("Latitude") +
xlab("Longitude")
Here are some frequently mentioned comments from lab7 report about how to improve the maps:
Let’s do that!
# Make edge color scale to the strength of job flow; use abbreviation
p1 <- ggplot(sf_j2j_state) +
geom_sf(fill = "white") +
geom_segment(data = dt_edge,
aes(x = long_start, y = lat_start,
xend = long_end, yend = lat_end, size = avg_n_job, colour = avg_n_job),
alpha = 0.9) +
scale_color_distiller(palette = "Reds", trans = "reverse") +
scale_size_continuous(range = c(0.001, 5)) +
geom_sf_text(data = dt_node, aes(label = STUSPS), size = 2) +
ggtitle(lbl_title) +
theme_bw() +
ylab("Latitude") +
xlab("Longitude")
print(p1)
# improve legend
p2 <- ggplot(sf_j2j_state) +
geom_sf(fill = "white") +
geom_segment(data = dt_edge,
aes(x = long_start, y = lat_start,
xend = long_end, yend = lat_end, size = avg_n_job, colour = avg_n_job),
alpha = 0.9) +
scale_color_distiller(palette = "Reds", trans = "reverse", name = "Job flow strength") +
scale_size_continuous(range = c(0.001, 5), name = "Job flow strength") +
geom_sf_text(data = dt_node, aes(label = STUSPS), size = 2) +
ggtitle(lbl_title) +
theme_bw() +
ylab("Latitude") +
xlab("Longitude")
print(p2)
# by adding North Arrow and Scale Bars
p3 <- p2 +
annotation_scale(location = "bl", width_hint = 0.1) +
annotation_north_arrow(location = "bl",
pad_x = unit(0.1, "in"),
pad_y = unit(0.3, "in"),
height = unit(0.3, "in"),
width = unit(0.3, "in"))
print(p3)
After mapping the j2j network, an intuitive question is what are the potential reasons behind job inflow/outflow. This section aims to explain the j2j patterns by linking them to states’ economic characteristics.
Check the DP03 Economic Characteristics data.
#### Economic Characteristic by State
View(df_economic)
View(df_economic_meta)
Identify the variables to extract from the “df_economic” data table. The chunk of code below is copyable.
lt_ecnomic_index <- c(
"DP03_0119PE", "DP03_0047PE", "DP03_0048PE",
"DP03_0049PE", "DP03_0050PE", "DP03_0062E",
"DP03_0009PE")
lt_economic_var_name <- c(
"pct_fam_below_poverty", "pct_cvl_16plus_emp_priv", "pct_cvl_16plus_emp_govr",
"pct_cvl_16plus_emp_self", "pct_cvl_16plus_emp_unpaid", "median_hh_income",
"unemply_rate")
Data cleaning
# clean the economic data table "df_economic"
df_economic_cln <- df_economic %>% select(GEO_ID, NAME, all_of(lt_ecnomic_index))
# delete first row
df_economic_cln <- df_economic_cln[-1, ]
# update column names to meaningful names
df_economic_cln <- setnames(df_economic_cln,
old = lt_ecnomic_index,
new=lt_economic_var_name)
Data processing by converting them to numeric.
# check column data type
str(df_economic_cln)
## 'data.frame': 52 obs. of 9 variables:
## $ GEO_ID : chr "0400000US01" "0400000US02" "0400000US04" "0400000US05" ...
## $ NAME : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ pct_fam_below_poverty : chr "11.3" "7.2" "8.7" "11.2" ...
## $ pct_cvl_16plus_emp_priv : chr "77.5" "69.9" "79.6" "78.8" ...
## $ pct_cvl_16plus_emp_govr : chr "16.8" "23.8" "14.0" "14.6" ...
## $ pct_cvl_16plus_emp_self : chr "5.5" "6.1" "6.2" "6.4" ...
## $ pct_cvl_16plus_emp_unpaid: chr "0.2" "0.3" "0.2" "0.3" ...
## $ median_hh_income : chr "62212" "86631" "77315" "58700" ...
## $ unemply_rate : chr "4.0" "4.8" "4.3" "4.2" ...
# convert columns to numeric
df_economic_cln[, 3:ncol(df_economic_cln)] <- sapply(
df_economic_cln[, 3:ncol(df_economic_cln)],
as.numeric) %>% as.data.frame()
Let’s use job flow from other states to Colorado as an example to examine how the strength of job flow is correlated to a state’s economic characteristics.
Preparing job inflow data.
# filter inflow to Colorado
data <- j2j_co_xy %>% filter(direction == "inflow_to_co",
naics_sector=="all_sectors") %>%
select(from_state, naics_sector, avg_n_job)
# untidy the data by using "pivot_wider"
data <- pivot_wider(data, names_from = "naics_sector", values_from = "avg_n_job")
Join the inflow data with states’ economic status.
# left join economic data with the inflow data
data <- left_join(df_economic_cln, data, by=c("NAME" = "from_state"))
# remove na
data <- na.omit(data)
[Q1] Please interpret the results from the correlation matrix.
# Pearson correlation matrix
ggcorr(data, label = TRUE, hjust = 0.75)
# join state boundary with economic data
sf_j2j_state_econ <- left_join(sf_j2j_state, df_economic_cln, by=c("GEOIDFQ" = "GEO_ID",
"NAME"="NAME"))
# Inflow network edge & node
dt_edge <- j2j_co_xy %>% filter(direction == "inflow_to_co",
naics_sector == "all_sectors")
# node - states polygons
dt_node <- sf_us_state %>% filter(NAME %in% unique(c(dt_edge$from_state, dt_edge$to_state)))
# node - convert to centroid
dt_node <- st_centroid(dt_node)
## mapping
lbl_title <- "2023 Job Inflow to Colorado (Q1-Q3) VS Unemployment Rate"
ggplot(sf_j2j_state_econ) +
geom_sf(aes(fill = unemply_rate), colour = "white", alpha = 0.7) +
scale_fill_distiller(palette = "Blues", trans = "reverse", name = "Unemployment Rate") +
geom_segment(data = dt_edge,
aes(x = long_start, y = lat_start,
xend = long_end, yend = lat_end, size = avg_n_job, colour = avg_n_job),
alpha = 0.9) +
scale_color_distiller(palette = "YlOrRd", trans = "reverse", name = "Job flow strength") +
scale_size_continuous(range = c(0.001, 5), name = "Job flow strength") +
geom_sf_text(data = dt_node, aes(label = STUSPS), size = 2) +
ggtitle(lbl_title) +
theme_bw() +
ylab("Latitude") +
xlab("Longitude") +
annotation_scale(location = "bl", width_hint = 0.1) +
annotation_north_arrow(location = "bl",
pad_x = unit(0.1, "in"),
pad_y = unit(0.3, "in"),
height = unit(0.3, "in"),
width = unit(0.3, "in"))
[Q2] Please revise the code above to produce another j2j-network map that uses a different variable other than “unemployment rate”. Please update the map title and legend title accordingly. Figure 8.3 shows the names of R color palettes. Please provide a screenshot of the new map and briefly explain the findings (2-3 sentences).
This section produces an interactive sankey chart to visualize j2j flows from top 10 industries to top 10 industries in Colorado. Section 8.2.2 explains the filters used in Job-to-Job Flows Explorer (J2J) to download this data.
# tidy and clean the data
j2j_inflow_ind_cln <- j2j_inflow_industry %>%
pivot_longer(cols = 2:ncol(.), names_to = "to_co_sector", values_to = "avg_n_job" ) %>%
filter(!to_co_sector %like% ".flag") %>%
na.omit() %>%
setnames(old="X", new="from_sector") %>%
select(from_sector, to_co_sector, avg_n_job)
This code below is copyable.
# clean sector names to make them consistent
j2j_inflow_ind_cln$to_co_sector <- j2j_inflow_ind_cln$to_co_sector %>%
gsub("\\.\\.", ", ", .) %>%
gsub("\\.", " ", .) %>%
gsub("Other Services, except Public Administration",
"Other Services (except Public Administration)", .) %>%
gsub("^\\s+|\\s+$", "", .)
Prepare network data: edge & node. !!! Please read carefully this section and try to understand scripts line by line. You would need to re-write this section to complete [Q3].
Hint The function slice_min()
could be
used to subset the smallest values of a variable.
# edges
dt_edge <- j2j_inflow_ind_cln
# top 10 industry with highest inflow to colorado
dt_ten_industry <- dt_edge %>% select(to_co_sector, avg_n_job) %>%
group_by(to_co_sector) %>%
summarise(n_ttl = sum(avg_n_job)) %>%
slice_max(n_ttl, n=10)
# extract edges from and to the top ten industries
dt_edge <- dt_edge %>%
filter(from_sector %in% dt_ten_industry$to_co_sector) %>%
filter(to_co_sector %in% dt_ten_industry$to_co_sector)
# add space to "to_co_sector"
dt_edge$to_co_sector <- paste(dt_edge$to_co_sector, " ", sep="")
# nodes
dt_node <- data.frame(name = unique(c(dt_edge$from_sector,
dt_edge$to_co_sector)))
# edges add id for nodes
dt_edge$id_from <- match(dt_edge$from_sector, dt_node$name) - 1
dt_edge$id_to <- match(dt_edge$to_co_sector, dt_node$name) - 1
# interactive sankey chart - network
sankeyNetwork(Links = dt_edge, Nodes = dt_node,
Source = "id_from", Target = "id_to",
Value = "avg_n_job", NodeID = "name",
sinksRight=FALSE, nodeWidth=15, fontSize=10, nodePadding=10)
The sankey chart above shows the number of jobs from the top 10 largest industries have been moved to the top 10 largest industries in Colorado.
[Q3] Please re-write the script to produce a sankey chart presenting the job flows from the BOTTOM 10 industries to the bottom 10 industries in Colorado. Paste a screenshot of your sankey chart to the report.
[Q4] Please list 3-5 functions that you have learnt from this tutorial and explain when and how to use them.
Congratulations!! You have completed the entire tutorial and learnt the intro to network visualization!!
Please submit your report on time, otherwise, a late penalty (5 pts / day) will be applied.
Please go “File”–> “Close Project” – a pop window asking “Do you want to save these changes” –> “Yes”.
Don’t forget to submit the lab8 report and your script to Canvas.