After completing this chapter, you should be able to:
Let’s first navigate to the course folder that we have created for the previous lab (see Chapter 1.2), and create a folder called lab2.
We’ll use RStudio again for this chapter. Please open RStudio as we have explained in Chapter 1.3.
(Note: all computers in COH 329 should have an updated RStudio already).
Before starting any works, please make sure you are in fresh new project with a blank script. Please follow the action checklist below:
At this step, you should see something similar to Figure 2.1. If you have any questions about the above steps, please refer to Chapter 1.3: Launching RStudio for detailed instructions.
R is a mixture of programming paragrams (Cotton, 2013). At its core, it is an imperative language.
An imperative language provides step-by-step instructions for a computer to perform a task. Your write a script that does one calculation after another. Therefore, the order to run codes in your script can affect outcomes.
But it is also supports object-oriented programming and functional programming.
Object-oriented programming (OOP) is based on the concept of objects, which can contain data and code. Data in the form of fields (e.g., variables) and code in the form of procedures.
Functional programming is constructed by applying and composing functions. In R functions are first-class objects. You can treat them like any other variable and call them recursively.
Let’s create some variables. Note: the codes below are pretty simple. Please try to re-write them by yourself and understand the purpose of each line.
var_int <- 99
We can also create consecutive integers by using the symbol
:
# create consecutive integers
var_ints <- -5:5
# let's check this variable
print(var_ints)
Note:
#
is used to write comments. Comments are
completed ignored by R and just there to make the code more
readable.<-
is Ctrl
+
-
.# Let's create floats by using the coordinates of UCCS as an example
var_long <- -104.801
var_lat <- 38.892
""
.# Create a character to indicate a city name.
var_char <- "Colorado Springs"
# Check the class of this variable "var_char"
class(var_char)
TRUE
or FALSE
. They can also be written as
T
or F
, but never true or
false because R programming is case-sensitive.
Logicals are useful to evaluate logical statements.# We can ask the machine to compare numbers. For example, we assign values to "a" and "b".
a <- -5
b <- 5
# Then, we ask the machine to compare the two values.
a > b
## [1] FALSE
c()
. Vectors are extremely important in statistics
and data science, since we usually want to analyze a whole dataset
rather than just a piece of data.# Vector can include a series of numeric values, by specifying them one by one
var_vec_num1 <- c(478961, 715522, 108250, 386261)
var_vec_num1
# or by using the symbol : to create consecutive integers
var_vec_num2 <- c(1:10, -99:-95)
var_vec_num2
# Vector can also include characters
var_vec_char <- c("Colorado Springs", "Denver", "Bouder", "Aurora")
var_vec_char
# or a mix of characters and numeric values
var_vec_mix <- c(1,3,5, "Aspen", "Pueble", 99)
var_vec_mix
Data frames are used to store spreadsheet-like data.
It provides a table-like way to organize multiple vectors into
one object. Data frames are also the most widely used format to
clean, manipulate and analyze data. In this lab, we only learn
the very basics of creating a simple data frame. We will learn more
about it in future chapters.
Let’s use the function data.frame()
to create a data
frame called “df_city_pop” to record the number of population
in four cities in Colorado.
# We create a table with two columns ("city" and "population") and four rows
# Hint: to make this code work, you need to select all four lines and run them together.
df_city_pop <- data.frame(
city = var_vec_char,
population = var_vec_num1
)
After ruining the code above, do you see a new variable df_city_pop in the Environment panel? Try to click on that variable and see what happens. Do you see the table? Please do a screenshot of the table and paste it to your “Lab2 Report”.
Next, we calculate some statistics for the four cities in the data
frame df_city_pop. Please close the table and go back to your
script.
To exact a particular column from a data frame, we use the symbol
$
.
# Let's examine the values in the "population" column
print(df_city_pop$population)
# Calculate the total number of population
sum(df_city_pop$population)
# The mean value
mean(df_city_pop$population)
# Median
median(df_city_pop$population)
Let’s play with some external data! I have used US Census Bureau to collect the total population and total housing unit in cities and towns in Colorado. This data was collected by US Census Bureau on 2022 through the 5-year American Community Survey (ACS) survey. US Census Bureau contains a lot of good dataset. We will explore that later.
Let’s pull the data in straight from the web using the function
read.csv()
and save it into a new variable “CoData”.
CoData <- read.csv(url("https://raw.githubusercontent.com/fuzhen-yin/uccs_geoviz/main/data/00_ACS_5yrs_estimate_2022_places_colorado_population_housing.csv"))
Click “CoData” in the Environment panel and examine the table a little bit. Report the following information to your “Lab2 Report”. In the table, by clicking a particular column name, you can order the table based the column’s value (see Figure 2.3).
Let’s examine the “CoData” column by column. Let’s check What are the
columns in the “CoData” data frame using names()
.
names(CoData)
Next, we check their data types using str()
.
str(CoData)
The column “name” is characters, while the other two
columns “total_pop” and “total_housing_unit” are
integers. For these two numerical columns, we calculate
some basic statistics for them. Remember the symbol $
, we
will use it again to pull out a particular column.
# First, let's calculate the statistics of "total_pop"
summary(CoData$total_pop)
The summary()
function gives us the (1) minimum, (2)
1st quartile, (3) median, (4) mean, (5) 3rd quartile and (6)
maximum.
Please write the code by yourself to calculate the above statistics for the column “total_housing_unit” and paste the outputs to “Lab2 Report”.
We are going to use the library ggplot2
visualization.
ggplot2
is a very popular open-source data visualization
package. Let’s install this package and launch it.
# install packages (only need to install once)
install.packages("ggplot2")
# launch it (have to launch it everytime we open RStudio).
library(ggplot2)
This chapter is already pretty lengthy. Let’s just do some basic visualization stuff and we will go back to this package for much cooler visualization later this semester.
We create a scatter plot to explore how cities’ total population are associated with the numbers of their total housing units.
ggplot(CoData, aes(x=total_pop, y =total_housing_unit )) +
geom_point()
We can improve the design of this scatter plot by using different color palette, background color, transparency etc.
color =
: shape outline color;fill =
: shape fill color;shape =
: shape. Please google ggplot2 shape
code for more information;alpha =
: shape transparency;size =
: shape size;stroke =
: stroke.ggplot(CoData, aes(x=total_pop, y =total_housing_unit )) +
geom_point(
color="black",
fill="#fb8500",
shape=22,
alpha=0.5,
size=4,
stroke = 1
) +
theme_bw()
Based on the code above, please make some changes to the plot and paste the final scatter plot to your “Lab2 Report”.
To close an R project, please go “File”–> “Close Project” – a pop window asking “Do you want to save these changes” –> “Yes”.
Cotton, R. (2013). Learning R: A step-by-step function guide to data analysis. ” O’Reilly Media, Inc.”.