Objectives
Introduce data wrangling using tidyverse verbs to:
- Add columns
- Change and rename columns
- Combine data objects
- Assign data to groups and calculate summary statistics
Basically we want to start being able to perform in \({\bf\textsf{R}}\) the intermediary steps between raw data and graphs/analysis, rather than going back to Excel.
Walking through the script
As stated before, I will use pacman in script for this course to load packages for each lesson.
I do this because p_load()
checks to see if one has the package installed already; if so, p_load()
calls library()
; if not, it calls install.packages()
first.
But to use p_load()
one must have pacman installed.
This line checks to see if pacman is on the local machine, and if not, it installs it.
This works because require()
returns a logical (TRUE
or FALSE
) as to whether or not a package is available locally. When the script queries that result and finds that the package is not available (The !
operator essentially means not, so if !require()...
can be read as ‘if require()
result = FALSE
)’, it calls install.packages()
. If result of require()
= TRUE
, the if
statement simply skips over install.packages()
.
if (!require("pacman")) install.packages("pacman")
Once a local installation of pacman is confirmed, use it to load tidyverse.
pacman::p_load(tidyverse)
Data loading
It is critical to get one’s data into \({\bf\textsf{R}}\) if one is to expect to use \({\bf\textsf{R}}\). Therefore, I take advantage of opportunities to make you practice data loading. There are many convenient shortcuts for getting data into \({\bf\textsf{R}}\), and it is tempting for me to write those in so we can get on with the fun stuff. But I don’t want anyone to be able to whine, Oh I’d use \({\bf\textsf{R}}\) for sure but I can never get my data loaded. No excuses! We practice loading data every lesson.
Several functions are available to load data. Selection depends on two things:
- What form the data are in (most important).
Options include:
- An \({\bf\textsf{R}}\) object already (
.Rdata
) - A simple table file (
.csv
,.txt
) - A spreadsheet/workbook format (
.xlsx
)
- An \({\bf\textsf{R}}\) object already (
- Where the data are located (less important).
Options include:
- Local folder (file path begins with
C:/
,E:/
,~/
, etc) - Online (file path begins with
http(s)://
,ftp://
, etc)
- Local folder (file path begins with
We begin by loading additional information about mtcars
from a .csv
file.
You should have saved the file mtcars_origins.csv
to a folder called data
in a local folder called R
that you can set as your working directory:
setwd(".../R")
Then one can use read_csv()
to bring that file into \({\bf\textsf{R}}\):
mtcars_origins <- read_csv("./data/mtcars_origins.csv")
If the file is published online, one can access it via the URL.
In this case, the file is posted to GitHub.
Because URLs can be long, it is often convenient to assign the URL to a shorter \({\bf\textsf{R}}\) object that can be used as an argument in subsequent functions.
Note that is begins with the url()
function, which helps \({\bf\textsf{R}}\) open an internet connection for you:
URL <- url("https://raw.githubusercontent.com/devanmcg/IntroRangeR/master/data/mtcars_origins.csv")
When reading from an object created by url()
, use the base read.csv()
function to load the data into \({\bf\textsf{R}}\) (read_csv()
is from tidyverse):
mtcars_origins <- read.csv(URL)
read.csv()
vs. read_csv()
Throwing both of these out there right now might seem confusing, but you can handle it.
I like read_csv()
because it automatically loads data into a tibble
object, the tidyverse form of data.frame
that (1) works well with other tidyverse operations and (2) is easy to query because one doesn’t need to use str()
or head()
; just call the object and it gives class information and just the top 10 rows.
But sometimes–like when using url()
–we need read.csv()
, which imports directly to a data.frame
that would need str()
or head()
to query.
In my script, I almost always immediately convert anything imported with read.csv()
to a tibble
using as_tibble()
.
mtcars_origins <- as_tibble(mtcars_origins)
Data manipulation
Let’s refamiliarize ourselves with mtcars
:
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Adding variables
The mtcars
dataset is a tad wonky.
Notice how the makes and models for the cars isn’t actually a variable column (see how it isn’t in the result of str(mtcars)
?).
Rather, makes and models are assigned to the names of the rows:
rownames(mtcars)
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
We can move the row names to a new column, make.model
:
mtcars <- rownames_to_column(mtcars, var = "make.model")
str(mtcars)
## 'data.frame': 32 obs. of 12 variables:
## $ make.model: chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp : num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec : num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear : num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4 ...
Recall from last time that mtcars
has several categorical variables that are improperly stored as numeric
class.
Use a dplyr
pipe to convert these columns to character
and perform other manipulations that make the data more meaningful:
mtcars <-
mtcars %>%
mutate_at(vars(cyl, vs, gear, carb, am), # acceptable linebreak
as.character) %>%
mutate(am = recode(am, "0"="Automatic", # acceptable linebreak
"1"="Manual")) %>%
rename(transmission = am)
Pay attention to line breaks and how the script is indented. Take a minute to review this post.
Check out the results of the pipe:
str(mtcars)
## 'data.frame': 32 obs. of 12 variables:
## $ make.model : chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : chr "6" "6" "4" "6" ...
## $ disp : num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec : num 16.5 17 18.6 19.4 17 ...
## $ vs : chr "0" "0" "1" "1" ...
## $ transmission: chr "Manual" "Manual" "Manual" "Automatic" ...
## $ gear : chr "4" "4" "4" "3" ...
## $ carb : chr "4" "4" "1" "1" ...
Joining datasets
Very often one would like to combine data sets, say if one .csv
has treatment information for a set of plots and another has sample data.
As long as there is a column in each file that can be used to connect the two data sets, it is easy to combine them into a single object even if they are in wildly different orders.
Furthermore, the columns that connect the datasets don’t even need to have the same name.
tidyverse includes a number of options to join
datasets1.
We begin with full_join
, which simply requires us to identify the objects we’d like to join.
The by=
argument identifies the columns that should be used to connect the datasets.
If there is a single commonly-named column, this argument can be left out and full_join
will use that column by default (and print a message saying which column it used).
If the binding columns have different names, they are to be each identified (see ?full_join
).
We are going to add information about where each automobile in mtcars
was produced, which we identified with the new make.model
column.
We can confirm that our new dataset has this column, as well:
str(mtcars_origins)
## Classes 'tbl_df', 'tbl' and 'data.frame': 32 obs. of 3 variables:
## $ make.model: Factor w/ 32 levels "AMC Javelin",..: 18 19 5 13 14 31 7 21 20 22 ...
## $ make : Factor w/ 20 levels "AMC","Cadillac",..: 14 14 5 1 1 16 16 15 15 15 ...
## $ country : Factor w/ 6 levels "Germany","Italy",..: 3 3 3 6 6 6 6 1 1 1 ...
We’ll specify by=
because it is good practice and avoids the message:
mtcars2 <- full_join(mtcars,
mtcars_origins,
by="make.model" )
## Warning: Column `make.model` joining character vector and factor, coercing into
## character vector
Or at least one of them.
Confirm the columns have been added:
str(mtcars2)
## 'data.frame': 32 obs. of 14 variables:
## $ make.model : chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : chr "6" "6" "4" "6" ...
## $ disp : num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec : num 16.5 17 18.6 19.4 17 ...
## $ vs : chr "0" "0" "1" "1" ...
## $ transmission: chr "Manual" "Manual" "Manual" "Automatic" ...
## $ gear : chr "4" "4" "4" "3" ...
## $ carb : chr "4" "4" "1" "1" ...
## $ make : Factor w/ 20 levels "AMC","Cadillac",..: 14 14 5 1 1 16 16 15 15 15 ...
## $ country : Factor w/ 6 levels "Germany","Italy",..: 3 3 3 6 6 6 6 1 1 1 ...
Often we don’t have all of the information we want conveniently stored in another file. Whereas one approach would be to go through and type in data to each row in Excel, that (a) can take forever, (b) is prone to errors, and (c) would need to be repeated if new data come in, etc.
We can use \({\bf\textsf{R}}\) to add information based on values already stored in an object. For example, we’ve added the country in which each automobile in mtcars
was made.
If we wanted to analyze domestic (USA) vs. foreign (not USA) cars, we can easily add such a column using mutate()
to create a new column, origin
.
In this case, the new information is based on a simple comparison: if the car was made in the USA, we want it identified as domestic, otherwise it is foreign. Such an either-or situation is ideal for ifelse()
, which is a shortcut for the logical progression if X, then Y; otherwise Z (because not X): if the stated condition is met (country == "USA"
), mutate()
writes domestic
; otherwise, if country
is not USA
2, it writes foreign
:
mtcars2 <- mutate(mtcars2,
origin = ifelse(country == "USA",
"domestic",
"foreign"))
str(mtcars2)
## 'data.frame': 32 obs. of 15 variables:
## $ make.model : chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : chr "6" "6" "4" "6" ...
## $ disp : num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec : num 16.5 17 18.6 19.4 17 ...
## $ vs : chr "0" "0" "1" "1" ...
## $ transmission: chr "Manual" "Manual" "Manual" "Automatic" ...
## $ gear : chr "4" "4" "4" "3" ...
## $ carb : chr "4" "4" "1" "1" ...
## $ make : Factor w/ 20 levels "AMC","Cadillac",..: 14 14 5 1 1 16 16 15 15 15 ...
## $ country : Factor w/ 6 levels "Germany","Italy",..: 3 3 3 6 6 6 6 1 1 1 ...
## $ origin : chr "foreign" "foreign" "foreign" "domestic" ...
ifelse()
is either-or, which might seem limited to instances with only two options, but it can be nested to work through remaining options:
mtcars2 <- mutate(mtcars2,
continent = ifelse(country=="USA",
"North America",
ifelse(country=="Japan",
"Asia",
"Europe")))
str(mtcars2)
## 'data.frame': 32 obs. of 16 variables:
## $ make.model : chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : chr "6" "6" "4" "6" ...
## $ disp : num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec : num 16.5 17 18.6 19.4 17 ...
## $ vs : chr "0" "0" "1" "1" ...
## $ transmission: chr "Manual" "Manual" "Manual" "Automatic" ...
## $ gear : chr "4" "4" "4" "3" ...
## $ carb : chr "4" "4" "1" "1" ...
## $ make : Factor w/ 20 levels "AMC","Cadillac",..: 14 14 5 1 1 16 16 15 15 15 ...
## $ country : Factor w/ 6 levels "Germany","Italy",..: 3 3 3 6 6 6 6 1 1 1 ...
## $ origin : chr "foreign" "foreign" "foreign" "domestic" ...
## $ continent : chr "Asia" "Asia" "Asia" "North America" ...
Such nesting can proceed through any number of options, but typing all that out is onerous and making sure all the parentheses get closed properly is prone to error.
More complicated distinctions can be sorted with case_when()
, for example, assigning each car a ranking for how sporty it is based on my arbitrary assessment of images on Google, which we’ll store in a new column called sportiness
.
Here, we use the %in%
operator to compare make.model
against a series of vectors with groups of possible values bound by c()
.
In the case_when
a value for make.model
is in the vector on the left-hand side of ~
, mutate()
writes whatever is on the right hand side.
If the value for make.model
isn’t in the first option, it moves to the second, and so on, until it either finds a case_when
it does occur, or it has a value that hasn’t yet been found and is TRUE
, in which case it gets the final option (in this case, “Not sporty”).
mtcars2 <- mutate(mtcars2,
sportiness = case_when(
make.model %in%
c("Pontiac Firebird",
"Camero Z28",
"Maserati Bora",
"Duster 360",
"Dodge Challenger",
"AMC Javelin",
"Lotus Europa",
"Ford Pantera L",
"Ferrari Dino",
"Porsche 914-2") ~ "Pretty sporty",
make.model %in%
c("Fiat X1-9",
"Hornet Sportabout",
"Datsun 710",
"Mazda RX4",
"Merc 450SLC") ~ "Kinda sporty",
make.model %in%
c("Chrysler Imperial",
"Lincoln Continental",
"Cadillac Fleetwood",
"Merc 450SE") ~ "Total boat",
TRUE ~ "Not sporty" ))
str(mtcars2)
## 'data.frame': 32 obs. of 17 variables:
## $ make.model : chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : chr "6" "6" "4" "6" ...
## $ disp : num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec : num 16.5 17 18.6 19.4 17 ...
## $ vs : chr "0" "0" "1" "1" ...
## $ transmission: chr "Manual" "Manual" "Manual" "Automatic" ...
## $ gear : chr "4" "4" "4" "3" ...
## $ carb : chr "4" "4" "1" "1" ...
## $ make : Factor w/ 20 levels "AMC","Cadillac",..: 14 14 5 1 1 16 16 15 15 15 ...
## $ country : Factor w/ 6 levels "Germany","Italy",..: 3 3 3 6 6 6 6 1 1 1 ...
## $ origin : chr "foreign" "foreign" "foreign" "domestic" ...
## $ continent : chr "Asia" "Asia" "Asia" "North America" ...
## $ sportiness : chr "Kinda sporty" "Not sporty" "Kinda sporty" "Not sporty" ...
Summary statistics
While mean()
is definitely useful to get the average of a data vector, rarely are we just interested in a single vector.
Instead, we need to apply functions like mean()
across a number of groups.
In Excel, we’d create a PivotTable.
In base \({\bf\textsf{R}}\), we’d use aggregate()
to use the function mean
to calculate the average horsepower hp
for each sportiness
group:
aggregate(hp ~ sportiness, data=mtcars2, FUN=mean)
## sportiness hp
## 1 Kinda sporty 124.8000
## 2 Not sporty 110.1429
## 3 Pretty sporty 188.6667
## 4 Total boat 207.5000
This method has limitations: it is difficult to stick into a pipe workflow, and uses one function at a time.
The tidy way is to group and summarize:
mtcars2 %>% # pipe operator 'pours' data down
group_by(sportiness) %>%
summarize(mean_hp = mean(hp))
## # A tibble: 4 x 2
## sportiness mean_hp
## <chr> <dbl>
## 1 Kinda sporty 125.
## 2 Not sporty 110.
## 3 Pretty sporty 189.
## 4 Total boat 208.
A nice thing about using summarize
is that one has control over the name of the new summary column (we called it mean_hp
above).3
The magic of tidyverse becomes further apparent as we…
Add another level:
mtcars2 %>%
group_by(transmission, sportiness) %>%
summarize(mean_hp = mean(hp))
## # A tibble: 7 x 3
## # Groups: transmission [2]
## transmission sportiness mean_hp
## <chr> <chr> <dbl>
## 1 Automatic Kinda sporty 178.
## 2 Automatic Not sporty 127.
## 3 Automatic Pretty sporty 180
## 4 Automatic Total boat 208.
## 5 Manual Kinda sporty 89.7
## 6 Manual Not sporty 80.4
## 7 Manual Pretty sporty 196.
Add another operation:
mtcars2 %>%
group_by(transmission, sportiness) %>%
summarize(mean_hp = mean(hp),
sd_hp = sd(hp))
## # A tibble: 7 x 4
## # Groups: transmission [2]
## transmission sportiness mean_hp sd_hp
## <chr> <chr> <dbl> <dbl>
## 1 Automatic Kinda sporty 178. 3.54
## 2 Automatic Not sporty 127. 54.4
## 3 Automatic Pretty sporty 180 44.9
## 4 Automatic Total boat 208. 21.0
## 5 Manual Kinda sporty 89.7 22.2
## 6 Manual Not sporty 80.4 27.1
## 7 Manual Pretty sporty 196. 103.
Or need to make a modification to the data prior to summarization and output in a nice table:
mtcars2 %>%
mutate(transmission = str_c(transmission, " shifter")) %>%
group_by(sportiness, transmission) %>%
summarize(mean_hp = mean(hp),
sd_hp = sd(hp)) %>%
mutate_at(vars(mean_hp, sd_hp), ~round(., 0)) %>%
mutate(`HP (mean ± sd)` = str_c(mean_hp, " ± ", sd_hp)) %>%
select(-mean_hp, -sd_hp) %>%
knitr::kable(caption="Average horsepower by sportiness category.")
sportiness | transmission | HP (mean ± sd) |
---|---|---|
Kinda sporty | Automatic shifter | 178 ± 4 |
Kinda sporty | Manual shifter | 90 ± 22 |
Not sporty | Automatic shifter | 127 ± 54 |
Not sporty | Manual shifter | 80 ± 27 |
Pretty sporty | Automatic shifter | 180 ± 45 |
Pretty sporty | Manual shifter | 196 ± 103 |
Total boat | Automatic shifter | 208 ± 21 |
Wrap up
Practice saving files by saving mtcars2
.
We’ll use it again.
save(mtcars2, file="./data/mtcars2.Rdata")
# or
save(mtcars2, file=file.choose())
Homework assignment
In base \({\bf\textsf{R}}\), we would use the
merge()
function.↩“not USA” would be coded as
country != "USA"
, which explains why we usecountry == "USA"
with==
and not simply=
.↩aggregate()
recycles the name of the column it summarizedhp
, which makes it difficult to identify the summary value and very annoying when seeking to combine the results of two operations because of the shared column names.↩