Objectives

Introduce data wrangling using tidyverse verbs to:

Add columns
Change and rename columns
Combine data objects
Assign data to groups and calculate summary statistics

Basically we want to start being able to perform in \({\bf\textsf{R}}\) the intermediary steps between raw data and graphs/analysis, rather than going back to Excel.

Materials

Script: DataWrangling.R

Video on YouTube

Walking through the script

As stated before, I will use pacman in script for this course to load packages for each lesson. I do this because p_load() checks to see if one has the package installed already; if so, p_load() calls library(); if not, it calls install.packages() first.

But to use p_load() one must have pacman installed. This line checks to see if pacman is on the local machine, and if not, it installs it.

This works because require() returns a logical (TRUE or FALSE) as to whether or not a package is available locally. When the script queries that result and finds that the package is not available (The ! operator essentially means not, so if !require()... can be read as ‘if require() result = FALSE)’, it calls install.packages(). If result of require() = TRUE, the if statement simply skips over install.packages().

if (!require("pacman")) install.packages("pacman")

Once a local installation of pacman is confirmed, use it to load tidyverse.

pacman::p_load(tidyverse)

Data loading

It is critical to get one’s data into \({\bf\textsf{R}}\) if one is to expect to use \({\bf\textsf{R}}\). Therefore, I take advantage of opportunities to make you practice data loading. There are many convenient shortcuts for getting data into \({\bf\textsf{R}}\), and it is tempting for me to write those in so we can get on with the fun stuff. But I don’t want anyone to be able to whine, Oh I’d use \({\bf\textsf{R}}\) for sure but I can never get my data loaded. No excuses! We practice loading data every lesson.

Several functions are available to load data. Selection depends on two things:

What form the data are in (most important). Options include:
- An \({\bf\textsf{R}}\) object already (.Rdata)
- A simple table file (.csv, .txt)
- A spreadsheet/workbook format (.xlsx)
Where the data are located (less important). Options include:
- Local folder (file path begins with C:/, E:/, ~/, etc)
- Online (file path begins with http(s)://, ftp://, etc)

We begin by loading additional information about mtcars from a .csv file. You should have saved the file mtcars_origins.csv to a folder called data in a local folder called R that you can set as your working directory:

setwd(".../R")

Then one can use read_csv() to bring that file into \({\bf\textsf{R}}\):

mtcars_origins <- read_csv("./data/mtcars_origins.csv")

If the file is published online, one can access it via the URL. In this case, the file is posted to GitHub. Because URLs can be long, it is often convenient to assign the URL to a shorter \({\bf\textsf{R}}\) object that can be used as an argument in subsequent functions. Note that is begins with the url() function, which helps \({\bf\textsf{R}}\) open an internet connection for you:

  URL <- url("https://raw.githubusercontent.com/devanmcg/IntroRangeR/master/data/mtcars_origins.csv")

When reading from an object created by url(), use the base read.csv() function to load the data into \({\bf\textsf{R}}\) (read_csv() is from tidyverse):

mtcars_origins <- read.csv(URL)

`read.csv()` vs. `read_csv()`

Throwing both of these out there right now might seem confusing, but you can handle it. I like read_csv() because it automatically loads data into a tibble object, the tidyverse form of data.frame that (1) works well with other tidyverse operations and (2) is easy to query because one doesn’t need to use str() or head(); just call the object and it gives class information and just the top 10 rows.

But sometimes–like when using url()–we need read.csv(), which imports directly to a data.frame that would need str() or head() to query. In my script, I almost always immediately convert anything imported with read.csv() to a tibble using as_tibble().

mtcars_origins <- as_tibble(mtcars_origins)

Data manipulation

Let’s refamiliarize ourselves with mtcars:

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Adding variables

The mtcars dataset is a tad wonky. Notice how the makes and models for the cars isn’t actually a variable column (see how it isn’t in the result of str(mtcars)?). Rather, makes and models are assigned to the names of the rows:

rownames(mtcars)

##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"

We can move the row names to a new column, make.model:

mtcars <- rownames_to_column(mtcars, var = "make.model")
str(mtcars)

## 'data.frame':    32 obs. of  12 variables:
##  $ make.model: chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ mpg       : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl       : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp      : num  160 160 108 258 360 ...
##  $ hp        : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat      : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt        : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec      : num  16.5 17 18.6 19.4 17 ...
##  $ vs        : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am        : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear      : num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb      : num  4 4 1 1 2 1 4 2 2 4 ...

Recall from last time that mtcars has several categorical variables that are improperly stored as numeric class. Use a dplyr pipe to convert these columns to character and perform other manipulations that make the data more meaningful:

mtcars <- 
    mtcars %>%
      mutate_at(vars(cyl, vs, gear, carb, am), # acceptable linebreak
                as.character) %>%
      mutate(am = recode(am, "0"="Automatic",  # acceptable linebreak  
                             "1"="Manual"))  %>%
      rename(transmission = am)

Pay attention to line breaks and how the script is indented. Take a minute to review this post.

Check out the results of the pipe:

str(mtcars)

## 'data.frame':    32 obs. of  12 variables:
##  $ make.model  : chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ mpg         : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl         : chr  "6" "6" "4" "6" ...
##  $ disp        : num  160 160 108 258 360 ...
##  $ hp          : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat        : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt          : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec        : num  16.5 17 18.6 19.4 17 ...
##  $ vs          : chr  "0" "0" "1" "1" ...
##  $ transmission: chr  "Manual" "Manual" "Manual" "Automatic" ...
##  $ gear        : chr  "4" "4" "4" "3" ...
##  $ carb        : chr  "4" "4" "1" "1" ...

Joining datasets

Very often one would like to combine data sets, say if one .csv has treatment information for a set of plots and another has sample data. As long as there is a column in each file that can be used to connect the two data sets, it is easy to combine them into a single object even if they are in wildly different orders. Furthermore, the columns that connect the datasets don’t even need to have the same name.

tidyverse includes a number of options to join datasets¹. We begin with full_join, which simply requires us to identify the objects we’d like to join. The by= argument identifies the columns that should be used to connect the datasets. If there is a single commonly-named column, this argument can be left out and full_join will use that column by default (and print a message saying which column it used). If the binding columns have different names, they are to be each identified (see ?full_join).

We are going to add information about where each automobile in mtcars was produced, which we identified with the new make.model column. We can confirm that our new dataset has this column, as well:

 str(mtcars_origins)

## Classes 'tbl_df', 'tbl' and 'data.frame':    32 obs. of  3 variables:
##  $ make.model: Factor w/ 32 levels "AMC Javelin",..: 18 19 5 13 14 31 7 21 20 22 ...
##  $ make      : Factor w/ 20 levels "AMC","Cadillac",..: 14 14 5 1 1 16 16 15 15 15 ...
##  $ country   : Factor w/ 6 levels "Germany","Italy",..: 3 3 3 6 6 6 6 1 1 1 ...

We’ll specify by= because it is good practice and avoids the message:

mtcars2 <- full_join(mtcars, 
                     mtcars_origins, 
                     by="make.model" )

## Warning: Column `make.model` joining character vector and factor, coercing into
## character vector

Or at least one of them.

Confirm the columns have been added:

str(mtcars2)

## 'data.frame':    32 obs. of  14 variables:
##  $ make.model  : chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ mpg         : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl         : chr  "6" "6" "4" "6" ...
##  $ disp        : num  160 160 108 258 360 ...
##  $ hp          : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat        : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt          : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec        : num  16.5 17 18.6 19.4 17 ...
##  $ vs          : chr  "0" "0" "1" "1" ...
##  $ transmission: chr  "Manual" "Manual" "Manual" "Automatic" ...
##  $ gear        : chr  "4" "4" "4" "3" ...
##  $ carb        : chr  "4" "4" "1" "1" ...
##  $ make        : Factor w/ 20 levels "AMC","Cadillac",..: 14 14 5 1 1 16 16 15 15 15 ...
##  $ country     : Factor w/ 6 levels "Germany","Italy",..: 3 3 3 6 6 6 6 1 1 1 ...

Often we don’t have all of the information we want conveniently stored in another file. Whereas one approach would be to go through and type in data to each row in Excel, that (a) can take forever, (b) is prone to errors, and (c) would need to be repeated if new data come in, etc.

We can use \({\bf\textsf{R}}\) to add information based on values already stored in an object. For example, we’ve added the country in which each automobile in mtcars was made. If we wanted to analyze domestic (USA) vs. foreign (not USA) cars, we can easily add such a column using mutate() to create a new column, origin.

In this case, the new information is based on a simple comparison: if the car was made in the USA, we want it identified as domestic, otherwise it is foreign. Such an either-or situation is ideal for ifelse(), which is a shortcut for the logical progression if X, then Y; otherwise Z (because not X): if the stated condition is met (country == "USA"), mutate() writes domestic; otherwise, if country is not USA², it writes foreign:

mtcars2 <- mutate(mtcars2, 
                    origin = ifelse(country == "USA", 
                                      "domestic",
                                         "foreign"))
str(mtcars2)

## 'data.frame':    32 obs. of  15 variables:
##  $ make.model  : chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ mpg         : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl         : chr  "6" "6" "4" "6" ...
##  $ disp        : num  160 160 108 258 360 ...
##  $ hp          : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat        : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt          : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec        : num  16.5 17 18.6 19.4 17 ...
##  $ vs          : chr  "0" "0" "1" "1" ...
##  $ transmission: chr  "Manual" "Manual" "Manual" "Automatic" ...
##  $ gear        : chr  "4" "4" "4" "3" ...
##  $ carb        : chr  "4" "4" "1" "1" ...
##  $ make        : Factor w/ 20 levels "AMC","Cadillac",..: 14 14 5 1 1 16 16 15 15 15 ...
##  $ country     : Factor w/ 6 levels "Germany","Italy",..: 3 3 3 6 6 6 6 1 1 1 ...
##  $ origin      : chr  "foreign" "foreign" "foreign" "domestic" ...

ifelse() is either-or, which might seem limited to instances with only two options, but it can be nested to work through remaining options:

mtcars2 <- mutate(mtcars2, 
                  continent =  ifelse(country=="USA", 
                                        "North America", 
                                  ifelse(country=="Japan", 
                                         "Asia", 
                                            "Europe")))
str(mtcars2)

## 'data.frame':    32 obs. of  16 variables:
##  $ make.model  : chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ mpg         : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl         : chr  "6" "6" "4" "6" ...
##  $ disp        : num  160 160 108 258 360 ...
##  $ hp          : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat        : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt          : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec        : num  16.5 17 18.6 19.4 17 ...
##  $ vs          : chr  "0" "0" "1" "1" ...
##  $ transmission: chr  "Manual" "Manual" "Manual" "Automatic" ...
##  $ gear        : chr  "4" "4" "4" "3" ...
##  $ carb        : chr  "4" "4" "1" "1" ...
##  $ make        : Factor w/ 20 levels "AMC","Cadillac",..: 14 14 5 1 1 16 16 15 15 15 ...
##  $ country     : Factor w/ 6 levels "Germany","Italy",..: 3 3 3 6 6 6 6 1 1 1 ...
##  $ origin      : chr  "foreign" "foreign" "foreign" "domestic" ...
##  $ continent   : chr  "Asia" "Asia" "Asia" "North America" ...

Such nesting can proceed through any number of options, but typing all that out is onerous and making sure all the parentheses get closed properly is prone to error. More complicated distinctions can be sorted with case_when(), for example, assigning each car a ranking for how sporty it is based on my arbitrary assessment of images on Google, which we’ll store in a new column called sportiness.

Here, we use the %in% operator to compare make.model against a series of vectors with groups of possible values bound by c(). In the case_when a value for make.model is in the vector on the left-hand side of ~, mutate() writes whatever is on the right hand side. If the value for make.model isn’t in the first option, it moves to the second, and so on, until it either finds a case_when it does occur, or it has a value that hasn’t yet been found and is TRUE, in which case it gets the final option (in this case, “Not sporty”).

mtcars2 <- mutate(mtcars2, 
              sportiness = case_when(
                make.model %in% 
                    c("Pontiac Firebird", 
                      "Camero Z28", 
                      "Maserati Bora",
                      "Duster 360", 
                      "Dodge Challenger", 
                      "AMC Javelin", 
                      "Lotus Europa", 
                      "Ford Pantera L", 
                      "Ferrari Dino", 
                      "Porsche 914-2") ~ "Pretty sporty", 
                make.model %in% 
                    c("Fiat X1-9", 
                      "Hornet Sportabout", 
                      "Datsun 710", 
                      "Mazda RX4", 
                      "Merc 450SLC") ~ "Kinda sporty", 
                make.model %in% 
                    c("Chrysler Imperial", 
                      "Lincoln Continental", 
                      "Cadillac Fleetwood", 
                      "Merc 450SE") ~ "Total boat", 
                TRUE ~ "Not sporty"  ))
str(mtcars2)

## 'data.frame':    32 obs. of  17 variables:
##  $ make.model  : chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ mpg         : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl         : chr  "6" "6" "4" "6" ...
##  $ disp        : num  160 160 108 258 360 ...
##  $ hp          : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat        : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt          : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec        : num  16.5 17 18.6 19.4 17 ...
##  $ vs          : chr  "0" "0" "1" "1" ...
##  $ transmission: chr  "Manual" "Manual" "Manual" "Automatic" ...
##  $ gear        : chr  "4" "4" "4" "3" ...
##  $ carb        : chr  "4" "4" "1" "1" ...
##  $ make        : Factor w/ 20 levels "AMC","Cadillac",..: 14 14 5 1 1 16 16 15 15 15 ...
##  $ country     : Factor w/ 6 levels "Germany","Italy",..: 3 3 3 6 6 6 6 1 1 1 ...
##  $ origin      : chr  "foreign" "foreign" "foreign" "domestic" ...
##  $ continent   : chr  "Asia" "Asia" "Asia" "North America" ...
##  $ sportiness  : chr  "Kinda sporty" "Not sporty" "Kinda sporty" "Not sporty" ...

Summary statistics

While mean() is definitely useful to get the average of a data vector, rarely are we just interested in a single vector. Instead, we need to apply functions like mean() across a number of groups. In Excel, we’d create a PivotTable. In base \({\bf\textsf{R}}\), we’d use aggregate() to use the function mean to calculate the average horsepower hp for each sportiness group:

aggregate(hp ~ sportiness, data=mtcars2, FUN=mean)

##      sportiness       hp
## 1  Kinda sporty 124.8000
## 2    Not sporty 110.1429
## 3 Pretty sporty 188.6667
## 4    Total boat 207.5000

This method has limitations: it is difficult to stick into a pipe workflow, and uses one function at a time.

The tidy way is to group and summarize:

mtcars2 %>% # pipe operator 'pours' data down
  group_by(sportiness) %>%
    summarize(mean_hp = mean(hp))

## # A tibble: 4 x 2
##   sportiness    mean_hp
##   <chr>           <dbl>
## 1 Kinda sporty     125.
## 2 Not sporty       110.
## 3 Pretty sporty    189.
## 4 Total boat       208.

A nice thing about using summarize is that one has control over the name of the new summary column (we called it mean_hp above).³

The magic of tidyverse becomes further apparent as we…

Add another level:

mtcars2 %>% 
  group_by(transmission, sportiness) %>%
    summarize(mean_hp = mean(hp))

## # A tibble: 7 x 3
## # Groups:   transmission [2]
##   transmission sportiness    mean_hp
##   <chr>        <chr>           <dbl>
## 1 Automatic    Kinda sporty    178. 
## 2 Automatic    Not sporty      127. 
## 3 Automatic    Pretty sporty   180  
## 4 Automatic    Total boat      208. 
## 5 Manual       Kinda sporty     89.7
## 6 Manual       Not sporty       80.4
## 7 Manual       Pretty sporty   196.

Add another operation:

mtcars2 %>% 
  group_by(transmission, sportiness) %>%
    summarize(mean_hp = mean(hp), 
              sd_hp = sd(hp))

## # A tibble: 7 x 4
## # Groups:   transmission [2]
##   transmission sportiness    mean_hp  sd_hp
##   <chr>        <chr>           <dbl>  <dbl>
## 1 Automatic    Kinda sporty    178.    3.54
## 2 Automatic    Not sporty      127.   54.4 
## 3 Automatic    Pretty sporty   180    44.9 
## 4 Automatic    Total boat      208.   21.0 
## 5 Manual       Kinda sporty     89.7  22.2 
## 6 Manual       Not sporty       80.4  27.1 
## 7 Manual       Pretty sporty   196.  103.

Or need to make a modification to the data prior to summarization and output in a nice table:

mtcars2 %>% 
  mutate(transmission = str_c(transmission, " shifter")) %>%
    group_by(sportiness, transmission) %>%
      summarize(mean_hp = mean(hp), 
                sd_hp = sd(hp)) %>%
  mutate_at(vars(mean_hp, sd_hp), ~round(., 0)) %>%
  mutate(`HP (mean ± sd)` = str_c(mean_hp, " ± ", sd_hp)) %>%
  select(-mean_hp, -sd_hp) %>%
  knitr::kable(caption="Average horsepower by sportiness category.")

Table 1: Average horsepower by sportiness category.
sportiness	transmission	HP (mean ± sd)
Kinda sporty	Automatic shifter	178 ± 4
Kinda sporty	Manual shifter	90 ± 22
Not sporty	Automatic shifter	127 ± 54
Not sporty	Manual shifter	80 ± 27
Pretty sporty	Automatic shifter	180 ± 45
Pretty sporty	Manual shifter	196 ± 103
Total boat	Automatic shifter	208 ± 21

Wrap up

Practice saving files by saving mtcars2. We’ll use it again.

save(mtcars2, file="./data/mtcars2.Rdata") 
# or
save(mtcars2, file=file.choose())

Homework assignment

Homework assignment on GitHub

In base \({\bf\textsf{R}}\), we would use the merge() function.↩
“not USA” would be coded as country != "USA", which explains why we use country == "USA" with == and not simply =.↩
aggregate() recycles the name of the column it summarized hp, which makes it difficult to identify the summary value and very annoying when seeking to combine the results of two operations because of the shared column names.↩

Lesson 3 | Intro to tidyverse