5 min read

Lesson 1 | A very basic introduction to R

Objectives

Introduce the following concepts:

  • Object-orientedness
  • Vectors
  • Functions

Working through the script

Comments

Note that anything preceded by # is ignored by \({\bf\textsf{R}}\). We call it a comment operator, and it is useful for adding explanation to the script.

At a very basic level \({\bf\textsf{R}}\) is a fancy calculator. It will chug arithmetic operations:

  2+2  
## [1] 4
  2*2 
## [1] 4
  2*2+6 
## [1] 10

\({\bf\textsf{R}}\) follows proper order of operations, including parentheses:

  6+2*2 
## [1] 10
  6+(2*2) 
## [1] 10

Object-oriented

\({\bf\textsf{R}}\) is “object-oriented”, which means that character strings can be used to represent values. We have two options when writing script to define objects:

  • <- assigns the operation on the right to the named object on the left.
  • = will do the same, and is shorter.

For programming purposes I like the idea of moving one side to the other, especially when the right side has many entries or is large. I tend to reserve = for defining variables or long file paths in my script, and use <- when creating data objects or storing statistical results.

Note that \({\bf\textsf{R}}\) creates objects on the fly and does not need them to be defined at the beginning of the script or session as in C++.

answer <- 2+(2*20)

Note that you should now see answer in the global environment pane of R studio.

Calling the object will print its content in the console:

answer
## [1] 42

This object can now be used for additional operations…

answer*2
## [1] 84

…and the creation of new objects:

new.answer <- answer*2
new.answer
## [1] 84

Functions

Functions are a special type of \({\bf\textsf{R}}\) object that instead of containing data, contain a series of operations. Functions are essentially shortcuts for common sets of operations.

For example, researchers often want to find the mean of data. Say we have the following five observations:

24, 13, 12, 22, and 15

The arithmetic mean is defined as the sum of the observations divided by the number of observations, which in \({\bf\textsf{R}}\) looks like:

(24 + 13 + 12 + 22 + 15) / 5 
## [1] 17.2

Alternatively, we can assign the data to an object using the c function, which stands for concatenate. It joins everything between the parentheses, separated by commas, into a vector that we’ll call data:

data <- c(24, 13, 12, 22, 15)
data 
## [1] 24 13 12 22 15

To find the mean of data, one might first think we can simply divide the object by 5…

data / 5
## [1] 4.8 2.6 2.4 4.4 3.0

…but this is obviously incorrect. Here, \({\bf\textsf{R}}\) has applied the “divide by five” operation to each value in the vector. This is an example of how \({\bf\textsf{R}}\) is vectorized: it is designed to perform its operations along vectors. Although it will be awhile before you feed \({\bf\textsf{R}}\) large enough datasets to notice the difference, vectorization optimizes performance and makes \({\bf\textsf{R}}\) computations quick.

Calculating the mean is a two-step process, and we need to define both. Thus, we must first find the sum of the data, for which we can use the shortcut function sum:

sum(data)
## [1] 86

Then we divide the sum by 5 to calculate the mean:

sum(data) / 5 
## [1] 17.2

This is an example of hard-coding: we’ve specified the divisor in this operation as a fixed value (5). But what if the value varies – say your technician (definitely not you!) inadvertently lost or failed to enter some data, and a given set of replicates do not have the number of observations you expect? Hard-coding your count creates problems:

data2 <- c(24, 13, 12, 22)
sum(data2) / 5
## [1] 14.2

The calculated mean is too low, because our hard-coded operation divided the sum of four observations by five.

It is preferable to have \({\bf\textsf{R}}\) determine the count for each operation, so if counts differ, \({\bf\textsf{R}}\) can automatically account for it.

We can use the length function to determine how many observations are in the set:

length(data)
## [1] 5

If length sounds odd, remember data is a vector comprised of individual values. The number of entries determines how long the vector is, and so length is a convenient way to count the number of observations. This is a core concept in \({\bf\textsf{R}}\) that we will return to frequently.

Let’s see how this combination of functions performs:

sum(data) / length(data)
## [1] 17.2
length(data2)
## [1] 4
sum(data2) / length(data2)
## [1] 17.75

Of course, calculating the mean of a vector is a very common operation, and \({\bf\textsf{R}}\) has a built-in function that combines the sum, length, and / operations into one shortcut:

mean(data)
## [1] 17.2

Custom functions

\({\bf\textsf{R}}\) has a lot of functions built in, and thousands of packages supply additional functions. But one often still encounters a situation where one’s life–or at least one’s script–is made more simple with a custom function.

Writing your own functions is easy. They are a special type of object in \({\bf\textsf{R}}\) that can be defined and added to the global environment. The function() function helps create them: one simply assigns arguments between the ( ) and specifies the operation between curly brackets { }.

Even though \({\bf\textsf{R}}\) already has mean(), let’s make our own alternative, called Meaner():

Meaner <- function(x) { sum(x) / length(x) }

We can call it without any arguments to see what is stored in the object:

Meaner 
## function(x) { sum(x) / length(x) }

Then we can call it on our data:

Meaner(data)
## [1] 17.2

Our custom Meaner() function performs the same as the base mean().

Let’s make it truly custom, and add a little excitement to the operation:

Meaner <- function(x) { 
            m = sum(x) / length(x)
            m1 = paste(m, "!", sep="") 
            return(m1)
                        }
Meaner(data)
## [1] "17.2!"

Notice how the function created two objects, m and m1, that were not added to the global environment but instead only existed while the operation was running on your computer’s processor. These two objects existed only temporarily during the calculation; return() specified what should be returned back to \({\bf\textsf{R}}\) when the operation was complete.