An Introduction to R Programming

For the purposes of this course, we will be working with the integrated development environment (IDE) Rstudio. Make sure you have downloaded it and have familiarised yourself with the interface before proceeding.

Features of R

R is a statistical programming language. As such, it understands and categorises its input as data types:

(Note that R is not too strict about data types, but you need to be able to identify them to use them in math operations.)

Examples of data types:

# Numeric (or double): 
numeric_vector <- c(1, 1.0, 65.5)
print(numeric_vector)
[1]  1.0  1.0 65.5
#Integers: 
integer_vector <- c(1L, 3L, 45L)
print(integer_vector)
[1]  1  3 45
#Logical (or boolean): 
boolean_vector <- c(TRUE, FALSE)
print(boolean_vector)
[1]  TRUE FALSE
#Character: 
character_vector <- c("Harry Potter", "Star Wars", "Lord of the Rings")
print(character_vector)
[1] "Harry Potter"      "Star Wars"         "Lord of the Rings"
#Factor: (also knows as categorical variables)
factor_vector <- as.factor(c("male","female"))
print(factor_vector)
[1] male   female
Levels: female male
#Missing: 
NA
[1] NA

… and data structures. A data structure is either homogeneous (all elements are of the same data type) or heterogeneous (elements can be of more than one data type).

Examples of data structures:

Vectors: think of a row, or a column of the same element.

#Vectors: 
x <- c(1L,3L,5L,7L,9L) # we call this an integer vector
y <- c(1.3, 1, 5, 7, 11.2) # we call this a numerical (or double) vector

print(x)
[1] 1 3 5 7 9
print(y)
[1]  1.3  1.0  5.0  7.0 11.2

Matrices: they have rows and columns containing elements of the same type.

#Matrices: 
A <- matrix(1:9, ncol=3, nrow=3, byrow= TRUE)
print(A)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

Arrays: A vector is a one-dimensional array. A matrix is a two-dimensional array. In short, an array is a collection of data of the same type.

#Arrays:
n <- 5*5*3 
B <- array(1:n, c(5,5,3))
print(B)
, , 1

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25

, , 2

     [,1] [,2] [,3] [,4] [,5]
[1,]   26   31   36   41   46
[2,]   27   32   37   42   47
[3,]   28   33   38   43   48
[4,]   29   34   39   44   49
[5,]   30   35   40   45   50

, , 3

     [,1] [,2] [,3] [,4] [,5]
[1,]   51   56   61   66   71
[2,]   52   57   62   67   72
[3,]   53   58   63   68   73
[4,]   54   59   64   69   74
[5,]   55   60   65   70   75

Lists: a list is a one-dimensional, heterogeneous data structure. Basically, it is an object that stores all object types.

my_list <- list(42, "The answer is", TRUE)
print(my_list)
[[1]]
[1] 42

[[2]]
[1] "The answer is"

[[3]]
[1] TRUE

Data frames: a data frame is a list of column vectors. Each vector must contain the same data type, but the different vectors can store different data types. Note, however, that in a data frame all vectors must have the same length.

a <- 1L:5L
class(a)
[1] "integer"
b <- c("a1", "b2", "c3", "d4", "e5")
class(b)
[1] "character"
c <- c(TRUE, TRUE, FALSE, TRUE, FALSE)
class(c)
[1] "logical"
df <- as.data.frame(cbind(a,b,c))

str(df)
'data.frame':   5 obs. of  3 variables:
 $ a: chr  "1" "2" "3" "4" ...
 $ b: chr  "a1" "b2" "c3" "d4" ...
 $ c: chr  "TRUE" "TRUE" "FALSE" "TRUE" ...

Notice that even though vector a in dataframe df is of class integer, vector b is of class character, and vector ^c is of class boolean/logical, when binding them together they have been coerced into factors. You’ll have to manually transform them into their original class to be able to use them in math operations.

df$a <- as.integer(df$a)
df$b <- as.character(df$b)
df$c <- as.logical(df$c)

str(df)
'data.frame':   5 obs. of  3 variables:
 $ a: int  1 2 3 4 5
 $ b: chr  "a1" "b2" "c3" "d4" ...
 $ c: logi  TRUE TRUE FALSE TRUE FALSE

From here on, we will write an R script together, and learn some basic commands and tools that will allow you to explore and manipulate data. Please note that this is not an exhaustive tutorial. It is nonetheless a good place to start.

1. Setting up the Rstudio working environment

It is good practice to make sure that the working environment is empty/clean before you start running any code.

# The r base command rm() stands for remove
rm(list = ls()) # this line indicates R to clear absolutely everything from the environment

Once that has been taken care of, you need to load the libraries you will be working with. While r base has a large number of commands to explore, wrangle, and manipulate data, the open source feature of R means that people all over the world are constantly working on packages and functions to make our lives easier. These can be used by calling the libraries in which they are stored:

#  My personal favourite are the Tidyverse library, by Hadley Whickam, and data.table. Both are brilliant for data exploration, manipulation, and visualisation. 
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(data.table)

Attaching package: 'data.table'

The following objects are masked from 'package:lubridate':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year

The following objects are masked from 'package:dplyr':

    between, first, last

The following object is masked from 'package:purrr':

    transpose

If you’re working with an imported data set, you should probably set up your working directory as well:

setwd("/Users/lucas/Documents/UNU-CDO/courses/ml4p/ml4p-website-v2")

From the working directory, you can call documents: .csv files, .xls and .xlsx, images, .txt, .dta (yes, STATA files!), and more. You’ll need to use the right libraries to do so. For instance: readxl (from the Tidyverse) uses the function read_excel() to import .xlsx and .xls files. If you want to export a data frame in .xlsx format, you can use the package write_xlsx().

2. R base commands for data set exploration

Now that we have all the basic stuff set up, let’s start with some basic r base commands that will allow us to explore our data. To do so, we will work with the toy data set mtcars that can be called from R without the need to upload data or call data from a website.

# some basics to explore your data 
str(mtcars) # show the structure of the object in a compact format
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars) # inspect the dimension of the dataset (returns #rows, #columns)
[1] 32 11
class(mtcars) # evaluate the class of the object (e.g. numeric, factor, character...)
[1] "data.frame"
length(mtcars$mpg) # evaluate the number of elements in vector mpg
[1] 32
mean(mtcars$mpg) # mean of all elements in vector mpg
[1] 20.09062
sum(mtcars$mpg) # sum of all elements in vector mpg (similar to a column sum)
[1] 642.9
sd(mtcars$mpg) # standard deviation
[1] 6.026948
median(mtcars$mpg) # median
[1] 19.2
cor(mtcars$mpg, mtcars$wt) # default is pearson correlation, specify method within function to change it.  
[1] -0.8676594
table(mtcars$am) #categorical data in a table: counts

 0  1 
19 13 
prop.table(table(mtcars$am)) #categorical data in a table: proportions

      0       1 
0.59375 0.40625 

3. Objects and assignments

Another important feature of the R programming language is that it is object oriented. For the most part, for every function used, there must be an object assigned! Let’s see an example of object assignment with a bivariate linear regression model:

ols_model <- lm(mpg ~ wt, data = mtcars) # lm stands for linear model. In parenthesis, dependent variable first, independent variable after the squiggly.
summary(ols_model)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

If you would like to see the results from your regression, you do not need to run it again. Instead, you print the object (ols_model) you have assigned for the linear model function. Similarly, you can call information stored in that object at any time, for example, the estimated coefficients:

ols_model$coefficients
(Intercept)          wt 
  37.285126   -5.344472 

If you want to read more about object oriented programming in R, check out Hadley Whickam’s site

4. Plotting with and without special libraries

We had previously loaded a couple of libraries. Why did we do that if we’ve only used r base commands so far? We’re going to exemplify the power of libraries by drawing plots using r base, and ggplot2. Ggplot2 is the plotting function from the Tidyverse, and arguably one of the best data visualisation tools across programming languages. If you’d like to read more about why that is the case, check out the Grammar of Graphics site.

Plotting with R base

plot(mtcars$wt, mtcars$mpg, pch = 14, col = "grey", main ="Mileage and Weight")
abline(ols_model, col ="blue") # Note that to add a line of best fit, we had to call our previously estimate linear model, stored in the ols_model object-

# with base R, you cannot directly assign an object to a plot, you need to use...
p_rbase <- recordPlot()
# plot.new() # don't forget to clean up your device afterwards!

Plotting with ggplot2, using the grammar of graphics

# Steps in the Grammar of Graphics
# 1: linking plot to dataset, 
# 2: defining (aes)thetic mapping, 
# 3: use (geom)etric objects such as points, bars, etc. as markers, 
# 4: the plot has layers

p_ggplot <- ggplot(data = mtcars, aes(x = wt, y = mpg, col=am )) + #the color is defined by car type (automatic 0 or manual 1)
    geom_smooth(method = "lm", col= "orange") + # no need to run a regression ex-ante to add a line of best fit 
    geom_point(alpha=0.5) + #alpha controls transparency of the geom (a.k.a. data point)
    theme(legend.position="none") #removing legend
print(p_ggplot)
`geom_smooth()` using formula = 'y ~ x'

Thanks to the grammar of graphics, we can continue to edit the plot after we have finished it. Perhaps we’ve come up with ideas to make it more stylish? Or helpful? Let’s see an example:

p_ggplot <- p_ggplot + theme(legend.position = "right") # I am saving it with the same name again, but I could easily choose another name and keep two versions of the plot. 
print(p_ggplot) # the legend we just added is NOT helpful. Why is that? 
`geom_smooth()` using formula = 'y ~ x'

# Remember when we talked about data types?!

class(mtcars$am) #for legends, we might prefer levels/categories 
[1] "numeric"
mtcars$am <- as.factor(mtcars$am) # we have now transformed the numeric am into a factor variable 
#the importance of assigning objects :)

#  Now we can plot our scatterplot without issues
p_ggplot <- ggplot(data = mtcars, aes(x = wt, y = mpg, col = am )) +
    geom_smooth(method = "lm", col= "red") +
    geom_point(alpha=0.5) +
    theme(legend.position="right") +
    theme_classic() 

print(p_ggplot)
`geom_smooth()` using formula = 'y ~ x'

# Shall we continue?

p_ggplot <- p_ggplot + ggtitle("Scatterplot of mileage vs weight of car") +
    xlab("Weight of car") + ylab("Miles per gallon") +
    theme(plot.title = element_text(color="black", size=14, face="bold", hjust = 0.5))


p_ggplot <- p_ggplot + scale_colour_manual(name = "Automatic or Manual", 
                                           labels = c("0.Automatic", "1.Manual"),
                                           values = c("darkorange2", "darkgoldenrod1"))

print(p_ggplot)
`geom_smooth()` using formula = 'y ~ x'

# Finally, perhaps we want two lines of best fit that follow the shape of the value dispersion by car type, and not the linear model function?

p <- ggplot(data = mtcars, aes(x = wt, y = mpg, col = am )) +
    geom_smooth() +
    geom_point(alpha=0.5) +
    theme(legend.position="right") +
    theme_classic() + ggtitle("Scatterplot of mileage vs weight of car") +
    xlab("Weight of car") + ylab("Miles per gallon") +
    theme(plot.title = element_text(color="black", size=14, face="bold", hjust = 0.5)) + 
    scale_colour_manual(name = "Automatic or Manual", 
                        labels = c("0.Automatic", "1.Manual"),
                        values = c("darkorange2", "darkgoldenrod1"))
print(p)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

5. R base capabilities

We’ve had some fun, now let’s go back to some r base basics. These are going to be relevant for making algorithms of your own:

Arithmetic and other operations

2+2 #addition
[1] 4
5-2 #subtraction
[1] 3
33.3/2 #division
[1] 16.65
5^2 #exponentiation
[1] 25
200 %/% 60 #integer division  [or, how many full hours in 200 minutes?]/aka quotient
[1] 3
200 %% 60 #remainder [or, how many minutes are left over? ]
[1] 20

Logical operators

34 < 35 #smaller than
[1] TRUE
34 < 35 | 33 #smaller than OR than (returns true if it is smaller than any one of them)
[1] TRUE
34 > 35 & 33 #bigger than AND than (returns true only if both conditions apply)
[1] FALSE
34 != 34 #negation
[1] FALSE
34 %in% 1:100 #value matching (is the object contained in the list of items?)
[1] TRUE
`%ni%` <- Negate(`%in%`) #let's create a function for NOT IN
1001 %ni% 1:100 # home-made function! :)
[1] TRUE
34==34 #evaluation
[1] TRUE

Indexing

The tidyverse comes with a cool Starwars dataframe (or tibble, in the tidyverse language). As long as you have loaded the tidyverse library, you can use it.

head(starwars) # print the first 5 elements of the dataframe/tibble
# A tibble: 6 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
starwars$name[1] #indexing: extract the first element in the "name" vector from the starwars dataframe
[1] "Luke Skywalker"
starwars[1,1] #alternative indexing: extract the element of row 1, col 1
# A tibble: 1 × 1
  name          
  <chr>         
1 Luke Skywalker
starwars$name[2:4] # elements 2, 3, 4 of "name" vector
[1] "C-3PO"       "R2-D2"       "Darth Vader"
starwars[,1] #extract all elements from column 1
# A tibble: 87 × 1
   name              
   <chr>             
 1 Luke Skywalker    
 2 C-3PO             
 3 R2-D2             
 4 Darth Vader       
 5 Leia Organa       
 6 Owen Lars         
 7 Beru Whitesun Lars
 8 R5-D4             
 9 Biggs Darklighter 
10 Obi-Wan Kenobi    
# ℹ 77 more rows
starwars[1,] #extract all elements from row 1
# A tibble: 1 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue              19 male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
starwars$height[starwars$height<150] # returns a logical vector TRUE for elements >150 in height vector
 [1]  96  97  66  NA  88 137 112  79  94 122  96  NA  NA  NA  NA  NA
starwars[c('height', 'name')] #returns c(oncatenated) vectors
# A tibble: 87 × 2
   height name              
    <int> <chr>             
 1    172 Luke Skywalker    
 2    167 C-3PO             
 3     96 R2-D2             
 4    202 Darth Vader       
 5    150 Leia Organa       
 6    178 Owen Lars         
 7    165 Beru Whitesun Lars
 8     97 R5-D4             
 9    183 Biggs Darklighter 
10    182 Obi-Wan Kenobi    
# ℹ 77 more rows

We are now reaching the end of this brief introduction to R and Rstudio. We will not go into the fun stuff you can do with data.table, you can find that out on your own if you need to (but know it is a powerful data wrangling package), and instead we will finalise with Reserved Names in R. These are names you cannot use for your objects, because they serve a programming purpose.

6. Reserved names

1. Not-observed data

# 1.1 NaN: results that cannot be reasonably defined
h <- 0/0
is.nan(h)
[1] TRUE
class(h)
[1] "numeric"
print(h)
[1] NaN
# 1.2 NA: missing data
colSums(is.na(starwars)) #How many missings in the starwars tibble*?
      name     height       mass hair_color skin_color  eye_color birth_year 
         0          6         28          5          0          0         44 
       sex     gender  homeworld    species      films   vehicles  starships 
         4          4         10          4          0          0          0 
mean(starwars$height) # evaluation gives NA? Does that mean that the height vector is empty?!
[1] NA
is.na(starwars$height) # logical vector returns 6 true statements for is.na (coincides with colSums table)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
[85]  TRUE  TRUE  TRUE
class(starwars$height) # class evaluation returns integer...
[1] "integer"
mean(as.integer(starwars$height), na.rm = TRUE) # read as integer, ignore NAs :) (rm stands for remove)
[1] 174.6049
# missing values (NA) are regarded by R as non comparables, even to themselves, so if you ever encounter a missing values in a vector, make sure tto explicitly tell the function you are using what to do with them.

2. if, else, ifelse, function, for, print, length (etc…)*

These words are reserved for control flow and looping. 
# if else statements
meanheight <- 174.358

if ( meanheight < 175) {
    print('very tall!')
} else {
    print('average height')
} 
[1] "very tall!"
# a note on using if else: else should be on the same line as if, otherwise it is not recognised

# home-made functions
f1 <- function(x){
    return(sum(x+1))
}
print(f1(5)) # returns value of x+1 when x = 5
[1] 6
# for loops
class(starwars$height) # if it is not integer or numeric, then transform it! 
[1] "integer"
starwars$height <- as.numeric(as.character(starwars$height)) # transforming the height vector to numeric
starwars$height_judge = NA # if you're using a new object within a for loop, make sure you initialize it before running it

for (i in 1:length(starwars$height)) {
    
    starwars$height_judge[i] <- ifelse(starwars$height[i]<100, "Short", "Tall")
}
print(starwars[c("name", "height_judge", "height")])
# A tibble: 87 × 3
   name               height_judge height
   <chr>              <chr>         <dbl>
 1 Luke Skywalker     Tall            172
 2 C-3PO              Tall            167
 3 R2-D2              Short            96
 4 Darth Vader        Tall            202
 5 Leia Organa        Tall            150
 6 Owen Lars          Tall            178
 7 Beru Whitesun Lars Tall            165
 8 R5-D4              Short            97
 9 Biggs Darklighter  Tall            183
10 Obi-Wan Kenobi     Tall            182
# ℹ 77 more rows

You can download the R script by clicking on the button below.

Download Intro to R script (.R)