An Introduction to R

Presented to Herbarium Bogoriense staff, 2016-01-26 and 2016-01-27

(Modified from ‘An introduction to R by example’ that I first prepared for the Harvard Biodiversity of Borneo courses. This document is also accessible via short URL: http://is.gd/rbogor)

Contents

Basic R usage

Background

Technically, R is an ‘object-based programming language optimized for statistics.’ Practically, it’s the most powerful, free statistical software available! It is the open-source implementation of the S language developed at AT&T Bell Laboratories by Rick Becker, John Chambers and Allan Wilks. Among its strengths are the ability to construct complicated, repeatable analyses that pass raw text data all the way through to publishable graphics. There are many contributed packages (libraries) that can be imported to extend the analyses (7,826 packages as of 2016-01-25!). R distributions, packages and documentation can be found at http://cran.r-project.org.

You can keep up to date with some relevant available resources (packages) by checking these pages:

This document is a brief introduction to R, by example. It is designed to be run as a script itself: you can paste the isolated lines of script into the console, as a sequence. Note that ‘#’ is the ‘comments’ symbol in R; everything following the # is ignored by the interpreter.

Getting help

Help is always available by locating the HTML pages installed with R, by typing help.start(), or by typing help(...) e.g.:

  help(cbind)

in the R console. Note that there is always an example at the end of each help page that can be pasted directly into the console. For a more comprehensive introduction see the on-board HTML page ‘An Introduction to R.’ See also https://cran.r-project.org/manuals.html and google ‘R tutorial.’

Installation and usage

Installation is platform dependent, and discussed in other documents. In general, the base software is either installed as a unit (Windows, Mac OS X or Linux), or built on-board with a compiler (UNIX). Packages can be either downloaded as binaries (Windows or Mac OS X) or build on-board from source (Linux). If the former, it is important to make sure that the packages were build for the same version of R you have installed (e.g. v 2.3 for Windows). Note that some packages require minimum versions of the R distribution to work.

Exercise: Install R on your laptop, and add (at minimum) the packages ‘vegan’ and ‘ape.’ Install one package from within the GUI, using the ‘Add Packages’ menu item. Add another by finding the right zip file on CRAN and using the ‘Install package from zip’ menu item.

R is command-line driven. Commands (i.e., program statements) are typed or pasted into the terminal window (or R console). Complete scripts can be imported and run, and interactive sessions can be saved for later editing and re-rerunning. The Linux version of R is invoked by typing ‘R’ at the shell prompt. The directory in which R was invoked is the default directory for reading scripts and data or outputting graphics. The Windows and Mac OS X versions are started as usual applications, and the working directory can be set via a menu item. Scripts can be run as batch jobs either from within R (using source()), or from the unix shell using:

  R CMD BATCH script.R outputfile.txt

There is now an option to run the basic functionality of R as a ‘point and click’ application via John Fox’s Rcmdr package. This uses Tk/Tcl widgets to translate your pointing and clicking into R commands. This may be useful, especially when trying to remember the commands for simple actions, but it should be considered an add-on, rather than the base mode. This document deals with the ‘raw’ commands only.

Language structure

R statements are most commonly made up of actions on (functions of) objects with the output assigned to another object. E.g.:

  new.object <- a.function(old.object)

Think of each statement as the operation you would do in a ‘point-and-click’ stats program (e.g., JMP, SYSTAT, SPSS). The difference is that the output objects are easy to use again as the input to another function. This ‘atomic’ approach (similar to the philosophy behind UNIX) permits the easy construction of complicated analyses and graphics. You are not limited by what the designers of your stats package decided to include for analyses and graphics.

R is also a full programming language with control flow, so that repetitive actions can be performed on objects (‘for’ loops) and conditional statements included (‘if’). For more on the programming aspects of R, check the on-board documentation. Note that using loops in R is inefficient (slow) relative to operating on an object as a whole, due to the design of the software. If you can, find ways not to use loops.

Basic objects

The basic objects in R are vectors of 0 dimensions and up. That is: single variables (0 dimensions), then a linear sequence of variables (one dimension), then a matrix, or array (two dimensions), etc. Their contents are assigned with the ‘<-’ operator (though = also works):

  a0 <- 1
  a1 <- c(0,1,2)                        # c() means concatenate
  b1 <- c("a", "b", "c")
  a2 <- cbind( c(0,1,2), c(10,11,12) )  # cbind() means ‘bind into columns’

Entering an object on its own causes the contents to be listed:

  a0
  a1
  b1
  a2

Elements of an object are referenced by indexes (beginning at 1 for the first element)

  b1[2]

For two dimensional objects the first index is the row number, the second the column:

  a2[3,2]

Non-specified indexes include all unrestricted elements of the object:

  a1[]
  a2[2,]
  a2[,2]

Indexing by sequence (1:10 means all integer values between 1 and 10 inclusive) is possible:

  a1[1:3]

Dropping elements is done with negative values:

  b1[-1]

Subsets are possible by including logical statements (vectors) as indices. E.g.: print all rows of the object a2 for which the second column of a2 is greater than 10 (spend some time understanding this one):

  a2[(a2[,2]>10),]

  a2[,2]>10            # this produces the underlying logical vector

Another: print all columns of the object a2 only for the rows for which the corresponding vale of b1 is "b":

  a2[(b1=="b"),]

  b1=="b"

Data frames are a special form of matrices that can contain columns with different types of data:

  d <- data.frame( cbind( c(0,1,2), c("x", "y", "z") ) )
  d
  colnames(d) <- c("a_num", "myletter")
  d

See how the first column is the name of the rows (by default an integer), and the first row is the name of the columns (by default X and an integer).

Exercise: Play around with variables, vectors, matrices and data frames for at least 10 minutes. Be sure that you can select components, as above.

Most data that you might want to read into R is of a data frame (mixed variable types) type. The import function read.table creates a data frame from raw text. (Download the file: mydata.txt).

  data.df <- read.table("mydata.txt", sep="|", header= TRUE)
  #   The columns of data were separated by the | symbol
  #   The first row is the column names 
  data.df
  attributes(data.df)

The simplest way to get your data out of a spreadsheet into R is to pass it through a plain text file, using the ‘Save as CSV’ function. You can use any column separator you want, but must tell read.table() what it is.

Columns of a data frame are referenced by the ‘$’ symbol

  data.df$rep

It’s a good idea to force factors into their correct form (this can also be done in the read.table function)

  levels(data.df$moisture)
  data.df$moisture <- as.factor(data.df$moisture)
  attributes(data.df$moisture)

  levels(data.df$rep)
  data.df$rep <- as.factor(data.df$rep)
  attributes(data.df$rep)

Exercise: Read in some of your own data, via a CSV file. Set the factors to be factors.

Data summaries

Means, variance, etc: Generate some random normal data with rnorm:

  my.norm <- rnorm(n=30, mean=1, sd=1)
  my.norm

  mean(my.norm)
  sd(my.norm)

Tabulations

  table(data.df$moisture, data.df$rep)

Another type of tabulation command (very useful):

  tapply(data.df$height, list(data.df$moisture, data.df$rep), FUN=length)
  tapply(data.df$height, data.df$moisture, FUN=mean)

Graphics

The first graphical command, a histogram. This will open a new window, a graphics window:

  hist(my.norm)

You can fiddle endlessly with the parameters to make your image look good, but here are the basics:

  hist(my.norm, main="My histogram", xlab="a continuous variable", sub="")

Now an x-y plot. The form is plot(x,y). Alternatively ‘~’ means ‘as a function of’.

  plot(data.df$height ~ data.df$fertility)

Separating the factors. First plot a blank frame:

  plot(data.df$fertility, data.df$height, type="n")

Then add the points:

  points(data.df$fertility[data.df$moisture=="DRY"], 
         data.df$height[data.df$moisture=="DRY"], col = "red")
  points(data.df$fertility[data.df$moisture=="WET"], 
         data.df$height[data.df$moisture=="WET"], col = "blue")

The default graphics device is the screen. To capture a graphic, a different device should be specified:

  jpeg(filename = "histogram.jpg", width = 480, height = 480,
      pointsize = 12, bg = "white")

Run the plot

  hist(my.norm)

Switch off the device:

  dev.off()

The file is made in the active directory. Other devices include pdf, ps, png, xfig

For a demo of the graphics capabilities of R, type:

  demo(graphics)

Simple tests

t-test:

  x1 <- rnorm(n=100, mean=5, sd=2)
  x2 <- rnorm(n=50, mean=6, sd=3)

  t.test(x1,x2)

Alternative formulation:

  x <- c(x1,x2)
  group <- as.factor( c( rep(1,times=100), rep(2,times=50) ) )

  t.test(x ~ group)
  plot(x ~ group)

Note that the two ‘columns’ x and group do not have to belong to the same object. They must have the same length though.

Non-parametric

  wilcox.test(x1,x2, paired= FALSE)

you might want to clean up every now and again:

  ls()

  rm(x1, x2)

Linear models

Make some data and view it:

  x <- rnorm(n=100, mean=5, sd=2)
  y <- (x * 0.5) + rnorm(n=100, mean=2, sd=1)
  plot(x,y, xlim=c(0,10), ylim=c(0,10))

lm() is the basic least-squares linear model function (i.e. ANOVA and regression). The output of lm() is an object:

  xy.lm <- lm(y ~ x)
  summary(xy.lm)

Add the regression line:

  abline(xy.lm, col="red")

Now some deviant (heteroscedastic) data of the same type

  y <- (x * rnorm(n=100, mean=0.5, sd=0.25)) + 2
  xy.lm <- lm(y ~ x)
  plot(x,y, xlim=c(0,10), ylim=c(0,10))
  abline(xy.lm, col="red")

Examine the residuals

  plot(fitted(xy.lm), resid(xy.lm))
  qqnorm(resid(xy.lm))
  qqline(resid(xy.lm))

Test for normality of the residuals:

  hist(resid(xy.lm))
  shapiro.test(resid(xy.lm))

But we can still use a non-parametric test:

  cor.test(x, y, method = "spearman", alternative="two.sided")

Now a more complicated model. The formula is Y ~ X1 + X2 + X1:X1 ... two factors and an interaction term:

  data.lm <- lm(height ~ fertility + moisture + fertility:moisture, data=data.df)

  summary(data.lm)

The following two sections introduce more complex analyses.

Randomization methods

Increasingly, modern statistical approaches use randomization and re-sampling tests (bootstrapping) to deal with ugly distributions and non-normal events. Here is a simple example:

We define a function which is the mean of the minimum distance between the two closed pairs of observations:

  meanmindist <- function(x){
    z <- as.matrix(dist(x))
    z[row(z)==col(z)] <- max(z) # recode the identities
    mean( apply(z,1,min) )
  }

We have a universe of possible observations (e.g. body sizes of organisms in region)

  all <- c(1, 2, 3, 3, 4, 4, 4, 4, 5, 5, 6, 7)

We have an observed set of 4 organisms, e.g. in a community:

  obs <- c(1, 3, 5, 7)

Question: is the mean difference between organism size more than expected by chance?

  obs.md <- meanmindist(obs)

Let’s sample 1000 cases of 4 organisms:

  exp.md <- numeric(1000) # create an empty vector of 1000 slots
  for (i in 1:1000) {
    exp.md[i] <- meanmindist( sample(all, size=4) )
  }

View the result, with observed line in red:

  hist(exp.md)
  lines(c(obs.md,obs.md),c(0,par("usr")[4]),col="red")

Get the quantile (sort of ... note the effect of ties):

  match(obs.md, sort(exp.md))

So the observed distances are greater than expected.

Quitting R

Finally, to leave R, type:

  q()

Enjoy the world of R!


Community ecology and multivariate data

As our final example, we’ll do some basic community ecology analysis, using the amazing vegan package of Jari Oksanen, Roeland Kindt, Pierre Legendre, Bob O’Hara.

First, load the package, and another package we’ll need:

  library(vegan)
  library(MASS) ## for isoMDS

Let’s start with some diversity analyses. Data from the Barro Colorado Island 50-ha plot. Each row is a separate 1-ha subplot. Columns are species. This is the default table for vegan (tables can be transposed with he t() function). Look at the distribution of species abundances in the whole plot:

  data(BCI)
  BCI[1:4,1:4]

  hist( log( apply(BCI, 2, sum) ) )

What is the comparative diversity of each subplot? There are many diversity methods in vegan - let’s use the rarefaction method:

  rarefy(BCI, 100)
  sort(rarefy(BCI, 100))

Let’s fit a series of rank-abundance curves to the 35th (least diverse) and 23rd (most diverse) subplots. We’ll do these plots in the same window:

  par(mfrow=c(1,2))
  plot(rad.veil(BCI[35,]))
  plot(rad.veil(BCI[23,]))
  par(mfrow=c(1,1)) # off again

Now some species composition analyses. Use the built in dune meadow dataset, with its associated environmental data dune.env

  data(dune)
  dune
  data(dune.env)
  dune.env

Create a distance matrix between each of the subplots, using the Bray-Curtis measure of compositional dissimilarity: d[jk] = (sum abs(x[ij]-x[ik])/(sum (x[ij]+x[ik])).

  dune.dist <- vegdist(dune, method="bray")

Display a clust plot of these distances.

  plot(hclust(dune.dist,method="ward"))

Now some ordinations. Let’s start with a Nonparametric Multidimensional Scaling (a good start for unknown data).

  dune.mds <- metaMDS(dune)
  plot(dune.mds, type="t")

This is a biplot - it has both sample units and species on it.

Here’s a compact way to visualize compositional turnover in community data. It uses the mds scores to order both species and plots.

  vegemite(dune, dune.mds, "Hill", zero="-")

Now lets add some environmental info, using envfit. For factors, it plots the centroid of each factor level, or the correlation vector for continuous variables.

  dune.envfit <- envfit(dune.mds, dune.env, perm = 1000)

Note that moisture and management are most strongly correlated with the MDS scores. Let’s put these on the plot:

  plot(dune.envfit, p.max = 0.005, col = "blue")

There was only one continuous variable, A1. See it’s correlation with the MDS axes:

  scores(dune.envfit, "vectors")

Now for something really cool. Join the plots that have the same level of Moisture to the moisture centroid position

  ordispider(dune.mds, dune.env$Moisture, col="skyblue")

Now look at the cluster plot again, using Moisture regime as the label:

  plot(hclust(dune.dist,method="ward"), labels=dune.env$Moisture)

How different are the plots on under different management regimes? Let’s use a permutation test to look at the variation between and among plots:

  dune.ano <- anosim(dune.dist, dune.env$Management)
  summary(dune.ano)
  plot(dune.ano)

Finally, lets do a constrained correspondence analysis (a.k.a. canonical correspondence analysis). In this analysis the environmental scores are directly included in the ordination. From the manual page: “CCA does not try to display all variation in the data, but only the part that can be explained by the used constraints. Consequently, the results are strongly dependent on the set of constraints and their transformations or interactions among the constraints.”

  dune.cca <- cca(dune ~ Moisture + Management, data=dune.env)
  plot(dune.cca)

The vectors Moisture.L, etc result from treating the ordered factor Moisture as a continuous variable.

Putting CCA and clustering together:

  plot(dune.cca, type = "n")
  text(dune.cca, display="sites", col="red", cex=1.2)
  text(dune.cca, display="cn", arrow = 2, col="blue")
  ordicluster(dune.cca, hclust(vegdist(dune)), prune=3, col = "black")

Wow!


Tools for systematists

Mapping distributions

  library(maps)
  library(mapdata)

  map('worldHires', resolution=0, xlim=c(95,151), 
      ylim=c(-13,9), lwd=1.7)     # fill=T, col="gray" ...
  map.axes()

  locnx <- c(110,111,112)
  locny <- c(0,-1,-2)
  points(locnx,locny,pch=19)

Clustering

  library(cluster)
  data(flower)
  flower
  lapply(flower, class)

Displaying phylogenies

  library(ape)
  TREE <- "((A:1,B:1):2,(C:3,D:4):2):3;"
  tree <- read.tree(text = TREE)
  plot(tree)

  plot(tree,type="cladogram")
  plot(tree,type="fan")
  plot(tree,type="unrooted")
  plot(tree,type="radial")
  plot(tree,use.edge.length=F)
  plot(tree, edge.color=c("red","orange","yellow",
                          "green","blue","purple"))

Ancestral character estimation

  data(bird.orders)
  x <- factor(c(rep(0, 5), rep(1, 18)))
  ans <- ace(x, bird.orders, type = "d")
  ans
  plot(bird.orders, type = "c", FALSE, label.offset = 1)
  co <- c("blue", "yellow")
  tiplabels(pch = 22, bg = co[as.numeric(x)], cex = 2, adj = 1)
  nodelabels(thermo = ans$lik.anc, piecol = co, cex = 0.75)

Phylogenetic inference

Neighbour-joining using ape

  library(ape)
  data(woodmouse)
  woodmouse
  trw <- nj(dist.dna(woodmouse))
  plot(trw)

Parsimony (on DNA data) using phangorn

  library(phangorn)
  data(Laurasiatherian)
  Laurasiatherian
  dm = dist.hamming(Laurasiatherian)
  tree = NJ(dm)
  plot(tree)
  parsimony(tree, Laurasiatherian)
  treeNNI <- optim.parsimony(tree, Laurasiatherian)
  plot(treeNNI)

Parsimony (on morphological data) using phangorn (from http://www.phytools.org/mpma/Exercise_2.1/)

  shark      <- c(1, 1, 0, 0, 0, 0, 0)
  goldfish   <- c(1, 1, 0, 0, 0, 0, 0)
  coelecanth <- c(1, 1, 0, 0, 0, 0, 0)
  iguana     <- c(1, 0, 0, 0, 0, 0, 1)
  robin      <- c(0, 0, 1, 0, 1, 0, 1)
  human      <- c(0, 0, 0, 1, 1, 0, 1)
  lemur      <- c(0, 0, 0, 1, 1, 0, 1)
  bat        <- c(0, 0, 1, 1, 1, 0, 1)
  pig        <- c(0, 0, 0, 1, 1, 1, 1)
  whale      <- c(0, 1, 0, 0, 1, 0, 0)
  cow        <- c(0, 0, 0, 1, 1, 1, 1)

  animals <- rbind(shark, goldfish, coelecanth, iguana, 
             robin, human, lemur, bat, pig, whale, cow)
  colnames(animals) <- c("scales","fins","wings","hair",
                         "warm-blood","hooves","legs")
  animals
  animals <- as.phyDat(animals, type = "USER", levels = c(0, 1))

  starting <- rtree(n = length(animals))
  starting$edge.length <- NULL
  starting$tip.label <- names(animals)
  plot(starting)
  tree <- optim.parsimony(starting, animals, rearrangements = "SPR")
  plot(tree)
  tree <- root(tree, outgroup = "shark", resolve.root = TRUE)
  plot(tree)

Spatial data

Here are some raster and vector data of Bogor:

Analyse with:

  library(raster)
  library(maptools)

  sat <- raster("sat.tif", band = 1)
  dem <- raster("srtm.tif")

  projection(sat)
  projection(dem)

  plot(sat)
  plot(dem)
  plot(sat)

  trees <- readShapePoints("trees/trees.shp")
  roads <- readShapeLines("roads/roads.shp")

  plot(roads, add=T)
  plot(trees, add=T)

A very simple spatial query: what are the values of the red band (think ‘roofing!’) under the tree points, vs. the overall values.

  vals <- extract(sat,trees)
  hist(bogor, freq=F, breaks=25, col = rgb(1,0,0,0.4))
  hist(vals, freq=F, breaks=25, add=T, col = rgb(0,0,1,0.4))
  legend("topright", c("All", "Trees"),
         fill=c(rgb(1,0,0,0.4),rgb(0,0,1,0.4)))