An Introduction to R


Basic R usage


Technically, R is an ‘object-based programming language optimized for statistics.’ Practically, it’s the most powerful, free statistical software available! It is the open-source implementation of the S language developed at AT&T Bell Laboratories by Rick Becker, John Chambers and Allan Wilks in ca. 1975. Among its strengths are the ability to construct complicated, repeatable analyses that pass raw text data all the way through to publishable graphics. There are many contributed packages (libraries) that can be imported to extend the analyses (7,826 packages as of 2016-01-25!). R distributions, packages and documentation can be found at

You can keep up to date with some relevant available resources (packages) by checking these pages:

This document is a brief introduction to R, by example. It is designed to be run as a script itself: you can paste the isolated lines of script into the console, in a sequence. Note that ‘#’ is the ‘comments’ symbol in R; everything following the # is ignored by the interpreter.

Installation and usage

Installation is platform dependent, and discussed in other documents. In general, the base software is either installed as a unit (Windows, Mac OS X or Linux), or built on-board with a compiler (UNIX). Packages can be either downloaded as binaries (Windows or Mac OS X) or build on-board from source (Linux). If the former, it is important to make sure that the packages were build for the same version of R you have installed (e.g. v 2.3 for Windows). Note that some packages require minimum versions of the R distribution to work.

Exercise: Install R on your laptop, and add (at minimum) the packages ‘vegan’ and ‘ape.’ Install one package from within the GUI, using the ‘Add Packages’ menu item. Add another by finding the right zip file on CRAN and using the ‘Install package from zip’ menu item.

R is command-line driven. Commands (i.e., program statements) are typed or pasted into the terminal window (or R console). Complete scripts can be imported and run, and interactive sessions can be saved for later editing and re-rerunning. The Linux version of R is invoked by typing ‘R’ at the shell prompt, or from running it inside emacs using ESS (highly recomended). The directory in which R was invoked is the default directory for reading scripts and data or outputting graphics. The Windows and Mac OS X versions are started as usual applications, and the working directory can be set via a menu item. Scripts can be run as batch jobs either from within R (using source()), or from the unix shell using:

  R CMD BATCH script.R outputfile.txt

The Mac and Windows installations include a basic ‘development environment’ with console, graphics windows and editor. Another slick, popular environment is Rstudio. There is also now an option to run the basic functionality of R as a ‘point and click’ application via John Fox’s Rcmdr package. This uses Tk/Tcl widgets to translate your pointing and clicking into R commands. This may be useful, especially when trying to remember the commands for simple actions, but should be considered an add-on, rather than the base mode. This document deals with the ‘raw’ commands only. If you are using an R environment, make sure you know how to i) type into the console, ii) type into the editor, iii) push lines from editor to console (Ctrl-R), iv) set the home directory, v) save the editor scripts, vi) save the R history, vii) saving the R data (workspace).

Language structure

R can be used like a basic calculator, with basic operations:


and functions:


However, more complex R statements are most commonly made up of actions on (functions of) objects with the output assigned to another object. E.g.:

  new.object <- a.function(old.object)

Think of each statement as the operation you would do in a ‘point-and-click’ stats program (e.g., JMP, SYSTAT, SPSS). The difference is that the output objects are easy to use again as the input to another function. This ‘atomic’ approach (similar to the philosophy behind UNIX) permits the easy construction of complicated analyses and graphics. You are not limited by what the designers of your stats package decided to include for analyses and graphics.

R is also a full programming language with control flow, so that repetitive actions can be performed on objects (‘for’ loops) and conditional statements included (‘if’). For more on the programming aspects of R, check the on-board documentation. Note that using loops in R is inefficient (slow) relative to operating on an object as a whole, due to the design of the software. If you can, find ways not to use loops.

Getting help

Help is always available by locating the HTML pages installed with R, by typing help.start(), or by typing help(...) e.g.:


Demonstrations are also available for many packages:


Note that there is always an example at the end of each help page that can be pasted directly into the console. For a more comprehensive introduction see the on-board HTML page ‘An Introduction to R.’ See also and google ‘R tutorial.’

Basic Objects

A symbol (in R, a name containing at least one letter) represents R objects. You can use any name you want (except the names used for R functions). It is good practice to use descriptive names in complicated R scripts. An objects contents is assigned with the <- operator (though = also works):

  a <- 1
  a      # typing just the object’s name prints out the object 
  a = 1  # alternate syntax
  a <- 1+1
  a <- 3  # overwrite it

You can manage objects by listing them and deleting them:

  b <- a

And you can see the command history with:


R has various data ‘modes’ or data types: logical, character, numeric, integer (and lists). When you assign a character to an object it then has a different mode from a number:

  c <- d  # error, there is no d
  c <- "d"
  d <- "hello"

The most basic objects in R are vectors of 0 dimensions and up. That is: i) single variables (0 dimensions), ii) a linear sequence of variables (aka ‘vector;’ one dimension), then iii) a matrix, or array (two dimensions), etc.

  f <- c(1, 2, 3)  # c() means concatenate
  f <- c(1, 2,
        4)         # commands can span multiple lines 
  c(b, b, 4, 5)
  f <- c(f, 25)    # objects can appear on both right and left of the `<-`
  f * 2            # operations on vectors are applied to each element

String vectors behave similarly:

  g <- c( "a" , "b", "z" )
  i <- "q"
  g <- c(i, g)

Note that vector elements can be of only one type, and so, e.g., the integers are ‘coerced’ to character mode:

  c(1, "a", 2, "b")

A list can contain multiple data types, and is itself another ‘data mode.’ A non-atomic one. The output of many functions are list objects. Parts of the list can be accessed by names (if supplied), using a $ syntax.

  list(1, "a", 2, "b")
  list(part1=1, part2="a", part3=c(2,1), part4= "b")
  list(part1=1, part2="a", part3=c(2,1), part4= "b")$part4

Elements of a dimensional object are referenced by indices (beginning at 1 for the first element)


Non-specified indexes include all unrestricted elements of the object:


Dropping elements is done with negative values:

  g[-21]  # there is no 21st element

Indexing by a series of indices is also possible:

  g[ c(2,3) ]

As is indexing by a sequence (2:4 means all integer values between 2 and 4 inclusive):


Subsets of vectors and arrays can be created by including logical statements (vectors) as indices.

  g != "a"               # Logical vector: g elements not equal to "a"
  g[ g != "a" ]
  g[ (g != "a") ]        # parentheses can sometimes help readability

Matrices or arrays are two dimensional objects.

  v1 <- c(1,2,3)
  v2 <- c(11,12,13)
  cbind(v1, v2)      # bind as columns
  rbind(v1, v2)      # bind as rows
  m <- cbind(v1, v2)

For two dimensional objects the first index is the row number, the second the column:

  m[ ,2]
  m[2, ]

Subsetting again: e.g., print all columns of object m, for the rows in which v2 is less than 13:

  m[,2] < 13
  m[ (m[,2] < 13), ]

Data frames are a special form of arrays that can contain columns with different types of data (they are actually a form of lists)

  v1                         # object names can contain numbers
  v3 <- c("a", "b", "c")
  cbind(v1, v3)              # coercion happens again here, but not here:
  data.frame(v1, v3)
  j.df <- data.frame(v1, v3) 
  j.df                       # as objects become more complex, it can
                             # be helpful to use a suffix (e.g., `.df`)
                             # to help remember the nature of the object.

See how the first column is the name of the rows (by default an integer), and the first row is the name of the columns (by default X and an integer). Column names can be set with:

  colnames(j.df) <- c("size","plot")
  j.df[ , 1]
  j.df[ 1, ]

An important feature of basic objects is their ‘class,’ particularly if they are factors or not (factors in the sense of experimental independent variables). Factors have levels.


Nope. However, a character column is also by default a factor, with levels:


Non-factor vectors can be forced to be factors:

  j.df$sizeAsFactor <- as.factor(j.df$size)
  j.df$sizeAsFactor <- NULL # zap it!
                            # To zap many cols: df[,c("col1","col2")]<-list(NULL)

Subsetting data frames is a common activity.

  j.df$size > 1
  j.df[ (j.df$size > 1) , ]
  j.df[ c(2,3) , ]
  j.df$plot != "b"
  j.df [ j.df$plot != "b" , ]

Reading in external data

The import function read.table creates a data frame from raw text. Most data that you might want to read into R is of a data frame (mixed variable types) type. The simplest way to get your data out of a spreadsheet into R is to pass it through a plain text file, using the ‘Save as CSV’ function. You can use any column separator you want, but must tell read.table() what it is. For the following, download the file: mydata.txt.

  read.table("mydata.txt", sep="|")
  read.table("mydata.txt", sep="|", header=TRUE)
  df2 <- read.table("mydata.txt", sep="|", header=TRUE)
  head(df2) # handy! And these:

Checking factors (these can also be set in the read.table function):


Data manipulation

At first, one of the most frustrating things about R is how difficult it seems to manipulate and reshape data tables. Here are a few tips (and see also here). First, summarize with a table:

  table(df2$rep, df2$moisture)  
  tab1 <- table(df2$rep, df2$moisture)
  # A 'table' object must be converted to matrix or DF for most other uses:

Run a function across some factors:

  tapply(df2$length, INDEX=df2$moisture, FUN=mean)
  tapply(df2$length, INDEX=list(df2$moisture , df2$rep), FUN=mean)
  # Unlike the table function, tapply creates a matrix

Another way of aggregating:

  aggregate(df2$length,  by=df2[,c("moisture","rep")], FUN=mean)

‘Stacking’ a table:

  tab2 <- as.matrix(tab1)
  tab2[1,2] <- 0
  tab2[3,1] <- 0
  data.frame(expand.grid(dimnames(tab2)), n = as.vector(tab2))

Joining information from two tables:

  df3 <- data.frame(rep=c("a","b","c"), status=c("OK","OK","bust"))
  merge(df2, df3, by="rep")
  # Alternatively, a version of VLOOKUP():
  match(df2$rep, df3$rep)
  df2$status <- df3$status[ match(df2$rep, df3$rep) ]

Basic statistics

Data summaries

Create some random data, and find the mean:

  rnorm(n=30, mean=1, sd=1)
  mean(rnorm(n=30,    mean=1, sd=1))

See how the sample mean approaches the population mean with larger samples:

  mean(rnorm(n=300,   mean=1, sd=1))
  mean(rnorm(n=3000,  mean=1, sd=1))
  mean(rnorm(n=30000, mean=1, sd=1))

Other stats:

  my.norm <- rnorm(n=300, mean=1, sd=1)

Find the median another way:

  median(my.norm) # hmm... not the same. Ahh: median() is interpolated

Now some graphics (a histogram). This will open a new window, a graphics window:


You can fiddle endlessly with the parameters to make your image look good, but here are the basics:

  hist(my.norm, main="My histogram", xlab="a continuous variable",
       ylab="frequency", sub="subtitle", breaks=25)

Now an x-y plot. The form is plot(x,y). Alternatively ‘~’ means ‘as a function of’.

  plot(df2$height ~ df2$fertility)

Separating the factors. First plot a blank frame:

  plot(df2$fertility, df2$height, type="n")

Then add the points:

         df2$height[df2$moisture=="DRY"], col = "red")
         df2$height[df2$moisture=="WET"], col = "blue")

The default graphics device is the screen. To capture a graphic, a different device should be specified:

  jpeg(filename = "histogram.jpg", width = 480, height = 480,
       pointsize = 12, bg = "white")

Run the plot


Switch off the device:

The file is made in the active directory. Other devices include pdf, ps, png, xfig

Simple tests


  x1 <- rnorm(n=100, mean=5, sd=2)
  x2 <- rnorm(n=50, mean=6, sd=3)

Alternative formulation:

  x <- c(x1,x2)
  group <- as.factor( c( rep(1,times=100), rep(2,times=50) ) )
  t.test(x ~ group)
  plot(x ~ group)

Note that the two ‘columns’ x and group do not have to belong to the same object. They must have the same length though.


  wilcox.test(x1,x2, paired= FALSE)

Linear models

Make some data and view it:

  x <- rnorm(n=100, mean=5, sd=2)
  y <- (x * 0.5) + rnorm(n=100, mean=2, sd=1)
  plot(x,y, xlim=c(0,10), ylim=c(0,10))

lm() is the basic least-squares linear model function (i.e. ANOVA and regression). The output of lm() is an object:

  xy.lm <- lm(y ~ x)

Add the regression line:

  abline(xy.lm, col="red")

Now some deviant (heteroscedastic) data of the same type

  y <- (x * rnorm(n=100, mean=0.5, sd=0.25)) + 2
  xy.lm <- lm(y ~ x)
  plot(x,y, xlim=c(0,10), ylim=c(0,10))
  abline(xy.lm, col="red")

Examine the residuals

  plot(fitted(xy.lm), resid(xy.lm))

Test for normality of the residuals:


But we can still use a non-parametric test:

  cor.test(x, y, method = "spearman", alternative="two.sided")

Now a more complicated model. The formula is Y ~ X1 + X2 + X1:X1 ... two factors and an interaction term:

  data.lm <- lm(height ~ fertility + moisture + fertility:moisture, data=data.df)

The following two sections introduce more complex analyses.

Randomization methods

Increasingly, modern statistical approaches use randomization and re-sampling tests (bootstrapping) to deal with ugly distributions and non-normal events. Here is a simple example:

We define a function which is the mean of the minimum distance between the two closed pairs of observations:

  meanmindist <- function(x){
    z <- as.matrix(dist(x))
    z[row(z)==col(z)] <- max(z) # recode the identities
    mean( apply(z,1,min) )

We have a universe of possible observations (e.g. body sizes of organisms in region)

  all <- c(1, 2, 3, 3, 4, 4, 4, 4, 5, 5, 6, 7)

We have an observed set of 4 organisms, e.g. in a community:

  obs <- c(1, 3, 5, 7)

Question: is the mean difference between organism size more than expected by chance? <- meanmindist(obs)

Let’s sample 1000 cases of 4 organisms: <- numeric(1000) # create an empty vector of 1000 slots
  for (i in 1:1000) {[i] <- meanmindist( sample(all, size=4) )

View the result, with observed line in red:


Get the quantile (sort of ... note the effect of ties):

  match(, sort(

So the observed distances are greater than expected.

Quitting R

Finally, to leave R, type:


If you say yes to the saving data, your workspace will be saved as .RData and your history as .Rhistory.

Community ecology and multivariate data

Now for some basic community ecology analysis, using the amazing vegan package of Jari Oksanen, Roeland Kindt, Pierre Legendre, Bob O’Hara.

First, load the package, and another package we’ll need:


Diversity analysis

Let’s start with some diversity analyses. Data from the Barro Colorado Island 50-ha plot. Each row is a separate 1-ha subplot. Columns are species. This is the default table for vegan (tables can be transposed with he t() function).

  BCI[1,]                         # first plot
  BCI[1, (BCI[1,] != 0)]          # non-zero species
  length(BCI[1, (BCI[1,] != 0)])  # number of non-zero species

Look at the distribution of species abundances in the whole plot:

  hist( log( apply(BCI, 2, sum) ) )

What is the comparative diversity of each subplot? There are many diversity methods in vegan - let’s use the rarefaction method:

  rarefy(BCI, 100)
  rarefy(BCI, 10000)      # error, cannot sample more individuals than there are
  sort(rarefy(BCI, 100))
  hist(rarefy(BCI, 200))
  sort(rarefy(BCI, 200))

Let’s fit a series of rank-abundance curves to the 35th (least diverse) and 23rd (most diverse) subplots. We’ll do these plots in the same window:

  par(mfrow=c(1,1)) # off again

And diversity indices for these plots


Composition analysis

Using the same data, let's compare species composition across the 50 subplots. First with a clustergram. Create a distance matrix between each of the subplots, using the Bray-Curtis measure of compositional dissimilarity: d[jk] = (sum abs(x[ij]-x[ik])/(sum (x[ij]+x[ik])).

  vegdist(BCI, method="bray")
  BCI.dist <- vegdist(BCI, method="bray")

Now with an ordination. Let’s start with a Nonparametric Multidimensional Scaling (a good start for unknown data).

  bci.mds <- metaMDS(BCI)

This is a biplot - it has both sample units and species on it. A variety of ways to plot the points:

  plot(bci.mds, display="sites")
  plot(bci.mds, type="t",display="sites")
  plot(bci.mds, display="species")
  plot(bci.mds, type="t",display="species")
  plot(bci.mds, type="p",display="sites")
  text(bci.mds, display="sites", adj=1.2)

Now let's work with a data set that has an associated environment matrix: the built in dune meadow data.


Here’s a compact way to visualize compositional turnover in community data. It uses the mds scores to order both species and plots.

  dune.mds <- metaMDS(dune)
  vegemite(dune, dune.mds, "Hill", zero="-")

Now lets add some environmental info, using envfit. For factors, it plots the centroid of each factor level, or the correlation vector for continuous variables.

  dune.envfit <- envfit(dune.mds, dune.env, perm = 1000)

Note that moisture and management are most strongly correlated with the MDS scores. Let’s put these on the plot:

  plot(dune.envfit) #, p.max = 0.005, col = "blue")

There was only one continuous variable, A1. See it’s correlation with the MDS axes:

  scores(dune.envfit, "vectors")

Now for something really cool. Join the plots that have the same level of Moisture to the moisture centroid position

  ordispider(dune.mds, dune.env$Moisture, col="skyblue")

Now look at a cluster plot, using Moisture regime as the label:

  dune.dist <- vegdist(dune, method="bray")
  plot(hclust(dune.dist,method="ward"), labels=dune.env$Moisture)

How different are the plots on under different management regimes? Let’s use a permutation test to look at the variation between and among plots:

  dune.ano <- anosim(dune.dist, dune.env$Management)

Finally, lets do a constrained correspondence analysis (a.k.a. canonical correspondence analysis). In this analysis the environmental scores are directly included in the ordination. From the manual page: “CCA does not try to display all variation in the data, but only the part that can be explained by the used constraints. Consequently, the results are strongly dependent on the set of constraints and their transformations or interactions among the constraints.”

  dune.cca <- cca(dune ~ Moisture + Management, data=dune.env)

The vectors Moisture.L, etc result from treating the ordered factor Moisture as a continuous variable.

Putting CCA and clustering together:

  plot(dune.cca, type = "n")
  text(dune.cca, display="sites", col="red", cex=1.2)
  text(dune.cca, display="cn", arrow = 2, col="blue")
  ordicluster(dune.cca, hclust(vegdist(dune)), prune=3, col = "black")


Finally, another class of composition analysis is the search for nestedness within samples:


Tools for systematists

Mapping distributions

Making hundreds of repetitive distribution maps is easy in R.


  map('worldHires', resolution=0, xlim=c(95,151), 
      ylim=c(-13,9), lwd=1.7)     # fill=T, col="gray" ...

  locnx <- c(110,111,112)
  locny <- c(0,-1,-2)


If you have a matrix of morphological characters for a bunch of taxa, you may want to analyse the similarity in a non-phylogenetic fashion. The daisy distance matrix tool allows for multiple data types in its calculation of distance.

  lapply(flower, class)  # apply class() over the data frame (=list) columns

The data are:

Remember: conclusions based on any multivariate analysis are highly dependent on the choice of methods.

  summary(daisy(flower, type = list(asymm = c("V1", "V3"), symm= 2,
                  ordratio= 7, logratio= 8)))
  flower.dist <- daisy(flower, type = list(asymm = c("V1", "V3"), symm= 2,
                              ordratio= 7, logratio= 8))
  plot(hclust(flower.dist, method="average"))


Displaying phylogenies

  TREE <- "((A:1,B:1):2,(C:3,D:4):2):3;"
  tree <- read.tree(text = TREE)

  plot(tree, edge.color=c("red","orange","yellow",

Ancestral character estimation

  x <- factor(c(rep(0, 5), rep(1, 18)))
  ans <- ace(x, bird.orders, type = "d")
  plot(bird.orders, type = "c", FALSE, label.offset = 1)
  co <- c("blue", "yellow")
  tiplabels(pch = 22, bg = co[as.numeric(x)], cex = 2, adj = 1)
  nodelabels(thermo = ans$lik.anc, piecol = co, cex = 0.75)

Phylogenetic inference

Neighbour-joining using ape. Remember, UPGMA and NJ are clustering methods. They are ‘phenetic,’ and not ‘cladistic’ (cf. methodological wars of the seventies).

  trw <- nj(dist.dna(woodmouse))

Parsimony (on DNA data) using phangorn

  dm = dist.hamming(Laurasiatherian)
  tree = NJ(dm)
  parsimony(tree, Laurasiatherian)
  treeNNI <- optim.parsimony(tree, Laurasiatherian)

Parsimony (on morphological data) using phangorn (from

  shark      <- c(1, 1, 0, 0, 0, 0, 0)
  goldfish   <- c(1, 1, 0, 0, 0, 0, 0)
  coelecanth <- c(1, 1, 0, 0, 0, 0, 0)
  iguana     <- c(1, 0, 0, 0, 0, 0, 1)
  robin      <- c(0, 0, 1, 0, 1, 0, 1)
  human      <- c(0, 0, 0, 1, 1, 0, 1)
  lemur      <- c(0, 0, 0, 1, 1, 0, 1)
  bat        <- c(0, 0, 1, 1, 1, 0, 1)
  pig        <- c(0, 0, 0, 1, 1, 1, 1)
  whale      <- c(0, 1, 0, 0, 1, 0, 0)
  cow        <- c(0, 0, 0, 1, 1, 1, 1)

  animals <- rbind(shark, goldfish, coelecanth, iguana, 
                   robin, human, lemur, bat, pig, whale, cow)
  colnames(animals) <- c("scales","fins","wings","hair",
  animals <- as.phyDat(animals, type = "USER", levels = c(0, 1))

  starting <- rtree(n = length(animals))
  starting$edge.length <- NULL
  starting$tip.label <- names(animals)
  tree <- optim.parsimony(starting, animals, rearrangements = "SPR")
  tree <- root(tree, outgroup = "shark", resolve.root = TRUE)


The package BioGeoBEARS is a powerful and rapidly evolving package that implements a variety of models for discrete biogeography. It is “designed to perform inference of biogeographic history on phylogenies, and also model testing and model choice of the many different possible models of how biogeography may evolve on a phylogeny (dispersal, vicariance, founder-event speciation, DEC, DIVA, BAYAREA, etc.).” See for full details. It requires many dependencies, and must load (source()) many online scripts each time you run it, so may cause you some difficulty.

I have modified the example given on the main site to perform a sinlge, simple analysis of the Hawaiian Psychotria example. The script is here, with data here (psychotria_geog.phy,

If you would rather use LaGrange externally, but use R to display the phylogenies, here is a simple script, acting on a sample Lagrange output. A more comprehensive (pretty but more complex) script (by Vijay Patil) is available from Steffi.

  infile <- readLines("psychotria.results.txt")
  # the tree, with labelled nodes is in line 7
  tr <- read.tree(text=infile[7]) 

  # Find the extant distributions by looking for '---+':
  taxa   <- gsub("^[^[]*\\[([A-Z])\\] ([A-Za-z0-9_]+) *$", "\\2",
                 infile[grep("---\\+", infile)])
  distrib <- gsub("^[^[]*\\[([A-Z])\\] (.*)$", "\\1",
                 infile[grep("---\\+", infile)])

  # Find the (most likely) splits info by looking for 'At node'
  nodelines <- grep("^At node", infile)
  nodelabl <- gsub(":", "", substr(infile[nodelines],9,100))
  # flip the splits to better be viewed on a sideways tree
  split <- gsub("\\[([A-Z]+)\\|([A-Z]+)\\]\ *", "\\2|\\1",
  # plot it, making room for tip info
  plot(tr, label.offset=0.5)

  # add the node labels
  nodelabels(split[match(nodelabl, tr$node.label)])
  # and the extant distribution
  tiplabels(distrib[match(taxa, tr$tip.label)])

Spatial data

Here are some raster and vector data of Bogor:

Analyse with:


  sat <- raster("sat.tif", band = 1)
  dem <- raster("srtm.tif")



  trees <- readShapePoints("trees/trees.shp")
  roads <- readShapeLines("roads/roads.shp")

  plot(roads, add=T)
  plot(trees, add=T)

A very simple spatial query: what are the values of the red band (think ‘roofing!’) under the tree points, vs. the overall values.

  vals <- extract(sat,trees)
  hist(bogor, freq=F, breaks=25, col = rgb(1,0,0,0.4))
  hist(vals, freq=F, breaks=25, add=T, col = rgb(0,0,1,0.4))
  legend("topright", c("All", "Trees"),

Document history

short URL:

(This HTML document is generated from a markdown document, which is in turn generated from the R script using a simple gawk script.)