Presented to Herbarium Bogoriense staff, 2016-01-26 and 2016-01-27
(Modified from ‘An introduction to R by example’ that I first prepared for the Harvard Biodiversity of Borneo courses. This document is also accessible via short URL: http://is.gd/rbogor)
Technically, R is an ‘object-based programming language optimized for statistics.’ Practically, it’s the most powerful, free statistical software available! It is the open-source implementation of the S language developed at AT&T Bell Laboratories by Rick Becker, John Chambers and Allan Wilks. Among its strengths are the ability to construct complicated, repeatable analyses that pass raw text data all the way through to publishable graphics. There are many contributed packages (libraries) that can be imported to extend the analyses (7,826 packages as of 2016-01-25!). R distributions, packages and documentation can be found at http://cran.r-project.org.
You can keep up to date with some relevant available resources (packages) by checking these pages:
This document is a brief introduction to R, by example. It is designed
to be run as a script itself: you can paste the isolated lines of
script into the console, as a sequence. Note that ‘#
’ is the
‘comments’ symbol in R; everything following the #
is ignored by the
interpreter.
Help is always available by locating the HTML pages installed with R,
by typing help.start()
, or by typing help(...)
e.g.:
help(cbind)
in the R console. Note that there is always an example at the end of each help page that can be pasted directly into the console. For a more comprehensive introduction see the on-board HTML page ‘An Introduction to R.’ See also https://cran.r-project.org/manuals.html and google ‘R tutorial.’
Installation is platform dependent, and discussed in other documents. In general, the base software is either installed as a unit (Windows, Mac OS X or Linux), or built on-board with a compiler (UNIX). Packages can be either downloaded as binaries (Windows or Mac OS X) or build on-board from source (Linux). If the former, it is important to make sure that the packages were build for the same version of R you have installed (e.g. v 2.3 for Windows). Note that some packages require minimum versions of the R distribution to work.
Exercise: Install R on your laptop, and add (at minimum) the packages ‘vegan’ and ‘ape.’ Install one package from within the GUI, using the ‘Add Packages’ menu item. Add another by finding the right zip file on CRAN and using the ‘Install package from zip’ menu item.
R is command-line driven. Commands (i.e., program statements) are
typed or pasted into the terminal window (or R console). Complete
scripts can be imported and run, and interactive sessions can be saved
for later editing and re-rerunning. The Linux version of R is invoked
by typing ‘R’ at the shell prompt. The directory in which R was
invoked is the default directory for reading scripts and data or
outputting graphics. The Windows and Mac OS X versions are started as
usual applications, and the working directory can be set via a menu
item. Scripts can be run as batch jobs either from within R (using
source()
), or from the unix shell using:
R CMD BATCH script.R outputfile.txt
There is now an option to run the basic functionality of R as a ‘point
and click’ application via John Fox’s Rcmdr
package. This uses Tk/Tcl
widgets to translate your pointing and clicking into R commands. This
may be useful, especially when trying to remember the commands for
simple actions, but it should be considered an add-on, rather than the
base mode. This document deals with the ‘raw’ commands only.
R statements are most commonly made up of actions on (functions of) objects with the output assigned to another object. E.g.:
new.object <- a.function(old.object)
Think of each statement as the operation you would do in a ‘point-and-click’ stats program (e.g., JMP, SYSTAT, SPSS). The difference is that the output objects are easy to use again as the input to another function. This ‘atomic’ approach (similar to the philosophy behind UNIX) permits the easy construction of complicated analyses and graphics. You are not limited by what the designers of your stats package decided to include for analyses and graphics.
R is also a full programming language with control flow, so that repetitive actions can be performed on objects (‘for’ loops) and conditional statements included (‘if’). For more on the programming aspects of R, check the on-board documentation. Note that using loops in R is inefficient (slow) relative to operating on an object as a whole, due to the design of the software. If you can, find ways not to use loops.
The basic objects in R are vectors of 0 dimensions and up. That is:
single variables (0 dimensions), then a linear sequence of variables
(one dimension), then a matrix, or array (two dimensions), etc. Their
contents are assigned with the ‘<-’ operator (though =
also works):
a0 <- 1
a1 <- c(0,1,2) # c() means concatenate
b1 <- c("a", "b", "c")
a2 <- cbind( c(0,1,2), c(10,11,12) ) # cbind() means ‘bind into columns’
Entering an object on its own causes the contents to be listed:
a0
a1
b1
a2
Elements of an object are referenced by indexes (beginning at 1 for the first element)
b1[2]
For two dimensional objects the first index is the row number, the second the column:
a2[3,2]
Non-specified indexes include all unrestricted elements of the object:
a1[]
a2[2,]
a2[,2]
Indexing by sequence (1:10 means all integer values between 1 and 10 inclusive) is possible:
a1[1:3]
Dropping elements is done with negative values:
b1[-1]
Subsets are possible by including logical statements (vectors) as indices. E.g.: print all rows of the object a2 for which the second column of a2 is greater than 10 (spend some time understanding this one):
a2[(a2[,2]>10),]
a2[,2]>10 # this produces the underlying logical vector
Another: print all columns of the object a2 only for the rows for which the corresponding vale of b1 is "b":
a2[(b1=="b"),]
b1=="b"
Data frames are a special form of matrices that can contain columns with different types of data:
d <- data.frame( cbind( c(0,1,2), c("x", "y", "z") ) )
d
colnames(d) <- c("a_num", "myletter")
d
See how the first column is the name of the rows (by default an
integer), and the first row is the name of the columns (by default X
and an integer).
Exercise: Play around with variables, vectors, matrices and data frames for at least 10 minutes. Be sure that you can select components, as above.
Most data that you might want to read into R is of a data frame (mixed variable types) type. The import function read.table creates a data frame from raw text. (Download the file: mydata.txt).
data.df <- read.table("mydata.txt", sep="|", header= TRUE)
# The columns of data were separated by the | symbol
# The first row is the column names
data.df
attributes(data.df)
The simplest way to get your data out of a spreadsheet into R is to
pass it through a plain text file, using the ‘Save as CSV’
function. You can use any column separator you want, but must tell
read.table()
what it is.
Columns of a data frame are referenced by the ‘$’ symbol
data.df$rep
It’s a good idea to force factors into their correct form (this can also be done in the read.table function)
levels(data.df$moisture)
data.df$moisture <- as.factor(data.df$moisture)
attributes(data.df$moisture)
levels(data.df$rep)
data.df$rep <- as.factor(data.df$rep)
attributes(data.df$rep)
Exercise: Read in some of your own data, via a CSV file. Set the factors to be factors.
Means, variance, etc: Generate some random normal data with rnorm:
my.norm <- rnorm(n=30, mean=1, sd=1)
my.norm
mean(my.norm)
sd(my.norm)
Tabulations
table(data.df$moisture, data.df$rep)
Another type of tabulation command (very useful):
tapply(data.df$height, list(data.df$moisture, data.df$rep), FUN=length)
tapply(data.df$height, data.df$moisture, FUN=mean)
The first graphical command, a histogram. This will open a new window, a graphics window:
hist(my.norm)
You can fiddle endlessly with the parameters to make your image look good, but here are the basics:
hist(my.norm, main="My histogram", xlab="a continuous variable", sub="")
Now an x-y plot. The form is plot(x,y). Alternatively ‘~’ means ‘as a function of’.
plot(data.df$height ~ data.df$fertility)
Separating the factors. First plot a blank frame:
plot(data.df$fertility, data.df$height, type="n")
Then add the points:
points(data.df$fertility[data.df$moisture=="DRY"],
data.df$height[data.df$moisture=="DRY"], col = "red")
points(data.df$fertility[data.df$moisture=="WET"],
data.df$height[data.df$moisture=="WET"], col = "blue")
The default graphics device is the screen. To capture a graphic, a different device should be specified:
jpeg(filename = "histogram.jpg", width = 480, height = 480,
pointsize = 12, bg = "white")
Run the plot
hist(my.norm)
Switch off the device:
dev.off()
The file is made in the active directory. Other devices include pdf
,
ps
, png
, xfig
For a demo of the graphics capabilities of R, type:
demo(graphics)
t-test:
x1 <- rnorm(n=100, mean=5, sd=2)
x2 <- rnorm(n=50, mean=6, sd=3)
t.test(x1,x2)
Alternative formulation:
x <- c(x1,x2)
group <- as.factor( c( rep(1,times=100), rep(2,times=50) ) )
t.test(x ~ group)
plot(x ~ group)
Note that the two ‘columns’ x and group do not have to belong to the same object. They must have the same length though.
Non-parametric
wilcox.test(x1,x2, paired= FALSE)
you might want to clean up every now and again:
ls()
rm(x1, x2)
Make some data and view it:
x <- rnorm(n=100, mean=5, sd=2)
y <- (x * 0.5) + rnorm(n=100, mean=2, sd=1)
plot(x,y, xlim=c(0,10), ylim=c(0,10))
lm()
is the basic least-squares linear model function (i.e. ANOVA and regression). The output of lm() is an object:
xy.lm <- lm(y ~ x)
summary(xy.lm)
Add the regression line:
abline(xy.lm, col="red")
Now some deviant (heteroscedastic) data of the same type
y <- (x * rnorm(n=100, mean=0.5, sd=0.25)) + 2
xy.lm <- lm(y ~ x)
plot(x,y, xlim=c(0,10), ylim=c(0,10))
abline(xy.lm, col="red")
Examine the residuals
plot(fitted(xy.lm), resid(xy.lm))
qqnorm(resid(xy.lm))
qqline(resid(xy.lm))
Test for normality of the residuals:
hist(resid(xy.lm))
shapiro.test(resid(xy.lm))
But we can still use a non-parametric test:
cor.test(x, y, method = "spearman", alternative="two.sided")
Now a more complicated model. The formula is Y ~ X1 + X2 + X1:X1
... two factors and an interaction term:
data.lm <- lm(height ~ fertility + moisture + fertility:moisture, data=data.df)
summary(data.lm)
The following two sections introduce more complex analyses.
Increasingly, modern statistical approaches use randomization and re-sampling tests (bootstrapping) to deal with ugly distributions and non-normal events. Here is a simple example:
We define a function which is the mean of the minimum distance between the two closed pairs of observations:
meanmindist <- function(x){
z <- as.matrix(dist(x))
z[row(z)==col(z)] <- max(z) # recode the identities
mean( apply(z,1,min) )
}
We have a universe of possible observations (e.g. body sizes of organisms in region)
all <- c(1, 2, 3, 3, 4, 4, 4, 4, 5, 5, 6, 7)
We have an observed set of 4 organisms, e.g. in a community:
obs <- c(1, 3, 5, 7)
Question: is the mean difference between organism size more than expected by chance?
obs.md <- meanmindist(obs)
Let’s sample 1000 cases of 4 organisms:
exp.md <- numeric(1000) # create an empty vector of 1000 slots
for (i in 1:1000) {
exp.md[i] <- meanmindist( sample(all, size=4) )
}
View the result, with observed line in red:
hist(exp.md)
lines(c(obs.md,obs.md),c(0,par("usr")[4]),col="red")
Get the quantile (sort of ... note the effect of ties):
match(obs.md, sort(exp.md))
So the observed distances are greater than expected.
Finally, to leave R, type:
q()
Enjoy the world of R!
As our final example, we’ll do some basic community ecology analysis, using the amazing vegan package of Jari Oksanen, Roeland Kindt, Pierre Legendre, Bob O’Hara.
First, load the package, and another package we’ll need:
library(vegan)
library(MASS) ## for isoMDS
Let’s start with some diversity analyses. Data from the Barro Colorado Island 50-ha plot. Each row is a separate 1-ha subplot. Columns are species. This is the default table for vegan (tables can be transposed with he t() function). Look at the distribution of species abundances in the whole plot:
data(BCI)
BCI[1:4,1:4]
hist( log( apply(BCI, 2, sum) ) )
What is the comparative diversity of each subplot? There are many diversity methods in vegan - let’s use the rarefaction method:
rarefy(BCI, 100)
sort(rarefy(BCI, 100))
Let’s fit a series of rank-abundance curves to the 35th (least diverse) and 23rd (most diverse) subplots. We’ll do these plots in the same window:
par(mfrow=c(1,2))
plot(rad.veil(BCI[35,]))
plot(rad.veil(BCI[23,]))
par(mfrow=c(1,1)) # off again
Now some species composition analyses. Use the built in dune meadow dataset, with its associated environmental data dune.env
data(dune)
dune
data(dune.env)
dune.env
Create a distance matrix between each of the subplots, using the
Bray-Curtis measure of compositional dissimilarity: d[jk] = (sum
abs(x[ij]-x[ik])/(sum (x[ij]+x[ik]))
.
dune.dist <- vegdist(dune, method="bray")
Display a clust plot of these distances.
plot(hclust(dune.dist,method="ward"))
Now some ordinations. Let’s start with a Nonparametric Multidimensional Scaling (a good start for unknown data).
dune.mds <- metaMDS(dune)
plot(dune.mds, type="t")
This is a biplot - it has both sample units and species on it.
Here’s a compact way to visualize compositional turnover in community data. It uses the mds scores to order both species and plots.
vegemite(dune, dune.mds, "Hill", zero="-")
Now lets add some environmental info, using envfit. For factors, it plots the centroid of each factor level, or the correlation vector for continuous variables.
dune.envfit <- envfit(dune.mds, dune.env, perm = 1000)
Note that moisture and management are most strongly correlated with the MDS scores. Let’s put these on the plot:
plot(dune.envfit, p.max = 0.005, col = "blue")
There was only one continuous variable, A1. See it’s correlation with the MDS axes:
scores(dune.envfit, "vectors")
Now for something really cool. Join the plots that have the same level of Moisture to the moisture centroid position
ordispider(dune.mds, dune.env$Moisture, col="skyblue")
Now look at the cluster plot again, using Moisture regime as the label:
plot(hclust(dune.dist,method="ward"), labels=dune.env$Moisture)
How different are the plots on under different management regimes? Let’s use a permutation test to look at the variation between and among plots:
dune.ano <- anosim(dune.dist, dune.env$Management)
summary(dune.ano)
plot(dune.ano)
Finally, lets do a constrained correspondence analysis (a.k.a. canonical correspondence analysis). In this analysis the environmental scores are directly included in the ordination. From the manual page: “CCA does not try to display all variation in the data, but only the part that can be explained by the used constraints. Consequently, the results are strongly dependent on the set of constraints and their transformations or interactions among the constraints.”
dune.cca <- cca(dune ~ Moisture + Management, data=dune.env)
plot(dune.cca)
The vectors Moisture.L, etc result from treating the ordered factor Moisture as a continuous variable.
Putting CCA and clustering together:
plot(dune.cca, type = "n")
text(dune.cca, display="sites", col="red", cex=1.2)
text(dune.cca, display="cn", arrow = 2, col="blue")
ordicluster(dune.cca, hclust(vegdist(dune)), prune=3, col = "black")
Wow!
library(maps)
library(mapdata)
map('worldHires', resolution=0, xlim=c(95,151),
ylim=c(-13,9), lwd=1.7) # fill=T, col="gray" ...
map.axes()
locnx <- c(110,111,112)
locny <- c(0,-1,-2)
points(locnx,locny,pch=19)
library(cluster)
data(flower)
flower
lapply(flower, class)
"V8" (numeric), distance, is interval scaled, the distance in centimeters that should be left between the plants.
summary(dfl3 <- daisy(flower, type = list(asymm = c("V1", "V3"), symm= 2, ordratio= 7, logratio= 8))) plot(hclust(dfl3, method="average"))
library(ape)
TREE <- "((A:1,B:1):2,(C:3,D:4):2):3;"
tree <- read.tree(text = TREE)
plot(tree)
plot(tree,type="cladogram")
plot(tree,type="fan")
plot(tree,type="unrooted")
plot(tree,type="radial")
plot(tree,use.edge.length=F)
plot(tree, edge.color=c("red","orange","yellow",
"green","blue","purple"))
data(bird.orders)
x <- factor(c(rep(0, 5), rep(1, 18)))
ans <- ace(x, bird.orders, type = "d")
ans
plot(bird.orders, type = "c", FALSE, label.offset = 1)
co <- c("blue", "yellow")
tiplabels(pch = 22, bg = co[as.numeric(x)], cex = 2, adj = 1)
nodelabels(thermo = ans$lik.anc, piecol = co, cex = 0.75)
Neighbour-joining using ape
library(ape)
data(woodmouse)
woodmouse
trw <- nj(dist.dna(woodmouse))
plot(trw)
Parsimony (on DNA data) using phangorn
library(phangorn)
data(Laurasiatherian)
Laurasiatherian
dm = dist.hamming(Laurasiatherian)
tree = NJ(dm)
plot(tree)
parsimony(tree, Laurasiatherian)
treeNNI <- optim.parsimony(tree, Laurasiatherian)
plot(treeNNI)
Parsimony (on morphological data) using phangorn
(from
http://www.phytools.org/mpma/Exercise_2.1/)
shark <- c(1, 1, 0, 0, 0, 0, 0)
goldfish <- c(1, 1, 0, 0, 0, 0, 0)
coelecanth <- c(1, 1, 0, 0, 0, 0, 0)
iguana <- c(1, 0, 0, 0, 0, 0, 1)
robin <- c(0, 0, 1, 0, 1, 0, 1)
human <- c(0, 0, 0, 1, 1, 0, 1)
lemur <- c(0, 0, 0, 1, 1, 0, 1)
bat <- c(0, 0, 1, 1, 1, 0, 1)
pig <- c(0, 0, 0, 1, 1, 1, 1)
whale <- c(0, 1, 0, 0, 1, 0, 0)
cow <- c(0, 0, 0, 1, 1, 1, 1)
animals <- rbind(shark, goldfish, coelecanth, iguana,
robin, human, lemur, bat, pig, whale, cow)
colnames(animals) <- c("scales","fins","wings","hair",
"warm-blood","hooves","legs")
animals
animals <- as.phyDat(animals, type = "USER", levels = c(0, 1))
starting <- rtree(n = length(animals))
starting$edge.length <- NULL
starting$tip.label <- names(animals)
plot(starting)
tree <- optim.parsimony(starting, animals, rearrangements = "SPR")
plot(tree)
tree <- root(tree, outgroup = "shark", resolve.root = TRUE)
plot(tree)
Here are some raster and vector data of Bogor:
Analyse with:
library(raster)
library(maptools)
sat <- raster("sat.tif", band = 1)
dem <- raster("srtm.tif")
projection(sat)
projection(dem)
plot(sat)
plot(dem)
plot(sat)
trees <- readShapePoints("trees/trees.shp")
roads <- readShapeLines("roads/roads.shp")
plot(roads, add=T)
plot(trees, add=T)
A very simple spatial query: what are the values of the red band (think ‘roofing!’) under the tree points, vs. the overall values.
vals <- extract(sat,trees)
hist(bogor, freq=F, breaks=25, col = rgb(1,0,0,0.4))
hist(vals, freq=F, breaks=25, add=T, col = rgb(0,0,1,0.4))
legend("topright", c("All", "Trees"),
fill=c(rgb(1,0,0,0.4),rgb(0,0,1,0.4)))