# An Introduction to R
## Contents
* **Basic R usage**
* [Background](#bg)
* [Installation and usage](#install)
* [Language structure](#lang)
* [Getting help](#help)
* [Basic objects](#obj)
* [Reading in external data](#csv)
* [Data manipulation](#manip)
* **Basic statistics**
* [Data summaries](#sum)
* [Simple tests](#tests)
* [Linear models](#lm)
* [Randomization methods](#rand)
* **Community ecology and multivariate data**
* [Diversity](#div)
* [Composition](#comp)
* **Systematics**
* [Mapping distributions](#map)
* [Clustering](#clust)
* **Phylogenetics**
* [Displaying phylogenies](#phy)
* [Ancestral character estimation](#ace)
* [Phylogenetic inference](#pi)
* [Biogeography](#biog)
* **[Spatial data](#spat)**
* [Document history](#hist)
-------------------------------------------------------------------
## Basic R usage
### Background
Technically, R is an ‘object-based programming language optimized
for statistics.’ Practically, it’s the most powerful, free
statistical software available! It is the open-source
implementation of the S language developed at AT&T Bell
Laboratories by Rick Becker, John Chambers and Allan Wilks in
ca. 1975. Among its strengths are the ability to construct
complicated, repeatable analyses that pass raw text data all the
way through to publishable graphics. There are many contributed
packages (libraries) that can be imported to extend the analyses
(7,826 packages as of 2016-01-25!). R distributions, packages and
documentation can be found at .
You can keep up to date with some relevant available resources
(packages) by checking these pages:
* Ecology:
* Phylogenetics:
* Spatial:
This document is a brief introduction to R, by example. It is designed
to be run as a script itself: you can paste the isolated lines of
script into the console, in a sequence. Note that ‘`#`’ is the
‘comments’ symbol in R; everything following the `#` is ignored by the
interpreter.
### Installation and usage
Installation is platform dependent, and discussed in other documents.
In general, the base software is either installed as a unit (Windows,
Mac OS X or Linux), or built on-board with a compiler (UNIX). Packages
can be either downloaded as binaries (Windows or Mac OS X) or build
on-board from source (Linux). If the former, it is important to make
sure that the packages were build for the same version of R you have
installed (e.g. v 2.3 for Windows). Note that some packages require
minimum versions of the R distribution to work.
> _**Exercise**: Install R on your laptop, and add (at minimum) the
> packages ‘vegan’ and ‘ape.’ Install one package from within the GUI,
> using the ‘Add Packages’ menu item. Add another by finding the right
> zip file on CRAN and using the ‘Install package from zip’ menu item._
R is command-line driven. Commands (i.e., program statements) are
typed or pasted into the terminal window (or **R console**).
Complete scripts can be imported and run, and interactive sessions
can be saved for later editing and re-rerunning. The Linux version
of R is invoked by typing ‘R’ at the shell prompt, or from running
it inside `emacs` using ESS (highly recomended). The directory in
which R was invoked is the default directory for reading scripts
and data or outputting graphics. The Windows and Mac OS X versions
are started as usual applications, and the working directory can be
set via a menu item. Scripts can be run as batch jobs either from
within R (using `source()`), or from the unix shell using:
R CMD BATCH script.R outputfile.txt
The Mac and Windows installations include a basic ‘development
environment’ with console, graphics windows and editor. Another
slick, popular environment is [Rstudio](https://www.rstudio.com/).
There is also now an option to run the basic functionality of R as
a ‘point and click’ application via John Fox’s `Rcmdr` package.
This uses Tk/Tcl widgets to translate your pointing and clicking
into R commands. This may be useful, especially when trying to
remember the commands for simple actions, but should be considered
an add-on, rather than the base mode. This document deals with the
‘raw’ commands only. If you are using an R environment, make sure
you know how to i) type into the console, ii) type into the editor,
iii) push lines from editor to console (Ctrl-R), iv) set the home
directory, v) save the editor scripts, vi) save the R history, vii)
saving the R data (workspace).
### Language structure
R can be used like a basic calculator, with basic operations:
1
1+1
1/0
1:10
and functions:
sqrt(2)
rep(1,10)
However, more complex R statements are most commonly made up of
actions on (functions of) objects with the output assigned to
another object. E.g.:
new.object <- a.function(old.object)
Think of each statement as the operation you would do in a
‘point-and-click’ stats program (e.g., JMP, SYSTAT, SPSS). The
difference is that the output objects are easy to use again as the
input to another function. This ‘atomic’ approach (similar to the
philosophy behind UNIX) permits the easy construction of complicated
analyses and graphics. You are not limited by what the designers of
your stats package decided to include for analyses and graphics.
R is also a full programming language with control flow, so that
repetitive actions can be performed on objects (‘for’ loops) and
conditional statements included (‘if’). For more on the
programming aspects of R, check the on-board documentation. Note
that using loops in R is inefficient (slow) relative to operating
on an object as a whole, due to the design of the software. If you
can, find ways not to use loops.
### Getting help
Help is always available by locating the HTML pages installed with R,
by typing `help.start()`, or by typing `help(...)` e.g.:
help(cbind)
?cbind
help.start()
Demonstrations are also available for many packages:
demo(graphics)
Note that there is always an example at the end of each help page
that can be pasted directly into the console. For a more
comprehensive introduction see the on-board HTML page ‘An
Introduction to R.’ See also
and google ‘R tutorial.’
### Basic Objects
A symbol (in R, a name containing at least one letter) represents R
objects. You can use any name you want (except the names used for R
functions). It is good practice to use descriptive names in
complicated R scripts. An objects contents is assigned with the
`<-` operator (though `=` also works):
a <- 1
a # typing just the object’s name prints out the object
a = 1 # alternate syntax
a
a <- 1+1
a
a <- 3 # overwrite it
a
You can manage objects by listing them and deleting them:
ls()
b <- a
ls()
rm(a)
ls()
And you can see the command history with:
history()
R has various data ‘modes’ or data types: logical, character,
numeric, integer (and lists). When you assign a character to an
object it then has a different mode from a number:
is.character(b)
c <- d # error, there is no d
c <- "d"
c
is.character(c)
d <- "hello"
d
ls()
The most basic objects in R are vectors of 0 dimensions and
up. That is: i) single variables (0 dimensions), ii) a linear
sequence of variables (aka ‘vector;’ one dimension), then iii) a
matrix, or array (two dimensions), etc.
length(b)
f <- c(1, 2, 3) # c() means concatenate
f
length(f)
f <- c(1, 2,
4) # commands can span multiple lines
f
c(b, b, 4, 5)
f <- c(f, 25) # objects can appear on both right and left of the `<-`
f
f * 2 # operations on vectors are applied to each element
String vectors behave similarly:
g <- c( "a" , "b", "z" )
g
i <- "q"
i
g <- c(i, g)
Note that vector elements can be of only one type, and so, e.g.,
the integers are ‘coerced’ to character mode:
c(1, "a", 2, "b")
A _list_ can contain multiple data types, and is itself another
‘data mode.’ A non-atomic one. The output of many functions are
list objects. Parts of the list can be accessed by names (if
supplied), using a `$` syntax.
list(1, "a", 2, "b")
list(part1=1, part2="a", part3=c(2,1), part4= "b")
list(part1=1, part2="a", part3=c(2,1), part4= "b")$part4
Elements of a dimensional object are referenced by indices
(beginning at 1 for the first element)
g
g[2]
g[3]
Non-specified indexes include all unrestricted elements of the
object:
g[]
Dropping elements is done with negative values:
g[-2]
g[-21] # there is no 21st element
g[-1]
Indexing by a series of indices is also possible:
c(2,3)
g[ c(2,3) ]
As is indexing by a sequence (2:4 means all integer values between 2 and 4
inclusive):
g[2:4]
Subsets of vectors and arrays can be created by including logical
statements (vectors) as indices.
g != "a" # Logical vector: g elements not equal to "a"
g[ g != "a" ]
g[ (g != "a") ] # parentheses can sometimes help readability
Matrices or arrays are two dimensional objects.
v1 <- c(1,2,3)
v1
v2 <- c(11,12,13)
v2
cbind(v1, v2) # bind as columns
rbind(v1, v2) # bind as rows
m <- cbind(v1, v2)
m
For two dimensional objects the first index is the row number, the
second the column:
m[1,1]
m[1,2]
m[ ,2]
m[2, ]
Subsetting again: e.g., print all columns of object m, for the rows
in which v2 is less than 13:
m[,2] < 13
m[ (m[,2] < 13), ]
**Data frames** are a special form of arrays that can contain columns
with different types of data (they are actually a form of lists)
v1 # object names can contain numbers
v3 <- c("a", "b", "c")
v3
cbind(v1, v3) # coercion happens again here, but not here:
data.frame(v1, v3)
j.df <- data.frame(v1, v3)
j.df # as objects become more complex, it can
# be helpful to use a suffix (e.g., `.df`)
# to help remember the nature of the object.
See how the first column is the name of the rows (by default an
integer), and the first row is the name of the columns (by default
X and an integer). Column names can be set with:
colnames(j.df) <- c("size","plot")
j.df
j.df[ , 1]
j.df[ 1, ]
j.df$plot
j.df$plot[3]
An important feature of basic objects is their ‘class,’
particularly if they are factors or not (factors in the sense of
experimental independent variables). Factors have levels.
class(j.df$size)
Nope. However, a character column is also by default a factor, with
levels:
class(j.df$plot)
is.factor(j.df$plot)
levels(j.df$plot)
Non-factor vectors can be forced to be factors:
j.df$sizeAsFactor <- as.factor(j.df$size)
j.df
j.df$sizeAsFactor <- NULL # zap it!
# To zap many cols: df[,c("col1","col2")]<-list(NULL)
j.df
Subsetting data frames is a common activity.
j.df$size > 1
j.df[ (j.df$size > 1) , ]
j.df[ c(2,3) , ]
j.df$plot
j.df$plot != "b"
j.df [ j.df$plot != "b" , ]
### Reading in external data
The import function `read.table` creates a data frame from raw
text. Most data that you might want to read into R is of a data
frame (mixed variable types) type. The simplest way to get your
data out of a spreadsheet into R is to pass it through a plain text
file, using the ‘Save as CSV’ function. You can use any column
separator you want, but must tell read.table() what it is. For the
following, download the file: [mydata.txt](mydata.txt).
read.table("mydata.txt")
read.table("mydata.txt", sep="|")
read.table("mydata.txt", sep="|", header=TRUE)
df2 <- read.table("mydata.txt", sep="|", header=TRUE)
df2
head(df2) # handy! And these:
str(df2)
summary(df2)
attributes(df2)
dim(df2)
Checking factors (these can also be set in the `read.table` function):
is.factor(df2$moisture)
is.factor(df2$rep)
is.factor(df2$length)
### Data manipulation
At first, one of the most frustrating things about R is how
difficult it seems to manipulate and reshape data tables. Here are
a few tips (and see also
[here](http://camwebb.info/blog/2014-04-29/)). First, summarize
with a table:
table(df2$rep, df2$moisture)
tab1 <- table(df2$rep, df2$moisture)
# A 'table' object must be converted to matrix or DF for most other uses:
as.data.frame(tab1)
as.matrix(tab1)
as.data.frame.matrix(tab1)
Run a function across some factors:
tapply(df2$length, INDEX=df2$moisture, FUN=mean)
tapply(df2$length, INDEX=list(df2$moisture , df2$rep), FUN=mean)
# Unlike the table function, tapply creates a matrix
Another way of aggregating:
aggregate(df2$length, by=df2[,c("moisture","rep")], FUN=mean)
‘Stacking’ a table:
tab2 <- as.matrix(tab1)
tab2[1,2] <- 0
tab2[3,1] <- 0
tab2
data.frame(expand.grid(dimnames(tab2)), n = as.vector(tab2))
Joining information from two tables:
df3 <- data.frame(rep=c("a","b","c"), status=c("OK","OK","bust"))
df3
merge(df2, df3, by="rep")
# Alternatively, a version of VLOOKUP():
match(df2$rep, df3$rep)
df2$status <- df3$status[ match(df2$rep, df3$rep) ]
head(df2)
---------------------------------------------------------------------
## Basic statistics
### Data summaries
Create some random data, and find the mean:
rnorm(n=30, mean=1, sd=1)
mean(rnorm(n=30, mean=1, sd=1))
See how the sample mean approaches the population mean with larger samples:
mean(rnorm(n=300, mean=1, sd=1))
mean(rnorm(n=3000, mean=1, sd=1))
mean(rnorm(n=30000, mean=1, sd=1))
Other stats:
my.norm <- rnorm(n=300, mean=1, sd=1)
my.norm
sd(my.norm)
median(my.norm)
summary(my.norm)
Find the median another way:
sort(my.norm)
sort(my.norm)[150]
sort(my.norm)[151]
median(my.norm) # hmm... not the same. Ahh: median() is interpolated
Now some **graphics** (a histogram). This will open a new window, a graphics window:
hist(my.norm)
You can fiddle endlessly with the parameters to make your image
look good, but here are the basics:
hist(my.norm, main="My histogram", xlab="a continuous variable",
ylab="frequency", sub="subtitle", breaks=25)
Now an x-y plot. The form is `plot(x,y)`. Alternatively ‘~’ means ‘as
a function of’.
plot(df2$height ~ df2$fertility)
Separating the factors. First plot a blank frame:
plot(df2$fertility, df2$height, type="n")
Then add the points:
points(df2$fertility[df2$moisture=="DRY"],
df2$height[df2$moisture=="DRY"], col = "red")
points(df2$fertility[df2$moisture=="WET"],
df2$height[df2$moisture=="WET"], col = "blue")
The default graphics device is the screen. To capture a graphic, a
different device should be specified:
jpeg(filename = "histogram.jpg", width = 480, height = 480,
pointsize = 12, bg = "white")
Run the plot
hist(my.norm)
Switch off the device:
dev.off()
The file is made in the active directory. Other devices include
**pdf**, ps, png, xfig
### Simple tests
t-test:
x1 <- rnorm(n=100, mean=5, sd=2)
x2 <- rnorm(n=50, mean=6, sd=3)
t.test(x1,x2)
Alternative formulation:
x <- c(x1,x2)
group <- as.factor( c( rep(1,times=100), rep(2,times=50) ) )
t.test(x ~ group)
plot(x ~ group)
Note that the two ‘columns’ x and group do not have to belong to the
same object. They must have the same length though.
Non-parametric
wilcox.test(x1,x2, paired= FALSE)
### Linear models
Make some data and view it:
x <- rnorm(n=100, mean=5, sd=2)
y <- (x * 0.5) + rnorm(n=100, mean=2, sd=1)
plot(x,y, xlim=c(0,10), ylim=c(0,10))
`lm()` is the basic least-squares linear model function (i.e. ANOVA
and regression). The output of lm() is an object:
xy.lm <- lm(y ~ x)
summary(xy.lm)
Add the regression line:
abline(xy.lm, col="red")
Now some deviant (heteroscedastic) data of the same type
y <- (x * rnorm(n=100, mean=0.5, sd=0.25)) + 2
xy.lm <- lm(y ~ x)
plot(x,y, xlim=c(0,10), ylim=c(0,10))
abline(xy.lm, col="red")
Examine the residuals
plot(fitted(xy.lm), resid(xy.lm))
qqnorm(resid(xy.lm))
qqline(resid(xy.lm))
Test for normality of the residuals:
hist(resid(xy.lm))
shapiro.test(resid(xy.lm))
But we can still use a non-parametric test:
cor.test(x, y, method = "spearman", alternative="two.sided")
Now a more complicated model. The formula is `Y ~ X1 + X2 + X1:X1`
... two factors and an interaction term:
data.lm <- lm(height ~ fertility + moisture + fertility:moisture, data=data.df)
summary(data.lm)
The following two sections introduce more complex analyses.
### Randomization methods
Increasingly, modern statistical approaches use randomization and
re-sampling tests (bootstrapping) to deal with ugly distributions
and non-normal events. Here is a simple example:
We define a function which is the mean of the minimum distance between
the two closed pairs of observations:
meanmindist <- function(x){
z <- as.matrix(dist(x))
z[row(z)==col(z)] <- max(z) # recode the identities
mean( apply(z,1,min) )
}
We have a universe of possible observations (e.g. body sizes of
organisms in region)
all <- c(1, 2, 3, 3, 4, 4, 4, 4, 5, 5, 6, 7)
We have an observed set of 4 organisms, e.g. in a community:
obs <- c(1, 3, 5, 7)
Question: is the mean difference between organism size more than
expected by chance?
obs.md <- meanmindist(obs)
Let’s sample 1000 cases of 4 organisms:
exp.md <- numeric(1000) # create an empty vector of 1000 slots
for (i in 1:1000) {
exp.md[i] <- meanmindist( sample(all, size=4) )
}
View the result, with observed line in red:
hist(exp.md)
lines(c(obs.md,obs.md),c(0,par("usr")[4]),col="red")
Get the quantile (sort of ... note the effect of ties):
match(obs.md, sort(exp.md))
So the observed distances are greater than expected.
### Quitting R
Finally, to leave R, type:
q()
If you say yes to the saving data, your workspace will be saved as `.RData` and your history as `.Rhistory`.
-----------------------------------------------------------------
## Community ecology and multivariate data
Now for some basic community ecology analysis, using the amazing
vegan package of Jari Oksanen, Roeland Kindt, Pierre Legendre, Bob
O’Hara.
First, load the package, and another package we’ll need:
library(vegan)
### Diversity analysis
Let’s start with some diversity analyses. Data from the Barro
Colorado Island 50-ha plot. Each row is a separate 1-ha subplot.
Columns are species. This is the default table for vegan (tables
can be transposed with he t() function).
data(BCI)
head(BCI)
BCI[1:4,1:4]
BCI[1,] # first plot
BCI[1, (BCI[1,] != 0)] # non-zero species
length(BCI[1, (BCI[1,] != 0)]) # number of non-zero species
Look at the distribution of species abundances in the whole plot:
hist( log( apply(BCI, 2, sum) ) )
What is the comparative diversity of each subplot? There are many
diversity methods in vegan - let’s use the rarefaction method:
rarefy(BCI, 100)
rarefy(BCI, 10000) # error, cannot sample more individuals than there are
sort(rarefy(BCI, 100))
hist(rarefy(BCI, 200))
sort(rarefy(BCI, 200))
Let’s fit a series of rank-abundance curves to the 35th (least
diverse) and 23rd (most diverse) subplots. We’ll do these plots in
the same window:
par(mfrow=c(1,2))
plot(rad.lognormal(BCI[35,]))
plot(rad.lognormal(BCI[23,]))
par(mfrow=c(1,1)) # off again
And diversity indices for these plots
diversity(BCI[35,],index="shannon")
diversity(BCI[23,],index="shannon")
### Composition analysis
Using the same data, let's compare species composition across the
50 subplots. First with a **clustergram**. Create a distance
matrix between each of the subplots, using the Bray-Curtis measure
of compositional dissimilarity: `d[jk] = (sum abs(x[ij]-x[ik])/(sum
(x[ij]+x[ik]))`.
vegdist(BCI, method="bray")
BCI.dist <- vegdist(BCI, method="bray")
plot(hclust(BCI.dist))
Now with an **ordination**. Let’s start with a Nonparametric
Multidimensional Scaling (a good start for unknown data).
bci.mds <- metaMDS(BCI)
plot(bci.mds)
This is a biplot - it has both sample units and species on it. A
variety of ways to plot the points:
plot(bci.mds, display="sites")
plot(bci.mds, type="t",display="sites")
plot(bci.mds, display="species")
plot(bci.mds, type="t",display="species")
plot(bci.mds, type="p",display="sites")
text(bci.mds, display="sites", adj=1.2)
Now let's work with a data set that has an associated environment
matrix: the built in dune meadow data.
data(dune)
dune
data(dune.env)
dune.env
Here’s a compact way to visualize compositional turnover in
community data. It uses the mds scores to order both species and
plots.
dune.mds <- metaMDS(dune)
vegemite(dune, dune.mds, "Hill", zero="-")
Now lets add some environmental info, using envfit. For factors,
it plots the centroid of each factor level, or the correlation
vector for continuous variables.
dune.envfit <- envfit(dune.mds, dune.env, perm = 1000)
Note that moisture and management are most strongly correlated with
the MDS scores. Let’s put these on the plot:
plot(dune.mds)
plot(dune.envfit) #, p.max = 0.005, col = "blue")
There was only one continuous variable, A1. See it’s correlation
with the MDS axes:
scores(dune.envfit, "vectors")
Now for something really cool. Join the plots that have the same
level of Moisture to the moisture centroid position
ordispider(dune.mds, dune.env$Moisture, col="skyblue")
Now look at a cluster plot, using Moisture regime as the label:
dune.dist <- vegdist(dune, method="bray")
plot(hclust(dune.dist,method="ward"), labels=dune.env$Moisture)
How different are the plots on under different management regimes?
Let’s use a permutation test to look at the variation between and
among plots:
dune.ano <- anosim(dune.dist, dune.env$Management)
summary(dune.ano)
plot(dune.ano)
Finally, lets do a constrained correspondence analysis
(a.k.a. canonical correspondence analysis). In this analysis the
environmental scores are directly included in the ordination. From
the manual page: “CCA does not try to display all variation in the
data, but only the part that can be explained by the used
constraints. Consequently, the results are strongly dependent on
the set of constraints and their transformations or interactions
among the constraints.”
dune.cca <- cca(dune ~ Moisture + Management, data=dune.env)
plot(dune.cca)
The vectors Moisture.L, etc result from treating the ordered factor
Moisture as a continuous variable.
Putting CCA and clustering together:
plot(dune.cca, type = "n")
text(dune.cca, display="sites", col="red", cex=1.2)
text(dune.cca, display="cn", arrow = 2, col="blue")
ordicluster(dune.cca, hclust(vegdist(dune)), prune=3, col = "black")
Wow!
Finally, another class of composition analysis is the search for
**nestedness** within samples:
data(sipoo)
head(sipoo)
plot(nestedtemp(sipoo))
----------------------------------------------------------------------
## Tools for systematists
### Mapping distributions
Making hundreds of repetitive distribution maps is easy in R.
library(maps)
library(mapdata)
map('worldHires', resolution=0, xlim=c(95,151),
ylim=c(-13,9), lwd=1.7) # fill=T, col="gray" ...
map.axes()
locnx <- c(110,111,112)
locny <- c(0,-1,-2)
points(locnx,locny,pch=19)
### Clustering
If you have a matrix of morphological characters for a bunch of
taxa, you may want to analyse the similarity in a
_non_-phylogenetic fashion. The `daisy` distance matrix tool allows
for multiple data types in its calculation of distance.
library(cluster)
data(flower)
head(flower)
str(flower)
lapply(flower, class) # apply class() over the data frame (=list) columns
The data are:
* "V1" (factor), winters, is binary and indicates whether the plant may be
left in the garden when it freezes.
* "V2" (factor), shadow, is binary and shows whether the plant needs
to stand in the shadow.
* "V3" (factor), tubers, is asymmetric binary and distinguishes
between plants with tubers and plants that grow in any other way.
* "V4" (factor), color, is nominal and specifies the flower's color
(1 = white, 2 = yellow, 3 = pink, 4 = red, 5 = blue).
* "V5" (ordered), soil, is ordinal and indicates whether the plant
grows in dry (1), normal (2), or wet (3) soil.
* "V6" (ordered), preference, is ordinal and gives someone's
preference ranking going from 1 to 18.
* "V7" (numeric), height, is interval scaled, the plant's height in
centimeters.
* "V8" (numeric), distance, is interval scaled, the distance in
centimeters that should be left between the plants.
Remember: conclusions based on any multivariate analysis are highly
dependent on the choice of methods.
summary(daisy(flower))
plot(hclust(daisy(flower)))
summary(daisy(flower, type = list(asymm = c("V1", "V3"), symm= 2,
ordratio= 7, logratio= 8)))
flower.dist <- daisy(flower, type = list(asymm = c("V1", "V3"), symm= 2,
ordratio= 7, logratio= 8))
plot(hclust(flower.dist))
plot(hclust(flower.dist, method="average"))
---------------------------------------------------------------------
## Phylogenetics
### Displaying phylogenies
library(ape)
TREE <- "((A:1,B:1):2,(C:3,D:4):2):3;"
tree <- read.tree(text = TREE)
plot(tree)
plot(tree,type="cladogram")
plot(tree,type="fan")
plot(tree,type="unrooted")
plot(tree,type="radial")
plot(tree,use.edge.length=F)
plot(tree, edge.color=c("red","orange","yellow",
"green","blue","purple"))
### Ancestral character estimation
data(bird.orders)
x <- factor(c(rep(0, 5), rep(1, 18)))
ans <- ace(x, bird.orders, type = "d")
ans
plot(bird.orders, type = "c", FALSE, label.offset = 1)
co <- c("blue", "yellow")
tiplabels(pch = 22, bg = co[as.numeric(x)], cex = 2, adj = 1)
nodelabels(thermo = ans$lik.anc, piecol = co, cex = 0.75)
### Phylogenetic inference
Neighbour-joining using `ape`. Remember, UPGMA and NJ are
clustering methods. They are ‘phenetic,’ and not ‘cladistic’
(cf. methodological wars of the seventies).
library(ape)
data(woodmouse)
woodmouse
trw <- nj(dist.dna(woodmouse))
plot(trw)
Parsimony (on DNA data) using `phangorn`
library(phangorn)
data(Laurasiatherian)
Laurasiatherian
dm = dist.hamming(Laurasiatherian)
tree = NJ(dm)
plot(tree)
parsimony(tree, Laurasiatherian)
treeNNI <- optim.parsimony(tree, Laurasiatherian)
plot(treeNNI)
Parsimony (on morphological data) using `phangorn` (from
)
shark <- c(1, 1, 0, 0, 0, 0, 0)
goldfish <- c(1, 1, 0, 0, 0, 0, 0)
coelecanth <- c(1, 1, 0, 0, 0, 0, 0)
iguana <- c(1, 0, 0, 0, 0, 0, 1)
robin <- c(0, 0, 1, 0, 1, 0, 1)
human <- c(0, 0, 0, 1, 1, 0, 1)
lemur <- c(0, 0, 0, 1, 1, 0, 1)
bat <- c(0, 0, 1, 1, 1, 0, 1)
pig <- c(0, 0, 0, 1, 1, 1, 1)
whale <- c(0, 1, 0, 0, 1, 0, 0)
cow <- c(0, 0, 0, 1, 1, 1, 1)
animals <- rbind(shark, goldfish, coelecanth, iguana,
robin, human, lemur, bat, pig, whale, cow)
colnames(animals) <- c("scales","fins","wings","hair",
"warm-blood","hooves","legs")
animals
animals <- as.phyDat(animals, type = "USER", levels = c(0, 1))
starting <- rtree(n = length(animals))
starting$edge.length <- NULL
starting$tip.label <- names(animals)
plot(starting)
tree <- optim.parsimony(starting, animals, rearrangements = "SPR")
plot(tree)
tree <- root(tree, outgroup = "shark", resolve.root = TRUE)
plot(tree)
### Biogeography
The package `BioGeoBEARS` is a powerful and rapidly evolving
package that implements a variety of models for discrete
biogeography. It is “designed to perform inference of biogeographic
history on phylogenies, and also model testing and model choice of
the many different possible models of how biogeography may evolve
on a phylogeny (dispersal, vicariance, founder-event speciation,
DEC, DIVA, BAYAREA, etc.).” See
for full details. It
requires many dependencies, and must load (`source()`) many online
scripts each time you run it, so may cause you some difficulty.
I have modified the
[example](http://phylo.wikidot.com/biogeobears#toc22) given on the
main site to perform a sinlge, simple analysis of the Hawaiian
_Psychotria_ example. The script is [here](biog.R), with data here
([psychotria_geog.phy](psychotria_geog.phy),
[psychotria.new](psychotria.new)).
If you would rather use **LaGrange** externally, but use R to
display the phylogenies, here is a simple script, acting on a
[sample](psychotria.results.txt) Lagrange output. A more
comprehensive (pretty but more complex) script (by Vijay Patil) is
available from Steffi.
library(ape)
infile <- readLines("psychotria.results.txt")
# the tree, with labelled nodes is in line 7
tr <- read.tree(text=infile[7])
# Find the extant distributions by looking for '---+':
taxa <- gsub("^[^[]*\\[([A-Z])\\] ([A-Za-z0-9_]+) *$", "\\2",
infile[grep("---\\+", infile)])
distrib <- gsub("^[^[]*\\[([A-Z])\\] (.*)$", "\\1",
infile[grep("---\\+", infile)])
# Find the (most likely) splits info by looking for 'At node'
nodelines <- grep("^At node", infile)
nodelabl <- gsub(":", "", substr(infile[nodelines],9,100))
# flip the splits to better be viewed on a sideways tree
split <- gsub("\\[([A-Z]+)\\|([A-Z]+)\\]\ *", "\\2|\\1",
substr(infile[nodelines+2],4,10))
# plot it, making room for tip info
plot(tr, label.offset=0.5)
# add the node labels
nodelabels(split[match(nodelabl, tr$node.label)])
# and the extant distribution
tiplabels(distrib[match(taxa, tr$tip.label)])
---------------------------------------------------------------------
## Spatial data
Here are some raster and vector data of Bogor:
* [sat.tif](img/sat.tif): 3-band satellite image (from the NASA MrSID
project, now at )
* [srtm.tif](img/srtm.tif): 90-m SRTM digital elevation model (from
)
* [trees.zip](img/trees.zip): Some trees in Bogor (extracted from
OpenStreetMap)
* [roads.zip](img/roads.zip): Some roads in Bogor (extracted from
OpenStreetMap)
Analyse with:
library(raster)
library(maptools)
sat <- raster("sat.tif", band = 1)
dem <- raster("srtm.tif")
projection(sat)
projection(dem)
plot(sat)
plot(dem)
plot(sat)
trees <- readShapePoints("trees/trees.shp")
roads <- readShapeLines("roads/roads.shp")
plot(roads, add=T)
plot(trees, add=T)
A very simple spatial query: what are the values of the red band
(think ‘roofing!’) under the tree points, vs. the overall values.
vals <- extract(sat,trees)
hist(bogor, freq=F, breaks=25, col = rgb(1,0,0,0.4))
hist(vals, freq=F, breaks=25, add=T, col = rgb(0,0,1,0.4))
legend("topright", c("All", "Trees"),
fill=c(rgb(1,0,0,0.4),rgb(0,0,1,0.4)))
----------------------------------------------------------------
## Document history
_short URL: _
* (v3) Modified for an R workshop at Univ. Alaska Fairbanks
2016-03-07.
* (v2) [Presention](index_bogor.html) to Herbarium Bogoriense staff,
2016-01-26 and 2016-01-27.
* (v1) ‘An introduction to R by example’
prepared for the Harvard Biodiversity of Borneo courses (2007-2010).
(This HTML document is generated from a [markdown](index.mdwn)
document, which is in turn generated from the [R script](intro.R)
using a simple `gawk` [script](r2md).)
----