Technically, R is an ‘object-based programming language optimized for statistics.’ Practically, it’s the most powerful, free statistical software available! It is the open-source implementation of the S language developed at AT&T Bell Laboratories by Rick Becker, John Chambers and Allan Wilks in ca. 1975. Among its strengths are the ability to construct complicated, repeatable analyses that pass raw text data all the way through to publishable graphics. There are many contributed packages (libraries) that can be imported to extend the analyses (7,826 packages as of 2016-01-25!). R distributions, packages and documentation can be found at http://cran.r-project.org.
You can keep up to date with some relevant available resources (packages) by checking these pages:
This document is a brief introduction to R, by example. It is designed
to be run as a script itself: you can paste the isolated lines of
script into the console, in a sequence. Note that ‘#
’ is the
‘comments’ symbol in R; everything following the #
is ignored by the
interpreter.
Installation is platform dependent, and discussed in other documents. In general, the base software is either installed as a unit (Windows, Mac OS X or Linux), or built on-board with a compiler (UNIX). Packages can be either downloaded as binaries (Windows or Mac OS X) or build on-board from source (Linux). If the former, it is important to make sure that the packages were build for the same version of R you have installed (e.g. v 2.3 for Windows). Note that some packages require minimum versions of the R distribution to work.
Exercise: Install R on your laptop, and add (at minimum) the packages ‘vegan’ and ‘ape.’ Install one package from within the GUI, using the ‘Add Packages’ menu item. Add another by finding the right zip file on CRAN and using the ‘Install package from zip’ menu item.
R is command-line driven. Commands (i.e., program statements) are
typed or pasted into the terminal window (or R console).
Complete scripts can be imported and run, and interactive sessions
can be saved for later editing and re-rerunning. The Linux version
of R is invoked by typing ‘R’ at the shell prompt, or from running
it inside emacs
using ESS (highly recomended). The directory in
which R was invoked is the default directory for reading scripts
and data or outputting graphics. The Windows and Mac OS X versions
are started as usual applications, and the working directory can be
set via a menu item. Scripts can be run as batch jobs either from
within R (using source()
), or from the unix shell using:
R CMD BATCH script.R outputfile.txt
The Mac and Windows installations include a basic ‘development
environment’ with console, graphics windows and editor. Another
slick, popular environment is Rstudio.
There is also now an option to run the basic functionality of R as
a ‘point and click’ application via John Fox’s Rcmdr
package.
This uses Tk/Tcl widgets to translate your pointing and clicking
into R commands. This may be useful, especially when trying to
remember the commands for simple actions, but should be considered
an add-on, rather than the base mode. This document deals with the
‘raw’ commands only. If you are using an R environment, make sure
you know how to i) type into the console, ii) type into the editor,
iii) push lines from editor to console (Ctrl-R), iv) set the home
directory, v) save the editor scripts, vi) save the R history, vii)
saving the R data (workspace).
R can be used like a basic calculator, with basic operations:
1
1+1
1/0
1:10
and functions:
sqrt(2)
rep(1,10)
However, more complex R statements are most commonly made up of actions on (functions of) objects with the output assigned to another object. E.g.:
new.object <- a.function(old.object)
Think of each statement as the operation you would do in a ‘point-and-click’ stats program (e.g., JMP, SYSTAT, SPSS). The difference is that the output objects are easy to use again as the input to another function. This ‘atomic’ approach (similar to the philosophy behind UNIX) permits the easy construction of complicated analyses and graphics. You are not limited by what the designers of your stats package decided to include for analyses and graphics.
R is also a full programming language with control flow, so that repetitive actions can be performed on objects (‘for’ loops) and conditional statements included (‘if’). For more on the programming aspects of R, check the on-board documentation. Note that using loops in R is inefficient (slow) relative to operating on an object as a whole, due to the design of the software. If you can, find ways not to use loops.
Help is always available by locating the HTML pages installed with R,
by typing help.start()
, or by typing help(...)
e.g.:
help(cbind)
?cbind
help.start()
Demonstrations are also available for many packages:
demo(graphics)
Note that there is always an example at the end of each help page that can be pasted directly into the console. For a more comprehensive introduction see the on-board HTML page ‘An Introduction to R.’ See also https://cran.r-project.org/manuals.html and google ‘R tutorial.’
A symbol (in R, a name containing at least one letter) represents R
objects. You can use any name you want (except the names used for R
functions). It is good practice to use descriptive names in
complicated R scripts. An objects contents is assigned with the
<-
operator (though =
also works):
a <- 1
a # typing just the object’s name prints out the object
a = 1 # alternate syntax
a
a <- 1+1
a
a <- 3 # overwrite it
a
You can manage objects by listing them and deleting them:
ls()
b <- a
ls()
rm(a)
ls()
And you can see the command history with:
history()
R has various data ‘modes’ or data types: logical, character, numeric, integer (and lists). When you assign a character to an object it then has a different mode from a number:
is.character(b)
c <- d # error, there is no d
c <- "d"
c
is.character(c)
d <- "hello"
d
ls()
The most basic objects in R are vectors of 0 dimensions and up. That is: i) single variables (0 dimensions), ii) a linear sequence of variables (aka ‘vector;’ one dimension), then iii) a matrix, or array (two dimensions), etc.
length(b)
f <- c(1, 2, 3) # c() means concatenate
f
length(f)
f <- c(1, 2,
4) # commands can span multiple lines
f
c(b, b, 4, 5)
f <- c(f, 25) # objects can appear on both right and left of the `<-`
f
f * 2 # operations on vectors are applied to each element
String vectors behave similarly:
g <- c( "a" , "b", "z" )
g
i <- "q"
i
g <- c(i, g)
Note that vector elements can be of only one type, and so, e.g., the integers are ‘coerced’ to character mode:
c(1, "a", 2, "b")
A list can contain multiple data types, and is itself another
‘data mode.’ A non-atomic one. The output of many functions are
list objects. Parts of the list can be accessed by names (if
supplied), using a $
syntax.
list(1, "a", 2, "b")
list(part1=1, part2="a", part3=c(2,1), part4= "b")
list(part1=1, part2="a", part3=c(2,1), part4= "b")$part4
Elements of a dimensional object are referenced by indices (beginning at 1 for the first element)
g
g[2]
g[3]
Non-specified indexes include all unrestricted elements of the object:
g[]
Dropping elements is done with negative values:
g[-2]
g[-21] # there is no 21st element
g[-1]
Indexing by a series of indices is also possible:
c(2,3)
g[ c(2,3) ]
As is indexing by a sequence (2:4 means all integer values between 2 and 4 inclusive):
g[2:4]
Subsets of vectors and arrays can be created by including logical statements (vectors) as indices.
g != "a" # Logical vector: g elements not equal to "a"
g[ g != "a" ]
g[ (g != "a") ] # parentheses can sometimes help readability
Matrices or arrays are two dimensional objects.
v1 <- c(1,2,3)
v1
v2 <- c(11,12,13)
v2
cbind(v1, v2) # bind as columns
rbind(v1, v2) # bind as rows
m <- cbind(v1, v2)
m
For two dimensional objects the first index is the row number, the second the column:
m[1,1]
m[1,2]
m[ ,2]
m[2, ]
Subsetting again: e.g., print all columns of object m, for the rows in which v2 is less than 13:
m[,2] < 13
m[ (m[,2] < 13), ]
Data frames are a special form of arrays that can contain columns with different types of data (they are actually a form of lists)
v1 # object names can contain numbers
v3 <- c("a", "b", "c")
v3
cbind(v1, v3) # coercion happens again here, but not here:
data.frame(v1, v3)
j.df <- data.frame(v1, v3)
j.df # as objects become more complex, it can
# be helpful to use a suffix (e.g., `.df`)
# to help remember the nature of the object.
See how the first column is the name of the rows (by default an integer), and the first row is the name of the columns (by default X and an integer). Column names can be set with:
colnames(j.df) <- c("size","plot")
j.df
j.df[ , 1]
j.df[ 1, ]
j.df$plot
j.df$plot[3]
An important feature of basic objects is their ‘class,’ particularly if they are factors or not (factors in the sense of experimental independent variables). Factors have levels.
class(j.df$size)
Nope. However, a character column is also by default a factor, with levels:
class(j.df$plot)
is.factor(j.df$plot)
levels(j.df$plot)
Non-factor vectors can be forced to be factors:
j.df$sizeAsFactor <- as.factor(j.df$size)
j.df
j.df$sizeAsFactor <- NULL # zap it!
# To zap many cols: df[,c("col1","col2")]<-list(NULL)
j.df
Subsetting data frames is a common activity.
j.df$size > 1
j.df[ (j.df$size > 1) , ]
j.df[ c(2,3) , ]
j.df$plot
j.df$plot != "b"
j.df [ j.df$plot != "b" , ]
The import function read.table
creates a data frame from raw
text. Most data that you might want to read into R is of a data
frame (mixed variable types) type. The simplest way to get your
data out of a spreadsheet into R is to pass it through a plain text
file, using the ‘Save as CSV’ function. You can use any column
separator you want, but must tell read.table() what it is. For the
following, download the file: mydata.txt.
read.table("mydata.txt")
read.table("mydata.txt", sep="|")
read.table("mydata.txt", sep="|", header=TRUE)
df2 <- read.table("mydata.txt", sep="|", header=TRUE)
df2
head(df2) # handy! And these:
str(df2)
summary(df2)
attributes(df2)
dim(df2)
Checking factors (these can also be set in the read.table
function):
is.factor(df2$moisture)
is.factor(df2$rep)
is.factor(df2$length)
At first, one of the most frustrating things about R is how difficult it seems to manipulate and reshape data tables. Here are a few tips (and see also here). First, summarize with a table:
table(df2$rep, df2$moisture)
tab1 <- table(df2$rep, df2$moisture)
# A 'table' object must be converted to matrix or DF for most other uses:
as.data.frame(tab1)
as.matrix(tab1)
as.data.frame.matrix(tab1)
Run a function across some factors:
tapply(df2$length, INDEX=df2$moisture, FUN=mean)
tapply(df2$length, INDEX=list(df2$moisture , df2$rep), FUN=mean)
# Unlike the table function, tapply creates a matrix
Another way of aggregating:
aggregate(df2$length, by=df2[,c("moisture","rep")], FUN=mean)
‘Stacking’ a table:
tab2 <- as.matrix(tab1)
tab2[1,2] <- 0
tab2[3,1] <- 0
tab2
data.frame(expand.grid(dimnames(tab2)), n = as.vector(tab2))
Joining information from two tables:
df3 <- data.frame(rep=c("a","b","c"), status=c("OK","OK","bust"))
df3
merge(df2, df3, by="rep")
# Alternatively, a version of VLOOKUP():
match(df2$rep, df3$rep)
df2$status <- df3$status[ match(df2$rep, df3$rep) ]
head(df2)
Create some random data, and find the mean:
rnorm(n=30, mean=1, sd=1)
mean(rnorm(n=30, mean=1, sd=1))
See how the sample mean approaches the population mean with larger samples:
mean(rnorm(n=300, mean=1, sd=1))
mean(rnorm(n=3000, mean=1, sd=1))
mean(rnorm(n=30000, mean=1, sd=1))
Other stats:
my.norm <- rnorm(n=300, mean=1, sd=1)
my.norm
sd(my.norm)
median(my.norm)
summary(my.norm)
Find the median another way:
sort(my.norm)
sort(my.norm)[150]
sort(my.norm)[151]
median(my.norm) # hmm... not the same. Ahh: median() is interpolated
Now some graphics (a histogram). This will open a new window, a graphics window:
hist(my.norm)
You can fiddle endlessly with the parameters to make your image look good, but here are the basics:
hist(my.norm, main="My histogram", xlab="a continuous variable",
ylab="frequency", sub="subtitle", breaks=25)
Now an x-y plot. The form is plot(x,y)
. Alternatively ‘~’ means ‘as
a function of’.
plot(df2$height ~ df2$fertility)
Separating the factors. First plot a blank frame:
plot(df2$fertility, df2$height, type="n")
Then add the points:
points(df2$fertility[df2$moisture=="DRY"],
df2$height[df2$moisture=="DRY"], col = "red")
points(df2$fertility[df2$moisture=="WET"],
df2$height[df2$moisture=="WET"], col = "blue")
The default graphics device is the screen. To capture a graphic, a different device should be specified:
jpeg(filename = "histogram.jpg", width = 480, height = 480,
pointsize = 12, bg = "white")
Run the plot
hist(my.norm)
Switch off the device:
dev.off()
The file is made in the active directory. Other devices include pdf, ps, png, xfig
t-test:
x1 <- rnorm(n=100, mean=5, sd=2)
x2 <- rnorm(n=50, mean=6, sd=3)
t.test(x1,x2)
Alternative formulation:
x <- c(x1,x2)
group <- as.factor( c( rep(1,times=100), rep(2,times=50) ) )
t.test(x ~ group)
plot(x ~ group)
Note that the two ‘columns’ x and group do not have to belong to the same object. They must have the same length though.
Non-parametric
wilcox.test(x1,x2, paired= FALSE)
Make some data and view it:
x <- rnorm(n=100, mean=5, sd=2)
y <- (x * 0.5) + rnorm(n=100, mean=2, sd=1)
plot(x,y, xlim=c(0,10), ylim=c(0,10))
lm()
is the basic least-squares linear model function (i.e. ANOVA
and regression). The output of lm() is an object:
xy.lm <- lm(y ~ x)
summary(xy.lm)
Add the regression line:
abline(xy.lm, col="red")
Now some deviant (heteroscedastic) data of the same type
y <- (x * rnorm(n=100, mean=0.5, sd=0.25)) + 2
xy.lm <- lm(y ~ x)
plot(x,y, xlim=c(0,10), ylim=c(0,10))
abline(xy.lm, col="red")
Examine the residuals
plot(fitted(xy.lm), resid(xy.lm))
qqnorm(resid(xy.lm))
qqline(resid(xy.lm))
Test for normality of the residuals:
hist(resid(xy.lm))
shapiro.test(resid(xy.lm))
But we can still use a non-parametric test:
cor.test(x, y, method = "spearman", alternative="two.sided")
Now a more complicated model. The formula is Y ~ X1 + X2 + X1:X1
... two factors and an interaction term:
data.lm <- lm(height ~ fertility + moisture + fertility:moisture, data=data.df)
summary(data.lm)
The following two sections introduce more complex analyses.
Increasingly, modern statistical approaches use randomization and re-sampling tests (bootstrapping) to deal with ugly distributions and non-normal events. Here is a simple example:
We define a function which is the mean of the minimum distance between the two closed pairs of observations:
meanmindist <- function(x){
z <- as.matrix(dist(x))
z[row(z)==col(z)] <- max(z) # recode the identities
mean( apply(z,1,min) )
}
We have a universe of possible observations (e.g. body sizes of organisms in region)
all <- c(1, 2, 3, 3, 4, 4, 4, 4, 5, 5, 6, 7)
We have an observed set of 4 organisms, e.g. in a community:
obs <- c(1, 3, 5, 7)
Question: is the mean difference between organism size more than expected by chance?
obs.md <- meanmindist(obs)
Let’s sample 1000 cases of 4 organisms:
exp.md <- numeric(1000) # create an empty vector of 1000 slots
for (i in 1:1000) {
exp.md[i] <- meanmindist( sample(all, size=4) )
}
View the result, with observed line in red:
hist(exp.md)
lines(c(obs.md,obs.md),c(0,par("usr")[4]),col="red")
Get the quantile (sort of ... note the effect of ties):
match(obs.md, sort(exp.md))
So the observed distances are greater than expected.
Finally, to leave R, type:
q()
If you say yes to the saving data, your workspace will be saved as .RData
and your history as .Rhistory
.
Now for some basic community ecology analysis, using the amazing vegan package of Jari Oksanen, Roeland Kindt, Pierre Legendre, Bob O’Hara.
First, load the package, and another package we’ll need:
library(vegan)
Let’s start with some diversity analyses. Data from the Barro Colorado Island 50-ha plot. Each row is a separate 1-ha subplot. Columns are species. This is the default table for vegan (tables can be transposed with he t() function).
data(BCI)
head(BCI)
BCI[1:4,1:4]
BCI[1,] # first plot
BCI[1, (BCI[1,] != 0)] # non-zero species
length(BCI[1, (BCI[1,] != 0)]) # number of non-zero species
Look at the distribution of species abundances in the whole plot:
hist( log( apply(BCI, 2, sum) ) )
What is the comparative diversity of each subplot? There are many diversity methods in vegan - let’s use the rarefaction method:
rarefy(BCI, 100)
rarefy(BCI, 10000) # error, cannot sample more individuals than there are
sort(rarefy(BCI, 100))
hist(rarefy(BCI, 200))
sort(rarefy(BCI, 200))
Let’s fit a series of rank-abundance curves to the 35th (least diverse) and 23rd (most diverse) subplots. We’ll do these plots in the same window:
par(mfrow=c(1,2))
plot(rad.lognormal(BCI[35,]))
plot(rad.lognormal(BCI[23,]))
par(mfrow=c(1,1)) # off again
And diversity indices for these plots
diversity(BCI[35,],index="shannon")
diversity(BCI[23,],index="shannon")
Using the same data, let's compare species composition across the
50 subplots. First with a clustergram. Create a distance
matrix between each of the subplots, using the Bray-Curtis measure
of compositional dissimilarity: d[jk] = (sum abs(x[ij]-x[ik])/(sum
(x[ij]+x[ik]))
.
vegdist(BCI, method="bray")
BCI.dist <- vegdist(BCI, method="bray")
plot(hclust(BCI.dist))
Now with an ordination. Let’s start with a Nonparametric Multidimensional Scaling (a good start for unknown data).
bci.mds <- metaMDS(BCI)
plot(bci.mds)
This is a biplot - it has both sample units and species on it. A variety of ways to plot the points:
plot(bci.mds, display="sites")
plot(bci.mds, type="t",display="sites")
plot(bci.mds, display="species")
plot(bci.mds, type="t",display="species")
plot(bci.mds, type="p",display="sites")
text(bci.mds, display="sites", adj=1.2)
Now let's work with a data set that has an associated environment matrix: the built in dune meadow data.
data(dune)
dune
data(dune.env)
dune.env
Here’s a compact way to visualize compositional turnover in community data. It uses the mds scores to order both species and plots.
dune.mds <- metaMDS(dune)
vegemite(dune, dune.mds, "Hill", zero="-")
Now lets add some environmental info, using envfit. For factors, it plots the centroid of each factor level, or the correlation vector for continuous variables.
dune.envfit <- envfit(dune.mds, dune.env, perm = 1000)
Note that moisture and management are most strongly correlated with the MDS scores. Let’s put these on the plot:
plot(dune.mds)
plot(dune.envfit) #, p.max = 0.005, col = "blue")
There was only one continuous variable, A1. See it’s correlation with the MDS axes:
scores(dune.envfit, "vectors")
Now for something really cool. Join the plots that have the same level of Moisture to the moisture centroid position
ordispider(dune.mds, dune.env$Moisture, col="skyblue")
Now look at a cluster plot, using Moisture regime as the label:
dune.dist <- vegdist(dune, method="bray")
plot(hclust(dune.dist,method="ward"), labels=dune.env$Moisture)
How different are the plots on under different management regimes? Let’s use a permutation test to look at the variation between and among plots:
dune.ano <- anosim(dune.dist, dune.env$Management)
summary(dune.ano)
plot(dune.ano)
Finally, lets do a constrained correspondence analysis (a.k.a. canonical correspondence analysis). In this analysis the environmental scores are directly included in the ordination. From the manual page: “CCA does not try to display all variation in the data, but only the part that can be explained by the used constraints. Consequently, the results are strongly dependent on the set of constraints and their transformations or interactions among the constraints.”
dune.cca <- cca(dune ~ Moisture + Management, data=dune.env)
plot(dune.cca)
The vectors Moisture.L, etc result from treating the ordered factor Moisture as a continuous variable.
Putting CCA and clustering together:
plot(dune.cca, type = "n")
text(dune.cca, display="sites", col="red", cex=1.2)
text(dune.cca, display="cn", arrow = 2, col="blue")
ordicluster(dune.cca, hclust(vegdist(dune)), prune=3, col = "black")
Wow!
Finally, another class of composition analysis is the search for nestedness within samples:
data(sipoo)
head(sipoo)
plot(nestedtemp(sipoo))
Making hundreds of repetitive distribution maps is easy in R.
library(maps)
library(mapdata)
map('worldHires', resolution=0, xlim=c(95,151),
ylim=c(-13,9), lwd=1.7) # fill=T, col="gray" ...
map.axes()
locnx <- c(110,111,112)
locny <- c(0,-1,-2)
points(locnx,locny,pch=19)
If you have a matrix of morphological characters for a bunch of
taxa, you may want to analyse the similarity in a
non-phylogenetic fashion. The daisy
distance matrix tool allows
for multiple data types in its calculation of distance.
library(cluster)
data(flower)
head(flower)
str(flower)
lapply(flower, class) # apply class() over the data frame (=list) columns
The data are:
Remember: conclusions based on any multivariate analysis are highly dependent on the choice of methods.
summary(daisy(flower))
plot(hclust(daisy(flower)))
summary(daisy(flower, type = list(asymm = c("V1", "V3"), symm= 2,
ordratio= 7, logratio= 8)))
flower.dist <- daisy(flower, type = list(asymm = c("V1", "V3"), symm= 2,
ordratio= 7, logratio= 8))
plot(hclust(flower.dist))
plot(hclust(flower.dist, method="average"))
library(ape)
TREE <- "((A:1,B:1):2,(C:3,D:4):2):3;"
tree <- read.tree(text = TREE)
plot(tree)
plot(tree,type="cladogram")
plot(tree,type="fan")
plot(tree,type="unrooted")
plot(tree,type="radial")
plot(tree,use.edge.length=F)
plot(tree, edge.color=c("red","orange","yellow",
"green","blue","purple"))
data(bird.orders)
x <- factor(c(rep(0, 5), rep(1, 18)))
ans <- ace(x, bird.orders, type = "d")
ans
plot(bird.orders, type = "c", FALSE, label.offset = 1)
co <- c("blue", "yellow")
tiplabels(pch = 22, bg = co[as.numeric(x)], cex = 2, adj = 1)
nodelabels(thermo = ans$lik.anc, piecol = co, cex = 0.75)
Neighbour-joining using ape
. Remember, UPGMA and NJ are
clustering methods. They are ‘phenetic,’ and not ‘cladistic’
(cf. methodological wars of the seventies).
library(ape)
data(woodmouse)
woodmouse
trw <- nj(dist.dna(woodmouse))
plot(trw)
Parsimony (on DNA data) using phangorn
library(phangorn)
data(Laurasiatherian)
Laurasiatherian
dm = dist.hamming(Laurasiatherian)
tree = NJ(dm)
plot(tree)
parsimony(tree, Laurasiatherian)
treeNNI <- optim.parsimony(tree, Laurasiatherian)
plot(treeNNI)
Parsimony (on morphological data) using phangorn
(from
http://www.phytools.org/mpma/Exercise_2.1/)
shark <- c(1, 1, 0, 0, 0, 0, 0)
goldfish <- c(1, 1, 0, 0, 0, 0, 0)
coelecanth <- c(1, 1, 0, 0, 0, 0, 0)
iguana <- c(1, 0, 0, 0, 0, 0, 1)
robin <- c(0, 0, 1, 0, 1, 0, 1)
human <- c(0, 0, 0, 1, 1, 0, 1)
lemur <- c(0, 0, 0, 1, 1, 0, 1)
bat <- c(0, 0, 1, 1, 1, 0, 1)
pig <- c(0, 0, 0, 1, 1, 1, 1)
whale <- c(0, 1, 0, 0, 1, 0, 0)
cow <- c(0, 0, 0, 1, 1, 1, 1)
animals <- rbind(shark, goldfish, coelecanth, iguana,
robin, human, lemur, bat, pig, whale, cow)
colnames(animals) <- c("scales","fins","wings","hair",
"warm-blood","hooves","legs")
animals
animals <- as.phyDat(animals, type = "USER", levels = c(0, 1))
starting <- rtree(n = length(animals))
starting$edge.length <- NULL
starting$tip.label <- names(animals)
plot(starting)
tree <- optim.parsimony(starting, animals, rearrangements = "SPR")
plot(tree)
tree <- root(tree, outgroup = "shark", resolve.root = TRUE)
plot(tree)
The package BioGeoBEARS
is a powerful and rapidly evolving
package that implements a variety of models for discrete
biogeography. It is “designed to perform inference of biogeographic
history on phylogenies, and also model testing and model choice of
the many different possible models of how biogeography may evolve
on a phylogeny (dispersal, vicariance, founder-event speciation,
DEC, DIVA, BAYAREA, etc.).” See
http://phylo.wikidot.com/biogeobears for full details. It
requires many dependencies, and must load (source()
) many online
scripts each time you run it, so may cause you some difficulty.
I have modified the example given on the main site to perform a sinlge, simple analysis of the Hawaiian Psychotria example. The script is here, with data here (psychotria_geog.phy, psychotria.new).
If you would rather use LaGrange externally, but use R to display the phylogenies, here is a simple script, acting on a sample Lagrange output. A more comprehensive (pretty but more complex) script (by Vijay Patil) is available from Steffi.
library(ape)
infile <- readLines("psychotria.results.txt")
# the tree, with labelled nodes is in line 7
tr <- read.tree(text=infile[7])
# Find the extant distributions by looking for '---+':
taxa <- gsub("^[^[]*\\[([A-Z])\\] ([A-Za-z0-9_]+) *$", "\\2",
infile[grep("---\\+", infile)])
distrib <- gsub("^[^[]*\\[([A-Z])\\] (.*)$", "\\1",
infile[grep("---\\+", infile)])
# Find the (most likely) splits info by looking for 'At node'
nodelines <- grep("^At node", infile)
nodelabl <- gsub(":", "", substr(infile[nodelines],9,100))
# flip the splits to better be viewed on a sideways tree
split <- gsub("\\[([A-Z]+)\\|([A-Z]+)\\]\ *", "\\2|\\1",
substr(infile[nodelines+2],4,10))
# plot it, making room for tip info
plot(tr, label.offset=0.5)
# add the node labels
nodelabels(split[match(nodelabl, tr$node.label)])
# and the extant distribution
tiplabels(distrib[match(taxa, tr$tip.label)])
Here are some raster and vector data of Bogor:
Analyse with:
library(raster)
library(maptools)
sat <- raster("sat.tif", band = 1)
dem <- raster("srtm.tif")
projection(sat)
projection(dem)
plot(sat)
plot(dem)
plot(sat)
trees <- readShapePoints("trees/trees.shp")
roads <- readShapeLines("roads/roads.shp")
plot(roads, add=T)
plot(trees, add=T)
A very simple spatial query: what are the values of the red band (think ‘roofing!’) under the tree points, vs. the overall values.
vals <- extract(sat,trees)
hist(bogor, freq=F, breaks=25, col = rgb(1,0,0,0.4))
hist(vals, freq=F, breaks=25, add=T, col = rgb(0,0,1,0.4))
legend("topright", c("All", "Trees"),
fill=c(rgb(1,0,0,0.4),rgb(0,0,1,0.4)))
short URL: http://is.gd/webbR
(This HTML document is generated from a markdown
document, which is in turn generated from the R script
using a simple gawk
script.)