Andrew MacDonald (ecologist,
hacker, past ASRI volunteer) recently wrote a very useful
page (which he just
tweeted)
about basic table manipulation in R
, and referred to a document I
wrote in ca. 2001 (!), which used to be on my old website. I thought I
would refind that document and post it again here. Thanks to Andrew,
and to all the original responders!
******************************************************************* *** Solutions to a question I posted on the S-news mailing list *** *** about simple table manipulation *** ******************************************************************* THE QUESTION: Hi all, I am fairly new to S-plus (actually R, on linux), and am continually frustrated at how hard it seems to be to do the most basic table/matrix manipulations. So, if anyone has some spare time for a challenge, I pose the following problem... Of several canned stats packages, I have found the table manipulation options in JMP to be the most helpful/intuitive. The options there are Group, Split, Stack, Join, Subset, Transpose, Sort. What would be the simplest S-plus/R code to do the same? Some of these are quite `easy' but I leave them in for completeness. A full solution to this will be an invaluable resource - apologies/request for directions if it already exists! In detail, with examples: 1) GROUP: Turn this table(A): A a 1 A a 3 A a 1 A b 1 A b 2 B c 2 B d 1 into this (B): Group1 Group2 Nrows SumOfCol3 A a 3 5 A b 2 3 B c 1 2 B d 1 1 2) SPLIT: Turn col3 of the second table (B) into this (C): a b c d A 3 2 . . B . . 1 1 3) STACK: Turn the above table (C) into this (D): A a 3 A b 2 A c . A d . B a . B b . B c 1 B d 1 4) JOIN these tables (E, F): A 1 B 2 C 3 A a B a B a B b C c A b to give (G): A a 1 B a 2 B a 2 B b 2 C c 3 A b 1 5) SUBSET Table G to give H: A a 1 A b 1 6) TRANSPOSE H to give: A A a b 1 1 7) and finally SORT, by up to 3 keys. Solutions referring to data frames would be most helpful. A big thanks to anyone who responds! Cam This message was distributed by s-news@lists.biostat.wustl.edu. THE ANSWERS: 1111111111111111111111111111111111111111111111111111111111111111111111111111111 ********** From: Bill Dunlap ********** > 1) GROUP: Turn this table(A): > > A a 1 > A a 3 > A a 1 > A b 1 > A b 2 > B c 2 > B d 1 > > into this (B): > > Group1 Group2 Nrows SumOfCol3 > A a 3 5 > A b 2 3 > B c 1 2 > B d 1 1 You can use aggregate(), but you only get 1 new column for each call to aggregate(). E.g., > aggregate(A$Number, by=A[,c("Group1","Group2")], FUN=sum) Group1 Group2 x 1 A a 5 2 A b 3 3 B c 2 4 B d 1 > aggregate(A$Number, by=A[,c("Group1","Group2")], FUN=length) Group1 Group2 x 1 A a 3 2 A b 2 3 B c 1 4 B d 1 Use, e.g., cbind() to paste together the columns of interest: > B <- aggregate(A$Number, by=A[,c("Group1","Group2")], FUN=length) > names(B)[3] <- "Nrows" > B <- cbind(B,SumOfCol3=aggregate(A$Number, by=A[,c("Group1","Group2")], FUN=sum)$x) > B Group1 Group2 Nrows SumOfCol3 1 A a 3 5 2 A b 2 3 3 B c 1 2 4 B d 1 1 aggregate() is a pretty general purpose function and there exist versions which are optimized for FUN=sum or length in slightly restricted circumstances. They are much faster when you are dealing with a lot of categories. Check in the S-news archives for aggsum or agglength. ********** From: Ed Kademan ********** ## I am doing the following in R. A <- data.frame( c1 = c('A', 'A', 'A', 'A', 'A', 'B', 'B'), c2 = c('a', 'a', 'a', 'b', 'b', 'c', 'd'), c3 = c(1, 3, 1, 1, 2, 2, 1)) problem1 <- function(A) { nRows <- as.vector(table(A[1:2])) sumOfCol3 <- as.vector(tapply(A[[3]], A[1:2], sum)) groups <- expand.grid(lapply(A[1:2], levels)) dimnames(groups)[[2]][1:2] <- c('Group1', 'Group2') cbind(groups, nRows, sumOfCol3)[nRows > 0,] } B <- problem1(A) ********** From: Frank Harrell ********** Some of what you want is worked out in http://hesweb1.med.virginia.edu/biostat/s/doc/splus.pdf section 4.2 - My libraries will be available in R in a few weeks ********** From: Patrick Connolly ********** Check out aggregate(). It'll do most of what you want. ********** From: Nick Ellis ********** 2 calls to aggregate() do this 2222222222222222222222222222222222222222222222222222222222222222222222222222222 ********** From: Bill Dunlap ********** > 2) SPLIT: Turn col3 of the second table (B) into this (C): > > a b c d > A 3 2 . . > B . . 1 1 The fastest way to get this result is to start with the raw data A: > table(A$Group1, A$Group2) a b c d A 3 2 0 0 B 0 0 1 1 Do you really want missing values (NA's in Splus) where table puts 0's? Do you really want to be able to turn B directly into C? ********** From: Ed Kademan ********** ## I am not turning B into C as you specified, but maybe this is ok. C <- table(A[1:2]) ********** From: Nick Ellis ********** see my attached function table.trans (originally Richard Heiberger's code) table.trans(df[,3],df[1:2]) > table.trans function(X, INDICES, ...) { len.ind <- length(INDICES) IN1 <- interaction(INDICES[ - len.ind], drop = T) IN2 <- INDICES[[len.ind]] a <- tapply(X, data.frame(IN1, IN2), c, ...) cbind(INDICES[seq(1, nrow(INDICES), length(levels(IN2))), - len.ind], a) } HELP FILE: filename="table_trans.d" .BG .FN table.trans .TL Transpose a data frame with 'by' variables .DN Transposes a data frame with SAS-style 'by' variables .CS table.trans(X, INDICES, sep="#") .RA .AG X The single data vector to transpose .AG INDICES All columns except the last are the 'by' variables. The last column is a factor whose levels are the names of the transposed columns. .OA .AG sep Delimiter used internally. Shouldn't occur as a character in any of the 'by' variables .RT A data frame with columns corresponding to 'by' variables and extra columns corresponding to levels of last column of INDICES whose value comes from X .SE none .DT Transposes all values for vector X into columns with names given by the levels of the last column of INDICES. Missing entries are denoted by NA. Usually the 'by' variables in INDICES will be factors or discrete values. .br The inverse of this function is `reshape', which folds columns into rows. .SH REFERENCES Adapted from code by Richard M Heiberger in s-news 22 Mar 1996 .SA `reshape', as.data.frame.array .EX > names(dimnames(iris)) <- c("flower", "measurement", "species") > iris1 <- as.data.frame.array(iris, row.dims = 1:3) > iris1[1:10, ] x.1 flower measurement species 1 5.1 1 Sepal L. Setosa 2 4.9 2 Sepal L. Setosa 3 4.7 3 Sepal L. Setosa 4 4.6 4 Sepal L. Setosa 5 5.0 5 Sepal L. Setosa 6 5.4 6 Sepal L. Setosa 7 4.6 7 Sepal L. Setosa 8 5.0 8 Sepal L. Setosa 9 4.4 9 Sepal L. Setosa 10 4.9 10 Sepal L. Setosa > iris.meas <- table.trans(iris1[, 1], iris1[, c(2, 4, 3)]) > iris.meas[1:10, ] flower species Sepal.L. Sepal.W. Petal.L. Petal.W. 1 1 Setosa 5.1 3.5 1.4 0.2 2 1 Versicolor 7.0 3.2 4.7 1.4 3 1 Virginica 6.3 3.3 6.0 2.5 4 2 Setosa 4.9 3.0 1.4 0.2 5 2 Versicolor 6.4 3.2 4.5 1.5 6 2 Virginica 5.8 2.7 5.1 1.9 7 3 Setosa 4.7 3.2 1.3 0.2 8 3 Versicolor 6.9 3.1 4.9 1.5 9 3 Virginica 7.1 3.0 5.9 2.1 10 4 Setosa 4.6 3.1 1.5 0.2 > samp <- sample(nrow(iris1), 20) > iris.meas <- table.trans(iris1[ - samp, 1], iris1[ - samp, c(2, 3, 4)]) > iris.meas[1:10, ] flower measurement Setosa Versicolor Virginica 1 1 Sepal L. 5.1 7.0 6.3 2 1 Sepal W. 3.5 3.2 3.3 3 1 Petal L. 1.4 4.7 6.0 4 1 Petal W. 0.2 1.4 2.5 5 2 Sepal L. 4.9 6.4 5.8 6 2 Sepal W. 3.0 3.2 2.7 7 2 Petal L. 1.4 4.5 5.1 8 2 Petal W. 0.2 1.5 1.9 9 3 Sepal L. 4.7 6.9 7.1 10 3 Sepal W. 3.2 3.1 3.0 .KW misc .WR 3333333333333333333333333333333333333333333333333333333333333333333333333333333 ********** From: Bill Dunlap ********** > 3) STACK: Turn the above table (C) into this (D): > > A a 3 > A b 2 > A c . > A d . > B a . > B b . > B c 1 > B d 1 I think an equivalent result is > as.data.frame.array(table(A$Group1, A$Group2), row.dims=1:2) X Dim1 Dim2 1 3 A a 2 0 B a 3 2 A b 4 0 B b 5 0 A c 6 1 B c 7 0 A d 8 1 B d See the sorting section on how to reorder it to your taste. The GUI has a pretty flexible stacking function. ********** From: Ed Kademan ********** problem3 <- function(C) { C <- t(C) cbind(expand.grid(dimnames(C))[2:1], as.vector(C)) } D <- problem3(C) ********** From: Nick Ellis ********** see my attached function reshape() reshape(df,col.unfol=1:4) > reshape function(obj, col.copy = NULL, col.unfold = 1:dim(obj)[2], label = "label", value = "value") { n <- dim(obj)[1] p <- length(col.unfold) res <- vector("list", length(col.copy) + 2) names(res) <- c(if(is.null(col.copy)) NULL else dimnames(obj[, col.copy, drop = F])[[2]], label, value) for(i in seq(along = col.copy)) { res[[i]] <- rep(obj[, col.copy[i]], p) } res[[label]] <- rep(dimnames(obj[, col.unfold, drop = F])[[2]], rep(n, p)) res[[value]] <- as.vector(data.matrix(obj[, col.unfold, drop = F])) as.data.frame(res) } HELP FILE: filename="reshape.d" .BG .FN reshape .TL Reshape, unfold or transpose a data frame .DN Reshape, unfold or transpose a data frame. .CS reshape(obj, col.copy=NULL, col.unfold=1:dim(obj)[2], label="label", value="value") .RA .AG obj the data frame to reshape .OA .AG col.copy The columns of `obj' to copy to result .AG col.unfold The columns of `obj' to be unfolded to rows .AG label the name of the label column .AG value the name of the value column .RT A data frame consisting of the copied columns replicated p times (where p is number of unfolded columns) and two extra columns: label contains the name of the unfolded column value contains the value .SE none .DT see code below. The inverse of this function is table.trans. .SH WARNINGS the data in the unfolded columns should all have the same mode .SA `table.trans', as.data.frame.array .EX > air[1:5, ] ozone radiation temperature wind 1 3.448217 190 67 7.4 2 3.301927 118 72 8.0 3 2.289428 149 74 12.6 4 2.620741 313 62 11.5 5 2.843867 299 65 8.6 > reshape(air[1:5,],2,c(1,3,4),"attribute") radiation attribute value 1 190 ozone 3.448217 2 118 ozone 3.301927 3 149 ozone 2.289428 4 313 ozone 2.620741 5 299 ozone 2.843867 6 190 temperature 67.000000 7 118 temperature 72.000000 8 149 temperature 74.000000 9 313 temperature 62.000000 10 299 temperature 65.000000 11 190 wind 7.400000 12 118 wind 8.000000 13 149 wind 12.600000 14 313 wind 11.500000 15 299 wind 8.600000 > reshape(air[1:5,],label="attribute") attribute value 1 ozone 3.448217 2 ozone 3.301927 3 ozone 2.289428 4 ozone 2.620741 5 ozone 2.843867 6 radiation 190.000000 7 radiation 118.000000 8 radiation 149.000000 9 radiation 313.000000 10 radiation 299.000000 11 temperature 67.000000 12 temperature 72.000000 13 temperature 74.000000 14 temperature 62.000000 15 temperature 65.000000 16 wind 7.400000 17 wind 8.000000 18 wind 12.600000 19 wind 11.500000 20 wind 8.600000 .KW misc .WR 4444444444444444444444444444444444444444444444444444444444444444444444444444444 ********** From: Bill Dunlap ********** > 4) JOIN these tables (E, F): > > A 1 > B 2 > C 3 > > A a > B a > B a > B b > C c > A b > > to give (G): > > A a 1 > B a 2 > B a 2 > B b 2 > C c 3 > A b 1 Use merge(): > E<-data.frame(Group1=c("A","B","C"), Number=1:3) > # F is a reserved word, so I call it FF > FF <- data.frame(Group1=c("A","B","B","B","C","A"), Group2=c("a","a","a","b","c","b")) > G <- merge(FF, E) > G Group1 Group2 Number 1 A a 1 2 A b 1 3 B a 2 4 B a 2 5 B b 2 6 C c 3 You get a different order but I think it is an equivalent table. ********** From: Ed Kademan ********** E <- data.frame(c1 = c('A', 'B', 'C'), c2 = 1:3) F <- data.frame(c1 = c('A', 'B', 'B', 'B', 'C', 'A'), c2 = c('a', 'a', 'a', 'b', 'c', 'b')) ## F is a very bad name for a variable in R, by the way. G <- cbind(F, E[[2]][match(F[[1]], E[[1]])]) ********** From: Nick Ellis ********** merge(df1,df2) # if fields have common name 5555555555555555555555555555555555555555555555555555555555555555555555555555555 ********** From: Bill Dunlap ********** > 5) SUBSET Table G to give H: > > A a 1 > A b 1 You didn't say what subset this represents. To get the first 2 rows use G[1:2,]. To get the rows with Number==1 use G[ G$Number==1,]. To get the rows with Group1=="A" use G[ G$Group1=="A",]. To get the rows with Group1=="A" and Group2 != "c" use G[G$Group1=="A" & G$Group2!="c",]. ********** From: Ed Kademan ********** H <- G[G[[1]] == 'A',] ## I am assuming you just want those rows whose first column contains ## 'A.' ********** From: Nick Ellis ********** df[df[,1] == "A",] also see help(subscript) 6666666666666666666666666666666666666666666666666666666666666666666666666666666 ********** From: Bill Dunlap ********** > 6) TRANSPOSE H to give: > > A A > a b > 1 1 Transpose is harder because data frames must have only one type of data per column. Our transpose function converts the data frame into a matrix (which has one type for the whole matrix) and returns the transpose of that matrix. You would have to convert that back to a data frame: > data.frame(t(H)) X1 X2 Group1 A A Group2 a b Number 1 1 ********** From: Ed Kademan ********** transposeH <- t(H) ********** From: Nick Ellis ********** t(df) 7777777777777777777777777777777777777777777777777777777777777777777777777777777 ********** From: Bill Dunlap ********** > 7) and finally SORT, by up to 3 keys. Use order() and subscripting. E.g., > G[ order(G$Group1, G$Group2, G$Number), ] Group1 Group2 Number 1 A a 1 2 A b 1 3 B a 2 4 B a 2 5 B b 2 6 C c 3 > G[ order(G$Number, G$Group2), ] Group1 Group2 Number 1 A a 1 2 A b 1 3 B a 2 4 B a 2 5 B b 2 6 C c 3 ********** From: Ed Kademan ********** dfr <- as.data.frame(matrix(sample(5, 120, replace = T), 30, 4)) sortedDfr <- dfr[do.call('order', dfr[1:3]),] ## The first column is the primary key, the second is the secondary ## key, and the third is tertiary. ********** From: Nick Ellis ********** use order() e.g. df[order(df$a,df$b,df$c),] ===============================================================================