JMP vs. S-plus (a very old post!)

Andrew MacDonald (ecologist, hacker, past ASRI volunteer) recently wrote a very useful page (which he just tweeted) about basic table manipulation in R, and referred to a document I wrote in ca. 2001 (!), which used to be on my old website. I thought I would refind that document and post it again here. Thanks to Andrew, and to all the original responders!


*******************************************************************
*** Solutions to a question I posted on the S-news mailing list ***
*** about simple table manipulation                             ***
*******************************************************************

THE QUESTION:

Hi all,

I am fairly new to S-plus (actually R, on linux), and am continually
frustrated at how hard it seems to be to do the most basic table/matrix
manipulations.  So, if anyone has some spare time for a challenge, I pose
the following problem...

Of several canned stats packages, I have found the table manipulation
options in JMP to be the most helpful/intuitive.  The options there
are Group, Split, Stack, Join, Subset, Transpose, Sort.  What would be
the simplest S-plus/R code to do the same?  Some of these are quite `easy'
but I leave them in for completeness.  A full solution to this will be an
invaluable resource - apologies/request for directions if it already
exists!

In detail, with examples:

1) GROUP: Turn this table(A):

A  a  1
A  a  3
A  a  1
A  b  1
A  b  2
B  c  2
B  d  1

into this (B):

Group1 Group2 Nrows SumOfCol3
A      a      3     5
A      b      2     3
B      c      1     2
B      d      1     1


2) SPLIT: Turn col3 of the second table (B) into this (C):

   a  b  c  d
A  3  2  .  .
B  .  .  1  1


3) STACK: Turn the above table (C) into this (D):

A  a  3
A  b  2
A  c  .
A  d  .
B  a  .
B  b  .
B  c  1
B  d  1


4) JOIN these tables (E, F):

A  1
B  2
C  3

A  a
B  a
B  a
B  b
C  c
A  b

to give (G):

A  a  1
B  a  2
B  a  2
B  b  2
C  c  3
A  b  1


5) SUBSET Table G to give H:

A  a  1
A  b  1


6)  TRANSPOSE H to give:

A  A
a  b
1  1

7) and finally SORT, by up to 3 keys.


Solutions referring to data frames would be most helpful. A big thanks to
anyone who responds!

Cam

This message was distributed by s-news@lists.biostat.wustl.edu.


THE ANSWERS:

1111111111111111111111111111111111111111111111111111111111111111111111111111111

********** From: Bill Dunlap **********

> 1) GROUP: Turn this table(A):
>
> A  a  1
> A  a  3
> A  a  1
> A  b  1
> A  b  2
> B  c  2
> B  d  1
>
> into this (B):
>
> Group1 Group2 Nrows SumOfCol3
> A      a      3     5
> A      b      2     3
> B      c      1     2
> B      d      1     1

You can use aggregate(), but you only get 1 new column for each call
to aggregate().  E.g.,

> aggregate(A$Number,  by=A[,c("Group1","Group2")], FUN=sum)

  Group1 Group2 x
1      A      a 5
2      A      b 3
3      B      c 2
4      B      d 1

> aggregate(A$Number,  by=A[,c("Group1","Group2")], FUN=length)

  Group1 Group2 x
1      A      a 3
2      A      b 2
3      B      c 1
4      B      d 1

Use, e.g., cbind() to paste together the columns of interest:

> B <- aggregate(A$Number,  by=A[,c("Group1","Group2")], FUN=length)
> names(B)[3] <- "Nrows"
> B <- cbind(B,SumOfCol3=aggregate(A$Number,  by=A[,c("Group1","Group2")], FUN=sum)$x)
> B

  Group1 Group2 Nrows SumOfCol3
1      A      a     3         5
2      A      b     2         3
3      B      c     1         2
4      B      d     1         1

aggregate() is a pretty general purpose function and there exist versions
which are optimized for FUN=sum or length in slightly restricted circumstances.
They are much faster when you are dealing with a lot of categories.
Check in the S-news archives for aggsum or agglength.


********** From: Ed Kademan **********
## I am doing the following in R.

A <- data.frame(
       c1 = c('A', 'A', 'A', 'A', 'A', 'B', 'B'),
       c2 = c('a', 'a', 'a', 'b', 'b', 'c', 'd'),
       c3 = c(1, 3, 1, 1, 2, 2, 1))

problem1 <-
  function(A)
{
  nRows     <- as.vector(table(A[1:2]))
  sumOfCol3 <- as.vector(tapply(A[[3]], A[1:2], sum))
  groups    <- expand.grid(lapply(A[1:2], levels))
  dimnames(groups)[[2]][1:2] <- c('Group1', 'Group2')
  cbind(groups, nRows, sumOfCol3)[nRows > 0,]
}

B <- problem1(A)


********** From: Frank Harrell  **********

Some of what you want is worked out in
http://hesweb1.med.virginia.edu/biostat/s/doc/splus.pdf
section 4.2

- My libraries will be available in R in a few weeks


********** From: Patrick Connolly **********

Check out aggregate().  It'll do most of what you want.


********** From: Nick Ellis **********

2 calls to aggregate() do this


2222222222222222222222222222222222222222222222222222222222222222222222222222222

********** From: Bill Dunlap **********

> 2) SPLIT: Turn col3 of the second table (B) into this (C):
>
>    a  b  c  d
> A  3  2  .  .
> B  .  .  1  1

The fastest way to get this result is to start with the raw data A:

   > table(A$Group1, A$Group2)
     a b c d
   A 3 2 0 0
   B 0 0 1 1

Do you really want missing values (NA's in Splus) where table puts 0's?
Do you really want to be able to turn B directly into C?


********** From: Ed Kademan **********

## I am not turning B into C as you specified, but maybe this is ok.

C <- table(A[1:2])


********** From: Nick Ellis **********

see my attached function table.trans (originally Richard Heiberger's code)
table.trans(df[,3],df[1:2])

> table.trans
function(X, INDICES, ...)
{
    len.ind <- length(INDICES)
    IN1 <- interaction(INDICES[ - len.ind], drop = T)
    IN2 <- INDICES[[len.ind]]
    a <- tapply(X, data.frame(IN1, IN2), c, ...)
    cbind(INDICES[seq(1, nrow(INDICES), length(levels(IN2))),  -
len.ind], a)
}

HELP FILE:
filename="table_trans.d"

.BG
.FN table.trans
.TL
Transpose a data frame with 'by' variables
.DN
Transposes a data frame with SAS-style 'by' variables
.CS
table.trans(X, INDICES, sep="#")
.RA
.AG X
The single data vector to transpose
.AG INDICES
All columns except the last are the 'by' variables.
The last column is a factor whose levels are the names
of the transposed columns.
.OA
.AG sep
Delimiter used internally. Shouldn't occur as a character in any of the 'by' variables
.RT
A data frame with columns corresponding to 'by' variables
and extra columns corresponding to levels of last column 
of INDICES whose value comes from X
.SE
none
.DT
Transposes all values for vector X into columns with names given by the
levels of the last column of INDICES. Missing entries are denoted by NA. 
Usually the 'by' variables in INDICES will be factors or discrete values.
.br
The inverse of this function is `reshape', which folds columns into rows.
.SH REFERENCES
Adapted from code by Richard M Heiberger in s-news 22 Mar 1996
.SA
`reshape', as.data.frame.array
.EX
> names(dimnames(iris)) <- c("flower", "measurement", "species")
> iris1 <- as.data.frame.array(iris, row.dims = 1:3)
> iris1[1:10,  ]
   x.1 flower measurement species 
 1 5.1      1    Sepal L.  Setosa
 2 4.9      2    Sepal L.  Setosa
 3 4.7      3    Sepal L.  Setosa
 4 4.6      4    Sepal L.  Setosa
 5 5.0      5    Sepal L.  Setosa
 6 5.4      6    Sepal L.  Setosa
 7 4.6      7    Sepal L.  Setosa
 8 5.0      8    Sepal L.  Setosa
 9 4.4      9    Sepal L.  Setosa
10 4.9     10    Sepal L.  Setosa
> iris.meas <- table.trans(iris1[, 1], iris1[, c(2, 4, 3)])
> iris.meas[1:10,  ]
   flower    species Sepal.L. Sepal.W. Petal.L. Petal.W. 
 1      1     Setosa      5.1      3.5      1.4      0.2
 2      1 Versicolor      7.0      3.2      4.7      1.4
 3      1  Virginica      6.3      3.3      6.0      2.5
 4      2     Setosa      4.9      3.0      1.4      0.2
 5      2 Versicolor      6.4      3.2      4.5      1.5
 6      2  Virginica      5.8      2.7      5.1      1.9
 7      3     Setosa      4.7      3.2      1.3      0.2
 8      3 Versicolor      6.9      3.1      4.9      1.5
 9      3  Virginica      7.1      3.0      5.9      2.1
10      4     Setosa      4.6      3.1      1.5      0.2
> samp <- sample(nrow(iris1), 20)
> iris.meas <- table.trans(iris1[ - samp, 1], iris1[ - samp, c(2, 3, 4)])
> iris.meas[1:10,  ]
   flower measurement Setosa Versicolor Virginica 
 1      1    Sepal L.    5.1        7.0       6.3
 2      1    Sepal W.    3.5        3.2       3.3
 3      1    Petal L.    1.4        4.7       6.0
 4      1    Petal W.    0.2        1.4       2.5
 5      2    Sepal L.    4.9        6.4       5.8
 6      2    Sepal W.    3.0        3.2       2.7
 7      2    Petal L.    1.4        4.5       5.1
 8      2    Petal W.    0.2        1.5       1.9
 9      3    Sepal L.    4.7        6.9       7.1
10      3    Sepal W.    3.2        3.1       3.0

.KW misc
.WR


3333333333333333333333333333333333333333333333333333333333333333333333333333333

********** From: Bill Dunlap **********

> 3) STACK: Turn the above table (C) into this (D):
>
> A  a  3
> A  b  2
> A  c  .
> A  d  .
> B  a  .
> B  b  .
> B  c  1
> B  d  1

   I think an equivalent result is
   > as.data.frame.array(table(A$Group1, A$Group2), row.dims=1:2)
     X Dim1 Dim2
   1 3    A    a
   2 0    B    a
   3 2    A    b
   4 0    B    b
   5 0    A    c
   6 1    B    c
   7 0    A    d
   8 1    B    d
   See the sorting section on how to reorder it to your taste.
   The GUI has a pretty flexible stacking function.


********** From: Ed Kademan **********

problem3 <- function(C)
{
  C <- t(C)
  cbind(expand.grid(dimnames(C))[2:1], as.vector(C))
}

D <- problem3(C)


********** From: Nick Ellis **********

see my attached function reshape()
reshape(df,col.unfol=1:4)

> reshape
function(obj, col.copy = NULL, col.unfold = 1:dim(obj)[2], label = "label",
value = "value")
{
    n <- dim(obj)[1]
    p <- length(col.unfold)
    res <- vector("list", length(col.copy) + 2)
    names(res) <- c(if(is.null(col.copy)) NULL else dimnames(obj[,
col.copy, drop = F])[[2]], label, value)
    for(i in seq(along = col.copy)) {
        res[[i]] <- rep(obj[, col.copy[i]], p)
    }
    res[[label]] <- rep(dimnames(obj[, col.unfold, drop = F])[[2]],
rep(n, p))
    res[[value]] <- as.vector(data.matrix(obj[, col.unfold, drop = F]))
    as.data.frame(res)
}

HELP FILE:
filename="reshape.d"

.BG
.FN reshape
.TL
Reshape, unfold or transpose a data frame
.DN
Reshape, unfold or transpose a data frame. 
.CS
reshape(obj, col.copy=NULL, col.unfold=1:dim(obj)[2], 
label="label", value="value")
.RA
.AG obj
the data frame to reshape
.OA
.AG col.copy
The columns of `obj' to copy to result
.AG col.unfold
The columns of `obj' to be unfolded to rows
.AG label
the name of the label column
.AG value
the name of the value column
.RT
A data frame consisting of the copied columns replicated p times
(where p is number of unfolded columns) and two extra columns:
label contains the name of the unfolded column
value contains the value
.SE
none
.DT
see code below. The inverse of this function is table.trans.
.SH WARNINGS
the data in the unfolded columns should all have the same mode
.SA
`table.trans', as.data.frame.array
.EX
> air[1:5,  ]
     ozone radiation temperature wind 
1 3.448217       190          67  7.4
2 3.301927       118          72  8.0
3 2.289428       149          74 12.6
4 2.620741       313          62 11.5
5 2.843867       299          65  8.6
> reshape(air[1:5,],2,c(1,3,4),"attribute")
   radiation   attribute     value 
 1       190       ozone  3.448217
 2       118       ozone  3.301927
 3       149       ozone  2.289428
 4       313       ozone  2.620741
 5       299       ozone  2.843867
 6       190 temperature 67.000000
 7       118 temperature 72.000000
 8       149 temperature 74.000000
 9       313 temperature 62.000000
10       299 temperature 65.000000
11       190        wind  7.400000
12       118        wind  8.000000
13       149        wind 12.600000
14       313        wind 11.500000
15       299        wind  8.600000
> reshape(air[1:5,],label="attribute")
     attribute      value 
 1       ozone   3.448217
 2       ozone   3.301927
 3       ozone   2.289428
 4       ozone   2.620741
 5       ozone   2.843867
 6   radiation 190.000000
 7   radiation 118.000000
 8   radiation 149.000000
 9   radiation 313.000000
10   radiation 299.000000
11 temperature  67.000000
12 temperature  72.000000
13 temperature  74.000000
14 temperature  62.000000
15 temperature  65.000000
16        wind   7.400000
17        wind   8.000000
18        wind  12.600000
19        wind  11.500000
20        wind   8.600000
.KW misc
.WR


4444444444444444444444444444444444444444444444444444444444444444444444444444444

********** From: Bill Dunlap **********

> 4) JOIN these tables (E, F):
>
> A  1
> B  2
> C  3
>
> A  a
> B  a
> B  a
> B  b
> C  c
> A  b
>
> to give (G):
>
> A  a  1
> B  a  2
> B  a  2
> B  b  2
> C  c  3
> A  b  1

Use merge():

    > E<-data.frame(Group1=c("A","B","C"), Number=1:3)
    > # F is a reserved word, so I call it FF
    > FF <- data.frame(Group1=c("A","B","B","B","C","A"), Group2=c("a","a","a","b","c","b"))
    > G <- merge(FF, E)
    > G
       Group1 Group2 Number
     1      A      a      1
     2      A      b      1
     3      B      a      2
     4      B      a      2
     5      B      b      2
     6      C      c      3

     You get a different order but I think it is an equivalent table.


********** From: Ed Kademan **********

E <- data.frame(c1 = c('A', 'B', 'C'), c2 = 1:3)
F <- data.frame(c1 = c('A', 'B', 'B', 'B', 'C', 'A'),
                c2 = c('a', 'a', 'a', 'b', 'c', 'b'))
## F is a very bad name for a variable in R, by the way.

G <- cbind(F, E[[2]][match(F[[1]], E[[1]])])


********** From: Nick Ellis **********

merge(df1,df2) # if fields have common name


5555555555555555555555555555555555555555555555555555555555555555555555555555555

********** From: Bill Dunlap **********

> 5) SUBSET Table G to give H:
>
> A  a  1
> A  b  1

   You didn't say what subset this represents.  To get the first 2 rows
   use G[1:2,].  To get the rows with Number==1 use G[ G$Number==1,].
   To get the rows with Group1=="A" use G[ G$Group1=="A",].  To get the
   rows with Group1=="A" and Group2 != "c" use G[G$Group1=="A" & G$Group2!="c",].


********** From: Ed Kademan **********

H <- G[G[[1]] == 'A',]
## I am assuming you just want those rows whose first column contains
## 'A.'


********** From: Nick Ellis **********

df[df[,1] == "A",]
also see help(subscript)


6666666666666666666666666666666666666666666666666666666666666666666666666666666

********** From: Bill Dunlap **********

> 6)  TRANSPOSE H to give:
>
> A  A
> a  b
> 1  1

   Transpose is harder because data frames must have only one type of data
   per column.  Our transpose function converts the data frame into a
   matrix (which has one type for the whole matrix) and returns the transpose
   of that matrix.  You would have to convert that back to a data frame:

   > data.frame(t(H))
          X1 X2
   Group1  A  A
   Group2  a  b
   Number  1  1


********** From: Ed Kademan **********

transposeH <- t(H)


********** From: Nick Ellis **********

t(df)


7777777777777777777777777777777777777777777777777777777777777777777777777777777

********** From: Bill Dunlap  **********

> 7) and finally SORT, by up to 3 keys.

   Use order() and subscripting.  E.g.,
   > G[ order(G$Group1, G$Group2, G$Number), ]
     Group1 Group2 Number
   1      A      a      1
   2      A      b      1
   3      B      a      2
   4      B      a      2
   5      B      b      2
   6      C      c      3
   > G[ order(G$Number, G$Group2), ]
     Group1 Group2 Number
   1      A      a      1
   2      A      b      1
   3      B      a      2
   4      B      a      2
   5      B      b      2
   6      C      c      3


********** From: Ed Kademan **********

dfr       <- as.data.frame(matrix(sample(5, 120, replace = T), 30, 4))
sortedDfr <- dfr[do.call('order', dfr[1:3]),]
## The first column is the primary key, the second is the secondary
## key, and the third is tertiary.


********** From: Nick Ellis **********

use order()

e.g. df[order(df$a,df$b,df$c),]


===============================================================================