Working with a data frame similar to this:
set.seed(100)
df <- data.frame(cat = c(rep("aaa", 5), rep("bbb", 5), rep("ccc", 5)), val = runif(15))
df <- df[order(df$cat, df$val), ]
df
cat val
1 aaa 0.05638315
2 aaa 0.25767250
3 aaa 0.30776611
4 aaa 0.46854928
5 aaa 0.55232243
6 bbb 0.17026205
7 bbb 0.37032054
8 bbb 0.48377074
9 bbb 0.54655860
10 bbb 0.81240262
11 ccc 0.28035384
12 ccc 0.39848790
13 ccc 0.62499648
14 ccc 0.76255108
15 ccc 0.88216552
I am trying to add a column with numbering within each group. Doing it this way obviously isn't using the powers of R:
df$num <- 1
for (i in 2:(length(df[,1]))) {
if (df[i,"cat"]==df[(i-1),"cat"]) {
df[i,"num"]<-df[i-1,"num"]+1
}
}
df
cat val num
1 aaa 0.05638315 1
2 aaa 0.25767250 2
3 aaa 0.30776611 3
4 aaa 0.46854928 4
5 aaa 0.55232243 5
6 bbb 0.17026205 1
7 bbb 0.37032054 2
8 bbb 0.48377074 3
9 bbb 0.54655860 4
10 bbb 0.81240262 5
11 ccc 0.28035384 1
12 ccc 0.39848790 2
13 ccc 0.62499648 3
14 ccc 0.76255108 4
15 ccc 0.88216552 5
What would be a good way to do this?
Use ave
, ddply
, dplyr
or data.table
:
df$num <- ave(df$val, df$cat, FUN = seq_along)
or:
library(plyr)
ddply(df, .(cat), mutate, id = seq_along(val))
or:
library(dplyr)
df %>% group_by(cat) %>% mutate(id = row_number())
or (the most memory efficient, as it assigns by reference within DT
):
library(data.table)
DT <- data.table(df)
DT[, id := seq_len(.N), by = cat]
DT[, id := rowid(cat)]
It might be worth mentioning that
ave
gives a float instead of an int here. Alternately, could changedf$val
toseq_len(nrow(df))
. I just ran into this over here: stackoverflow.com/questions/42796857/…Interestingly this
data.table
solution seems to be quicker than usingfrank
:library(microbenchmark); microbenchmark(a = DT[, .(val ,num = frank(val)), by = list(cat)] ,b =DT[, .(val , id = seq_len(.N)), by = list(cat)] , times = 1000L)
Thanks! The
dplyr
solution is good. But if, like me, you kept getting weird errors when trying this approach, make sure that you are not getting conflicts betweenplyr
anddplyr
as explained in this post It can be avoided by explicitly callingdplyr::mutate(...)
another
data.table
method issetDT(df)[, id:=rleid(val), by=.(cat)]
How to modify
library(plyr)
andlibrary(dplyr)
answers to make the ranking val column in descending order?