Considering a dataset such as the classical mtcars
, I want to know the number of observations (=rows) by different levels of factors, taking them separately as well as together.
For example, the following code will generate a column N with the number of observations per level of cyl and gear, but not the number of observations for cyl and gear separately.
mtcars %>% dplyr::group_by(cyl, gear) %>% dplyr::summarise(N = n())
I know that a separate number of observations for cyl and gear can be obtained just in a similar way, creating separate dataframes, and merging all together. The following would generate the expected output:
df <- mtcars %>% dplyr::group_by(cyl, gear) %>% dplyr::summarise(N = n())
df_gear <- mtcars %>% dplyr::group_by(gear) %>% dplyr::summarise(Ngear = n())
df_cyl <- mtcars %>% dplyr::group_by(cyl) %>% dplyr::summarise(Ncyl = n())
df %>% dplyr::left_join(df_cyl) %>% dplyr::left_join(df_gear)
But I am wondering if there is a cleaner way to generate this dataset, hopefully without needing to generate intermediate datasets.
Here is one way that you might approach this, relying on mutate()
and ave()
instead of group_by()
and summarise()
for compactness:
library(dplyr)
mtcars %>%
mutate(n = ave(cyl, cyl, gear, FUN = length),
n_cyl = ave(cyl, cyl, FUN = length),
n_gear = ave(gear, gear, FUN = length)) %>%
select(gear, cyl, n, n_cyl, n_gear) %>%
distinct()
gear cyl n n_cyl n_gear
1 4 6 4 7 12
2 4 4 8 11 12
3 3 6 2 7 15
4 3 8 12 14 15
5 3 4 1 11 15
6 5 4 2 11 5
7 5 8 2 14 5
8 5 6 1 7 5