Warm tip: This article is reproduced from stackoverflow.com, please click
r

How to group by consecutive rows in a R dataframe?

发布于 2020-03-27 10:31:40

I have a dataframe with columns TimeStamp, Type, Value in time series data. Type refers to whether it is a peak or valley. I want to:

Group all data by consecutive types For groups of "peak" type I want to select the highest For groups if "valley" type I want to select the lowest Filter the dataframe by these highest/lowest Expectation: I would have a dataframe that alternated each row between the highest peak and lowest valley.

The only way I know how to do this is by using a for loop and then adding consecutive values into a vector and then getting the max, then shoving this in a new dataframe and so on.

For those who know python, this is what I did in that (I need to transfer my code to R though):

segmentation['min_v'] = segmentation.groupby( segmentation.pv_type.ne(segmentation.pv_type.shift()).cumsum() ).price.transform(min)
segmentation['max_p'] = segmentation.groupby( segmentation.segmentation.pv_type.ne(segmentation.pv_type.shift()).cumsum() ).price.transform(max)

EDIT

Sample data set:

types <- c('peak', 'peak', 'valley', 'peak', 'valley', 'valley', 'valley')
values <- c(1.01,   1.00,    0.4,     1.2,     0.3,      0.1,      0.2)
segmentation <- data.frame(types, values)
segmentation

expectedTypes <- c('peak', 'valley', 'peak', 'valley')
expectedValues <- c(1.00, 0.4, 1.2, 0.1 )
expectedResult <- data.frame(expectedTypes, expectedValues)
expectedResult

I dont know a better way to generate the data.

Questioner
Fred Johnson
Viewed
55
akrun 2019-07-04 03:18

With R, an implementation using dplyr would be to take the cumulative sum of the logical comparison between the 'pv_type' and the lag of 'pv_type' as a grouping column and then get the min and max of 'price' as two new columns

library(dplyr)
segmentation %>%
       group_by(pv_type_group = cumsum(pv_type != lag(pv_type,
                 default = first(pv_type))) %>%
       mutate(min_v = min(price), max_p = max(price))

Update

With the OP's example, the expected output is summarised, so we use summarise instead of mutate. Also, used rleid (from data.table) instead of the logical cumulative sum

library(data.table)
segmentation %>% 
    group_by(grp = rleid(types)) %>% 
    summarise(types = first(types), expectedvalues = min(values)) %>%
    ungroup %>%
    select(-grp)
# A tibble: 4 x 2
#  types  expectedvalues
# <fct>           <dbl>
#1 peak              1  
#2 valley            0.4
#3 peak              1.2
#4 valley            0.1