lundi 20 septembre 2021

Conditional filtering and data transformation in R; removing rows and adding additional columns based on group and value

I am trying to transform an existing dataset by filtering out rows that I no longer need based on condition and by group. Subsequently, I would like to add additional column which essentially is a status column and an activity instance column. Here is the dataframe:-

rawdata<-structure(list(DateTime = c("20/02/2021 13:00", "20/02/2021 14:00", 
                                     "20/02/2021 15:00", "20/02/2021 16:00", "20/02/2021 17:00", "20/02/2021 18:00", 
                                     "20/02/2021 19:00", "20/02/2021 20:00", "20/02/2021 21:00", "20/02/2021 22:00", 
                                     "20/02/2021 23:00", "21/02/2021 00:00", "01/03/2021 00:00", "01/03/2021 01:00", 
                                     "01/03/2021 02:00", "01/03/2021 03:00", "01/03/2021 04:00", "01/03/2021 05:00", 
                                     "01/03/2021 06:00", "01/03/2021 07:00", "01/03/2021 08:00", "01/03/2021 09:00", 
                                     "01/03/2021 10:00", "01/03/2021 11:00", "01/03/2021 12:00", "01/03/2021 13:00"
), Cluster = c("Cluster 3", "Cluster 3", "Cluster 3", "Cluster 3", 
               "NotActive", "NotActive", "NotActive", "Cluster 2", "Cluster 1", 
               "Cluster 3", "NotActive", "NotActive", "NotActive", "Cluster 5", 
               "Cluster 5", "Cluster 4", "NotActive", "NotActive", "NotActive", 
               "NotActive", "Cluster 2", "Cluster 2", "Cluster 3", "NotActive", 
               "NotActive", "NotActive"), UserID = c("AAA", "AAA", "AAA", "AAA", 
                                                     "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", 
                                                     "BBB", "BBB", "BBB", "BBB", "BBB", "BBB", "BBB", "BBB", "BBB", 
                                                     "BBB", "BBB", "BBB", "BBB")), class = "data.frame", row.names = c(NA, 
                                                                                                                       -26L))

print(rawdata)
           DateTime   Cluster UserID
1  20/02/2021 13:00 Cluster 3    AAA
2  20/02/2021 14:00 Cluster 3    AAA
3  20/02/2021 15:00 Cluster 3    AAA
4  20/02/2021 16:00 Cluster 3    AAA
5  20/02/2021 17:00 NotActive    AAA
6  20/02/2021 18:00 NotActive    AAA
7  20/02/2021 19:00 NotActive    AAA
8  20/02/2021 20:00 Cluster 2    AAA
9  20/02/2021 21:00 Cluster 1    AAA
10 20/02/2021 22:00 Cluster 3    AAA
11 20/02/2021 23:00 NotActive    AAA
12 21/02/2021 00:00 NotActive    AAA
13 01/03/2021 00:00 NotActive    AAA
14 01/03/2021 01:00 Cluster 5    BBB
15 01/03/2021 02:00 Cluster 5    BBB
16 01/03/2021 03:00 Cluster 4    BBB
17 01/03/2021 04:00 NotActive    BBB
18 01/03/2021 05:00 NotActive    BBB
19 01/03/2021 06:00 NotActive    BBB
20 01/03/2021 07:00 NotActive    BBB
21 01/03/2021 08:00 Cluster 2    BBB
22 01/03/2021 09:00 Cluster 2    BBB
23 01/03/2021 10:00 Cluster 3    BBB
24 01/03/2021 11:00 NotActive    BBB
25 01/03/2021 12:00 NotActive    BBB
26 01/03/2021 13:00 NotActive    BBB

And for further explanation, here is the desired output:-

desiredoutput<-structure(list(DateTime = structure(c(1613826000, 1613829600, 
                                                     1613833200, 1613836800, 1613840400, 1613851200, 1613854800, 1613858400, 
                                                     1613862000, 1614560400, 1614564000, 1614567600, 1614571200, 1614585600, 
                                                     1614589200, 1614592800, 1614596400), class = c("POSIXct", "POSIXt"
                                                     ), tzone = "UTC"), Cluster = c("Cluster 3", "Cluster 3", "Cluster 3", 
                                                                                    "Cluster 3", "NotActive", "Cluster 2", "Cluster 1", "Cluster 3", 
                                                                                    "NotActive", "Cluster 5", "Cluster 5", "Cluster 4", "NotActive", 
                                                                                    "Cluster 2", "Cluster 2", "Cluster 3", "NotActive"), UserID = c("AAA", 
                                                                                                                                                    "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "BBB", 
                                                                                                                                                    "BBB", "BBB", "BBB", "BBB", "BBB", "BBB", "BBB"), Status = c("Start", 
                                                                                                                                                                                                                 "Ongoing", "Ongoing", "Ongoing", "Complete", "Start", "Ongoing", 
                                                                                                                                                                                                                 "Ongoing", "Complete", "Start", "Ongoing", "Ongoing", "Complete", 
                                                                                                                                                                                                                 "Start", "Ongoing", "Ongoing", "Complete"), Instance = c(1, 2, 
                                                                                                                                                                                                                                                                          3, 4, 5, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)), row.names = c(NA, 
                                                                                                                                                                                                                                                                                                                                       -17L), class = c("tbl_df", "tbl", "data.frame"))

print(desiredoutput)
   DateTime            Cluster   UserID Status   Instance
   <dttm>              <chr>     <chr>  <chr>       <dbl>
 1 2021-02-20 13:00:00 Cluster 3 AAA    Start           1
 2 2021-02-20 14:00:00 Cluster 3 AAA    Ongoing         2
 3 2021-02-20 15:00:00 Cluster 3 AAA    Ongoing         3
 4 2021-02-20 16:00:00 Cluster 3 AAA    Ongoing         4
 5 2021-02-20 17:00:00 NotActive AAA    Complete        5
 6 2021-02-20 20:00:00 Cluster 2 AAA    Start           1
 7 2021-02-20 21:00:00 Cluster 1 AAA    Ongoing         2
 8 2021-02-20 22:00:00 Cluster 3 AAA    Ongoing         3
 9 2021-02-20 23:00:00 NotActive AAA    Complete        4
10 2021-03-01 01:00:00 Cluster 5 BBB    Start           1
11 2021-03-01 02:00:00 Cluster 5 BBB    Ongoing         2
12 2021-03-01 03:00:00 Cluster 4 BBB    Ongoing         3
13 2021-03-01 04:00:00 NotActive BBB    Complete        4
14 2021-03-01 08:00:00 Cluster 2 BBB    Start           1
15 2021-03-01 09:00:00 Cluster 2 BBB    Ongoing         2
16 2021-03-01 10:00:00 Cluster 3 BBB    Ongoing         3
17 2021-03-01 11:00:00 NotActive BBB    Complete        4

What I am after is that in the Cluster column (by UserID), if the value says NotActive, keep the first NotActive value and discard the remaining NotActive rows until it is something else. As well, I would like to create a "Status" column, where the first value in this column corresponds to Start, until the first NotActive value, which will correspond to Complete; everything else in between Ongoing. Lastly, the Instance column is just numerated from 1==Start to N==Complete in the Status column.

Any help would be appreciated :)

Aucun commentaire:

Enregistrer un commentaire