I would like to remove rows from my dataframe if the contents across multiple columns do not match varying user criteria.
The following repex should cover the complexity of what I am trying to achieve.
install.packages("dplyr")
install.packages("purrr")
# Create user criteria (UC) data.
UC <- data.frame(
Series = 1:5,
Unit = c("cm","mm",NA,NA,"cm"),
Month = c(NA,NA,"Jan",NA,"Feb"),
Height = c(3,NA,NA,3,1)
)
# Create range of scenarios (RS) but only consider two series initially.
set.seed(2)
num_series <- 2
RS <- data.frame(
Series = sample(c(1:5), num_series, replace=TRUE),
Unit = sample(c("cm","mm"), num_series, replace=TRUE),
Month = sample(c("Jan","Feb","Mar","Apr"), num_series, replace=TRUE),
Height = sample(c(1:3), num_series, replace=TRUE)
)
# Identify applicable critera for matching (AC).
AC <- dplyr::filter(UC, UC$Series %in% unique(RS$Series))
AC <- AC[, !purrr::map_lgl(AC, ~all(is.na(.))), drop=FALSE]
# Combine the scenario data and the applicable criteria.
SC <- merge(x=RS, y=AC, by="Series", all.x=TRUE)
# Function to identify rows for removal.
fn_remove_row <- function(cols, rm) {
x <- paste0(cols,".x")
y <- paste0(cols,".y")
rm$remove <- ifelse(rm$remove == 0 && !is.na(rm[[x]]) && rm[[y]] != rm[[x]], 1, rm$remove)
rm[[y]] <- NULL
setnames(rm, paste0(cols,".x"), cols)
}
# Identify columns to be considered for matching for the given scenarios.
cols <- as.list(gsub("\\.y","",grep("\\.y", names(SC), value=TRUE)))
# Final dataframe (with option to filter).
SC$remove <- 0 # Initial values.
df <- cbind(data.frame(lapply(cols, fn_remove_row, rm=SC)))
#df <- dplyr::filter(df, remove == 0)
Some explanations on my code:
• The UC dataframe provides some example rules that may or may not apply to specific columns depending on the series observed in the data.
• The RS dataframe creates different data scenarios that need to be catered for although I initially only consider two scenarios. The num_series parameter can be increased to provide more scenarios.
• The AC dataframe subsets the user criteria to only select the columns that apply to the observed scenarios.
• The SC dataframe combines the scenario data and the applicable criteria. Columns with rules to be applied will be identified with a .x suffix (the original data) and a .y suffix (the criteria).
• I have created a function to consider the required columns in turn and check if the values match. If they do not match then the row will be flagged with a “1” to indicate it is to be removed. If the criteria value is missing (NA) for a particular column then there is no need to make a match in that case. After making the check, the criteria column is removed and the original data column is renamed to remove the suffix.
• I use lapply to create a final dataframe (df) containing a column to be filtered. The filter is not currently applied as the flags are not being created correctly.
The input dataframes (created with seed=2) are:
> UC > RS
Series Unit Month Height Series Unit Month Height
1 cm <NA> 3 1 mm Apr 1
2 mm <NA> NA 4 cm Apr 3
3 <NA> Jan NA
4 <NA> <NA> 3
5 cm Feb 1
Since RS contains series 1 and 4, AC is created to also contain these series and only the columns that remain applicable:
> AC
Series Unit Height
1 cm 3
4 <NA> 3
The merge combines RS and AC as desired and the remove flag is initialised:
> SC
Series Unit.x Month Height.x Unit.y Height.y remove
1 mm Apr 1 cm 3 0
4 cm Apr 3 <NA> 3 0
In this case I want to flag series 1 for removal because Unit.x does not equal Unit.y but if they did match it would still be flagged because Height.x does not equal Height.y. The Month column does not come into the equation because there was no applicable criteria for these two series.
Series 4 would not be flagged because the Unit comparison is not applicable (Unit.y = NA) and the Height comparison gives a match.
In the end I want (before filtering):
> df
Series Unit Month Height remove
1 mm Apr 1 1
4 cm Apr 3 0
But what I am getting from the lapply is repeated columns despite no return() call and various unshown cbind attempts:
> df
Series Unit Month Height.x Height.y remove Series.1 Unit.x Month.1 Height Unit.y remove.1
1 mm Apr 1 3 1 1 mm Apr ...
Is lapply the wrong function for looping through the applicable columns or can it be made to work? It feels to me like there is just a tiny crucial element missing.
A full solution test should use different seeds and increase num_series.
Aucun commentaire:
Enregistrer un commentaire