mardi 12 novembre 2019

Data Cleaning Function: Replacing powers of ten with the median power

In Forestry, hand-held measuring devices frequently produce decimal errors due to handling errors. When not corrected during data-input, this results in obvious errors, such as a tree growing from, (height): 24 cm, 250cm, 26 cm.

My idea was to write a filtering function which would convert the height to scientific notation, ignore NA's and if the value is within the range of the previous and subsequent value; and if not, replace the exponent with a suitable exponent of ten which matches the others (i.e. median == mode for safety). i.e. 2.4e+1, 2.5e+2, 2.6e+1 -> 2.4e+1, 2.5e+1, 2.6e+1.

I quickly realised that a normal if/else function did not respond well, as it is not vectorised, which is why I used Vectorize() rather than writing a deeply nested ifelse.

This is what I've got so far:

I take a test vector, convert it to scientific notation, split the shorthand value and create lead and lag variables. A copied function finds the mode.

As scientific notation is a character, I convert it to numeric before running statements and checking if it is in range. If not, I replace the exponent with the mode exponent.

When I run the function, I however still get a lot of errors, most notably stating that my Vectorscientific[i,"leader] or similar have the incorrect number of dimensions. What am I doing wrong?

Testvector

Vector <- c(2e+2, 2.1e+2, 2.2e+2, 2.3e+4, 2.4e+2)

Create Magnitude Filter

magnitudefilter <- function(Vector){

  Vectorscientific <- data.frame(Vectorscientific=formatC(Vector, format = "e"))
  Vectorscientific$leader <- dplyr::lead(Vectorscientific$Vectorscientific,1)
  Vectorscientific$lagger <- dplyr::lag(Vectorscientific$Vectorscientific,1)

  Vectorscientific$shorthandvalue <- gsub("e.*","",Vectorscientific$Vectorscientific)

  medianexponent <-  median(as.numeric(gsub("^.*e","",Vectorscientific$Vectorscientific)))

  getmode <- function(v) {
    uniqv <- unique(v)
    uniqv[which.max(tabulate(match(v, uniqv)))]
  }

  modeexponent <-  getmode(as.numeric(gsub("^.*e","",Vectorscientific$Vectorscientific)))

  Vectorscientific$Vectorscientific <- as.numeric(as.character(Vectorscientific$Vectorscientific))

  ##Create sorting chain
  #if lead NA
  sortingchain  <- function(Vectorscientific){
    if(is.na(Vectorscientific[i,"leader"])){
    Vectorscientific[i,"Vectorscientific"] <- Vectorscientific[i,"Vectorscientific"]
  }
  #if lag NA
  else if(is.na(Vectorscientific[i,"lagger"])){
    Vectorscientific[i,"Vectorscientific"] <- Vectorscientific[i,"Vectorscientific"]
  }

  # if in range
  else if(Vectorscientific[i,"Vectorscientific"] >= Vectorscientific[i,"lagger"] & 
          Vectorscientific[i,"Vectorscientific"] <= Vectorscientific[i,"leader"]){
    Vectorscientific[i,"Vectorscientific"] <- Vectorscientific[i,"Vectorscientific"]
  } 

  #if replace exponent
  else {
    Vectorscientific[i,"Vectorscientific"] <- paste0(Vectorscientific[i,"shorthandvalue"], "e+",medianexponent)
  }
}
  #Vectorize sorting chain (if/else not vectorised in R)
  vectorizedsort <- Vectorize(sortingchain)

  if(identical(modeexponent, medianexponent)){

  for(i in Vectorscientific){
  vectorizedsort(Vectorscientific[i,])
    }
  }

  return(Vectorscientific$Vectorscientific)
}


magnitudefilter(Vector)

Aucun commentaire:

Enregistrer un commentaire