mardi 23 mai 2017

Removing duplicates in R (20K observations)

I am currently working in a large data set looking at duplicate water rights. Some duplicates are helpful to me but others have no relevance. E.g. there are double entries when a metal tag number was assigned to a specific water right. To avoid double counting the critical information I need to delete an observation.

I have this written at the moment,

Updated Metal Tag Number

for(i in 1:duplicate.rights){ met.tag<- if( [i, "RightID"]==[i-1, "RightID"] & [i,"MetalTagNu"]=![i-1, "MetalTagNu"] ){ remove(i) } }

I know there are a lot of syntax errors but I am not sure how to specify. Running a for loop through my data set that adds this info to the data frame met.tag: if the rightID is identical and the metal tag isn't, remove the first observation.

I am quite the novice so please forgive my poor previous attempt. I know i can use the lapply function to make this go faster and more efficiently. Any guidance there would be much appreciated.

Aucun commentaire:

Enregistrer un commentaire