I have the following mess of code, which works for what I want to accomplish. However, in reading about R, I'm seeing over and over again that for loops are incredibly slow, and to avoid them whenever possible.
My actual datasets that I need to apply this code to contain 2,000,000+ data points, so speed is a significant concern. I've been reading up on mapply
, but I am brand new to coding, and really unsure of how to make it work, since I have if/else statements within my for loops.
Moreover, if I understand correctly, mapply
input has to have a function with the same number of variables as the number of arguments you input? My function sum.bin
only has one variable, but I want to apply this over 3 sets of data - length.feature
, num.bins
, and length.data
. Any suggestions?
for(x in 1:length.feature){
for(y in 1:num.bins){
sum.bin <- 0
count <- 0
bin.start <- feature.bins[x,y]
bin.end <- feature.bins[x,(y + 1)]
for(i in j:length.data){
if(data.arm[i] == feature.arm[x]){
if(data.position[i] < bin.start){next}
if(data.position[i] > bin.end){break}
sum.bin <- sum.bin + data.value[i]
count <- count + 1
z <- i
}
else{next}
}
j <- z - count
if(j < 1){j <- 1}
feature.value[n] <- sum.bin
n <- n + 1
}
}
My input data is essentially 3 data frames, which in the following code, I break apart into smaller pieces (e.g. length.feature <- dim(feature)[1]
to work with.
data <- data.frame[2772122,3]
feature <- data.frame[8538, 6]
feature.bins <- data.frame[8538,101]
The output that I'm looking for is a matrix or data frame 8538 rows by 100 columns, containing the results of the sum.bin
function.
Aucun commentaire:
Enregistrer un commentaire