mercredi 22 juillet 2020

How to test the presence of a node (string) in an xml file to use in a loop to extract data from multiple files?

As a new R user I've been struggling with this problem for a while and could not figure it out on my own. Perhaps the answer is simple, and someone can help me. My challenge is that I have thousands of xm files in a folder and I want to extract the content of a specific node from each of them and save in a dataframe. The files, however, have repetition of the names for my node of interest. So I used numbers instead of names to extract the data I want.

for (i in (1 : length(file_list))) {
  
test.file<- file_list[i]
datax<-xmlParse(test.file)  #enter the xml file name you want to analyze
data<-xmlToList(datax) #convert xml as a list
serial<-as.vector(unlist(data$.attrs[2]))
print(serial)

# Check if the xml file contains the node AuditRules

n <- ifelse(xml_find_all(test.file, "//AuditRules") == TRUE, 6, 5)

#Extract waveform values for the Current ECG srip

waveform <- as.vector(data[[n]][[3]][[1]][[1]][[2]])
waveform <- as.character(waveform)
waveform<-strsplit(waveform, split = " ")
waveform<-as.numeric(unlist(waveform))
waveform<-as.data.frame(waveform)

#Extract the serial number to be used as ID for the animal and create a column on the dataframe

serial<-as.vector(unlist(data$.attrs[2]))
serial<-as.factor(serial)
waveform$serial<-serial

#Extract date and time of Current ECG and save it as a column date

date<-as.vector(unlist(data$.attrs[n]))
date <- gsub("T", " ", date)
waveform$date <- as.POSIXct(date, format = "%Y-%m-%d %H:%M:%S", tz = 'Etc/GMT+5')

#Extract time offset [the first R-R interval from the Current ECG ]

offset <-as.vector(unlist(data[[n]][[3]][[1]][[1]][[1]][[1]]))
offset <- gsub("[a-zA-Z]+", "", offset)
waveform$offset <- offset

#Crate a column for voltage in mv using the  amplitudeScaleFactor="0.000815" 
#waveform$mv <- waveform$waveform*0.000815

#Create a column for time (sec) using the sampleInterval="PT0.0078125S"

#waveform$time <- as.numeric(waveform$offset)
# add a new column to old data.frame. Set value "offset"  as the starting value for row 1.

# populate newcol with values starting from row 2.

#for (i in 4:nrow(waveform)){
#  waveform[i,6] <- waveform[i-1,6] +0.0078125

# Write data to CSV  
 
 write.csv(waveform, paste0(data_export_dir,"/Savannah_", file_names[i],"_ECG.csv"))
}

My Problem: Some files have one extra node before the node of interest [5]. For those I would need to change the node of interest to [6] instead. My question: How could I change the above code to include a condition (presence or absence of the extra node) and alternate the use of [5] or [6] accordingly. I tried to add something like this to my loop, but it did not work:

for (i in (1 : length(file_list))) {
  
test.file<- file_list[i]
datax<-xmlParse(test.file)  
data<-xmlToList(datax) #convert xml as a list
serial<-as.vector(unlist(data$.attrs[2]))
print(serial)

# Check if the xml file contains the node AuditRules

n <- ifelse(xml_find_all(test.file, "//AuditRules") == TRUE, 6, 5)

#Extract waveform values for the Current ECG srip

waveform <- as.vector(data[[n]][[3]][[1]][[1]][[2]])
waveform <- as.character(waveform)
waveform<-strsplit(waveform, split = " ")
waveform<-as.numeric(unlist(waveform))
waveform<-as.data.frame(waveform)

I would appreciate any help! Thanks in advance.

Aucun commentaire:

Enregistrer un commentaire