I have a dataset that is 521 variables wide and 17M observations deep, yeah, it's big.
Today I have been asked add two new variables to the mix. The first variable is to contain either true or false. If any of the variables in the row contain special characters show true, if not show false.
The second variable needs to index which variables, in numeric order, contain special characters. If variable 1, 13 and 251 contain special characters on that row the new variable should reflect "1, 13, 251".
This way the data controller can filter to the cases where there are special characters, identify where in the chain the issue is and then fix at the source.
I can find numerous examples of how to remove special characters and indeed do so regulary with a mix of REGEX and GSUB, both at data frame level and at individual variable level, but no where can I find an example of how to create either of the required two new variables.
Any help would be greatly appreciated. The special characters in question need to be anything that is not alpha numeric, (x, "[^[:alnum:]]", " ") covers me when I have removed these from this dataset in the past.
Aucun commentaire:
Enregistrer un commentaire