I have a list of many names (>1.000.000) which I need to check for diacritics. Since the dataset is encoded in UTF-8 de-DE I have to consider special characters like ö,ä,ü.
Names might also contain characters like , - ' . and of course spaces for middle names.
This is what I came up with so far:
I take every name and break it up into individual characters rowwise. Formula for B1: =ArrayFormula((LOWER(MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1))))
The next step is to compare each row in B against an array of characters that are allowed - which came to me is a simpler idea than figuring out which characters are not allowed: {"a"\"b"\"c"\"d"\"e"\"f"\"g"\"h"\"i"\"j"\"k"\"l"\"m"\"n"\"o"\"p"\"q"\"r"\"s"\"t"\"u"\"v"\"w"\"x"\"y"\"z"\"."\"'"\" "\"ö"\"ä"\"ü"\"-"\","}
Formula for C: =ArrayFormula(SUM(--(LOWER(MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1))={"a"\"b"\"c"\"d"\"e"\"f"\"g"\"h"\"i"\"j"\"k"\"l"\"m"\"n"\"o"\"p"\"q"\"r"\"s"\"t"\"u"\"v"\"w"\"x"\"y"\"z"\"."\"'"\" "\"ö"\"ä"\"ü"\"-"\","}))=LEN(A1)) which returns TRUE or FALSE.
This approach takes quite some time for this amount of rows. Is there any better, faster, more concise approach?


Aucun commentaire:
Enregistrer un commentaire