jeudi 23 janvier 2020

Find Diacritics in String (Normalize Strings)

I have a list of many names (>1.000.000) which I need to check for diacritics. Since the dataset is encoded in UTF-8 de-DE I have to consider special characters like ö,ä,ü.

Names might also contain characters like , - ' . and of course spaces for middle names.

This is what I came up with so far:

I take every name and break it up into individual characters rowwise. Formula for B1: =ArrayFormula((LOWER(MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1))))

enter image description here

The next step is to compare each row in B against an array of characters that are allowed - which came to me is a simpler idea than figuring out which characters are not allowed: {"a"\"b"\"c"\"d"\"e"\"f"\"g"\"h"\"i"\"j"\"k"\"l"\"m"\"n"\"o"\"p"\"q"\"r"\"s"\"t"\"u"\"v"\"w"\"x"\"y"\"z"\"."\"'"\" "\"ö"\"ä"\"ü"\"-"\","}

enter image description here

Formula for C: =ArrayFormula(SUM(--(LOWER(MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1))={"a"\"b"\"c"\"d"\"e"\"f"\"g"\"h"\"i"\"j"\"k"\"l"\"m"\"n"\"o"\"p"\"q"\"r"\"s"\"t"\"u"\"v"\"w"\"x"\"y"\"z"\"."\"'"\" "\"ö"\"ä"\"ü"\"-"\","}))=LEN(A1)) which returns TRUE or FALSE.

This approach takes quite some time for this amount of rows. Is there any better, faster, more concise approach?

Aucun commentaire:

Enregistrer un commentaire