(TIL) Python: Normalize text with unicodedata

less than 1 minute read

Using the unicodedata Python module it’s easy to normalize any unicode data strings (remove accents etc):

import unicodedata

data = u'ïnvéntìvé'
normal = unicodedata.normalize\
    ('NFKD', data).\
    encode('ASCII', 'ignore')
print(normal)

The output will be:

b'inventive'

The NFKD stands for Normalization Form Compatibility Decomposition, and this is where characters are decomposed by compatibility, also multiple combining characters are arranged in a specific order.

Via enkipro.com.

Comments