Warm tip: This article is reproduced from stackoverflow.com, please click
nlp python string

Displacement of diacritics/accents from strings in python

发布于 2020-03-27 10:29:20

I am working on a NLP task that requires using a corpus of the language called Yoruba. Yoruba is a language that has diacritics in its alphabets. If I read any text/corpus into the python environment, some of the upper diacritics gets displaced/shifted especially for alphabets ẹ and ọ:

an image description

for characters ẹ with diacritics at the top they get displaced. to have:ẹ́ ẹ̀ also for ọ the same thing occurs.( ọ́ ọ̀ )

def readCorpus(directory="news_sites.txt"):
with open(directory, 'r',encoding="utf8", errors='replace') as doc:
data = doc.readlines()
return data

The expected result is having the diacritics rightly placed at the top (I am surprised stackoverflow was able to fix the diacritics).

Later the diacritics that have been displaced are seen as a punctuation and therefore removed (by my NLP processing function) thus affecting the whole task.

Jesujoba ALABI
Jesujoba ALABI 2019-07-04 04:47

I removed numbers using reg ex 'cos some the strings contain numbers but used the maketrans method from the string library to remove the punctuation marks.

import string
out = re.sub(r'[0-9\.]+', '', ins)
punct = str.maketrans({k: None for k in string.punctuation})
new_s = out.translate(punct)