I am working on a NLP task that requires using a corpus of the language called Yoruba. Yoruba is a language that has diacritics in its alphabets. If I read any text/corpus into the python environment, some of the upper diacritics gets displaced/shifted especially for alphabets ẹ and ọ:
for characters ẹ with diacritics at the top they get displaced. to have:ẹ́ ẹ̀ also for ọ the same thing occurs.( ọ́ ọ̀ )
def readCorpus(directory="news_sites.txt"): with open(directory, 'r',encoding="utf8", errors='replace') as doc: data = doc.readlines() return data
The expected result is having the diacritics rightly placed at the top (I am surprised stackoverflow was able to fix the diacritics).
Later the diacritics that have been displaced are seen as a punctuation and therefore removed (by my NLP processing function) thus affecting the whole task.