I am working on a NLP task that requires using a corpus of the language called Yoruba. Yoruba is a language that has diacritics in its alphabets. If I read any text/corpus into the python environment, some of the upper diacritics gets displaced/shifted especially for alphabets ẹ and ọ:
for characters ẹ with diacritics at the top they get displaced. to have:ẹ́ ẹ̀ also for ọ the same thing occurs.( ọ́ ọ̀ )
def readCorpus(directory="news_sites.txt"):
with open(directory, 'r',encoding="utf8", errors='replace') as doc:
data = doc.readlines()
return data
The expected result is having the diacritics rightly placed at the top (I am surprised stackoverflow was able to fix the diacritics).
Later the diacritics that have been displaced are seen as a punctuation and therefore removed (by my NLP processing function) thus affecting the whole task.
I removed numbers using reg ex 'cos some the strings contain numbers but used the maketrans method from the string library to remove the punctuation marks.
import string
out = re.sub(r'[0-9\.]+', '', ins)
punct = str.maketrans({k: None for k in string.punctuation})
new_s = out.translate(punct)
Note: You don't need to build the
dict
manually before passing tostr.maketrans
, you can just dostr.maketrans('', '', string.punctuation)
(the third argument is characters to be deleted in the translation table). Also note that this only handles ASCII punctuation, stuff like smart quotes (that many word processing tools substitute for ASCII single and double quotes automatically) aren't included instring.punctuation
.