Warm tip: This article is reproduced from serverfault.com, please click

Best way to detect accent HTML escape in a string?

发布于 2020-11-29 17:57:43

Python has some good libraries to convert Unicode accent characters to its closest Ascii character, as well as libraries to encode codepoint to its Unicode character.

However, what options are there to check whether a string has unicode codepoint or HTML escape? For example, this string:

Rialta te Venice&#199

Has the &#199, which translates to a latin capital letter C. Is there a python library that detects codepoints/escape within a string and outputs the Unicode equivalent?

Questioner
Elie
Viewed
0
xjcl 2020-11-30 03:37:38

It's not quite clear to me what you're asking, but here is my best try:

  1. &#199 is an HTML escape, which you can unescape like so:

    >>> s = 'Rialta te Venice&#199'
    >>> import html
    >>> s2 = html.unescape(s); s2
    'Rialta te VeniceÇ'
    
  2. As you've said, there are libraries for normalizing/removing accents:

    >>> import unidecode
    >>> unidecode.unidecode(s2)
    'Rialta te VeniceC'
    

    You don't really need to check if it has Unicode codepoints, as this function won't change non-accented characters. But you could check anyway using s2.isascii().

So the complete solution is to use unidecode.unidecode(html.unescape(s)).