Warm tip: This article is reproduced from serverfault.com, please click

ascii python string unicode non-ascii-characters

Best way to detect accent HTML escape in a string?

发布于 2020-11-29 17:57:43

Python has some good libraries to convert Unicode accent characters to its closest Ascii character, as well as libraries to encode codepoint to its Unicode character.

However, what options are there to check whether a string has unicode codepoint or HTML escape? For example, this string:

Rialta te Venice&#199

Has the &#199, which translates to a latin capital letter C. Is there a python library that detects codepoints/escape within a string and outputs the Unicode equivalent?

Questioner

Elie

Viewed

0

xjcl 2020-11-30 03:37:38

It's not quite clear to me what you're asking, but here is my best try:

&#199 is an HTML escape, which you can unescape like so:

>>> s = 'Rialta te Venice&#199'
>>> import html
>>> s2 = html.unescape(s); s2
'Rialta te VeniceÇ'

As you've said, there are libraries for normalizing/removing accents:
```
>>> import unidecode
>>> unidecode.unidecode(s2)
'Rialta te VeniceC'
```
You don't really need to check if it has Unicode codepoints, as this function won't change non-accented characters. But you could check anyway using s2.isascii().

So the complete solution is to use unidecode.unidecode(html.unescape(s)).

Elie 2020-11-29 19:47:40

My question was essentially how to "undo" the &#199, which I mistakenly thought was codepoint (when it is actually HTML escape, as you pointed out). Your answer works beautifully, as unescape(s) detects where an HTML escape exists - thank you!

热门帖子

1

卷死同行 gpt-4o 模型 1.4 折中转接近官网 3.5 的价格！

2

虚心求教，数据量上亿的爬虫数据用什么该用什么数据库呢

3

联通推出了更便宜的 eSIM iPad 套餐

4

google doc如何快速插入日期时间？

5

onedrive 登陆问题

6

成都租车被坑经历

7

代开百度网盘 svip

8

求一个 Spotify 长期车

9

有内嵌的简单 mysql 版本的 MQ 吗

10

[送码] AIBotPro 一个不仅仅做 AI 服务集成的网站，提供最少一小时 10 次的 gpt4 服务，且有可玩性，已开源

热门github

1

A multi-platform library for OpenGL, OpenGL ES, Vulkan, window and input

2

Dev tool that writes scalable apps from scratch while the developer oversees the implementation

3

shadcn/ui, but for Svelte. ✨

4

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.

5

Performance-portable, length-agnostic SIMD with runtime dispatch

6

ZK Credo

7

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

8

Joplin - the secure note taking and to-do app with synchronisation capabilities for Windows, macOS, Linux, Android and iOS.

9

Mamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.

10

This repository contains System Design resources which are useful while preparing for interviews and learning Distributed Systems

11

Curso para aprender el lenguaje de programación Python desde cero y para principiantes. 75 clases, 37 horas en vídeo, código, proyectos y grupo de chat. Fundamentos, frontend, backend, testing, IA...

12

🎓 Path to a free self-taught education in Computer Science!

13

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

14

A collective list of free APIs

15

📚 Freely available programming books