Warm tip: This article is reproduced from serverfault.com, please click

unicode text-segmentation

Non reducable grapheme clusters in unicode

发布于 2015-08-13 10:06:08

I'm of the opinion that "user perceived character" (henceforth UPC) iterator would be very useful in a unicode library. By UPC I mean the sense discussed in unicode standard annex 29, which is what a user perceives as a character, but might be represented in unicode as a codepoint or a grapheme-cluster. Since I typically work with latin languages, I always come up with examples like "I want to handle ü as one UPC, regardless of whether the UPC is a grapheme cluster, or a single codepoint".

Colleagues who are against a UPC iterator (or grapheme cluster iterator, take your pick) counter "You can normalize to NFC, and then use codepoint iteration", and "there is no use case for grapheme cluster iteration".

I keep thinking of latin-centric use cases, which maybe don't translate well to the unicode universe -- like I'm doing terminal output, I want to pad a column to N column widths, so I want to know how many UPCs are in a string...

I think what I want to know is:

Are there meaningful grapheme clusters which can't be normalized to a single codepoint? Are there any that are likely to occur among western users? I'm assuming Korean or Arabic are cases of this, but I have to admit to total ignorance there.
Do any other languages provide UPC/grapheme cluster iteration/operations? Is there any kind of advice from the Unicode specification?

Questioner

Spacemoose

Viewed

0

一二三 2015-08-13 19:35:04

It's unclear how your questions are not answered by UAX #29:

There are many such grapheme clusters, even for languages that only use the Latin alphabet as not all combining marks have compositions with all other letters/forms—for example, the gaps in this table on Wikipedia. Table 1a in UAX #29 has several non-Latin examples.
This is the purpose of UAX #29: to generalise grapheme cluster operations to all languages that are supported in Unicode.

Spacemoose 2015-08-13 13:39:07

I just reread UAX #15... Are you referring to section 5 "Composite Exclusion Table"? I have to admit I have trouble taking the content of the section and applying it to the languages I know. I suppose I am asking for cultural knowledge -- how commonly will I need to be aware of grapheme clusters? Is it reasonable to tell my customers we don't support them? There's an element in my company leaning towards ignoreing their presence until they bite us. I'd like to know the risks, and have compelling arguments at hand, if they exist.

Spacemoose 2015-08-13 13:42:22

The wikepedia table seems to be what I 'm looking for r.e. Latin languages. Can you or anyone else tell me how commonly these excluded clusters are, and in which countries I'm likely to encounter them?

一二三 2015-08-13 14:18:49

Given that the algorithm for supporting grapheme clusters is well-known and implemented in any decent Unicode library, not supporting them would seem to be more difficult.

热门帖子

1

iOS 17.5 BUG 有用户发现多年前删除的照片重新出现在照片库

2

怎么 vision pro 没啥讨论度了

3

卷死同行 gpt-4o 模型 1.4 折中转接近官网 3.5 的价格！

4

新房入住， 618 有推荐的组 MESH 的主副路由器吗？

5

各位大佬好，我是一名大学生，想请教一下大家有没有什么适合大学生的赚钱小项目？我深知赚钱不易，所以想在不影响学业的前提下，找一些小项目来赚点零花钱。希望各位大佬能不吝赐教，分享一些你们的经验和建议。谢谢大家啦！

6

虚心求教，数据量上亿的爬虫数据用什么该用什么数据库呢

7

联通推出了更便宜的 eSIM iPad 套餐

8

坐标深圳，收台主机，不急

9

google doc如何快速插入日期时间？

10

最近三年面了三百多人，给程序员和面试官们分享一下我的感受

热门github

1

A multi-platform library for OpenGL, OpenGL ES, Vulkan, window and input

2

Dev tool that writes scalable apps from scratch while the developer oversees the implementation

3

shadcn/ui, but for Svelte. ✨

4

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.

5

Performance-portable, length-agnostic SIMD with runtime dispatch

6

ZK Credo

7

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

8

Joplin - the secure note taking and to-do app with synchronisation capabilities for Windows, macOS, Linux, Android and iOS.

9

Mamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.

10

This repository contains System Design resources which are useful while preparing for interviews and learning Distributed Systems

11

Curso para aprender el lenguaje de programación Python desde cero y para principiantes. 75 clases, 37 horas en vídeo, código, proyectos y grupo de chat. Fundamentos, frontend, backend, testing, IA...

12

🎓 Path to a free self-taught education in Computer Science!

13

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

14

A collective list of free APIs

15

📚 Freely available programming books