I'm of the opinion that "user perceived character" (henceforth UPC) iterator would be very useful in a unicode library. By UPC I mean the sense discussed in unicode standard annex 29, which is what a user perceives as a character, but might be represented in unicode as a codepoint or a grapheme-cluster. Since I typically work with latin languages, I always come up with examples like "I want to handle ü as one UPC, regardless of whether the UPC is a grapheme cluster, or a single codepoint".
Colleagues who are against a UPC iterator (or grapheme cluster iterator, take your pick) counter "You can normalize to NFC, and then use codepoint iteration", and "there is no use case for grapheme cluster iteration".
I keep thinking of latin-centric use cases, which maybe don't translate well to the unicode universe -- like I'm doing terminal output, I want to pad a column to N column widths, so I want to know how many UPCs are in a string...
I think what I want to know is:
It's unclear how your questions are not answered by UAX #29:
There are many such grapheme clusters, even for languages that only use the Latin alphabet as not all combining marks have compositions with all other letters/forms—for example, the gaps in this table on Wikipedia. Table 1a in UAX #29 has several non-Latin examples.
This is the purpose of UAX #29: to generalise grapheme cluster operations to all languages that are supported in Unicode.
I just reread UAX #15... Are you referring to section 5 "Composite Exclusion Table"? I have to admit I have trouble taking the content of the section and applying it to the languages I know. I suppose I am asking for cultural knowledge -- how commonly will I need to be aware of grapheme clusters? Is it reasonable to tell my customers we don't support them? There's an element in my company leaning towards ignoreing their presence until they bite us. I'd like to know the risks, and have compelling arguments at hand, if they exist.
The wikepedia table seems to be what I 'm looking for r.e. Latin languages. Can you or anyone else tell me how commonly these excluded clusters are, and in which countries I'm likely to encounter them?
Given that the algorithm for supporting grapheme clusters is well-known and implemented in any decent Unicode library, not supporting them would seem to be more difficult.