Support lots of ligatures #53

Jules-Bertholet · 2024-05-28T14:17:11Z

Arabic Lam-Alef ligature

The lam-alef type of ligatures are extremely common in the Arabic script. These ligatures occur in almost all font designs, except for a few modern styles. When supported by the style of the font, lam-alef ligatures are considered obligatory. This means that all character sequences rendered in that font, which match the rules specified in the following discussion, must form these ligatures.

In practice, the three monospace Arabic fonts that seem to be most widespread (Courier New, Simplified Arabic Fixed, and Kawkab Mono) all render this ligature with width 1.

Buginese <a, -i> ya ligature

Documented in Unicode.

Hebrew Alef-Lamed ligature

This one is not documented in the Unicode standard, but all the Hebrew monospace fonts I could find have it. And it requires an explicit ZWJ to get, so there is no real risk in including it.

Khmer coeng signs

Documented in Unicode.

Old Turkic ligature

Documented in Unicode.

Tifinagh bi-consonants

Documented in Unicode.

Emoji modifiers and ZWJ sequences

Documented in UTS 51.

U+17D8 KHMER SIGN BEYYAL

According to the charts, this character should be equivalent to U+17D4 U+179B U+17D4, so give it the same width as that sequence.

This PR also fixes a bug with canonical equivalence for the CJK width of certain strings containing '\u{0338}' COMBINING LONG SOLIDUS OVERLAY. I can try to split that out if you'd prefer.

Manishearth

not a huge fan of the python getting super complex but it's fine

Manishearth · 2024-06-06T16:53:06Z

scripts/unicode.py

-class EffectiveWidth(enum.IntEnum):
-    """Represents the width of a Unicode character. All East Asian Width classes resolve into
-    either `EffectiveWidth.NARROW`, `EffectiveWidth.WIDE`, or `EffectiveWidth.AMBIGUOUS`.
+def to_sorted_ranges(iter: Iterable[Codepoint]) -> list[tuple[Codepoint, Codepoint]]:


thought: given the amount of logic that is ending up in Python here I think it may make sense to switch to a rust script for this. unfortunately that will make CI much slower since a rust script would need HTTP deps

Perhaps we can instead put more work into documenting and cleaning up the python. you've been doing pretty well so far!

a rust script would need HTTP deps

Or we could just shell out to curl…

Not opposed.

(let's not do that in this PR, though)

Actually we could also just check in the fetched file and have a CI job ensure it's up to date.

…instead of meaning that it has ambiguous width. Instead, we use a (partly) separate table to handle CJK width.

and a bugfix caught by said tests

- Support nonspacing coeng signs - Assign width 2 to KHMER INDEPENDENT VOWEL QAA and 3 to KHMER SIGN BEYYAL (https://unicode.org/charts/nameslist/n_1780.html)

kchibisov · 2024-10-07T17:34:26Z

@Jules-Bertholet what's the reason to test string normalization with all the internal modes defined instead of always starting with the Default mode? Like it's really hard to do that for unicode-16 that way due to more complex compose rules and the need for internal states to account for width(AXX) == width(XX), and so on, see https://www.unicode.org/reports/tr15/tr15-56.html#Contexts_Care .

Jules-Bertholet · 2024-10-07T17:50:09Z

@kchibisov It ensures that normalization-equivalent strings have the same width no matter what might follow them. I don’t see the issue wrt Unicode 16; you can either make the composed characters wide, or have some extra states (encounter B -> state B, encounter A while in state B -> state AB and don’t increment width, etc)

kchibisov · 2024-10-07T18:05:07Z

Well, then the only way to avoid that is to have a state that is not passed directly. Because you have an issue were X character can multiple itself multiple times and it'll differ in width.

Like states don't really help here, since I don't ever want to start from a special state, when I face specific char, because the width on how to compute it is more complex.

If you still think that all of this possible without touching test to prevent certain state passed automatically as start. The python code is too complex anyway for me to care about that more, since I don't really need normalization as well.

kchibisov · 2024-10-07T18:25:27Z

Also, I don't see how it matters to test states like that, since you can only get into e.g. state X from character y, but if you get into this X state from start, it means that whatever you have before is y character meaning that for repeating characters you're in repeat state(if you add one more state you'll end up in this situation again, since you'll just start from repeat, which is also changes the processing).

Like all you say with states makes sense, as long as your unicode characters to compose are different, which is not the case if you look into the link I've posted. It's also not the case with anything present already, which is way they said that manual changes required for certain algos.

Jules-Bertholet force-pushed the ligatures branch 3 times, most recently from 714ddc5 to 19dc63f Compare June 2, 2024 19:09

Jules-Bertholet mentioned this pull request Jun 6, 2024

v0.1.13 breaks semver #55

Closed

Manishearth approved these changes Jun 6, 2024

View reviewed changes

Jules-Bertholet added 17 commits June 6, 2024 13:45

Use array of arrays for tables

9289c0a

3 means codepoint needs special handling

1eb69dc

…instead of meaning that it has ambiguous width. Instead, we use a (partly) separate table to handle CJK width.

Document and test cjk flag

3a21f14

Hebrew Alef Lamed

c8ceb18

Arabic Lam-Alef ligature

1eafad8

Buginese ligature, ligature transparency, fix canonical equivalence

8f0ea15

Tifinagh biconsonants

a38437b

Old Turkic ligature

246b9bd

Emoji modifiers

e1566f2

unicode.py: print new table sizes

79bb6e7

Compress EMOJI_MODIFIER_LEAVES a little more

ce19987

Compress a little

3bad9e6

Support emoji ZWJ sequences

af3e4cd

Add test that traits are not sealed

7bf7490

More extensive normalization tests

b3cdccc

and a bugfix caught by said tests

Test emoji sequences from Unicode test files

060cbbb

Support Khmer

84455bb

- Support nonspacing coeng signs - Assign width 2 to KHMER INDEPENDENT VOWEL QAA and 3 to KHMER SIGN BEYYAL (https://unicode.org/charts/nameslist/n_1780.html)

Jules-Bertholet force-pushed the ligatures branch from 19dc63f to 84455bb Compare June 6, 2024 17:53

Jules-Bertholet requested a review from Manishearth June 6, 2024 17:54

Manishearth approved these changes Jun 6, 2024

View reviewed changes

Manishearth merged commit afab363 into unicode-rs:master Jun 6, 2024
2 checks passed

Jules-Bertholet deleted the ligatures branch June 6, 2024 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support lots of ligatures #53

Support lots of ligatures #53

Jules-Bertholet commented May 28, 2024 •

edited

Loading

Manishearth left a comment

Manishearth Jun 6, 2024

Jules-Bertholet Jun 6, 2024

Manishearth Jun 6, 2024

Manishearth Jun 6, 2024

Manishearth Jun 6, 2024

kchibisov commented Oct 7, 2024

Jules-Bertholet commented Oct 7, 2024

kchibisov commented Oct 7, 2024

kchibisov commented Oct 7, 2024

Support lots of ligatures #53

Support lots of ligatures #53

Conversation

Jules-Bertholet commented May 28, 2024 • edited Loading

Arabic Lam-Alef ligature

Buginese <a, -i> ya ligature

Hebrew Alef-Lamed ligature

Khmer coeng signs

Old Turkic ligature

Tifinagh bi-consonants

Emoji modifiers and ZWJ sequences

U+17D8 KHMER SIGN BEYYAL

Manishearth left a comment

Choose a reason for hiding this comment

Manishearth Jun 6, 2024

Choose a reason for hiding this comment

Jules-Bertholet Jun 6, 2024

Choose a reason for hiding this comment

Manishearth Jun 6, 2024

Choose a reason for hiding this comment

Manishearth Jun 6, 2024

Choose a reason for hiding this comment

Manishearth Jun 6, 2024

Choose a reason for hiding this comment

kchibisov commented Oct 7, 2024

Jules-Bertholet commented Oct 7, 2024

kchibisov commented Oct 7, 2024

kchibisov commented Oct 7, 2024

Jules-Bertholet commented May 28, 2024 •

edited

Loading