Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support lots of ligatures #53

Merged
merged 17 commits into from
Jun 6, 2024
Merged

Conversation

Jules-Bertholet
Copy link
Contributor

@Jules-Bertholet Jules-Bertholet commented May 28, 2024

Arabic Lam-Alef ligature

Unicode:

The lam-alef type of ligatures are extremely common in the Arabic script. These ligatures occur in almost all font designs, except for a few modern styles. When supported by the style of the font, lam-alef ligatures are considered obligatory. This means that all character sequences rendered in that font, which match the rules specified in the following discussion, must form these ligatures.

In practice, the three monospace Arabic fonts that seem to be most widespread (Courier New, Simplified Arabic Fixed, and Kawkab Mono) all render this ligature with width 1.

Buginese <a, -i> ya ligature

Documented in Unicode.

Hebrew Alef-Lamed ligature

This one is not documented in the Unicode standard, but all the Hebrew monospace fonts I could find have it. And it requires an explicit ZWJ to get, so there is no real risk in including it.

Khmer coeng signs

Documented in Unicode.

Old Turkic ligature

Documented in Unicode.

Tifinagh bi-consonants

Documented in Unicode.

Emoji modifiers and ZWJ sequences

Documented in UTS 51.

U+17D8 KHMER SIGN BEYYAL

According to the charts, this character should be equivalent to U+17D4 U+179B U+17D4, so give it the same width as that sequence.


This PR also fixes a bug with canonical equivalence for the CJK width of certain strings containing '\u{0338}' COMBINING LONG SOLIDUS OVERLAY. I can try to split that out if you'd prefer.

@Jules-Bertholet Jules-Bertholet force-pushed the ligatures branch 3 times, most recently from 714ddc5 to 19dc63f Compare June 2, 2024 19:09
Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a huge fan of the python getting super complex but it's fine

class EffectiveWidth(enum.IntEnum):
"""Represents the width of a Unicode character. All East Asian Width classes resolve into
either `EffectiveWidth.NARROW`, `EffectiveWidth.WIDE`, or `EffectiveWidth.AMBIGUOUS`.
def to_sorted_ranges(iter: Iterable[Codepoint]) -> list[tuple[Codepoint, Codepoint]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: given the amount of logic that is ending up in Python here I think it may make sense to switch to a rust script for this. unfortunately that will make CI much slower since a rust script would need HTTP deps

Perhaps we can instead put more work into documenting and cleaning up the python. you've been doing pretty well so far!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a rust script would need HTTP deps

Or we could just shell out to curl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not opposed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(let's not do that in this PR, though)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we could also just check in the fetched file and have a CI job ensure it's up to date.

@Manishearth Manishearth merged commit afab363 into unicode-rs:master Jun 6, 2024
2 checks passed
@Jules-Bertholet Jules-Bertholet deleted the ligatures branch June 6, 2024 18:00
@kchibisov
Copy link

@Jules-Bertholet what's the reason to test string normalization with all the internal modes defined instead of always starting with the Default mode? Like it's really hard to do that for unicode-16 that way due to more complex compose rules and the need for internal states to account for width(AXX) == width(XX), and so on, see https://www.unicode.org/reports/tr15/tr15-56.html#Contexts_Care .

@Jules-Bertholet
Copy link
Contributor Author

@kchibisov It ensures that normalization-equivalent strings have the same width no matter what might follow them. I don’t see the issue wrt Unicode 16; you can either make the composed characters wide, or have some extra states (encounter B -> state B, encounter A while in state B -> state AB and don’t increment width, etc)

@kchibisov
Copy link

Well, then the only way to avoid that is to have a state that is not passed directly. Because you have an issue were X character can multiple itself multiple times and it'll differ in width.

Like states don't really help here, since I don't ever want to start from a special state, when I face specific char, because the width on how to compute it is more complex.

If you still think that all of this possible without touching test to prevent certain state passed automatically as start. The python code is too complex anyway for me to care about that more, since I don't really need normalization as well.

@kchibisov
Copy link

Also, I don't see how it matters to test states like that, since you can only get into e.g. state X from character y, but if you get into this X state from start, it means that whatever you have before is y character meaning that for repeating characters you're in repeat state(if you add one more state you'll end up in this situation again, since you'll just start from repeat, which is also changes the processing).

Like all you say with states makes sense, as long as your unicode characters to compose are different, which is not the case if you look into the link I've posted. It's also not the case with anything present already, which is way they said that manual changes required for certain algos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants