-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support lots of ligatures #53
Conversation
714ddc5
to
19dc63f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a huge fan of the python getting super complex but it's fine
class EffectiveWidth(enum.IntEnum): | ||
"""Represents the width of a Unicode character. All East Asian Width classes resolve into | ||
either `EffectiveWidth.NARROW`, `EffectiveWidth.WIDE`, or `EffectiveWidth.AMBIGUOUS`. | ||
def to_sorted_ranges(iter: Iterable[Codepoint]) -> list[tuple[Codepoint, Codepoint]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: given the amount of logic that is ending up in Python here I think it may make sense to switch to a rust script for this. unfortunately that will make CI much slower since a rust script would need HTTP deps
Perhaps we can instead put more work into documenting and cleaning up the python. you've been doing pretty well so far!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a rust script would need HTTP deps
Or we could just shell out to curl
…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not opposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(let's not do that in this PR, though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we could also just check in the fetched file and have a CI job ensure it's up to date.
…instead of meaning that it has ambiguous width. Instead, we use a (partly) separate table to handle CJK width.
and a bugfix caught by said tests
- Support nonspacing coeng signs - Assign width 2 to KHMER INDEPENDENT VOWEL QAA and 3 to KHMER SIGN BEYYAL (https://unicode.org/charts/nameslist/n_1780.html)
19dc63f
to
84455bb
Compare
@Jules-Bertholet what's the reason to test string normalization with all the internal modes defined instead of always starting with the |
@kchibisov It ensures that normalization-equivalent strings have the same width no matter what might follow them. I don’t see the issue wrt Unicode 16; you can either make the composed characters wide, or have some extra states (encounter B -> state B, encounter A while in state B -> state AB and don’t increment width, etc) |
Well, then the only way to avoid that is to have a state that is not passed directly. Because you have an issue were Like states don't really help here, since I don't ever want to start from a special state, when I face specific char, because the width on how to compute it is more complex. If you still think that all of this possible without touching test to prevent certain state passed automatically as start. The python code is too complex anyway for me to care about that more, since I don't really need normalization as well. |
Also, I don't see how it matters to test states like that, since you can only get into e.g. state Like all you say with states makes sense, as long as your unicode characters to compose are different, which is not the case if you look into the link I've posted. It's also not the case with anything present already, which is way they said that manual changes required for certain algos. |
Arabic Lam-Alef ligature
Unicode:
In practice, the three monospace Arabic fonts that seem to be most widespread (Courier New, Simplified Arabic Fixed, and Kawkab Mono) all render this ligature with width 1.
Buginese <a, -i> ya ligature
Documented in Unicode.
Hebrew Alef-Lamed ligature
This one is not documented in the Unicode standard, but all the Hebrew monospace fonts I could find have it. And it requires an explicit ZWJ to get, so there is no real risk in including it.
Khmer coeng signs
Documented in Unicode.
Old Turkic ligature
Documented in Unicode.
Tifinagh bi-consonants
Documented in Unicode.
Emoji modifiers and ZWJ sequences
Documented in UTS 51.
U+17D8 KHMER SIGN BEYYAL
According to the charts, this character should be equivalent to U+17D4 U+179B U+17D4, so give it the same width as that sequence.
This PR also fixes a bug with canonical equivalence for the CJK width of certain strings containing
'\u{0338}'
COMBINING LONG SOLIDUS OVERLAY. I can try to split that out if you'd prefer.