Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update for Unicode 16.0.0 #271

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

update for Unicode 16.0.0 #271

wants to merge 1 commit into from

Conversation

stevengj
Copy link
Member

Draft PR to update our data tables to the upcoming Unicode 16.0.0 standard.

data_generator.jl script is currently failing with:

julia --project=. data_generator.jl > utf8proc_data.c.new
ERROR: LoadError: AssertionError: !(haskey(comb_indices, dm1))
Stacktrace:
 [1] top-level scope
   @ ~/Documents/Code/utf8proc/data/data_generator.jl:325
in expression starting at /Users/stevenj/Documents/Code/utf8proc/data/data_generator.jl:293

@c42f, since you wrote/ported this script in #258, can you help?

@inkydragon
Copy link
Contributor

inkydragon commented Oct 9, 2024

An error occurred while processing Character Decomposition Mapping data.
Here, dm0, dm1 are new characters obtained after a character decomposition.

Assert fails when dm0 == dm1.
I tried the old ruby script, and it also fails at the same assert.

@assert !haskey(comb_indices, dm0)
comb_indices[dm0] = cumoffset
cumoffset += last - first + 1 + 2
end
offset = 0
for dm1 in comb2nd_indices_sorted_keys
@assert !haskey(comb_indices, dm1)

In other words, we assume that a character will not be split into two identical characters.

But Unicode 16 introduces a new character: KIRAT RAI VOWEL SIGN AI (16D68).
It will be split into two KIRAT RAI VOWEL SIGN E (16D67).

image

P683, Figure 13-16
The Unicode Standard, Version 16.0 – Core Specification

I'm not sure whether the current compressed table (xref: #68) can represent this type of mapping.

@stevengj
Copy link
Member Author

stevengj commented Oct 9, 2024

Naively commenting out the assert doesn't work; it fails the normalization test for U+113C5, which is another such character introduced in Unicode 16.0 that decomposes into two identical characters U+113C2 + U+113C2.

It looks like we'll have to special-case the tables somehow for this. It would be unfortunate to have to add an extra table just for this, but I'm not sure I see a way around it yet.

@eschnett
Copy link

Do we also need to introduce re-combining? If we start with [16D67, 16D68, 16D67], should this then result in [16D68, 16D68]?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants