-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support ES2015 Regex u
flag (Unicode)
#958
Comments
u
flagu
flag (Unicode)
The harness code used by the tests in test262 make use of unicode regexes, so tests are failing not so much because what is being tested fails, but because the code for executing the tests fails. |
In order to run |
Will try next week to resurrect my attempt to implement this and share my WIP |
Did make progress, but not yet as far as I hoped. Still at it though |
Got some Java implementation-related questions I'm hoping to get some answers on before going down the wrong rabbits-hole. Right now, inside the regex implementation there are 2 things going on that I'm contemplating on changing:
Anyone got insight what the changes from char > int and byte[] to int[] would do memory and performance-wise? |
And another questionThe logic needs the mappings defined in this file: https://www.unicode.org/Public/14.0.0/ucd/CaseFolding.txt What's the better approach:
FYI: Graal did the later: https://github.com/oracle/graal/blob/89e4cfc7aeea69970b60c64cd075ceb2a104e864/regex/docs/UpdatingUnicodeFiles.md |
I am not familiar with Java performance, but when I implemented BigInt I used Perhaps you can check with However, the benchmarks used by Rhino are quite old. When I looked at what benchmarks other JavaScript engines use, I couldn't find any that were compatible with the latest ECMA262, but instead found this article. |
I think you want to look at UTF-16 code units/surrogate pairs. Java Strings that require more than a single byte per character are stored as UTF-16, which is a variable encoding, and the higher code points are stored in 2 chars (or two pairs of bytes.) |
yeah, already got to the bottom of surrogate pairs and such. Thing is that if the code is to operate on chars in the string, then on each and every operation on the char, you'd need to check whether its a high/leading surrogate and if so, if its followed by a low/trailing surrogate, which messed up the code big time. If instead operating on ints (unicode codepoints), the code is much cleaner. The EcmaScript spec also mentions this in the note in https://tc39.es/ecma262/#sec-regexpidentifiercodepoint: from a spec pov the operations are on codePoints. Its up to the implementation to choose whether internally operating on codePoints or on chars (and thus deal with the high/low surrogate pairs) From an easy-of-coding perspective, I'd prefer codePoints, just not sure about the ramifications memory/performance-wise |
Another interesting one: However: imho that v1 spec doesn't say what should happen if So, how to proceed? Break backwards compatibility completely? Tie the spec-compliant behavior to a specific language version? (if so, which one?). Remain incompatible? By default become spec-compliant, but through a feature flag allow enabling the old behavior (I can see a lot of such flags coming)? I think my preference would be to tie the proper, spec-compliant behavior to a new EcmaScript language version and then have one generic feature flag to enable ALL non-standard behavior (as far as it can be made optional in the codebase) |
See https://mathiasbynens.be/notes/es6-unicode-regex
The text was updated successfully, but these errors were encountered: