Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Unicode RegExp property escapes #32214

Open
5 tasks done
mathiasbynens opened this issue Jul 2, 2019 · 3 comments
Open
5 tasks done

Support Unicode RegExp property escapes #32214

mathiasbynens opened this issue Jul 2, 2019 · 3 comments
Labels
In Discussion Not yet reached consensus Suggestion An idea for TypeScript

Comments

@mathiasbynens
Copy link

TypeScript currently doesn't transpile Unicode property escapes (of the form \p{ID_Start} or \P{ASCII}) in regular expressions.

It would be great if it did!

https://www.typescriptlang.org/play/?target=1#code/MYewdgzgLgBATgUwOYIB4wLwwPQB0AOA3gMrBwCW+UGA4oggNYC+2ArgNwBQokIANggB0fEEgAUiFKkFQE0MQHJAA8AKAlKvZA

Search Terms

regexp, regular expression, Unicode, property escapes, ES2018

Suggestion

Support transpiling Unicode property escapes in regular expressions. Examples:

/\p{ID_Start}/u;
/\P{ASCII}/u;
/\p{Script_Extensions=Greek}/u;

Use Cases

One particular use case is matching identifier characters in JavaScript parsers. This is currently commonly implemented as a large script-generated regular expression pattern (like in Esprima) or as a magical-looking list of code point ranges (like in TypeScript itself). However, it would be much simpler to use property escapes.

const regexIdentifierStart = /[$_\p{ID_Start}]/u;
const regexIdentifierPart = /[$_\u200C\u200D\p{ID_Continue}]/u;
const regexIdentifierName = /^(?:[$_\p{ID_Start}])(?:[$_\u200C\u200D\p{ID_Continue}])*$/u;

Checklist

My suggestion meets these guidelines:

  • This wouldn't be a breaking change in existing TypeScript/JavaScript code
  • This wouldn't change the runtime behavior of existing JavaScript code
  • This could be implemented without emitting different JS based on the types of the expressions
  • This isn't a runtime feature (e.g. library functionality, non-ECMAScript syntax with JavaScript output, etc.)
  • This feature would agree with the rest of TypeScript's Design Goals.
@xanatos
Copy link

xanatos commented Apr 28, 2020

It is probably more complex. RegExp has a constructor that accepts a string. You'll need a regex "translator" that can be used both by the compiler and the javascript runtime. Because:

const regex = /\p{Script=Greek}/u;
const regex2 = new RegExp('\\p{Script=Greek}', 'u');
console.log(regex.test('π'));
console.log(regex2.test('π'));

both of these should work.

Another possibility would be to change all the /.../ regular expressions to the RegExp(string) and let a polyfiller (like regexpu) do all the regex translation.

@mathiasbynens
Copy link
Author

regexpu is a transpiler, not a polyfill. It only translates regular expression literals, which seems strictly better than blocking this feature on RegExp support and effectively doing nothing.

@redstrike
Copy link

redstrike commented Sep 29, 2021

Hello, is there any update? This is a very important feature to support (Babel has done it). I'm relying on RegExp's Unicode Property Escapes to simplify some normalize & CJKV characters extraction functions. For example:

const normalizedWord = word.trim().toLowerCase().normalize('NFD')

// normalizedWord.replace(/[\u0300-\u036f]/gu, '')
// normalizedWord.replace(/[\^`\xA8\xAF\xB4\xB7\xB8\u02B0-\u034E\u0350-\u0357\u035D-\u0362\u0374\u0375\u037A\u0384\u0385\u0483-\u0487\u0559\u0591-\u05A1\u05A3-\u05BD\u05BF\u05C1\u05C2\u05C4\u064B-\u0652\u0657\u0658\u06DF\u06E0\u06E5\u06E6\u06EA-\u06EC\u0730-\u074A\u07A6-\u07B0\u07EB-\u07F5\u0818\u0819\u0898-\u089F\u08C9-\u08D2\u08E3-\u08FE\u093C\u094D\u0951-\u0954\u0971\u09BC\u09CD\u0A3C\u0A4D\u0ABC\u0ACD\u0AFD-\u0AFF\u0B3C\u0B4D\u0B55\u0BCD\u0C3C\u0C4D\u0CBC\u0CCD\u0D3B\u0D3C\u0D4D\u0DCA\u0E47-\u0E4C\u0E4E\u0EBA\u0EC8-\u0ECC\u0F18\u0F19\u0F35\u0F37\u0F39\u0F3E\u0F3F\u0F82-\u0F84\u0F86\u0F87\u0FC6\u1037\u1039\u103A\u1063\u1064\u1069-\u106D\u1087-\u108D\u108F\u109A\u109B\u135D-\u135F\u1714\u1715\u17C9-\u17D3\u17DD\u1939-\u193B\u1A75-\u1A7C\u1A7F\u1AB0-\u1ABE\u1AC1-\u1ACB\u1B34\u1B44\u1B6B-\u1B73\u1BAA\u1BAB\u1C36\u1C37\u1C78-\u1C7D\u1CD0-\u1CE8\u1CED\u1CF4\u1CF7-\u1CF9\u1D2C-\u1D6A\u1DC4-\u1DCF\u1DF5-\u1DFF\u1FBD\u1FBF-\u1FC1\u1FCD-\u1FCF\u1FDD-\u1FDF\u1FED-\u1FEF\u1FFD\u1FFE\u2CEF-\u2CF1\u2E2F\u302A-\u302F\u3099-\u309C\u30FC\uA66F\uA67C\uA67D\uA67F\uA69C\uA69D\uA6F0\uA6F1\uA700-\uA721\uA788-\uA78A\uA7F8\uA7F9\uA8C4\uA8E0-\uA8F1\uA92B-\uA92E\uA953\uA9B3\uA9C0\uA9E5\uAA7B-\uAA7D\uAABF-\uAAC2\uAAF6\uAB5B-\uAB5F\uAB69-\uAB6B\uABEC\uABED\uFB1E\uFE20-\uFE2F\uFF3E\uFF40\uFF70\uFF9E\uFF9F\uFFE3\u{102E0}\u{10780}-\u{10785}\u{10787}-\u{107B0}\u{107B2}-\u{107BA}\u{10AE5}\u{10AE6}\u{10D22}-\u{10D27}\u{10F46}-\u{10F50}\u{10F82}-\u{10F85}\u{11046}\u{11070}\u{110B9}\u{110BA}\u{11133}\u{11134}\u{11173}\u{111C0}\u{111CA}-\u{111CC}\u{11235}\u{11236}\u{112E9}\u{112EA}\u{1133C}\u{1134D}\u{11366}-\u{1136C}\u{11370}-\u{11374}\u{11442}\u{11446}\u{114C2}\u{114C3}\u{115BF}\u{115C0}\u{1163F}\u{116B6}\u{116B7}\u{1172B}\u{11839}\u{1183A}\u{1193D}\u{1193E}\u{11943}\u{119E0}\u{11A34}\u{11A47}\u{11A99}\u{11C3F}\u{11D42}\u{11D44}\u{11D45}\u{11D97}\u{16AF0}-\u{16AF4}\u{16B30}-\u{16B36}\u{16F8F}-\u{16F9F}\u{16FF0}\u{16FF1}\u{1AFF0}-\u{1AFF3}\u{1AFF5}-\u{1AFFB}\u{1AFFD}\u{1AFFE}\u{1CF00}-\u{1CF2D}\u{1CF30}-\u{1CF46}\u{1D167}-\u{1D169}\u{1D16D}-\u{1D172}\u{1D17B}-\u{1D182}\u{1D185}-\u{1D18B}\u{1D1AA}-\u{1D1AD}\u{1E130}-\u{1E136}\u{1E2AE}\u{1E2EC}-\u{1E2EF}\u{1E8D0}-\u{1E8D6}\u{1E944}-\u{1E946}\u{1E948}-\u{1E94A}]/gu, '')
normalizedWord.replace(/\p{Diacritic}/gu, '')

// normalizedWord.match(/[\u3006\u3007\u3021-\u3029\u3038-\u303A\u3400-\u4DBF\u4E00-\u9FFF\uF900-\uFA6D\uFA70-\uFAD9\u{16FE4}\u{17000}-\u{187F7}\u{18800}-\u{18CD5}\u{18D00}-\u{18D08}\u{1B170}-\u{1B2FB}\u{20000}-\u{2A6DF}\u{2A700}-\u{2B738}\u{2B740}-\u{2B81D}\u{2B820}-\u{2CEA1}\u{2CEB0}-\u{2EBE0}\u{2F800}-\u{2FA1D}\u{30000}-\u{3134A}]/gu)
normalizedWord.match(/\p{Ideographic}/gu)

// My thanks to this online tool: https://mothereff.in/regexpu
// It was very hard to crawl the Unicode docs and declare accuracy Unicode ranges which matched the properties such as "Diacritic", "Ideographic", etc...

Most of the popular browsers are supported this feature since ES2018. However, I'm currently stuck with an outdated V8 Engine that is embedded into the native Android/iOS game runtime. Upgrading the embedded V8 Engine and integrating it into a third-party game engine is not a suitable choice due to the very tight deadline and my skill's limitations (yeah, it's interesting to try, but had better later).

I'm hoping that TypeScript will support transpiling this super awesome feature to lower targets such as "ES3, ES5, ES2015 - 2017". So that many developers can benefit from it easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
In Discussion Not yet reached consensus Suggestion An idea for TypeScript
Projects
None yet
Development

No branches or pull requests

4 participants