Support Unicode RegExp property escapes #32214

mathiasbynens · 2019-07-02T09:52:01Z

TypeScript currently doesn't transpile Unicode property escapes (of the form \p{ID_Start} or \P{ASCII}) in regular expressions.

It would be great if it did!

https://www.typescriptlang.org/play/?target=1#code/MYewdgzgLgBATgUwOYIB4wLwwPQB0AOA3gMrBwCW+UGA4oggNYC+2ArgNwBQokIANggB0fEEgAUiFKkFQE0MQHJAA8AKAlKvZA

Search Terms

regexp, regular expression, Unicode, property escapes, ES2018

Suggestion

Support transpiling Unicode property escapes in regular expressions. Examples:

/\p{ID_Start}/u;
/\P{ASCII}/u;
/\p{Script_Extensions=Greek}/u;

Use Cases

One particular use case is matching identifier characters in JavaScript parsers. This is currently commonly implemented as a large script-generated regular expression pattern (like in Esprima) or as a magical-looking list of code point ranges (like in TypeScript itself). However, it would be much simpler to use property escapes.

const regexIdentifierStart = /[$_\p{ID_Start}]/u;
const regexIdentifierPart = /[$_\u200C\u200D\p{ID_Continue}]/u;
const regexIdentifierName = /^(?:[$_\p{ID_Start}])(?:[$_\u200C\u200D\p{ID_Continue}])*$/u;

Checklist

My suggestion meets these guidelines:

This wouldn't be a breaking change in existing TypeScript/JavaScript code
This wouldn't change the runtime behavior of existing JavaScript code
This could be implemented without emitting different JS based on the types of the expressions
This isn't a runtime feature (e.g. library functionality, non-ECMAScript syntax with JavaScript output, etc.)
This feature would agree with the rest of TypeScript's Design Goals.

The text was updated successfully, but these errors were encountered:

xanatos · 2020-04-28T11:42:22Z

It is probably more complex. RegExp has a constructor that accepts a string. You'll need a regex "translator" that can be used both by the compiler and the javascript runtime. Because:

const regex = /\p{Script=Greek}/u;
const regex2 = new RegExp('\\p{Script=Greek}', 'u');
console.log(regex.test('π'));
console.log(regex2.test('π'));

both of these should work.

Another possibility would be to change all the /.../ regular expressions to the RegExp(string) and let a polyfiller (like regexpu) do all the regex translation.

mathiasbynens · 2020-04-28T17:55:31Z

regexpu is a transpiler, not a polyfill. It only translates regular expression literals, which seems strictly better than blocking this feature on RegExp support and effectively doing nothing.

redstrike · 2021-09-29T12:04:03Z

Hello, is there any update? This is a very important feature to support (Babel has done it). I'm relying on RegExp's Unicode Property Escapes to simplify some normalize & CJKV characters extraction functions. For example:

const normalizedWord = word.trim().toLowerCase().normalize('NFD')

// normalizedWord.replace(/[\u0300-\u036f]/gu, '')
// normalizedWord.replace(/[\^`\xA8\xAF\xB4\xB7\xB8\u02B0-\u034E\u0350-\u0357\u035D-\u0362\u0374\u0375\u037A\u0384\u0385\u0483-\u0487\u0559\u0591-\u05A1\u05A3-\u05BD\u05BF\u05C1\u05C2\u05C4\u064B-\u0652\u0657\u0658\u06DF\u06E0\u06E5\u06E6\u06EA-\u06EC\u0730-\u074A\u07A6-\u07B0\u07EB-\u07F5\u0818\u0819\u0898-\u089F\u08C9-\u08D2\u08E3-\u08FE\u093C\u094D\u0951-\u0954\u0971\u09BC\u09CD\u0A3C\u0A4D\u0ABC\u0ACD\u0AFD-\u0AFF\u0B3C\u0B4D\u0B55\u0BCD\u0C3C\u0C4D\u0CBC\u0CCD\u0D3B\u0D3C\u0D4D\u0DCA\u0E47-\u0E4C\u0E4E\u0EBA\u0EC8-\u0ECC\u0F18\u0F19\u0F35\u0F37\u0F39\u0F3E\u0F3F\u0F82-\u0F84\u0F86\u0F87\u0FC6\u1037\u1039\u103A\u1063\u1064\u1069-\u106D\u1087-\u108D\u108F\u109A\u109B\u135D-\u135F\u1714\u1715\u17C9-\u17D3\u17DD\u1939-\u193B\u1A75-\u1A7C\u1A7F\u1AB0-\u1ABE\u1AC1-\u1ACB\u1B34\u1B44\u1B6B-\u1B73\u1BAA\u1BAB\u1C36\u1C37\u1C78-\u1C7D\u1CD0-\u1CE8\u1CED\u1CF4\u1CF7-\u1CF9\u1D2C-\u1D6A\u1DC4-\u1DCF\u1DF5-\u1DFF\u1FBD\u1FBF-\u1FC1\u1FCD-\u1FCF\u1FDD-\u1FDF\u1FED-\u1FEF\u1FFD\u1FFE\u2CEF-\u2CF1\u2E2F\u302A-\u302F\u3099-\u309C\u30FC\uA66F\uA67C\uA67D\uA67F\uA69C\uA69D\uA6F0\uA6F1\uA700-\uA721\uA788-\uA78A\uA7F8\uA7F9\uA8C4\uA8E0-\uA8F1\uA92B-\uA92E\uA953\uA9B3\uA9C0\uA9E5\uAA7B-\uAA7D\uAABF-\uAAC2\uAAF6\uAB5B-\uAB5F\uAB69-\uAB6B\uABEC\uABED\uFB1E\uFE20-\uFE2F\uFF3E\uFF40\uFF70\uFF9E\uFF9F\uFFE3\u{102E0}\u{10780}-\u{10785}\u{10787}-\u{107B0}\u{107B2}-\u{107BA}\u{10AE5}\u{10AE6}\u{10D22}-\u{10D27}\u{10F46}-\u{10F50}\u{10F82}-\u{10F85}\u{11046}\u{11070}\u{110B9}\u{110BA}\u{11133}\u{11134}\u{11173}\u{111C0}\u{111CA}-\u{111CC}\u{11235}\u{11236}\u{112E9}\u{112EA}\u{1133C}\u{1134D}\u{11366}-\u{1136C}\u{11370}-\u{11374}\u{11442}\u{11446}\u{114C2}\u{114C3}\u{115BF}\u{115C0}\u{1163F}\u{116B6}\u{116B7}\u{1172B}\u{11839}\u{1183A}\u{1193D}\u{1193E}\u{11943}\u{119E0}\u{11A34}\u{11A47}\u{11A99}\u{11C3F}\u{11D42}\u{11D44}\u{11D45}\u{11D97}\u{16AF0}-\u{16AF4}\u{16B30}-\u{16B36}\u{16F8F}-\u{16F9F}\u{16FF0}\u{16FF1}\u{1AFF0}-\u{1AFF3}\u{1AFF5}-\u{1AFFB}\u{1AFFD}\u{1AFFE}\u{1CF00}-\u{1CF2D}\u{1CF30}-\u{1CF46}\u{1D167}-\u{1D169}\u{1D16D}-\u{1D172}\u{1D17B}-\u{1D182}\u{1D185}-\u{1D18B}\u{1D1AA}-\u{1D1AD}\u{1E130}-\u{1E136}\u{1E2AE}\u{1E2EC}-\u{1E2EF}\u{1E8D0}-\u{1E8D6}\u{1E944}-\u{1E946}\u{1E948}-\u{1E94A}]/gu, '')
normalizedWord.replace(/\p{Diacritic}/gu, '')

// normalizedWord.match(/[\u3006\u3007\u3021-\u3029\u3038-\u303A\u3400-\u4DBF\u4E00-\u9FFF\uF900-\uFA6D\uFA70-\uFAD9\u{16FE4}\u{17000}-\u{187F7}\u{18800}-\u{18CD5}\u{18D00}-\u{18D08}\u{1B170}-\u{1B2FB}\u{20000}-\u{2A6DF}\u{2A700}-\u{2B738}\u{2B740}-\u{2B81D}\u{2B820}-\u{2CEA1}\u{2CEB0}-\u{2EBE0}\u{2F800}-\u{2FA1D}\u{30000}-\u{3134A}]/gu)
normalizedWord.match(/\p{Ideographic}/gu)

// My thanks to this online tool: https://mothereff.in/regexpu
// It was very hard to crawl the Unicode docs and declare accuracy Unicode ranges which matched the properties such as "Diacritic", "Ideographic", etc...

Most of the popular browsers are supported this feature since ES2018. However, I'm currently stuck with an outdated V8 Engine that is embedded into the native Android/iOS game runtime. Upgrading the embedded V8 Engine and integrating it into a third-party game engine is not a suitable choice due to the very tight deadline and my skill's limitations (yeah, it's interesting to try, but had better later).

I'm hoping that TypeScript will support transpiling this super awesome feature to lower targets such as "ES3, ES5, ES2015 - 2017". So that many developers can benefit from it easily.

mathiasbynens mentioned this issue Jul 2, 2019

Consider using Unicode RegExp property escapes for identifier matching jquery/esprima#1979

Open

sandersn added Suggestion An idea for TypeScript In Discussion Not yet reached consensus labels Jul 2, 2019

This was referenced May 25, 2021

alpha25+ requires ECMAScript 2018 mui/mui-x#1630

Closed

[DataGrid] Build failing with message "expected atom at position 4" mui/mui-x#1766

Closed

evanw mentioned this issue Apr 28, 2022

Can't convert RegExp:Unicode property escapes evanw/esbuild#2215

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Unicode RegExp property escapes #32214

Support Unicode RegExp property escapes #32214

mathiasbynens commented Jul 2, 2019

xanatos commented Apr 28, 2020 •

edited

Loading

mathiasbynens commented Apr 28, 2020

redstrike commented Sep 29, 2021 •

edited

Loading

Support Unicode RegExp property escapes #32214

Support Unicode RegExp property escapes #32214

Comments

mathiasbynens commented Jul 2, 2019

Search Terms

Suggestion

Use Cases

Checklist

xanatos commented Apr 28, 2020 • edited Loading

mathiasbynens commented Apr 28, 2020

redstrike commented Sep 29, 2021 • edited Loading

xanatos commented Apr 28, 2020 •

edited

Loading

redstrike commented Sep 29, 2021 •

edited

Loading