Skip to content
This repository has been archived by the owner on May 20, 2022. It is now read-only.

Using something other than \p{…}, e.g. \m{…}? #10

Closed
mathiasbynens opened this issue May 24, 2018 · 43 comments
Closed

Using something other than \p{…}, e.g. \m{…}? #10

mathiasbynens opened this issue May 24, 2018 · 43 comments

Comments

@mathiasbynens
Copy link
Member

After the TC39 meeting, @msaboff suggested not overloading the meaning of \p on the grounds that it behaves differently than existing \p for non-sequence properties. This proposal breaks the invariant that \p always expands to a character class. He suggested \q for sequence. Unfortunately, \Q is a modifier in Perl regular expressions so using \q might be confusing.

I thought not having to introduce new syntax was a nice property (pun not intended), and I defend overloading \p{…} because sequence properties are still properties. The mental model is: \p{…} refers to a Unicode property. I don’t think end users think of \p as character classes (although they are currently implemented as such, and currently happen to be transpiled as such).

What does everyone else think? Should we use \p{…} or use something else?

@bathos
Copy link

bathos commented May 25, 2018

I think \q adds useful clarity. (As for Perl, I would expect anybody coming from there to be unfazed by learning new meanings for symbols.)

@mathiasbynens
Copy link
Member Author

mathiasbynens commented May 25, 2018

@bathos Can you elaborate on the additional clarity offered by \q? What makes \q{SomeSequenceProperty} more clear than \p{SomeSequenceProperty}?

@bathos
Copy link

bathos commented May 25, 2018

My understanding (correct me if this is wrong) is that sequences cannot be used in [] character classes. So the benefit is being able to discern correctness from syntax (without needing to know the name of every character property and character sequence property / alias).

@mathiasbynens
Copy link
Member Author

My counterargument is that you don’t need syntax to make that distinction. All sequence properties, now and in the future, will have _Sequence in them.

@bathos
Copy link

bathos commented May 25, 2018

I didn’t realize that. In that case, although it still feels like they’re substantially different to me, the case for q (as I saw it at least) is certainly weakened.

@msaboff
Copy link

msaboff commented May 29, 2018

I believe that \q (or whatever distinct character is chosen) does add clarity because they couldn't be use in character classes. This is irrespective of the _Sequence suffix for the Unicode sequences.

It is much clearer that \p and \P terms can be used in character classes while \q terms cannot. There is no need to look at the properties or sequence names to see that.

It doesn't seem to be too far of a stretch to think that the Unicode consortium might add property names for the named sequences in Unicode TR34. None of these sequence names end with "Sequence" and few have it in their name.

@mathiasbynens
Copy link
Member Author

If a developer explicitly refers to a given Unicode property, we have to assume they know what that property means according to Unicode. The property's exact behavior, including which code points (or sequences) it covers, is defined at the Unicode level, not the ECMAScript level.

Unicode properties, and the way any given Unicode property behaves per the Unicode Standard, can change over time. Non-sequence properties could start to include some sequences as well. If we decide to go with anything else than \p{...}, such an upstream change in Unicode would suddenly be a breaking change in ECMAScript, as then \p{SaidProperty} would no longer work, and instead developers would have to start using \q{SaidProperty}. IMHO, we shouldn't let syntax depend on an upstream spec we do not control.

@msaboff
Copy link

msaboff commented Sep 14, 2018

If a developer explicitly refers to a given Unicode property, we have to assume they know what that property means according to Unicode. The property's exact behavior, including which code points (or sequences) it covers, is defined at the Unicode level, not the ECMAScript level.

Unicode properties, and the way any given Unicode property behaves per the Unicode Standard, can change over time. Non-sequence properties could start to include some sequences as well. If we decide to go with anything else than \p{...}, such an upstream change in Unicode would suddenly be a breaking change in ECMAScript, as then \p{SaidProperty} would no longer work, and instead developers would have to start using \q{SaidProperty}. IMHO, we shouldn't let syntax depend on an upstream spec we do not control.

The properties that formed the basis for the prior Unicode Property Escape feature are part of the core Unicode Standard with some Emoji properties pulled in from the standalone Emoji Unicode Technical Standard #51. The property data files for the core Unicode Standard defined in the Unicode Character Database along with the data files for the Unicode Technical Standard 51. These are distinct from the data files that contain sequences UCD NamedSequences.txt and UTS #51 emoji-sequences.txt. The core Unicode specification says that all named sequences are in NamedSequences.txt (see Unicode Named Character Sequences (UAX34) Data Files and not in the property data files. The Emoji Technical Standard (UTS 51) Data Files lists the properties in one file and the sequences in other files.

In the case of the initial Emoji sequences proposed for this feature, they all have a _Sequence suffix. If Unicode changes a property to a sequence the standard states that it will be put in a sequence data file. Likely this means that they will move that property from a property data file to a sequence data file. For core Unicode sequences, UAX 34 defines a sequence uniqueness naming rule that would assure that the sequence name would be different from a property name. For Emoji properties that become sequences, at a minimum the there would be an addional _Sequence suffix. Given their past history, I would expect that there is a deprecation period where both the property and sequence exist. Both the old property and new sequence would need to be supported by JavaScript, something we could not do if they had the same name and only \p{Property} syntax. If we used \p for both of these, developers would need to update their source when a property changed names when becoming a sequence, e.g. from \p{Emoji_Keycap} to \p{Emoji_Keycap_Sequence}. Changing the escape character from \p to \q at that time would communicate that the semantics change from matching a single code point to a sequence.

Given this, I don't think that we have an issue with using a distinct character to denote a sequence \q{SaidSequence} from what we use for a property \p{SaidProperty} or \P{SaidProperty}. In fact I believe it is preferred given the way Unicode specifies sequence names distinct from property names.

@gibson042
Copy link

Restricting \p{…} to match only single characters would be nice, as would be forbidding a distinct sequence-friendly escape inside a character class.

But if we do pursue the change, I would like to avoid \q or anything else already in use by e.g. Perl (in either upper or lower case). We are constrained by existing ECMAScript regular expression syntax, but not by much given the presence of a Unicode flag—there are still many letters available (\g, \i, \j, \x, \y, \z), not to mention punctuation like \_, \#, or \&.

@ljharb
Copy link
Member

ljharb commented Sep 26, 2018

What about \p+{ or \p*{?

@mathiasbynens
Copy link
Member Author

mathiasbynens commented Sep 26, 2018

Let’s separate the discussion on whether or not to use \p from the discussion/bikeshed (😅) on what else to use. When it comes to it, I’ll file an issue where we can all discuss which letter to use.

Let’s also avoid discussing hypothetical changes on the Unicode level where an existing property suddenly goes from being a non-sequence property to also supporting sequences. As @gibson042 points out, this is already a problem for existing property escapes, and is not new to this proposal. I’m confident we can work with the Unicode Consortium to prevent such changes from happening.

For now, I’d like to follow @littledan’s suggestion of basing this decision on the developer-facing mental model.

The case for \p{Seq}

With \p{Seq}, the mental model would remain: \p{Foo} refers to the Unicode property Foo. Depending on what Unicode says about this property, behavior varies. (E.g. sequence properties don’t work in character classes.) Developers don’t have to worry about it for the common case (i.e. \p{Seq} outside of a character class); if and only if they run into the \P{Seq} or [\p{Seq}] case, they’ll see the (hopefully descriptive) error message, consult documentation, fix their code, and move on.

The case for something else

With \q{Seq}, developers are forced to learn new syntax just in case they run into the \q{Seq} or [\q{Seq}] case. The mental model would be: use \p{Foo} if Foo is a non-sequence property, and use \q{Foo} if it’s a sequence property.

My bias is clearly showing here so I’m hoping someone else can chime in and illustrate how the mental model with \q{…} would be more developer-friendly.

@ljharb
Copy link
Member

ljharb commented Sep 26, 2018

I don't have a strong opinion either way, but it might be relatively friendly if a developer uses a sequence with the single syntax and gets an error "this is a sequence, use PLACEHOLDER instead" - and if the reverse, "this is not a sequence, use \p instead" - in that it would communicate the likely effects more clearly than "let me go google the unicode property name"

@msaboff
Copy link

msaboff commented Sep 26, 2018

The case for \p{Seq}

With \p{Seq}, the mental model would remain: \p{Foo} refers to the Unicode property Foo. Depending on what Unicode says about this property, behavior varies. (E.g. sequence properties don’t work in character classes.) Developers don’t have to worry about it for the common case (i.e. \p{Seq} outside of a character class); if and only if they run into the \P{Seq} or [\p{Seq}] case, they’ll see the (hopefully descriptive) error message, consult documentation, fix their code, and move on.

There currently are not any Unicode Named sequence properties, they are a hypothetical in this proposal. Properties are defined in UTR 23 - The Unicode Character Property Model and Named Sequences are defined in UTR 34 - Unicode Named Character Sequences They are separate.

This proposal references Basic_Emoji, but the current draft spec TR#51 - Unicode Emoji describes this using new terminology of a Set. Which is discussed separately from Emoji sequences.

The case for something else

With \q{Seq}, developers are forced to learn new syntax just in case they run into the \q{Seq} or [\q{Seq}] case. The mental model would be: use \p{Foo} if Foo is a non-sequence property. Use \q{Foo} if it’s a sequence property.

They only have to learn the new syntax if they use sequences.

My bias is clearly showing here so I’m hoping someone else can chime in and illustrate how the mental model with \q{…} would be more developer-friendly.

Note that in UTR 18 - Unicode Regular Expressions, 2.5.1 Individually Named Character there is separate syntax \N for named characters, that can include named character sequences. Point 1 under that section states \N matches a single character or a sequence, while \p matches a set of characters.

@gibson042
Copy link

gibson042 commented Sep 26, 2018

The mental model in favor of separate escapes would be correspondence with the Unicode data model, in which sequences are not properties but rather a separate class of information (cf. emoji-sequences.txt: "The type_field [any of {Emoji_Combining_Sequence, Emoji_Flag_Sequence, Emoji_Modifier_Sequence}] is a convenience for parsing the emoji sequence files, and is not intended to be maintained as a property."). So \p{…} is always a single code point, while \�{…} is one or more.

@tonton-pixel
Copy link

tonton-pixel commented Sep 29, 2018

As a developer, I find quite difficult to consider that a list of sequences of code points can be expressed as a property; I don't really understand what it would be a property of, to begin with...

Also, I have the feeling that, in order for anything to qualify as a regex property escape, it must comply at least to these basic rules:

  • A property escape \p{...} can be negated: \P{...}.
  • A property escape \p{...} can be part of a character class: [...\p{...}...].

Right now, a valid Unicode property escape \p{...} maps to some [...] character class; likewise, \P{...} would map to [^...]; so, it does respect the rules because it involves only one character at a time.

On the other hand, a list of sequences of code points (characters) would map instead to a set of alternatives (?:a|mn|xyz), and therefore would not meet the requirements to qualify because it deals with variable-length strings.

In conclusion, I would propose that:

  • If a list of sequences of characters can indeed be expressed as a property, then it must use\p{...}, for the sake of consistency.
  • Otherwise, it must use some other syntax...

[ FYI, I opened a separate issue on the subject of "Sequence properties" terminology ]

@mathiasbynens
Copy link
Member Author

Re: @gibson042's and @msaboff's comments w.r.t.

the Unicode data model, in which sequences are not properties but rather a separate class of information

I don't really understand what it would be a property of, to begin with...

I've been working with the Unicode Consortium on clarifying all this upstream. Here's a quote from Mark Davis himself:

I'll give you my mental model:

A property P is a function that maps a key K to value V: that is, K is in the domain and V is in the codomain and P(K) = V. For Unicode properties, the codomain can be simple (binary, enum, etc.) or complex (set of enum values, etc.). The value of P(K) is stable within the same version of the spec that defines the property P, but may change in successive versions. Typically there is a distinguished "n/a" value returned where the key is not in the domain, since that is easier for programming. A function from X to Y can also be called a "Y property of Xs" or "Y property over Xs". Thus a "binary property of strings" is a function from a string to a binary value.

A property of code points is equivalent to a property of strings with exactly one code point. In most Unicode documentation, the "of code-points" may be omitted. Thus when you see "string property", it [currently] usually means a string property of code points: that is, a function whose domain is code points and codomain is strings.

A property of code points is also known as a code point property or character property, although formally speaking not all code points are characters; some are not (yet) assigned to characters, and some cannot be (e.g. surrogates). A property of strings is also known as a property of sequences.

So Emoji_Keycap_Sequence is a binary property of strings, and Emoji_Presentation is a binary property of code points (but thereby also a binary property of strings). Both of them can also be called emoji properties, since they deal with emoji.

@mathiasbynens
Copy link
Member Author

Over the past few months, I've worked on a proposal at the Unicode level, which has now been formally submitted.

One of the many things it addresses is this question of "are sequence properties really properties"? As you all know, my assumption has always been that various Unicode documents just hadn't been updated when emoji got standardized, and that's why "properties" only seems to refer to character properties right now. Mark Davis, who authored several of these documents, shares this view. Among other things, the proposal makes this assumption explicit, which (if accepted) would resolve this part of the discussion.

Mark Davis will present the proposal during the January UTC meeting at Google MTV from January 22–25. We'll be able to resolve this issue accordingly once the UTC accepts/rejects the proposal.

@mathiasbynens
Copy link
Member Author

mathiasbynens commented May 4, 2019

While I’m still hoping we can proceed with \p{…}, it would be good to come up with a list of available letters instead of \q just in case UTC decides otherwise. Requirements:

  • must not be used in JavaScript (this rules out e.g. \d)
  • must not be used in any other language (this rules out e.g. \x)
  • the uppercased/lowercased form of the letter must also not be used in JavaScript or any other language (this rules out e.g. \q, since Perl has \Q)
  • must be within the ASCII range (this rules out e.g. \�{…} or \🔡{…})

Here’s the results:

letter status notes
a \a matches U+0007 BELL in Perl
b \b is a word boundary
c \cX matches Control-X
d \d matches ASCII digits
e \e means “escape character” in Perl
f \f matches U+000C FORM FEED (FF)
g \g{} and \g1 are backreferences in Perl
h \h matches horizontal whitespace in PCRE, PHP, Java, and Boost
i (?i:) (but not \i) makes part of the regular expression case-insensitive in Perl
j
k \k<foo> is a named backreference
l \l means “lowercase next character” in Perl
m
n \n matches U+000A LINE FEED (LF)
o \o{…} is an octal escape sequence in Perl
p
q \Q is a modifier in Perl
r \r matches U+000D CARRIAGE RETURN (CR)
s \s matches whitespace characters
t \t matches U+0009 CHARACTER TABULATION
u \uFFFF and \u{FFFF} are hexadecimal escape sequences
v \v matches U+000B LINE TABULATION
w \w matches word characters
x \x{FFFF} is a hexadecimal escape sequence in Perl, PCRE, Boost, and std::regex
y
z \z and \Z means “end of string” in Perl

TL;DR Other than \p{…}, we only have the following single-letter options: \i{…}, \j{…}, \m{…}, \y{…}.

@FireyFly
Copy link

FireyFly commented May 4, 2019

@mathiasbynens shouldn't \a{…} be disqualified if \a is used to match \x07 in Perl? (per the note provided)

@mathiasbynens
Copy link
Member Author

@FireyFly It should have been! I forgot to update the status column. Fixed now — thanks!

@msaboff
Copy link

msaboff commented May 6, 2019

Why the restriction that both the lower and upper case must be available in the revenant RE engines?
Also, by requiring that we choose an escape letter that isn't used by other languages, we may unnecessarily restrict our options.

@mathiasbynens
Copy link
Member Author

Why the restriction that both the lower and upper case must be available in the revenant RE engines?

We could use an uppercase letter when only the lowercase variant is taken, or vice versa. It’s just not ideal.

Also, by requiring that we choose an escape letter that isn't used by other languages, we may unnecessarily restrict our options.

TC39 has made the decision to let UTC decide here. For UTC and UTS18, other languages matter.

@littledan
Copy link
Member

Distinguishing based on case seems odd, since for the most part, that is used for negation in RegExps.

@bakkot
Copy link

bakkot commented Oct 2, 2019

I said this at the meeting today, but let me try to capture it here as well:

In order to reason about how a regular expression with the u flag behaves, you need to know that the regular expression the notion of "character" as understood by such regexs is different from the notion of "character" as understood by .length and string[n]. Anyone working with unicode strings has likely encountered the fact these two distinct notions exist, or will be bitten by it soon in any case, and I would give good odds that they have also encountered the fact that you can expose this other notion of "character" by doing [...string]. You don't need to have the word "code point" in your head to do this.

As a reader or maintainer of code, the way that I understand the behavior of a regular expression is by "walking" it along the string, one "character" at a time. With the u flag, that's "one character as exposed by [...string]", which is what I would probably be looking at when trying to understand the regex. It is very important, then, that I be able to understand from reading the regex if it is going to match exactly one character or if it is going to match something other than that. (As a maintainer, I would also like to understand if I can put something in a character class.)

Currently that is the case: it is always obvious from surface syntax how many code points an atom can match.

If \p is overloaded, now I need to understand what kind of thing is referred to by the name within its braces.

That's bad.

We should not overload \p in this way.

@mathiasbynens
Copy link
Member Author

Currently that is the case: it is always obvious from surface syntax how many code points an atom can match.

How many code points can \p{Foo} match? It can either match one, or throw (e.g. \p{Invalid}).

If \p is overloaded, now I need to understand what kind of thing is referred to by the name within its braces.

That’s already the case — see the above example. In order to understand how a regular expression behaves, knowing what the thing between the braces means is already a requirement today.

@waldemarhorwat
Copy link

My preference is to just go with \p for everything. It's hard to remember which properties expand to single characters and which expand to sequences. In a context where we don't need negation or character classes it's gratuitous to have to remember to write \p{Emoji} vs. \q{Basic_Emoji}.

I can see some of the arguments for going with \q for sequence properties. However, if we were to adopt that approach, it would make no sense to disallow \q{Emoji}. A single character is always a one-character sequence, so a character property such as {Emoji} should be usable with either \p or \q. Multi-character properties such as {Basic_Emoji} would be usable with only \q.

@bakkot
Copy link

bakkot commented Oct 2, 2019

How many code points can \p{Foo} match? It can either match one, or throw (e.g. \p{Invalid}).

As a reader or maintainer of code it is generally safe to assume that the code is not a syntax error. I (and, I expect, you) generally do not consider "causing an early syntax error if textually present" to be within the realm of behavior of regular expressions. (If you like, you could think of this as being "not a regular expression", and hence not required for reasoning about the behavior of regular expressions.)

Guarding against early errors is also much easier than reasoning about the behavior of regular expressions.

This really feels like a non-sequitur.

@hax
Copy link
Member

hax commented Oct 2, 2019

As my experience in dealing with regexp and unicode bugs, you need to be unicode expert anyway, (in this case like @mathiasbynens said "u need to know what the thing between the braces mean"). So there is no much difference between overloading\p or adding separate\q. On the other side, for most common use cases, add another \q is just add the burden to average programmers who just want to remember a pattern --- shall i use \p{Emoji} or \q{Emoji}? Why \p{Emoji} works but \p{Basic_Emoji} doesn't? ...

@bakkot
Copy link

bakkot commented Oct 2, 2019

@hax (sorry for miss-ping)

If you are enough of a unicode expert to know which things are sequence properties and which are not, "why \p{Emoji} works but \p{Basic_Emoji} doesn't" should be clear. (And it's not much of a burden to remember - if you get it wrong you will get an early error telling you to fix it, and hopefully why.)

If you are not that much of an expert, but are still trying to reason about how the regular expression behaves, it is very important that you be able to identify which things can match exactly one code point and which can match multiple.

@mathiasbynens
Copy link
Member Author

What @hax is saying is that splitting the syntax into \p and \q complicates the common case for developers, where the property escape is NOT used within a character class or within \P.

(And it's not much of a burden to remember - if you get it wrong you will get an early error telling you to fix it, and hopefully why.)

This argument applies equally to the statement “\p{Seq} doesn’t work within character classes”.

@bakkot
Copy link

bakkot commented Oct 2, 2019

This argument applies equally to the statement “\p{Seq} doesn’t work within character classes”.

Sure, but no more so. (Actually, I would argue less so - different kinds of things usually have different syntax, and here you are encountering the fact that these are different kinds of things.) So if the argument for overloading is "it will be confusing to deal with this early error", and the argument against is "it will be confusing to deal with this other early error, and also make it harder to reason about the behavior of regular expressions", that seems to come down pretty hard on the "not overloading" side.

In any case, my main claim was not "\p{Seq} doesn't work within character classes, and that's confusing in itself". It was "the fact that \p{Foo} works in character classes at least some of the time strongly implies that it always matches exactly one character (because I certainly would not intuit that it works in character classes only sometimes), and we should not violate that intuition".

@mathiasbynens
Copy link
Member Author

So if the argument for overloading is "it will be confusing to deal with this early error" […]

The argument for it is the argument I presented in plenary today.

We can choose to preserve the current mental model of \p{…} while adding the new functionality. Or, we can choose to violate the current mental model of \p{…} and introduce new syntax for the new functionality. The latter is needlessly complicated.

@bakkot
Copy link

bakkot commented Oct 2, 2019

We can choose to preserve the current mental model of \p{…} while adding the new functionality.

No, we can't. We cannot dictate mental models.

Currently things are consistent with both of our mental models: my "\p{Foo} matches a set of characters" and your "\p{Foo} matches a set of sequences of characters as defined by unicode and behaves as described in therein, including being usable in some places and possibly not others, but it happens that all such sets currently allowed consist exclusively of single characters".

With the introduction of sequence properties, those mental models stop being consistent. So we have to pick one with which to retain consistency.

I claim my model is more obvious: the fact that \p is usable in character classes today seems to indicate it is syntax for matching a set of characters. You say it is not syntax for matching a set of characters, except that sometimes it is, which is why it can be sometimes used in character classes (and why, in fact, it can always be used in character classes today). That seems much less intuitive and much more complicated to me, but I accept that some people will intuit this.

I also claim that the consequences of trying to retain consistency with your mental model are strictly worse, because it makes it harder to reason about the behavior of regular expressions.

@msaboff
Copy link

msaboff commented Oct 2, 2019

Given that programmers likely have to look up the Unicode escape names irrespective of \p or \q, that fact provides no insight or added weight in this discussion to favor overloading \p versus adding a new escape (\q).

I think it is appropriate to this discussion to consider Unicode current property escapes as sugaring for character classes. Correspondingly, the proposed Unicode sequence property escapes are sugaring to non-capturing groups of alterations.

\p{Foo} => [<characters-in-Foo]
\q{Bar_Seq} => (?:Bar-string1|Bar-string2|...|Bar-stringN)

From this, it is clear that the desugared use permitted in larger RE's is quite different. Trying to hide the distinctive differences between character properties and sequence properties by overriding \p I would claim makes it more difficult to form a correct mental model. The new syntax would aid a proper formation of a proper mental model as to the appropriate use within a larger RE.

@mathiasbynens
Copy link
Member Author

The p in \p stands for property. What does it stand for in your mental model?

@bakkot
Copy link

bakkot commented Oct 2, 2019

It doesn't stand for anything. What does the k in \k stand for?

@msaboff
Copy link

msaboff commented Oct 2, 2019

I'm willing to agree that p means property and that \p is the logical character given UTS#18 and the use in other languages for properties as described in the current UTS#18. The proposed changes to UTS#18 use \m (multicharacter property) and not \q. Seems like an easy to remember escape. Although both \m and \p are both listed in that proposed UTS#18 change, it is interesting to note that the \m escape is listed ahead of \p.

@mathiasbynens
Copy link
Member Author

Although both \m and \p are both listed in that proposed UTS#18 change, it is interesting to note that the \m escape is listed ahead of \p.

An attempt was made to make the changes in UTS18 as neutral as possible. I don’t think we should be ascribing any particular meaning to the order in which the options are listed.

FWIW, Mark Davis, the author of the original UTS18 and the proposed UTS18 update, prefers \p for sequence properties.

@sffc
Copy link
Collaborator

sffc commented Oct 10, 2019

Notes from UTC Discussion about this topic:

RPR = Roozbeh Pournader
MSH = Markus Scherer
SFC = Shane F. Carr
MED = Mark E. Davis
MGR = Manish Goregaokar
BYG = Benjamin Yang
MLS = Michael Saboff

MLS: (summarizes points made in TC39)

RPR: Those sound like the same points we arrived at on our own.

MED: You wouldn't be prevented from using string properties in ranges, it's just that the range becomes an alternation. For example, I could say the set of emoji minus the set of emoji characters. It's not a range; it's an alternation.

SFC: The fact that the binary operator has different behavior there could be justification for using different syntaxes.

MED: The downside is that people think of properties as a unified set of things.

SFC: Most users don't have the properties memorized and won't try to write \p{sequence property} because they will get examples directly from documentation.

MSH: When I first looked at this discussion, I thought this was about matching functions. We clarified that for the purpose of this proposal, that isn't the case. Once we wanted to go to unlimited matching functions, we definitely want different syntax (escape letter). For what we're proposing here, which is basically a named alternation, that's close enough to the things we have that it makes sense to keep using \p. In the ICU project, we've had a class UnicodeSet, which first only included code point values but now also includes support for strings. It can match against that. We also added a span function to that UnicodeSet API. This seems like a natural extension to what we have. While it seems natural to have a new syntax, I don't think it's necessary.

RPR: Is this all about readability? At compile time, you're going to notice a difference anyway.

MED: You may want to write, for example, [\p{x}--\p{y}]. For example, we may want to write, [\p{RGI_Emoji}--\p{Emoji_flags}].

SFC/RPR: That's not regex syntax.

MED: It's an extension.

MSH: ICU UnicodeSet has been doing this for years. You have, \p{Grek}, \p{Lu}.

MLS: Those properties are sets of code points.

MSH: If you look at the Basic_Emoji property, it has code points and sequences in the same property.

MED: A code point can always be considered a single-character string.

SFC: Regex syntax and UnicodeSet syntax are two different nodes in my brain. I don't like to conflate them. In regular regexes, I like to know that \p and \m resolve to fundamentally different regex operations.

RPR: (missed)

MED: It feels like /u at the end, that now I'm working in Unicode space.

MSH: I want to point out that [abcde{ch}{Ch}...] is already supported in TR 18, with \q syntax.

SFC: Should we support multi-character sequences in square brackets in ECMAScript?

MLS: Knowing TC39 that would probably have to be a separate proposal.

MED: I don't know until runtime whether [] is a character range or a string range. Even if we have separate syntax, you don't know until runtime.

SFC: If you have \p only, then it is definitely a character range. If you have \m, it makes you think harder.

MSH: A property of characters will never be changed to be a property of strings. In my opinion, a regular expression engine that's interested in supporting properties of strings should support the set operations as well.

MLS: But then you can't unconditionally negate the [].

MED: I think it's a perfectly reasonable concern that you don't want the same syntax to fail on different Unicode versions that may have added more strings to a set.

MLS: The programmer knows immediately that if they have \m, it would be a syntax error to negate the []. And \M is undefined.

RPR: With the same syntax, they have to go and look up and figure out if the character is a character property or string property.

MED: The advantage of \m is that it is self-documenting. The advantage of overloading \p is that it is fewer things to think about.

SFC: I prefer if UTC would set the recommendations for new regex syntax, rather than being wishy-washy and deferring to downstream consumers of the syntax. My first preference is \m.

MSH: My first preference is \p because I think it is a reasonable option.

RPR: If we have \p then we should require implementations to support extended [] syntax.

MSH: If you did that, it would also mean that you could expand any arbitrary \p into [].

SFC: I think RPR hit the point on the head. If you think of [] as a set of characters, then you want the self-documenting \m. If you think of [] as being able to also include strings, then you want to overload the syntax.

SFC: I made an email list to schedule meetings to discuss this topic further.

https://groups.google.com/a/chromium.org/forum/#!forum/regex-sp-wg

@mathiasbynens
Copy link
Member Author

@mathiasbynens mathiasbynens changed the title Using something other than \p{…}, e.g. \q{…}? Using something other than \p{…}, e.g. \m{…}? Apr 20, 2020
@bakkot
Copy link

bakkot commented Apr 20, 2020

@mathiasbynens From your update there:

Implementations that are constrained in that they do not support strings in character classes should use \m{Property_Name} as an alternate notation for properties of strings appearing outside of character class expressions. \m should also accept ordinary properties of characters; it can be limited in where it may appear, not in what properties it allows.

Am I correct in reading this to mean that the Unicode consortium recommends (or proposes to recommend) that we go with \m for this proposal, since we are an implementation which does not support strings in character classes?

(Specifically, with the semantics that \m could also be used with regular properties, just not negated or used in character classes, and with no modifications to \p.)

@mathiasbynens
Copy link
Member Author

mathiasbynens commented Apr 20, 2020

@bakkot That's what the current UTS18 draft says, yes. Note the use of "should" instead of "must"; UTC wanted to leave the final syntax decision up to implementers. cc @markusicu @macchiati @sffc

Some other suggestions were made for JavaScript to consider:

  • We could allow [\p{Seq}] if we wanted to. [a-z_\p{Seq}] would then behave as (?:\p{Seq}|[a-z_]). [^\p{Seq}] and \P{Seq} would still throw.
  • We could consider expanding JavaScript's character classes with string support more generally. UTS18 has had \q{...} syntax for that for a long time. I consider that a separate effort which would be its own proposal.

@mathiasbynens
Copy link
Member Author

After much discussion, this has been resolved as part of the RegExp v flag proposal which enables extended character classes, supporting properties of strings using the same \p{…} syntax.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests