-
Notifications
You must be signed in to change notification settings - Fork 13
Using something other than \p{…}
, e.g. \m{…}
?
#10
Comments
I think |
@bathos Can you elaborate on the additional clarity offered by |
My understanding (correct me if this is wrong) is that sequences cannot be used in |
My counterargument is that you don’t need syntax to make that distinction. All sequence properties, now and in the future, will have |
I didn’t realize that. In that case, although it still feels like they’re substantially different to me, the case for q (as I saw it at least) is certainly weakened. |
I believe that \q (or whatever distinct character is chosen) does add clarity because they couldn't be use in character classes. This is irrespective of the It is much clearer that \p and \P terms can be used in character classes while \q terms cannot. There is no need to look at the properties or sequence names to see that. It doesn't seem to be too far of a stretch to think that the Unicode consortium might add property names for the named sequences in Unicode TR34. None of these sequence names end with "Sequence" and few have it in their name. |
If a developer explicitly refers to a given Unicode property, we have to assume they know what that property means according to Unicode. The property's exact behavior, including which code points (or sequences) it covers, is defined at the Unicode level, not the ECMAScript level. Unicode properties, and the way any given Unicode property behaves per the Unicode Standard, can change over time. Non-sequence properties could start to include some sequences as well. If we decide to go with anything else than |
The properties that formed the basis for the prior Unicode Property Escape feature are part of the core Unicode Standard with some Emoji properties pulled in from the standalone Emoji Unicode Technical Standard #51. The property data files for the core Unicode Standard defined in the Unicode Character Database along with the data files for the Unicode Technical Standard 51. These are distinct from the data files that contain sequences UCD NamedSequences.txt and UTS #51 emoji-sequences.txt. The core Unicode specification says that all named sequences are in NamedSequences.txt (see Unicode Named Character Sequences (UAX34) Data Files and not in the property data files. The Emoji Technical Standard (UTS 51) Data Files lists the properties in one file and the sequences in other files. In the case of the initial Emoji sequences proposed for this feature, they all have a Given this, I don't think that we have an issue with using a distinct character to denote a sequence |
Restricting But if we do pursue the change, I would like to avoid |
What about |
Let’s separate the discussion on whether or not to use Let’s also avoid discussing hypothetical changes on the Unicode level where an existing property suddenly goes from being a non-sequence property to also supporting sequences. As @gibson042 points out, this is already a problem for existing property escapes, and is not new to this proposal. I’m confident we can work with the Unicode Consortium to prevent such changes from happening. For now, I’d like to follow @littledan’s suggestion of basing this decision on the developer-facing mental model. The case for
|
I don't have a strong opinion either way, but it might be relatively friendly if a developer uses a sequence with the single syntax and gets an error "this is a sequence, use PLACEHOLDER instead" - and if the reverse, "this is not a sequence, use \p instead" - in that it would communicate the likely effects more clearly than "let me go google the unicode property name" |
There currently are not any Unicode Named sequence properties, they are a hypothetical in this proposal. Properties are defined in UTR 23 - The Unicode Character Property Model and Named Sequences are defined in UTR 34 - Unicode Named Character Sequences They are separate. This proposal references Basic_Emoji, but the current draft spec TR#51 - Unicode Emoji describes this using new terminology of a Set. Which is discussed separately from Emoji sequences.
They only have to learn the new syntax if they use sequences.
Note that in UTR 18 - Unicode Regular Expressions, 2.5.1 Individually Named Character there is separate syntax \N for named characters, that can include named character sequences. Point 1 under that section states \N matches a single character or a sequence, while \p matches a set of characters. |
The mental model in favor of separate escapes would be correspondence with the Unicode data model, in which sequences are not properties but rather a separate class of information (cf. emoji-sequences.txt: "The type_field [any of {Emoji_Combining_Sequence, Emoji_Flag_Sequence, Emoji_Modifier_Sequence}] is a convenience for parsing the emoji sequence files, and is not intended to be maintained as a property."). So |
As a developer, I find quite difficult to consider that a list of sequences of code points can be expressed as a property; I don't really understand what it would be a property of, to begin with... Also, I have the feeling that, in order for anything to qualify as a regex property escape, it must comply at least to these basic rules:
Right now, a valid Unicode property escape On the other hand, a list of sequences of code points (characters) would map instead to a set of alternatives In conclusion, I would propose that:
[ FYI, I opened a separate issue on the subject of "Sequence properties" terminology ] |
Re: @gibson042's and @msaboff's comments w.r.t.
I've been working with the Unicode Consortium on clarifying all this upstream. Here's a quote from Mark Davis himself:
|
Over the past few months, I've worked on a proposal at the Unicode level, which has now been formally submitted. One of the many things it addresses is this question of "are sequence properties really properties"? As you all know, my assumption has always been that various Unicode documents just hadn't been updated when emoji got standardized, and that's why "properties" only seems to refer to character properties right now. Mark Davis, who authored several of these documents, shares this view. Among other things, the proposal makes this assumption explicit, which (if accepted) would resolve this part of the discussion. Mark Davis will present the proposal during the January UTC meeting at Google MTV from January 22–25. We'll be able to resolve this issue accordingly once the UTC accepts/rejects the proposal. |
While I’m still hoping we can proceed with
Here’s the results:
TL;DR Other than |
@mathiasbynens shouldn't |
@FireyFly It should have been! I forgot to update the status column. Fixed now — thanks! |
Why the restriction that both the lower and upper case must be available in the revenant RE engines? |
We could use an uppercase letter when only the lowercase variant is taken, or vice versa. It’s just not ideal.
TC39 has made the decision to let UTC decide here. For UTC and UTS18, other languages matter. |
Distinguishing based on case seems odd, since for the most part, that is used for negation in RegExps. |
I said this at the meeting today, but let me try to capture it here as well: In order to reason about how a regular expression with the As a reader or maintainer of code, the way that I understand the behavior of a regular expression is by "walking" it along the string, one "character" at a time. With the Currently that is the case: it is always obvious from surface syntax how many code points an atom can match. If That's bad. We should not overload |
How many code points can
That’s already the case — see the above example. In order to understand how a regular expression behaves, knowing what the thing between the braces means is already a requirement today. |
My preference is to just go with I can see some of the arguments for going with |
As a reader or maintainer of code it is generally safe to assume that the code is not a syntax error. I (and, I expect, you) generally do not consider "causing an early syntax error if textually present" to be within the realm of behavior of regular expressions. (If you like, you could think of this as being "not a regular expression", and hence not required for reasoning about the behavior of regular expressions.) Guarding against early errors is also much easier than reasoning about the behavior of regular expressions. This really feels like a non-sequitur. |
As my experience in dealing with regexp and unicode bugs, you need to be unicode expert anyway, (in this case like @mathiasbynens said "u need to know what the thing between the braces mean"). So there is no much difference between overloading |
@hax (sorry for miss-ping) If you are enough of a unicode expert to know which things are sequence properties and which are not, "why If you are not that much of an expert, but are still trying to reason about how the regular expression behaves, it is very important that you be able to identify which things can match exactly one code point and which can match multiple. |
What @hax is saying is that splitting the syntax into
This argument applies equally to the statement “ |
Sure, but no more so. (Actually, I would argue less so - different kinds of things usually have different syntax, and here you are encountering the fact that these are different kinds of things.) So if the argument for overloading is "it will be confusing to deal with this early error", and the argument against is "it will be confusing to deal with this other early error, and also make it harder to reason about the behavior of regular expressions", that seems to come down pretty hard on the "not overloading" side. In any case, my main claim was not " |
The argument for it is the argument I presented in plenary today. We can choose to preserve the current mental model of |
No, we can't. We cannot dictate mental models. Currently things are consistent with both of our mental models: my " With the introduction of sequence properties, those mental models stop being consistent. So we have to pick one with which to retain consistency. I claim my model is more obvious: the fact that I also claim that the consequences of trying to retain consistency with your mental model are strictly worse, because it makes it harder to reason about the behavior of regular expressions. |
Given that programmers likely have to look up the Unicode escape names irrespective of \p or \q, that fact provides no insight or added weight in this discussion to favor overloading \p versus adding a new escape (\q). I think it is appropriate to this discussion to consider Unicode current property escapes as sugaring for character classes. Correspondingly, the proposed Unicode sequence property escapes are sugaring to non-capturing groups of alterations. \p{Foo} => [<characters-in-Foo] From this, it is clear that the desugared use permitted in larger RE's is quite different. Trying to hide the distinctive differences between character properties and sequence properties by overriding \p I would claim makes it more difficult to form a correct mental model. The new syntax would aid a proper formation of a proper mental model as to the appropriate use within a larger RE. |
The |
It doesn't stand for anything. What does the |
I'm willing to agree that |
An attempt was made to make the changes in UTS18 as neutral as possible. I don’t think we should be ascribing any particular meaning to the order in which the options are listed. FWIW, Mark Davis, the author of the original UTS18 and the proposed UTS18 update, prefers |
Notes from UTC Discussion about this topic: RPR = Roozbeh Pournader MLS: (summarizes points made in TC39) RPR: Those sound like the same points we arrived at on our own. MED: You wouldn't be prevented from using string properties in ranges, it's just that the range becomes an alternation. For example, I could say the set of emoji minus the set of emoji characters. It's not a range; it's an alternation. SFC: The fact that the binary operator has different behavior there could be justification for using different syntaxes. MED: The downside is that people think of properties as a unified set of things. SFC: Most users don't have the properties memorized and won't try to write MSH: When I first looked at this discussion, I thought this was about matching functions. We clarified that for the purpose of this proposal, that isn't the case. Once we wanted to go to unlimited matching functions, we definitely want different syntax (escape letter). For what we're proposing here, which is basically a named alternation, that's close enough to the things we have that it makes sense to keep using RPR: Is this all about readability? At compile time, you're going to notice a difference anyway. MED: You may want to write, for example, SFC/RPR: That's not regex syntax. MED: It's an extension. MSH: ICU UnicodeSet has been doing this for years. You have, MLS: Those properties are sets of code points. MSH: If you look at the Basic_Emoji property, it has code points and sequences in the same property. MED: A code point can always be considered a single-character string. SFC: Regex syntax and UnicodeSet syntax are two different nodes in my brain. I don't like to conflate them. In regular regexes, I like to know that RPR: (missed) MED: It feels like MSH: I want to point out that SFC: Should we support multi-character sequences in square brackets in ECMAScript? MLS: Knowing TC39 that would probably have to be a separate proposal. MED: I don't know until runtime whether SFC: If you have MSH: A property of characters will never be changed to be a property of strings. In my opinion, a regular expression engine that's interested in supporting properties of strings should support the set operations as well. MLS: But then you can't unconditionally negate the MED: I think it's a perfectly reasonable concern that you don't want the same syntax to fail on different Unicode versions that may have added more strings to a set. MLS: The programmer knows immediately that if they have RPR: With the same syntax, they have to go and look up and figure out if the character is a character property or string property. MED: The advantage of SFC: I prefer if UTC would set the recommendations for new regex syntax, rather than being wishy-washy and deferring to downstream consumers of the syntax. My first preference is MSH: My first preference is RPR: If we have MSH: If you did that, it would also mean that you could expand any arbitrary SFC: I think RPR hit the point on the head. If you think of SFC: I made an email list to schedule meetings to discuss this topic further. https://groups.google.com/a/chromium.org/forum/#!forum/regex-sp-wg |
\p{…}
, e.g. \q{…}
?\p{…}
, e.g. \m{…}
?
@mathiasbynens From your update there:
Am I correct in reading this to mean that the Unicode consortium recommends (or proposes to recommend) that we go with (Specifically, with the semantics that |
@bakkot That's what the current UTS18 draft says, yes. Note the use of "should" instead of "must"; UTC wanted to leave the final syntax decision up to implementers. cc @markusicu @macchiati @sffc Some other suggestions were made for JavaScript to consider:
|
After much discussion, this has been resolved as part of the RegExp |
After the TC39 meeting, @msaboff suggested not overloading the meaning of
\p
on the grounds that it behaves differently than existing\p
for non-sequence properties. This proposal breaks the invariant that\p
always expands to a character class. He suggested\q
for sequence. Unfortunately,\Q
is a modifier in Perl regular expressions so using\q
might be confusing.I thought not having to introduce new syntax was a nice property (pun not intended), and I defend overloading
\p{…}
because sequence properties are still properties. The mental model is:\p{…}
refers to a Unicode property. I don’t think end users think of\p
as character classes (although they are currently implemented as such, and currently happen to be transpiled as such).What does everyone else think? Should we use
\p{…}
or use something else?The text was updated successfully, but these errors were encountered: