-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhanced st.from_regex()
strategy with alphabet=...
argument and filter-rewriting
#3479
Comments
@Zac-HD Thanks for this feature! Here's a use case of ours in case you ever wonder how it might be used: this is important when you want to test XML - JSON conversions for the same specs. XML does not support the whole character range of UTF-32 (for example, |
I was looking into a related scenario where I wanted to include/exclude certain Unicode scripts and codepoints. Using Python’s own re package wouldn’t have the support but the regex package does, e.g. >>> regex.match(r"\p{Script=Latin}+", "שלום")
>>> regex.match(r"\p{Script=Latin}+", "hello")
<regex.Match object; span=(0, 5), match='hello'>
>>> re.match(r"\p{Script=Latin}+", "hello")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
re.error: bad escape \p at position 0 Would it be possible/make sense to build |
We can't require the |
That’d be perfect 👍🏼 |
Closing this issue as we now have |
Currently, the
st.from_regex()
strategy produces strings which might include any unicode character (or byte, for bytestrings). However, this can be frustrating if you want strings restricted to some subset of codepoints, e.g. those in a particular encoding (#1664).I therefore propose adding an
alphabet=...
strategy, which accepts a collection of length-one strings, orsampled_from()
strategy of the same (orst.characters()
, for Unicode strings). All generated characters must then be valid according to this set. If there are no matches for the pattern which also satisfy the alphabet restriction, an error should be raised; if this only requires dropping some arms of an alternation or subsets of charactersets, that's OK.(I considered allowing out-of-alphabet literals etc., but that would violate the invariant we need for good encoding support and also make filter-rewriting with regex intersection work differently. Better to be restrictive but consistent.)
Once we've got that working, it should be feasible to complete the last #2701-style filter rewriting tricks:
st.text()
/st.binary()
filtered withre.compile(...).find
/match
/fullmatch
st.from_regex()
with the same filters - note that this will require some upstream work ingreenery
.See also Use
greenery
andregex_transformer
to mergepattern
andpatternProperties
keywords python-jsonschema/hypothesis-jsonschema#85The text was updated successfully, but these errors were encountered: