-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize pathlib.Path.glob()
by avoiding repeated calls to os.path.normcase()
#104104
Comments
…calls to `os.path.normcase()` Use `re.IGNORECASE` to implement case-insensitive matching. This restores behaviour from before python#31691.
* main: (760 commits) pythonGH-104102: Optimize `pathlib.Path.glob()` handling of `../` pattern segments (pythonGH-104103) pythonGH-104104: Optimize `pathlib.Path.glob()` by avoiding repeated calls to `os.path.normcase()` (pythonGH-104105) pythongh-103822: [Calendar] change return value to enum for day and month APIs (pythonGH-103827) pythongh-65022: Fix description of tuple return value in copyreg (python#103892) pythonGH-103525: Improve exception message from `pathlib.PurePath()` (pythonGH-103526) pythongh-84436: Add integration C API tests for immortal objects (pythongh-103962) pythongh-103743: Add PyUnstable_Object_GC_NewWithExtraData (pythonGH-103744) pythongh-102997: Update Windows installer to SQLite 3.41.2. (python#102999) pythonGH-103484: Fix redirected permanently URLs (python#104001) Improve assert_type phrasing (python#104081) pythongh-102997: Update macOS installer to SQLite 3.41.2. (pythonGH-102998) pythonGH-103472: close response in HTTPConnection._tunnel (python#103473) pythongh-88496: IDLE - fix another test on macOS (python#104075) pythongh-94673: Hide Objects in PyTypeObject Behind Accessors (pythongh-104074) pythongh-94673: Properly Initialize and Finalize Static Builtin Types for Each Interpreter (pythongh-104072) pythongh-104016: Skip test for deeply neste f-strings on wasi (python#104071) pythongh-104057: Fix direct invocation of test_super (python#104064) pythongh-87092: Expose assembler to unit tests (python#103988) pythongh-97696: asyncio eager tasks factory (python#102853) pythongh-84436: Immortalize in _PyStructSequence_InitBuiltinWithFlags() (pythongh-104054) ...
Using For example, Python's lowercase mapping of "İ" (U+0130) is a two-character string: >>> print(ascii('İ'.lower()))
'i\u0307' To a Windows filesystem, "İ" and "i\u0307" are different filenames. >>> open('i\u0307', 'w').close()
>>> open('\u0130', 'w').close()
>>> names = os.listdir()
>>> names
['i̇', 'İ']
>>> os.path.normcase(names[0])
'i̇'
>>> os.path.normcase(names[1])
'İ'
>>> os.path.normcase(names[0]) == os.path.normcase(names[1])
False Using >>> re.match(names[1], names[0], re.IGNORECASE)
<re.Match object; span=(0, 1), match='i'> |
That bug existed in all previous versions of pathlib; I didn't intentionally fix it IIRC. Personally I consider it pretty minor. I'm hoping to add a case_sensitive argument to |
Okay, I didn't review the code in detail. I just wanted you to be aware that Windows paths should not be case-insensitively compared for equality using Python's lowercase mapping. The system's locale-invariant, non-linguistic case mapping has to be used when comparing paths. |
Of course when I actually checked this just now I discovered a bug. When >>> os.listdir()
[]
>>> chars = [c for i in range(65536, sys.maxunicode) if normcase(c:=chr(i)) != c]
>>> len(chars)
40
>>> [(c, normcase(c)) for c in chars]
[('𐐀', '𐐨'), ('𐐁', '𐐩'), ('𐐂', '𐐪'), ('𐐃', '𐐫'), ('𐐄', '𐐬'), ('𐐅', '𐐭'), ('𐐆', '𐐮'), ('𐐇', '𐐯'), ('𐐈',
'𐐰'), ('𐐉', '𐐱'), ('𐐊', '𐐲'), ('𐐋', '𐐳'), ('𐐌', '𐐴'), ('𐐍', '𐐵'), ('𐐎', '𐐶'), ('𐐏', '𐐷'), ('𐐐', '𐐸'),
('𐐑', '𐐹'), ('𐐒', '𐐺'), ('𐐓', '𐐻'), ('𐐔', '𐐼'), ('𐐕', '𐐽'), ('𐐖', '𐐾'), ('𐐗', '𐐿'), ('𐐘', '𐑀'), ('𐐙',
'𐑁'), ('𐐚', '𐑂'), ('𐐛', '𐑃'), ('𐐜', '𐑄'), ('𐐝', '𐑅'), ('𐐞', '𐑆'), ('𐐟', '𐑇'), ('𐐠', '𐑈'), ('𐐡', '𐑉'),
('𐐢', '𐑊'), ('𐐣', '𐑋'), ('𐐤', '𐑌'), ('𐐥', '𐑍'), ('𐐦', '𐑎'), ('𐐧', '𐑏')] To the filesystem, each of these normalized names is unique: >>> len(os.listdir())
0
>>> for c in chars: open(c, 'w').close(); open(normcase(c), 'w').close()
...
>>> len(os.listdir())
80 I don't know how to efficiently resolve this bug, short of calling a lower-level NTAPI function instead of |
As part of removing "flavour" classes in #31691, I changed pathlib's
glob()
implementation: previously it usedre.IGNORECASE
to implement case-insensitive matches, whereas after it calledos.path.normcase()
on the pattern and the paths. The new behaviour is a little slower, and I think we should restore the previous implementation.Linked PRs
pathlib.Path.glob()
by avoiding repeated calls toos.path.normcase()
#104105The text was updated successfully, but these errors were encountered: