Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize unicode during export / import #1085

Closed
RhetTbull opened this issue Jun 14, 2023 · 2 comments
Closed

normalize unicode during export / import #1085

RhetTbull opened this issue Jun 14, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@RhetTbull
Copy link
Owner

RhetTbull commented Jun 14, 2023

strings with unicode (e.g. keywords, etc) need to be normalized in all round trips with the Photos library.

Hi there. It worked very well.

But same issue found now with Keywords.

While exporting, I've added keywords to the files (via XMP) -- eg "Cão" -- which, on osxphotos import, via exiftool caused the same issues now with Keywords: their duplication: visually the same name but one in NFD and the other in NFC.

  • On Photos under the Keywords windows you can see two entries for the "same" keyword: "Cão" and "Cão".
  • On a Photos SmartAlbum if you select Keywords, oddly enough, it only shows one option to filter under Keyword, but shows the total pics, say 100.
  • Also querying with `osxphotos query --keyword "Cão"

Workaround:

  1. Tried using --keyword "{keyword|function:fixunicode.py::fixunicode}" with and without --merge-keywords on the osxphotos import but itdos not seem to change keywords!
  2. I can query on Photos on Keyword "Cão" and imported in the last 30 days - basically the pics from the EXPORT-CONVERT-IMPORT and remove the bad keyword and add the "old" one.
    • Testing export... to see if merge-keywords will still pickup the "new" keyword tagged on the file's EXIF.
    • Hmmm.. there isn't really a way to tell on the exported files ;)

Originally posted by @oPromessa in #907 (reply in thread)

@RhetTbull RhetTbull added the bug Something isn't working label Jun 14, 2023
@RhetTbull
Copy link
Owner Author

@oPromessa Trying to figure this one out. As I said in the original comment, there are 2 things I hate dealing with in programming: dates and unicode! Sorry for long post -- these notes to help me figure out how to fix this.

Unicode characters can take one of 4 different normalization forms: NFC, NFD, NKFC, NKFD). See this explanation for more details.

osxphotos uses NFC internally (but not everywhere apparently) and NFD when writing to disk (except on Linux) -- see below for more details.

I created a photo and give it keyword Cão in NFC form. I confirmed that both osxphotos and the native Photos interface (accessible via my python to AppleScript photoscript bridge) are using unicode form NFC:

From the osxphotos REPL:

>>> import unicodedata
>>> import photoscript
>>> keyword = get_selected()[0].keywords[0]
>>> keyword
'Cão'

>>> unicodedata.is_normalized("NFC", keyword)
True

>>> photo = photoscript.Photo(uuid=get_selected()[0].uuid)
>>> photo.keywords
['Cão']

>>> unicodedata.is_normalized("NFC", photo.keywords[0])
True

But when I paste the same keyword using NFD form, osxphotos reports it as NFC (as expected -- see below) but Photos preserves the original form and reports as NFD:

>>> import unicodedata
>>> import photoscript
>>> get_selected()[0].keywords[0]
'Cão'

>>> keyword = get_selected()[0].keywords[0]
>>> keyword
'Cão'

>>> unicodedata.is_normalized("NFC", keyword)
True

>>> photo = photoscript.Photo(uuid=get_selected()[0].uuid)
>>> photo.keywords[0]
'Cão'

>>> unicodedata.is_normalized("NFC", photo.keywords[0])
False

>>> unicodedata.is_normalized("NFD", photo.keywords[0])
True

When writing to files, osxphotos uses NFD unicode form. This is because the HFS+ filesystem which osxphotos was first developed on normalized unicode to NFD and normalization was required when using --update with unicode file names (see #410 and #515). More about this in an excellent article here. APFS which modern Macs use do not normalize unicode but osxphotos still normalizes all file names to NFD (except on Linux which uses NFC by default):

def normalize_fs_path(path: T) -> T:
"""Normalize filesystem paths with unicode in them"""
form = "NFD" if is_macos else "NFC"
if isinstance(path, pathlib.Path):
return pathlib.Path(unicodedata.normalize(form, str(path)))
else:
return unicodedata.normalize(form, path)

However, internally, osxphotos uses NFC when comparing strings because that's used by Photos (as is shown above):

def normalize_unicode(value) -> Any:
"""normalize unicode data"""
if value is None:
return None
if isinstance(value, (tuple, list)):
return tuple(unicodedata.normalize(UNICODE_FORMAT, v) for v in value)
elif isinstance(value, str):
return unicodedata.normalize(UNICODE_FORMAT, value)
else:
return value

# Unicode format to use for comparing strings
UNICODE_FORMAT = "NFC"

It appears the system preserves whatever format was used when reading from the command line as is demonstrated by the following simple script:

uni.py:

import sys
import unicodedata

if __name__ == "__main__":
    text = sys.argv[1]
    for form in ["NFC", "NFD", "NFKC", "NFKD"]:
        print(form, unicodedata.is_normalized(form, text))
❯ python uni.py Cão
NFC True
NFD False
NFKC True
NFKD False

❯ python
Python 3.11.2 (v3.11.2:878ead1ac1, Feb  7 2023, 10:02:41) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> unicodedata.normalize("NFD", "Cão")
'Cão'

^^^ Copy this code and paste into command line

❯ python uni.py Cão
NFC False
NFD True
NFKC False
NFKD True

So in osxphotos import (and likely several other places wherever metadata is read from command line or user input), I need to normalize the input to NFC before using in the code. For example, for osxphotos import --merge-keywords:

def set_photo_metadata(
photo: Photo,
metadata: MetaData,
merge_keywords: bool,
) -> MetaData:
"""Set metadata (title, description, keywords) for a Photo object"""
photo.title = metadata.title
photo.description = metadata.description
keywords = metadata.keywords.copy()
if merge_keywords:
if old_keywords := photo.keywords:
keywords.extend(old_keywords)
keywords = list(set(keywords))
photo.keywords = keywords
return MetaData(metadata.title, metadata.description, keywords, metadata.location)

@RhetTbull
Copy link
Owner Author

RhetTbull commented Jun 20, 2023

I've started a unicode_refactor branch to work on this. First part was pulling out unicode and platform specify function out of utils.py into separate modules which is done. Now I need to create a map of all the places where unicode conversion needs to happen and determine what to do in each case.

  • Currently, when processing the database, osxphotos normalizes "user facing text" (keywords, descriptions, etc.) for each photo. I don't think this is done for album names, need to verify this.
  • When writing files to disk or creating paths, osxphotos normalizes all file paths. For historical reasons, the form is NFD on macOS and NFC on linux. There should be an option to allow users to specify the form for paths. (on HFS+, file paths must be in the decomposed NFD form.)
    -Any use of PhotosDB.query() should normalize input to the query (though I don't think this currently happens)
  • Creating albums for --add-to-album needs to check both normalization forms to see if an album with that name exists then normalize the form used for creation if it does't.
  • osxphotos import must normalize all user input. Some things to think about: when merging keywords, need to check if a keyword exists in both normalized forms and ensure the original form is used to write the keywords back. This complicates the keyword merge but ensures we're not changing data in ways that could be mysterious to the user when writing data back to the library.
  • Similar behavior needed for captions and titles.
  • I don't think detected_text currently normalizes -- need to do so.
  • Does normalization happen when writing sidecars? What about exiftool? Changing this now would break a lot of existing exports so need to understand what is actually occurring.
  • See export.py:1410 for album checks -- need to check existence of matching album in each form before calling PhotoAlbum()

On macOS 13.4, creating new data (keywords, titles, descriptions) in Photos uses NFC:

>>> import unicodedata
>>> unicodedata.is_normalized("NFC", get_selected()[0].keywords[1])
True

>>> unicodedata.is_normalized("NFC", get_selected()[0].title)
True

>>> unicodedata.is_normalized("NFC", get_selected()[0].description)
True

RhetTbull added a commit that referenced this issue Jun 24, 2023
* Began refactoring for improving unicode handling

* Added platform and unicode modules

* Added tests for unicode utilities

* Added tests for unicode utilities

* Added tests for unicode utilities

* Added tests for unicode utilities

* Fixed unicode tests for linux

* Fixed unicode tests for linux

* Fixed duplicate alubm name with --add-to-album

* Fixed test for linux

* Fix for duplicate unicode kewyords, see #907, #1085
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant