Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix byte/string handling for Python 3 #19

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

ndevenish
Copy link
Member

Problem: All char * fields, input and output are mapped to str. This works on Python 2, but on Python 3 causes problems that your binary data is now encoded in a string and needs to be converted via encode('utf-8', errors='surrogateescape') (see SWIG docs). This can be worked around by setting SWIG_PYTHON_STRICT_BYTE_CHAR in the build - but now all char * fields are bytes, meaning that you need to encode any strings that you are passing into pycbf (see e.g. cctbx/dxtbx@44b6d72).

This commit fixes this. When using Python 3, functions that return data will return it as a python bytes object, and those that accept data will accept a python bytes object. Everything else will accept str.

The output/swig files have been regenerated from the nuweb sources, and the test/make_pycbf.py converted to print functions so that they will work on python 3 or 2 (using __future__, so Python 2.6.0+ required for regeneration now).
Because of the nature of the generation, that I've generated in SWIG 4.0, that the output files had previously all been generated with CRLF line endings, and that the base sources haven't been regenerated from the HTML for a while, the total diff is quite large and hard to read. The important parts are in:

  • 2834241, which makes relevant functions accept bytes objects for data input, and
  • e8b4152, which makes data functions return bytes objects for data
  • master...ndevenish:9052c5c for comparing the source trees without the regeneration step.

The one additional caveat is that the pycbf.py doesn't now work in python 2 without adding the line # coding: utf-8 to the top. This isn't because of these changes, but because doc/CBFlib.html now uses unicode characters for e.g. , these are carried over the the .py file in regeneration. I've made this change manually here, under the presumption that people probably aren't going to regenerate, but this could be fixed by adding:

sed -i '' '1s/^/\# coding: utf-8\
/' pycbf.py

to the linux.sh file. I don't know the equivalent for windows though, and don't know how important this use case is, so have left it rather than making the platforms work differently.

Add bytestring_output_allocate_size, and use for appropriate functions.
This means that you don't need to set SWIG_PYTHON_STRICT_BYTE_CHAR,
which means that you can pass strings to functions that accept them
whilst also retrieving byte data objects without having to convert
to/from unicode representations.

The functions this is applied to are:
    get_integerarray_as_string
    get_image_as_string
    get_image_fs_as_string
    get_image_sf_as_string
    get_real_image_as_string
    get_real_image_fs_as_string
    get_real_image_sf_as_string
    get_3d_image_as_string
    get_3d_image_fs_as_string
    get_3d_image_sf_as_string
    get_real_3d_image_as_string
    get_real_3d_image_fs_as_string
    get_real_3d_image_sf_as_string
    get_realarray_as_string
Even without SWIG_PYTHON_STRICT_BYTE_CHAR - on python 3, previously you
had a choice of everything being a bytes - in which case string arguments
have to be encoded - or nothing is, in which case you need to convert
binary data to/from unicode in order to pass in/out of functions.

This applies this to the following functions:

    set_integerarray
    set_integerarray_wdims
    set_integerarray_wdims_fs
    set_integerarray_wdims_sf
    set_realarray
    set_realarray_wdims
    set_realarray_wdims_fs
    set_realarray_wdims_sf
    set_image
    set_image_fs
    set_image_sf
    set_real_image
    set_real_image_fs
    set_real_image_sf
    set_3d_image
    set_3d_image_fs
    set_3d_image_sf
    set_real_3d_image
    set_real_3d_image_fs
    set_real_3d_image_sf
Also, ensure consistency of LF/CRLF line endings,
and remove trailing spaces.
@ndevenish
Copy link
Member Author

See discussion about this same issue: cctbx/cctbx_project#475

@yayahjb
Copy link
Collaborator

yayahjb commented Jan 31, 2021

Except for the Windows issue, this looks good, but with increasing use by small molecule people on windows,
we need a good version there as well, at least under WSL2 ubuntu and debian. I think we need to work out
clean python 3 and python 2 support on Mac, Linux and Windows 10. On a related issue, I think it is time to
restrict our HDF5 support to 1.12. -- Herbert

@ndevenish
Copy link
Member Author

To be clear - with this branch the windows issue is that I don't know if there is an equivalent for rewriting the pycbf.py, which is probably fine as long as nobody wants to regenerate from the literate programming sources from scratch, which I'm not sure anyone has done in quite a while.

I did run into different windows issues however, as I described in cctbx/staged-recipes#1 (comment) - mainly to do with cl.exe support for optional language features (most notably variable-length arrays) and a slightly hacky attempt at fixing in the unmerged branch ndevenish/cbflib@refactor...ndevenish:winchanges - taking the minimally intrusive option instead of reworking the relevant functions (or using goto!). I need to polish this up and make a PR/issue for discussion though; haven't spent much time on cbflib/conda since November. (I think we had plans to try to discuss general cbflib issues with you at the march HDRMX meeting but not sure if anything has moved on that yet)

@yayahjb
Copy link
Collaborator

yayahjb commented Jan 31, 2021 via email

@ndevenish
Copy link
Member Author

This would also solve #18 - a person claiming to be the current (then) maintainer of libcbf on debian seems to have run into this issue

Base automatically changed from master to main March 25, 2021 10:06
ndevenish added a commit to dials/pycbf that referenced this pull request May 14, 2021
This is the patches from the PR dials/cbflib#19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants