Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: set keyword argument so zipfile actually compresses #21144

Merged
merged 13 commits into from
May 29, 2018

Conversation

minggli
Copy link
Contributor

@minggli minggli commented May 20, 2018

zipfile.ZipFile has default compression mode zipfile.ZIP_STORED. It creates an uncompressed archive member. Whilst it doesn't cause issue, it is a strange default to have given users would want to compress files.

In order for zip compression to actually reduce file size, keyword argument compression=zipfile.ZIP_DEFLATED is added.

@minggli minggli changed the title set keyword argument so zipfile actually compresses BUG: set keyword argument so zipfile actually compresses May 20, 2018
@codecov
Copy link

codecov bot commented May 20, 2018

Codecov Report

Merging #21144 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21144   +/-   ##
=======================================
  Coverage   91.84%   91.84%           
=======================================
  Files         153      153           
  Lines       49505    49505           
=======================================
  Hits        45466    45466           
  Misses       4039     4039
Flag Coverage Δ
#multiple 90.23% <100%> (ø) ⬆️
#single 41.88% <75%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/common.py 70.04% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c2844a...974b063. Read the comment docs.

@gfyoung gfyoung added Bug IO Data IO issues that don't fit into a more specific label labels May 21, 2018
if mode in ['wb', 'rb']:
mode = mode.replace('b', '')
super(BytesZipFile, self).__init__(file, mode, **kwargs)
super(BytesZipFile, self).__init__(file, mode, compression, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add tests and a whatsnew for this?

Copy link
Member

@gfyoung gfyoung May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, because you are modifying the default behavior, I'm not sure if we need a deprecation cycle for this (to be safe, we should I would imagine).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this is a bug

Copy link
Member

@gfyoung gfyoung May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, though tests and whatsnew are still needed (just to be clear).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. added whatsnew and tests.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a whatsnew & a test (not sure what that should look like)

@minggli minggli closed this May 26, 2018
@minggli minggli reopened this May 26, 2018
@@ -943,6 +943,22 @@ def test_to_csv_compression(self, compression):
with tm.decompress_file(filename, compression) as fh:
assert_frame_equal(df, read_csv(fh, index_col=0))

def test_to_csv_compression_size(self, compression):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are all the same tests I think it makes more sense to put in test_common and parametrize for the different writers, rather than splitting out across the various modules

Copy link
Contributor Author

@minggli minggli May 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. done.

@minggli
Copy link
Contributor Author

minggli commented May 27, 2018

@WillAyd @jreback @gfyoung changes implemented, comments welcome. 👍

s = df.iloc[:, 0]

with tm.ensure_clean() as filename:
for obj in [df, s]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can even parametrize the Series and Frame instead of a loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

file_size = os.path.getsize(filename)
getattr(obj, method)(filename, compression=None)
uncompressed_file_size = os.path.getsize(filename)
if compression:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't need this conditional

Copy link
Contributor Author

@minggli minggli May 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. though had to skip or xfail when compression==None. the fixture is shared across other tests so no need to change fixture.



@pytest.mark.parametrize('frame', [
pd.concat(100 * [DataFrame([[0.123456, 0.234567, 0.567567],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using concat just multiply your list of lists by 100 within the constructor - will be more performant and idiomatic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

@@ -222,3 +224,22 @@ def test_standardize_mapping():

dd = collections.defaultdict(list)
assert isinstance(com.standardize_mapping(dd), partial)


@pytest.mark.parametrize('frame', [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is either a frame or a series don't use frame as the variable name, as that's slightly confusing in case of the former being passed. obj should be fine

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - will see what others say but thanks for the PR!

@@ -83,6 +83,7 @@ Indexing
I/O
^^^

- Bug in :class:`pandas.io.common.BytesZipFile` where zip compression produces uncompressed zip archive (:issue:`17778`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you reference 21144 (as well is fine) as that other issue was closed for 0.23.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jreback @WillAyd , tested 21144 on older version of pandas 0.20 on windows and same issue occurred. I think it's an existing issue unrelated to this PR. This PR only address zip compression and doesn't touch gzip at all.

pytest.skip("only test compression case.")

with tm.ensure_clean() as filename:
getattr(obj, method)(filename, compression=compression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this fail under 0.23.0? (and works here clearly)

@jreback jreback merged commit c85ab08 into pandas-dev:master May 29, 2018
@jreback
Copy link
Contributor

jreback commented May 29, 2018

thanks @minggli !

@minggli
Copy link
Contributor Author

minggli commented May 29, 2018

happy to help!

@minggli minggli deleted the bugfix/zip branch May 29, 2018 10:44
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Jun 8, 2018
jorisvandenbossche pushed a commit that referenced this pull request Jun 9, 2018
david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants