Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: set keyword argument so zipfile actually compresses #21144

Merged
merged 13 commits into from
May 29, 2018
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ Indexing
I/O
^^^

- Bug in IO methods specifying ``compression='zip'`` which produced uncompressed zip archives (:issue:`17778`, :issue:`21144`)
- Bug in :meth:`DataFrame.to_stata` which prevented exporting DataFrames to buffers and most file-like objects (:issue:`21041`)
-

Expand Down
8 changes: 4 additions & 4 deletions pandas/io/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import codecs
import mmap
from contextlib import contextmanager, closing
from zipfile import ZipFile
import zipfile

from pandas.compat import StringIO, BytesIO, string_types, text_type
from pandas import compat
Expand Down Expand Up @@ -428,7 +428,7 @@ def _get_handle(path_or_buf, mode, encoding=None, compression=None,
return f, handles


class BytesZipFile(ZipFile, BytesIO):
class BytesZipFile(zipfile.ZipFile, BytesIO):
"""
Wrapper for standard library class ZipFile and allow the returned file-like
handle to accept byte strings via `write` method.
Expand All @@ -437,10 +437,10 @@ class BytesZipFile(ZipFile, BytesIO):
bytes strings into a member of the archive.
"""
# GH 17778
def __init__(self, file, mode='r', **kwargs):
def __init__(self, file, mode, compression=zipfile.ZIP_DEFLATED, **kwargs):
if mode in ['wb', 'rb']:
mode = mode.replace('b', '')
super(BytesZipFile, self).__init__(file, mode, **kwargs)
super(BytesZipFile, self).__init__(file, mode, compression, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add tests and a whatsnew for this?

Copy link
Member

@gfyoung gfyoung May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, because you are modifying the default behavior, I'm not sure if we need a deprecation cycle for this (to be safe, we should I would imagine).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this is a bug

Copy link
Member

@gfyoung gfyoung May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, though tests and whatsnew are still needed (just to be clear).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. added whatsnew and tests.


def write(self, data):
super(BytesZipFile, self).writestr(self.filename, data)
Expand Down
21 changes: 20 additions & 1 deletion pandas/tests/test_common.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
# -*- coding: utf-8 -*-

import pytest
import os
import collections
from functools import partial

import numpy as np

from pandas import Series, Timestamp
from pandas import Series, DataFrame, Timestamp
from pandas.compat import range, lmap
import pandas.core.common as com
from pandas.core import ops
Expand Down Expand Up @@ -222,3 +223,21 @@ def test_standardize_mapping():

dd = collections.defaultdict(list)
assert isinstance(com.standardize_mapping(dd), partial)


@pytest.mark.parametrize('obj', [
DataFrame(100 * [[0.123456, 0.234567, 0.567567],
[12.32112, 123123.2, 321321.2]],
columns=['X', 'Y', 'Z']),
Series(100 * [0.123456, 0.234567, 0.567567], name='X')])
@pytest.mark.parametrize('method', ['to_pickle', 'to_json', 'to_csv'])
def test_compression_size(obj, method, compression):
if not compression:
pytest.skip("only test compression case.")

with tm.ensure_clean() as filename:
getattr(obj, method)(filename, compression=compression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this fail under 0.23.0? (and works here clearly)

compressed = os.path.getsize(filename)
getattr(obj, method)(filename, compression=None)
uncompressed = os.path.getsize(filename)
assert uncompressed > compressed