The tafra
began life as a thought experiment: how could we reduce the idea
of a dataframe (as expressed in libraries like pandas
or languages
like R) to its useful essence, while carving away the cruft?
The original proof of concept
stopped at "group by".
This library expands on the proof of concept to produce a practically
useful tafra
, which we hope you may find to be a helpful lightweight
substitute for certain uses of pandas
.
A tafra
is, more-or-less, a set of named columns or dimensions.
Each of these is a typed numpy
array of consistent length, representing
the values for each column by rows.
The library provides lightweight syntax for manipulating rows and columns, support for managing data types, iterators for rows and sub-frames, pandas-like "transform" support and conversion from pandas Dataframes, and SQL-style "group by" and join operations.
Tafra | Tafra |
Aggregations | Union, GroupBy, Transform, IterateBy, InnerJoin, LeftJoin, CrossJoin |
Aggregation Helpers | union, union_inplace, group_by, transform, iterate_by, inner_join, left_join, cross_join |
Constructors | as_tafra, from_dataframe, from_series, from_records |
SQL Readers | read_sql, read_sql_chunks |
Destructors | to_records, to_list, to_tuple, to_array, to_pandas |
Properties | rows, columns, data, dtypes, size, ndim, shape |
Iter Methods | iterrows, itertuples, itercols |
Functional Methods | row_map, tuple_map, col_map, pipe |
Dict-like Methods | keys, values, items, get, update, update_inplace, update_dtypes, update_dtypes_inplace |
Other Helper Methods | select, head, copy, rename, rename_inplace, coalesce, coalesce_inplace, _coalesce_dtypes, delete, delete_inplace |
Printer Methods | pprint, pformat, to_html |
Indexing Methods | _slice, _index, _ndindex |
Install the library with pip:
pip install tafra
>>> from tafra import Tafra
>>> t = Tafra({
... 'x': np.array([1, 2, 3, 4]),
... 'y': np.array(['one', 'two', 'one', 'two'], dtype='object'),
... })
>>> t.pformat()
Tafra(data = {
'x': array([1, 2, 3, 4]),
'y': array(['one', 'two', 'one', 'two'])},
dtypes = {
'x': 'int', 'y': 'object'},
rows = 4)
>>> print('List:', '\n', t.to_list())
List:
[array([1, 2, 3, 4]), array(['one', 'two', 'one', 'two'], dtype=object)]
>>> print('Records:', '\n', tuple(t.to_records()))
Records:
((1, 'one'), (2, 'two'), (3, 'one'), (4, 'two'))
>>> gb = t.group_by(
... ['y'], {'x': sum}
... )
>>> print('Group By:', '\n', gb.pformat())
Group By:
Tafra(data = {
'x': array([4, 6]), 'y': array(['one', 'two'])},
dtypes = {
'x': 'int', 'y': 'object'},
rows = 2)
Have some code that works with pandas
, or just a way of doing things
that you prefer? tafra
is flexible:
>>> df = pd.DataFrame(np.c_[
... np.array([1, 2, 3, 4]),
... np.array(['one', 'two', 'one', 'two'])
... ], columns=['x', 'y'])
>>> t = Tafra.from_dataframe(df)
And going back is just as simple:
>>> df = pd.DataFrame(t.data)
In this case, lightweight also means performant. Beyond any additional
features added to the library, tafra
should provide the necessary
base for organizing data structures for numerical processing. One of the
most important aspects is fast access to the data itself. By minimizing
abstraction to access the underlying numpy
arrays, tafra
provides
an order of magnitude increase in performance.
- Import note If you assign directly to the
Tafra.data
orTafra._data
attributes, you must callTafra._coalesce_dtypes
afterwards in order to ensure the typing is consistent.
Construct a Tafra
and a DataFrame
:
>>> tf = Tafra({
... 'x': np.array([1., 2., 3., 4., 5., 6.]),
... 'y': np.array(['one', 'two', 'one', 'two', 'one', 'two'], dtype='object'),
... 'z': np.array([0, 0, 0, 1, 1, 1])
... })
>>> df = pd.DataFrame(t.data)
Direct access:
>>> %timemit x = t._data['x']
55.3 ns ± 5.64 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Indirect with some penalty to support Tafra
slicing and numpy
's
advanced indexing:
>>> %timemit x = t['x']
219 ns ± 71.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
pandas
timing:
>>> %timemit x = df['x']
1.55 µs ± 105 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
This is the fastest methed for accessing the numpy array among alternatives of
df.values()
, df.to_numpy()
, and df.loc[]
.
Direct access is not recommended as it avoids the validation steps, but it does provide fast access to the data attribute:
>>> x = np.arange(6)
>>> %timeit tf._data['x'] = x
65 ns ± 5.55 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Indidrect access has a performance penalty due to the validation checks to
ensure consistency of the tafra
:
>>> %timeit tf['x'] = x
7.39 µs ± 950 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Even so, there is considerable performance improvement over pandas
.
pandas
timing:
>>> %timeit df['x'] = x
47.8 µs ± 3.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
tafra
also excels at aggregation methods, the primary of which are a
SQL-like GROUP BY
and the split-apply-combine equivalent to a SQL-like
GROUP BY
following by a LEFT JOIN
back to the original table.
>>> %timeit tf.group_by(['y', 'z'], {'x': sum})
138 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit tf.transform(['y', 'z'], {'sum_x': (sum, 'x')})
161 µs ± 2.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The equivalent pandas
functions are given below. They require a chain
of several object methods to perform the same role, and the transform requires
a copy operation and assignment into the copied DataFrame
in order to
preserve immutability.
>>> %timeit df.groupby(['y','z']).agg({'x': 'sum'}).reset_index()
2.5 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %%timeit
... tdf = df.copy()
... tdf['x'] = df.groupby(['y', 'z'])[['x']].transform(sum)
2.81 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)