-
Notifications
You must be signed in to change notification settings - Fork 1
DataFrame
This document describes design issues related to DataFrames (tabular structures containing heterogeneous types, allowing NAs, and facilitating the types of data-processing operations typically performed in the course of interactive data investigation and analysis) for Julia.
The three most commonly used existing DataFrame-like objects in other programming languages are R's data.frame, Python/Pandas' DataFrame, and R's data.table. Other implementations exist in other languages, with much smaller user-bases. The structures used in relational databases, especially column-oriented databases, are similar as well.
Julia's design goals include being high-performance and simple to use for large-scale computing. There are built-in constructs for distributed arrays and parallel computing via message passing. Therefore, compared to design tradeoffs made R and in other languages, consideration of large and distributed data should be a high priority. In particular, support for large data sets that exceed the size of RAM should be easy to construct and access, with good performance.
Julia also has a goal of providing a simple syntax that is easy to use from a REPL. As discussed below, it is unlikely that Julia can provide as simple a syntax for accessing DataFrame elements as R can (df$field
). However, the process of manipulating DataFrames should be as clear and easy to write (and read) as possible.
In other systems, DataFrames are thought of as generalized matrixes. It may be valuable for Julian DataFrames to be thought of slightly differently, as sets of columns linked together in a data structure. The difference would be visible in the types used to represent the DataFrame -- explicit DataVec columns and perhaps separate per-column files when dealing with large data sets.
Another philosophical goal is to easily support common operations specific to statistical data, such as recoding and transforming data in-place, adding and removing columns, reshaping (converting from long to wide format or vice-versa), joining (in a relational sense), and split-apply-combine operations.
From the point of a user, a DataFrame should look like a rectangular, indexable structure, similar to a Matrix, but with differently-typed columns. Internally, it may or may not have that structure. R implements data.frames as a list of vectors with several additional attributes.
Critically for statistical use, the columns of a DataFrame must not be strictly type-homogeneous, but must allow NAs (missing data) as well. See Issue #470, as well as insightful documents from the Python community: Python, NumPy.
A viable option for Julia is to create a parameterized data type called DataVec{T}
which implements a masked-array. Importantly, slices of DataVecs are DataVecs, while references to single elements are either the base type or an NA. DataVecs should support iteration, and an efficient way of removing NAs could be through iteration. The naFilter
and naReplace
methods cause iterations (but no other operations) to not return NAs, while the nafilter
and nareplace
methods copy the vector to a new vector of the base type, without NAs. Arithmatic operations on numeric DataVecs should work as in R, with NAs propagating.
For example (from a prototype implementation):
dvint = DataVec[1, 2, NA, 4]
dvstr = DataVec["one", "two", NA, "four"]
julia> show(dvint)
[1,2,NA,4]
julia> typeof(dvint)
DataVec{Int64}
julia> dvint[1:3]
[1,2,NA]
julia> typeof(dvint[2])
Int64
julia> typeof(dvint[3])
NAtype
julia> nafilter(dvint)
3-element Int64 Array:
1
2
4
julia> [naFilter(dvint)]
1-element DataVec{Int64} Array:
[1,2,4]
julia> [naReplace(dvint,-1)]
1-element DataVec{Int64} Array:
[1,2,-1,4]
julia> isna(dvint)
4-element Bool Array:
false
false
true
false
julia> dvint+1
[2,3,NA,5]
julia> dvint .* 2
[2,4,NA,8]
julia> dvint[1] = 10
10
julia> dvint
[10,2,NA,4]
DataFrames need to support any standard numeric type, including integers, floats, Booleans, as well as Rational and Complex types, and even BigInt/BigFloat and similar. The underlying DataVec types should be able to handle this transparently.
Julia doesn't yet support dates or times, but when these types exists, the DataVec type should be able to support it. DataFrame read/write methods (see below) should be able to detect dates/times automatically, when possible.
A DataVec type can easily support strings. There may be complexity with UTF-8 -- it hasn't been tried yet.
A more substantial note is that much statistical data of string types tends to be repetitive, with the identical string appearing thousands or millions of times in a column. R uses a global string cache, so that instances of string constants take up only as much space as the index, not the full string. It's not clear if there is a good option for doing this with a DataVec{String}
. Although Julia methods can be parameterized to have specific behavior for that type, the fields in the type cannot. It may be useful/necessary to have DataVec inherit from an abstract type, AbstractDataVec, that also includes a DataVecString type with an internal (or perhaps global) string pool, and to have DataFrames be collections of AbstractDataVec types.
Depending on implementation, a DataVecString could perhaps allow operations such as dvstr == 'cat'
to avoid making string comparisons at all, which would be an additional performance benefit beyond the memory savings.
Of critical importance to statistical data is a type that looks like strings, but optionally has arbitrary (nonalphabetic) ordering. For example, an experiment might have conditions "Control", "Low Dose", and "High Dose", or a survey response might be "Completely Agree", "Agree", "Disagree", "Completely Disagree". For the purposes of both data analysis, allowing ordinal data to be dealt with appropriately, and for display, it is extremely useful for these enumerated types to have the specified order.
In terms of implementation, the additional storage required to specify the optional ordering precludes a DataVec{Factor}
or similar approach, and probably requires a DataVecFactor descended from AbstractDataVec. Alternatively, in lieu of factors being a separate data type, DataVecStrings could optionally have defined levels (a superset of the existing strings) with an optional ordering.
When interacting with DataFrames in an interactive REPL environment, it is critical that the columns be nameable, and that columns can be efficiently accessed by name. In R, axes can optionally have a string name, columns must each have a unique string name, and rows may optionally have a unique string name. In Pandas, there is a complex system where any data type can be a row or column name, including dates/times -- handy for timeseries representations, and tuples -- allowing for powerful hierarchical indexing and grouping.
My current recommendation is that in Julia, DataFrames should have no row names, but only column names. If rows need to be identified, a column may be created by the user to do just that. Column names should be optional, and are of a specified type at DataFrame{T} creation, where T is often String, but may be Any.
Alternatively, Julia could optionally allow both types of names. If a row or column name vector is present, there should be a hash table allowing fast indexing of rows/columns and fast uniqueness-detection operations. (Rebuilding the hash table after a table is reshaped may slow down that operation, however.) A disadvantage of having indexed row names is that it partially replicates the functionality of data indexes, below.
Indexes are a critical and contentious issue. Whether and how various things are indexed affects the speed, syntax, and mental models associated with DataFrames. R data.frames are not really indexed, and many operations are incredibly slow. The R data.table class allows a user to define one or more columns as a key, sorts the table, and then allows very fast reference and joining operations via binary search. Pandas always indexes the row and column names of a DataTable in such a way that many operations are very fast.
My current thinking is that for Julia, DataFrames should feel like more like a relational database (but with a natural syntax for operations), meaning that data may be optionally and easily indexed by the user, providing a speedup during reference/join operations, but necessarily slowing changes to that column. Syntax could be as simple as index!(df, "id")
. Indexes would thus be associated with/part of a DataVec, not a DataFrame as a whole.
Lazy indexing, where an index is marked "dirty", then only rebuilt when next accessed, could help performance.
As mentioned above, DataFrames should be usable for bigger-than-RAM data, which is generally a Julian design goal. There are two types of Big Data in question. First, it should be possible to define a DataFrame as a memory-mapped disk file (or set of disk files) that once constructed could be accessed and processed as if it were entirely in RAM. Efforts are underway to memory-map the built-in Array types. Second, and more challengingly, it should be possible to define a DataFrame against a set of sharded data on different machines. The darray construct in base Julia offers some suggestions for how data might be distributed to different machines for processing.
For a memory-mapped DataFrame, it seems reasonable to expect that the DataFrame infrastructure, including column names, properties, and indexes, is loaded into in RAM, while the underlying data is read from disk only when necessary for access.
Note that memory-mapping fixed-length types, such as numbers, is relatively easy, but that memory-mapping types such as string vectors requires considerably more effort.
It is likely that memory-mapped DataFrames would be fixed-length on disk. Certain insert/append operations might require re-building the columns, which could be very slow. See also [below][Concatenation and Insertion].
There should be methods to give text-based overviews and summaries of a DataFrame as a whole, and of properties of the columns. This should be straightforward.
Julia allows user-defined ref()
and assign()
methods, which means that DataFrame access and assignment should be relatively straightforward. The general principle should be that 2-D access methods should return a DataFrame as a result, unless the access method is explicitly scalar, as in df[1,2]
, in which case the result should also be a scalar of the appropriate type.
1-D column access methods should return a 1-D DataVec as a result. The syntax df["cat"]
is easy to support, but it may be desirable to have a syntax such as df.cat
as in Pandas, or df$$cat
as in R (sorta) to extract the column object.
It may be advantageous to have a distinction between DataFrame subsetting, which would involve copying any referenced data, and DataFrame views, which would refer to the existing data. The latter may be a challenge when dealing with large memory-mapped data sets.
DataFrames should be serializable to a standard format. This may or may not be an appropriate structure for Big Data representation and memory-mapping.
DataFrames must be creatable in intelligent ways by reading CSV, XLS(X), HDF5, and similar formats. DataFrames must be exportable to those formats as well.
Concatenation is a relatively common operation in which rows are added to an existing DataFrame, either one-at-a-time or in large chunks. Making this work efficiently is tricky. Care will need to be taken that the existing Julia code for array extension is used appropriately. A further consideration is how to deal with extending pooled-String and Factor types. A common bottleneck when repeatedly extending data.frames in R is the need to reconcile factor levels on each append -- this should be avoided and should be avoidable.
A related concern is the question of how to deal with concatenation of indexed vectors. It may be possible to join indexes efficiently, but it will not be overhead-free.
An improvement to extending arrays when concatenating is to allow the underlying DataVec representation to be a chunked array. Chunks of records might be stored in something like a B+tree. Certain types of operations, including small concatenations, insertions, and deletions could be much cheaper. This type of representation might also tie in nicely with memory-mapped representations.
Adding columns to a DataFrame should be straightforward.
Although different users would be more or less likely to use a relational join regularly, they are a fundamental operation on DataFrames. In general, my preference is to allow naive joins on un-indexed columns to work, but to make it syntactically very easy to index columns, which would dramatically increase join speed.
The explicit functional syntax of R data.frames and Pandas seems to make the most sense here: join(a, b, Options(by.x="cat", by.y="dog", type="left", etc))
The R data.table syntax is not very intuitive.
It may make sense to have separate functionality for joins that maintain order, as are typically done when joining time-series data.
DataFrame-like structures are reshaped occasionally, although not in the same fashion as matrixes are reshaped. An example would be when a experiment includes multiple samples from the same individual, which would typically be stored with one column per observation and one row per subject. Reshaping to "long" form entails generation of a single column per observation-type, and one row per subject*observation combination. Long form data can be easier to analyze, and lends itself to split-apply-combine paradigms, as described below, but requires replication of data. Reshaping to "wide" form entails the reverse operation.
The libraries that have been created for doing this by column name, including reshape2 in R, and to a lesser extent the stack/unstack/melt functionality in Pandas, do not have a particularly appealing or intuitive syntax, perhaps unavoidably.
This functional-programming idiom is a widespread and very programmer-effective task when working with tabular data. When the column(s) to be split on are indexed, the split operation is very fast. If there are views of DataFrames that don't require copying, the apply step should also be fast and easily parallelizable. The complexity is the combine/reduce step, which often requires pasting together many sub-objects (DataFrames or matrixes) in intelligent ways. The plyr package for R and the groupby/aggregate/transform/apply Panda methods are reasonable approaches.
Note that splits need not yield disjoint subsets. See John Myles White's work on overlapping subsets in R.
I tend to think that DataFrame operations tend to be relational in nature, and that when it comes time to perform mathematical or statistical types of algorithms on the data, it should be converted to a standard Array or Sparse Array type. Julian DataFrames should support various types of exports to these structures, including various rules for dealing with non-numeric types so that the resulting structure is of a common type. The R function model.matrix
, for example, converts factors to numeric dummy variables in various statistically-informed ways.
split-apply-combine, parallelized, is map-reduce. The split could be done on chunks of rows, or by values of columns (as above), depending on the operation.
It may be useful to consider operations on sharded data, where a common structure defines the columns and the shards (sets of DataVecs), and individual clients perform operations on their shard.
- AbstractDataVec to support different implementations.
- DataVecString for pooled strings.
- DataFrame views.
- Append DataFrames and add columns.
- Split-apply-combine.
- Exports to Arrays.
- Join/merge operations.
- Factor extension to DataVecString.
- Blocked Arrays. Use for DataVec to improve row-deletion speed.
- Indexes for fast get-index-by-name operations on DataVecs. Fast replace-by-name. Fast creation of views.
- MMapDataVecs.
- ???
- profit.