-
Notifications
You must be signed in to change notification settings - Fork 50
Replace PooledDataArray with NominalVariable/CategoricalVariable #73
Comments
Are you aware of #50 (comment)? |
Thanks for laying out these ideas, Simon. I'm not clear how we handle per-variable orderings with the proposal described in (1). If I do How are you thinking of handling those kinds of issue? |
I was thinking that for (1) and (2), immutable OrdinalVariable{T<:Integer,S}
index::T
pool::Vector{S} # Or maybe a wrapper over the pool?
end Then we'd have: function Base.(:<)(x::OrdinalVariable, y::OrdinalVariable)
x.pool === y.pool || error("OrdinalVariables are not comparable")
x.index < y.index
end |
Everything you've said, @simonster, seems completely reasonable. I agree that your first proposal would be the most straightforward way to move forward. But I can't yet shake my sense that making everything nominal/ordinal variable into its own type is the simplest way to do things when you want to work with categorical data without using DataArrays. (This, of course, assumes that it's actually reasonable to care about ordinal factor without also caring about missingness. I think it is, but may get convinced it's not worth the effort.) Rather than parametrize each runtime generated type with extra information, I was thinking that we would just define functions like: |
Both (3) and (4) have the problem that the pool itself is getting leaked, but dynamically creating types has the additional problem that every function gets inferred/compiled for each type. The performance and memory characteristics depend on the number of types * the number of different functions called in addition to the size of the data. See the quick benchmark I made here. With the option 1: 0.007096643
option 2: 0.000856045
option 3: 0.000178305
option 3.5: 7.3403e-5
option 4: 1.166516775 With option 1: 0.755386073
option 2: 1.2290704
option 3: 0.16911997
option 3.5: 0.034059569
option 4: 0.034355416 So for some but not all cases, dynamically generating types can be far slower, since The option that I didn't consider in my original post, which I call option 3.5 above, is using an enum ID as in (3) in conjunction with a special array type as in (1) to get efficient storage but also avoid a GC root when accessing the values. If the ID is a As far as categorical data without missingness with a PooledDataArray-like type, one possibility is to have a separate PooledArray type that shares most of the implementation but defines different |
Those benchmarks aren't entirely fair, since some have to deal with garbage produced by preceding benchmarks and some benchmarks include copy overhead while others don't. The below timings should be fairer. With option 1: 0.000691289
option 2: 0.000917854
option 3: 0.000205124
option 3.5: 8.7721e-5
option 4: 1.307516131 With option 1: 0.547448457
option 2: 1.013021837
option 3: 0.173764345
option 3.5: 0.055936094
option 4: 0.056068792 My feeling is that option 3.5 (storing raw values and the enum ID in a PooledDataArray type, and maintaining a global array of enum pools) is the way to go. Options 1 and 2 incur substantial garbage collection-related overhead for large data sets. With option 4, performance drops with the number of enums and the number of functions you plan to call on them, since each function needs to be compiled for each enum type. With option 3.5, performance depends only on the amount of data, and even if the user stores the enums in regular vectors instead of a special structure, it is not hugely inefficient (option 3 is ~3x slower). |
I'm not very comfortable with the idea of not being able to clear the pool from memory. Even if in most cases it shouldn't be a problem, in some cases the number of levels might be relatively high. When loading data produced by someone else (e.g. survey data), there may be many variables, some of which with many levels, and if they are marked as categorical they will waste memory which cannot be freed without restarting Julia (I need to do this with R quite often and it's painful). Let's dream a little. In theory, it seems that the best solution would have the following properties:
Using enums and thus letting the type system handle the global pool sounds like a good solution, which would allow writing simpler code. Also, the type system is (or could be made) smart enough to avoid compiling two specialized functions for two enums which are actually @simonster Regarding the benchmarks, I think you shouldn't build the vectors in the loops, as that's a more less common operation than indexing. Does the code in the gist correspond to your last comment? |
To make the Julia avoid compiling two specialized functions for two enums with the same fields, you need 1) abstract types with fields (JuliaLang/julia#4935) so that the compiler can optimize field access and 2) some way to specify that the compiler should not specialize on the subtypes (potentially related to "direct call methods specialized for non-leaf types" from JuliaLang/julia#3440, although in this case we'd have to be able to specify that on the type and not on the method). This is a little "featurey." There is also JuliaLang/julia#3080, although in that discussion it seems like people would like to be able to specialize/dispatch on the enum, which we probably do not want. GC of types poses its own problems. Objects cannot simply hold a reference to the type without the same kind of overhead as in option 2 (which makes GC take 0.5 seconds with 50,000,000 values). I don't know if reference counting would really be much better, since we would still have to increment the reference count every time a value is retrieved from the array and decrement it when the value goes out of scope. This idea seems really featurey, since it isn't really intended that types are dynamically generated in the course of executing code in the first place. In general, hacking this onto the type system continues to seem kind of wrong to me. The main purpose of the type system is dispatch and specialization. We don't want those things, just efficient storage. Maybe @JeffBezanson or @StefanKarpinski have some thoughts here? Here are the benchmarks for just array accesses. While option 2 looks better here, it is not really an option because, as noted above, with 50,000,000 data points and thus 50,000,000 references to the pool, it makes every GC take half a second. (This is not a problem with the other approaches; option 1 holds only one reference to the pool and the others hold no references.) Also the benchmarks for option 4 would obviously be worse if I were calling more than two functions that need to be specialized for each type. With
With
|
Maybe using types is not the best solution after all. Anyway the big issue with all solutions is how to keep a global pool which is not attached to any individual object, and yet could be garbage-collected when no longer needed. Indeed, for that you (or Julia) need to keep track of all objects which are using the pool, either by adapting the reference count when creating/deleting objects (and thus when indexing), or go over all existing objects at gc time. |
Let me add another alternative to be sure we have weighted all the advantages and drawbacks of using types to handle this. @johnmyleswhite's comment #73 (comment) seems to point to a fifth solution, which could be more efficient (even though the garbage collection issue isn't fixed): create one type for each variable, which would only wrap a Obviously all the issues about unwanted specialization and garbage collection do not go away in this scenario. |
I'm not sure if/how what you're suggesting differs from option 4 as implemented above. (I didn't implement the One compromise between garbage collection and performance would be a variant of 3.5 where we maintain a strong reference to the pool in each PooledDataArray and a weak global reference. Then we'd avoid the overhead of maintaining references in each enum and the pool would get garbage collected after every PooledDataArray that uses it gets garbage collected. For most use cases I think this would work well, but it's certainly possible to write code where the PooledDataArray falls out of scope before the enum, which could be surprising, and resulting bugs would be nondeterministic, which is not great. |
Yeah, that's actually option 4), but not parametrized on a tuple of levels. |
@johnmyleswhite, can I bother you to get your current thinking on this? |
Yeah, let me think about it again and get back to you. I'd really like to hear @StefanKarpinski, @JeffBezanson or @ViralBShah's opinions before we make a decision. In retrospect, I think a lot of the design of DataArrays/DataFrames was too close to R's idioms and not optimized for Julia. I'd like our decision here to be more appropriate for what Julia is going to offer when 1.0 comes out. |
Coming back to this with fresh eyes, Option 3.5 does indeed seem best. Could we expose a mechanism for intentionally freeing the pool? This would only be used by performance buffs. Even though it's a little ugly, it does seem like the best solution. I'm going to e-mail Jeff, Viral and Stefan offline to see if I can get them to chime in. Once we resolve this, I'd like to also debate the remaining design issues for PDA's:
|
I haven't read all of the commentary on this issue (there's a lot of comments here and in the linked discussions), but I think that part of the confusion here comes from conflation of semantics and implementation. There are data that you want to represent in a DataFrame that are nominal or ordinal and you want them to behave as such. As John observed, the nominal/ordinal nature of a variable must be linked to the element type, not a property of the container – otherwise when you take the elements out of the container, they have different behavior, which is bad. The solution there seems pretty obvious: use appropriate ordinal and nominal types. This isn't really related to the data frame abstraction at all. It makes sense to have types for these kinds of things, possibly something like an enum (although the difference between ordinal and nominal, not to mention flags indicates to me that maybe we need a slightly finer gradation). Then there's also the issue of efficient storage and operation – which should not be confused with behavior. The fact that nominal and ordinal values can often be stored and operator on particularly efficiently might be the red herring here. I'm unconvinced that they should be related. Shouldn't this be like building an index for a data frame column? The element type is still the same and it doesn't affect the behavior of its values, but it provides a different, possibly more efficient implementation? |
In other words, I feel like this entire discussion assumes implicitly that nominal/ordinal values and value pooling should be linked – after all, it's often sensible to use a value pool when you have nominal/ordinal values and vice versa. But that's not necessarily the case. You could, for example, have a BigInt column that is truly integral, but because BigInts are huge, you may want to pool their storage. That data is pooled (implementation), but numerical. You might also have nominal data that you don't want to pool because the values are small and easier to simply store inline. |
This is basically the structure we have for PooledDataArrays right now, but a switch to enums would suggests two pools: one for the enums that contains its levels, and one for the PooledDataArrays that contains enums. In some sense this is more complicated, and I'm not convinced that the use case of storing BigInts/BigFloats in a PooledDataArray is a particularly strong one, but from a performance standpoint it might work better than the approaches I've discussed above, since we can just get the enum out of the pool instead of creating a new object on |
If you're using an enum, why do you need the pool? The only cases where I can see needing the pool is a case where you have a lot of strings or something and you want to reduce storage or do faster indexing. What kind of data really benefits from pooled storage? |
How do we store our enums in an ordinary array? Either we potentially have many thousands of objects that each hold a reference to the enum pool, which makes GC take forever, or we have to make the enum pools un-garbage-collectable. There is an additional concern of storage efficiency, since each enum needs to hold some kind of (redundant) pool identifier in addition to the level, but I don't think that's a big deal. |
I don't know what you're thinking about when you say "enum" but I'm thinking about a thin immutable types that wraps a |
Let's say you have a set of responses on a Likert scale, where each response is one of "strongly disagree," "disagree," "neutral," "agree," and "strongly agree." Those five levels are what I'm calling the enum pool, but perhaps that is a confusing name for it. We need a way to go from a given value back to the level to which it refers, or else you can't e.g. select all rows of a DataFrame where the response is "strongly agree." |
I feel like this should be the same as the proposed |
See JuliaLang/julia#2988 for a possible implementation. The title is misleading, since the PR includes a non-bitmask version of |
This has the same problem as the other type-based approaches: functions get specialized for each enum type, which means that if you have 100 different categorical columns you are compiling a lot of functions 100 times. Also both the functions and the lists of levels stay around in memory forever. There are major differences between what we want for doing stats and what |
Ok, these are good examples. Why not use strings or symbols? Do you have examples where pooling the storage is essential? |
Pooling the storage is essential in most uses; in the example of Likert scales, you don't want to repeat one million times "strongly disagree," "disagree," "neutral," "agree," and "strongly agree" as strings, as it would be terribly inefficient. And you don't want to store integers directly because it would be terribly confusing. We seem to need a specific type of enum, which would not trigger function specialization, and which would be garbage collected when no longer used. |
One option would be just autopool strings behind the user's back. This is an option because Julia treats strings as immutable, so any two strings with the same value are just as good as each other. The strings still just act as strings but you'd avoid all the storage overhead of repeating the strings over and over again. |
After implementing some ideas for working with missing scalar values in the NullableTypes package, I decided to go ahead and implement Simon's 1st proposal at the start of this issue so that we could have some way of working with scalar values of categorical datatypes. This reflects my concession that getting an Based on Simon's proposal, I've built up a potential backend for working with scalar values that are instances of either categorical and ordinal variables. You can check out what I've done at https://github.com/johnmyleswhite/CategoricalData.jl. Basically, I've:
One thing that I've done that's not strictly necessary, but I would argue is helpful, is to remove any notion of missingness from the underlying representation of categorical data. My idea is that we'd replace the current PooledDataArray type with two new types, CategoricalArray and OrdinalArray, that build on top of the functionality provided in CategoricalData.jl. Those new types would handle the problem of missingness for arrays. At the scalar level, we'd kick things back to NullableTypes.jl. Thus, indexing a scalar element from a CategoricalArray or a OrdinalArray would produce either an object of type |
I like the CategoricalArray without missingness -- I've thought about that when dealing with PDAs. |
Sorry, I was a little unclear. My idea was that CategoricalArray would support missingness just as PDA does now (by using a pointer to an invalid index like 0), but that CategoricalPool would not support a missing value as a level. (Unless, of course, the missing value were explicit like "Did not answer question.") |
What's the use case for allowing users to say that CategoricalArray not permit NULL? To get something like the non-NULL constraint we're allowing now by inserting an Array into a DataFrame? |
It would definitely be good for that. I suppose the real goal is, that however people end up representing arrays of categorical data without missing values elsewhere (perhaps an Enum type in Base), the API for the categorical version of DataArrays should mirror the Enum-like type as DataArrays mirrors Arrays (working with either is near identical except for added code/computational complexity to deal with missingness when working with DataArrays), and, for example, a function that builds a ModelMatrix should be be happy to accept either type. That seems like a consideration to push aside for now in favor of easier experimentation. But I really appreciate the work to make DataArrays and DataFrames consistent with Base -- I'm the kind of person who forgets a library's names, syntax quirks, etc., after a very short time away. |
I guess the question for me is how Enum relates to CategoricalVariable. OrdinalVariable is, in my mind, clearly distinct from Enum because of the support for a custom order. So the question is whether CategoricalVariable is redundant in the face of Enum. I think the answer is probably no, but I'm not super sure. Here's what I feel comfortable saying:
|
Very good points in the bullets. I remember seeing in discussion of Enums for Base that order may be handled there, so it's possible that if CategoricalVariable was redundant OrdinalVariable would be, too, but I think you're right that neither will be redundant. If neither type is redundant, it seems increasingly desirable to for both to have mirror types that don't take NA, but I'm not ready to take a strong stance one way or the other there. |
I think we actually want the CategoricalPool to hold an array of CategoricalVariables so that, instead of constructing a new CategoricalVariable on every access to a CategoricalArray, we would can just pull the CategoricalVariable out of the CategoricalPool. The reason is that a CategoricalVariable is a non-pointerfree immutable (because it holds a reference to the CategoricalPool, which holds an array), and it appears that Julia presently allocates these on the heap, i.e., they have performance characteristics more like regular objects than pointerfree immutable objects. To avoid allocation on every I still think that Enums of the type people want in Base, which are primarily for controlling what code gets executed, are different from CategoricalVariables, which are primarily for holding data, and it's important we not conflate these two. |
I kind of regret referring to Enums now without knowing more about what On Sun, Aug 17, 2014 at 3:04 PM, Simon Kornblith [email protected]
|
Ok. It sounds like we're all in agreement that Enum's aren't quite what we want. That makes me feel better about having implemented this stuff. :) @simonster, your note about being pointer-free is well-taken. I'll try to cycle back on this soon and experiment with an alternative implementation that includes a small set of constant CategoricalVariables in a pool that can returned to users. There's also a question of whether the CategoricalPool should be immutable or not. The @garborg, the package I wrote doesn't provide much usable functionality: it's more a demonstration of a potential backend than a drop-in replacement for PooledDataArray's. I can demo out an implementation of CategoricalArray's and OrdinalArray's to give a feel how these things would work out in practice. |
@johnmyleswhite I don't get why I also have a small remark regarding the naming: Regarding enums, I agree with @garborg that we don't know yet how they will work, so it's hard to tell whether they would make CategoricalVariable redundant or not. |
@nalimilan You're right: there's no reason to have the two-vector constructor for
|
With a Ph.D. and a permanent position to start soon, I've finally found the time to take up @johnmyleswhite's work on CategoricalData.jl, and bring it to a point where I think it can constitute a replacement for The features are essentially the same (though with additional support for ordered levels), and the API is very similar, except that instead of suffering from the Please have a look at the README (and if possible at the implementation) and tell me what you think. If you agree, I'd like to move this to JuliaStats soon, and open a branch in DataFrames which no longer uses DataArrays at all. Cc possibly interested people: @andreasnoack @davidagold @tshort @bkamins |
Looks great. A few comments
|
I haven't done systematic benchmarks yet. You're right that the most common use case is probably to pass a whole array, not individual values. The reference to the pool may not be required after all. As I see it, it can be useful as it would allow comparing values from two distinct pools which are
I don't think that would be very practical, as it would introduce type instability in a lot of places. It's already quite annoying to change the integer type used to store references based on the size of the pool (which isn't done automatically for this reason). Also, do you think it would bring a significant speed improvement? Accessing an array by integer indexing is fast too. |
I remember the problem now: storing (a reference to) the pool is the only way to provide both very fast equality checks between values from the same pool (the most common scenario), while still supporting comparisons between different pools. In the first case (same pool), you can simply check that pools are If we don't want to keep a reference to the pool, we need to store the actual level, which in the common case will be a string: AFAIK this wouldn't be really more efficient than keeping a reference to the pool, would it? Except if we could find a way to store strings inline, e.g. as a tuple of chars, as an implementation detail. (While it's currently and error in the package, I think the ability to compare values from different pools is important, as when importing data arrays may have the same set of levels but not the same pool objects, and requiring the user to merge them would be painful.) Forgot to CC @quinnj. |
Thanks for the CC @nalimilan; I'll look more into your fork of CategoricalArrays. I've read through a good number of the comments here, but I'd like to add one more point on @simonster 's original point 4, which would make a type definition like # `O` = whether the Category is ordered or not
# `I` internal storage type
# `T` = tuple of symbols representing Category level values
immutable Category{O,I,T}
value::I
end I dislike throwing out this option simply because of the current state of code generation in Julia (i.e. over-specializing every method that operates on Category). IMO, this is the simplest, most straightforward implementation and I'd hate to see us get additional code despecialization capabilities in 0.6 and then have an entire package or two become overly complicated. It'd be nice to be able to do something like: immutable Category{O,I,T::ANY}
value::I
end that would help the compiler know that it shouldn't ever specialize methods on the I'd much rather write up an implementation based on this approach and then provide enough whiskey to @JeffBezanson, @vtjnash, and @carnaval during JuliaCon to add the ability to despecialize a type parameter 😄 |
The downside with that approach is that any code modifying the levels would become type unstable if levels are not know statically. I'm also not sure what's the practical advantage: why would statically knowing the possible levels make anything faster? |
Is it really a common use case to need to modify an existing Category's levels? I can see a case where you "build up" the levels first, outside the type, but in my mind, once you have them, you then build your Category type and you're done. The practical advantage of a definition like |
It's quite frequent in my experience to change levels e.g. because you take a subset of the dataset and you want to drop unused levels, or even more frequently to recode data. If the levels are part of the type definition, you need to pass them in advance before assigning them based on various conditions (which violates the DRY principle), and The machinery required to implement a custom type is significant, but that's not the end of the world either. CategoricalArrays.jl is mostly functional now, and it's only ~600 lines of code under |
See JuliaData/DataFrames.jl#994 regarding port of DataFrames to NullableArrays and CategoricaArrays. |
Closing now that DataFrames has been ported (JuliaData/DataFrames.jl#1008). |
In #67, @johnmyleswhite proposed replacing PooledDataArrays with OrdinalVariable/NominalVariable enums. Here are four possible approaches in order of my current preferences regarding them:
getindex
wrap extracted value as an OrdinalVariable and NominalVariable.So far, this is looking like the best approach to me.Pros: Efficient storage and avoids most of the potential performance pitfalls of the below approaches. Cons: We don't get to get rid of much code.I have no idea how much this matters.This makes garbage collection really slow.)The text was updated successfully, but these errors were encountered: