-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Union{T,Null} inference in structs #6
Comments
Can you give a concrete example where this behavior is a problem? |
So I've been experimenting w/ porting DataStreams/CSV/DataFrames over to use using Nulls
struct Foo3{T}
v::T
end
function bar3(x)
x = rand() > 0.5 ? 0. : null
return Foo3{?(Float64)}(x)
end In this case now, So all that to say that I recognize the issue you're talking about here and ran into it myself, but the work-around isn't actually a problem once you stop to think about it. It just takes a little more pre-declaring container types instead of purely relying on inferring the type parameters. |
For a query like this: @from i in df begin
@select {i.a, i.b}
end we will end up with an anonymous function like this somewhere deep inside Query.jl: struct Foo{T1,T2}
a::T1
b::T2
end This anonymous function will be called with instances of another type, say struct Bar{T1,T2}
a::T1
b::T2
end The specific concrete type that would be passed into this anonymous function might be @quinnj suggestion to explicitly declare the fields to be @from i in df begin
@select @NT(a::?(Int) = i.a, b::?(String) = i.b)
end |
Couldn't the query macros do that work automatically? |
No, the macro doesn't even know that the |
I think this problem is actually just an instance of two issues that were raised and discussed in the original julep (but never resolved, as far as I can tell). The core of the problem is this: if you have a higher-order function and pass a procedural parameter that returns a composite type and it operates over one or more collections with element type The two issues from the julep (and the discussion in the PR) that are relevant here are 1) John's rational for preferring John's reasoning for preferring @andyferris's reason for preferring The only way to solve both problems with the Here is a simple examples that illustrate this problem: using Nulls
A = Union{Int,Null}[1,2,null,4,null]
B = Union{Int,Null}[1,null,3,4,null]
C = map(Pair, A, B) Note how this array will have elements with four different concrete types: So I think this issue is essentially about two problems with |
Yes, this is still the same original problem here, but I'm also not sure why the same solution doesn't apply julia> C = map((x,y)->Pair{?(Int),?(Int)}(x, y), A, B)
5-element Array{Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}},1}:
Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(1, 1)
Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(2, null)
Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(null, 3)
Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(4, 4)
Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(null, null)
julia> Pair{?Int, ?Int}[Pair(x, y) for (x, y) in zip(A, B)]
5-element Array{Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}},1}:
Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(1, 1)
Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(2, null)
Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(null, 3)
Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(4, 4)
Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(null, null) i.e. the container just has to be declared/tagged w/ the right Going back over the John/Andy discussions, I'm inclined to say that it's probably a more theoretical problem than practical; especially since I believe most users won't even run into this issue in their own code. They'll more than likely be calling higher-level libraries like DataStreams/Query which can handle the typing under the hood. Even if they're extracting DataTable/Frame columns to do their own operations, they're already set since the columns themselves will already be correctly typed with |
As a prelude: Like all great works, I feel that reading John's julep was a process of gradually transcending many layers of deeper conceptual understanding :) That said, after much reflection, I actually think that this application (nullables, nulls, etc) may have been the first to expose a bit of an inconsistency (*) in Julia's type system. For example, "covariant" objects like direct bindings follow the semantics of a dynamic language, while the elements of structs, refs, arrays, etc behave a lot like the types in a static type system. Unfortunately, once you reach any level of complexity, you end up taking covariant objects and throwing them into containers, passing that through a function, which maps these elementwise through another function, etc... and you end up with the problems discussed in this issue. So, I think that none of
are perfectly suitable for all situations - more precisely, I think you can construct objectionable situations for any of the five types above. I think what happened in the Julep is that different people found different objectionable situations for several of the above. Given that, my suggestion is to choose whatever is easiest for the typical user, and then do your best to patch up the resulting holes that appear - case in point being the OP. At this stage, I'd say that (*) - inconsistency is definitely too strong of a word (apologies to Jeff B, if you're reading this), but I couldn't think of another one. (Warning, going off-topic). I'm not saying that it's fundamentally broken or anything, but we definitely have to deal with covariant types and their invariant type parameters. In a static language, you would just use |
Yes, declaring the return types in all these cases works. But I think that would prevent us from enabling a large class of very elegant data manipulation solutions. Here is one I just enabled on With the latest using DataFrames, IterableTables, NamedTuples
df = DataFrame(a=[1,2,3], b=[4,5,6])
df2 = DataFrame(@NT(c=i.a, b=i.b) for i in getiterator(df) if get(i.a)>2) This will generate a new This is still somewhat verbose, but I think there is a very good chance that we could get this down to the following in maybe even the julia 1.0 time frame: df2 = DataFrame({c=i.a, i.b} for i in df if i.a>2) This would require a few things: First, a named tuple syntax in base. Doesn't have to be In my mind this is syntax that would potentially be used a lot by end users for very simple data manipulation of table data. Now, if we wanted this to work with the df2 = DataFrame({c::?(Int)=i.a, b::?(Int)=i.b} for i in df if i.a>2) I think that would make this whole idea probably moot, folks don't want to specify every column type in this kind of simple query. Alright, running out of time, I'll write up why this is not trivially handled in something like Query.jl, or at least why I don't see a way to do this right now. Would obviously be great if one of you could think of a solution to that, once I've described the problem. Oh, one last thing: I'm not surprised that this all works great in DataStreams.jl. I'm 99% sure I would not run into the slightest problem with |
Thanks for the details, that helps understanding the issue. So the concrete problem is very similar to this: julia> using NamedTuples, Nulls
julia> itr = (@NT(c=i.a, b=i.b) for i in (@NT(a=1, b=2), @NT(a=2, b=null)));
julia> collect(itr)
2-element Array{NamedTuples._NT_c_b{Int64,T2} where T2,1}:
(c = 1, b = 2)
(c = 2, b = null) Or even this: julia> itr2 = (((i[1], i[2]) for i in ((1, 2), (2, null))));
julia> collect(itr2)
2-element Array{Tuple{Int64,Any},1}:
(1, 2)
(2, null) The element type is chosen by julia> Core.Inference.return_type(first, Tuple{typeof(itr)})
Union{NamedTuples._NT_c_b{Int64,Int64}, NamedTuples._NT_c_b{Int64,Nulls.Null}}
julia> Core.Inference.return_type(first, Tuple{typeof(itr2)})
Tuple{Int64,Any} Fortunately, even though function getiterator(source::Base.Generator)
T = Core.Inference.return_type(first, Tuple{typeof(source)})
S = typeof(source)
return IterableTables.GeneratorIterator{T,S}(source)
end I get this: julia> eltype(getiterator(itr))
Union{NamedTuples._NT_c_b{Int64,Int64}, NamedTuples._NT_c_b{Int64,Nulls.Null}} This means you should be able to perform whatever type computation you need inside BTW, I haven't been able to replicate the example from your last post on Julia 0.6, even with IterableTables from git master. The |
Also, regarding the julia> [a=>b for (a, b) in zip([1, 2], [2, null])]
2-element Array{Pair{Int64,B} where B,1}:
1=>2
2=>null Notice that the return type is not |
Perhaps, though it seems to me that the advantage of going as general as possible off the bat for the element type of a mutable object is that the element type doesn't have to be recomputed if a new element is added. I imagine that could be detrimental to performance. (Edit: This makes no sense, disregard) |
What do you mean? The type of an array won't change when adding an element, so you can't even add an entry of a different type. This is indeed one of the drawbacks of precise typing, but we accept this for concrete types because it is essential for performance (and safety/predictability BTW). |
What I mean is that if you have |
Yes, but how is that different from e.g. |
That's a good point. Don't mind me, then... |
@nalimilan There are really three questions, I believe: 1) can one recover enough type information to construct the right data structure to materialize a query, 2) will that data structure be fast and efficient and 3) can one write code that materializes things that is fast. In your examples we end up with the following types for elements: a) I believe b) is something we can only get if there are very few fields in the target struct that can take a missing value. Otherwise you end up with 2^n elements in the union. I think type inference will actually infer things as UnionAll types (as in c)) once the number of fields increases. So, for now I'll just limit the discussion to a) and c). In the a) case we lost the required type information, i.e. we no longer can tell that the second tuple element can only be an c) is interesting. In this case we have to deal with UnionAll types. In your example there is only one type variable, but if there were more fields that can take nulls, this might look like Now, I believe you are right that with the type information in c) we could construct say a Now, just for the case of materializing something into a for i in source
# Store tuple elements into DataFrame columns
end Type inference would infer the type of Now, if type inference was able to generally infer these UnionAll types as say |
What does it do differently? |
f(a::Nullable{Int}) = Pair(a,a) always returns a f(a::Union{Int,Null} = Pair(a,a) will return either a
|
Unless you define f(a::Union{Int, Null}) = Pair{Union{Int, Null}, Union{Int, Null}}(a, a) which has already been pointed out. |
As I wrote before, that doesn't work with the design for Query. I can't ask users to make these type annotations in every query, and the macro doesn't know anything about types, so it can't automatically rewrite this either. There is no other stage where one could rewrite this, so I don't see how this suggestion can be made to work. |
Why is this? I have some guesses but I can't tell for sure. |
Will |
@shashi Query.jl is essentially a Julia implementation of LINQ, so for a general overview of the design any LINQ description should work. Specifically, in many cases a query will often end up being translated into a chain of iterations, and a bunch of those chained iterators are map-like, i.e.they apply a user provided anonymous function to each element in the iteration (this is true for Are you coming to juliacon this year? I'll dive into some detail of this in my talk. |
@nalimilan I don't think that will change anything. I'm really looking forward to |
Do you mean maybe @vtjnash knows if that would work. I think this sort of thing should be taken care of the compiler either way...
I'll make sure to attend! |
The basic problem here is that this example seems to be too naive: it explicitly asserted that the inputs had to be Thus to be pedantic and accurate about what the language / compiler can do, the result of this function should really be represented as With the alternate design, the possible outputs are simply Aside: I'm not sure the latter is necessarily meaningful, so I would expect the intended result is instead Making the
|
@shashi I was a bit sloppy there. If the function creates a named tuple with just one field things get inferred as
@vtjnash Well, for Query.jl I can guarantee that the inputs are always properly typed as say @from i in df begin
@select {x=i.a, y=i.b, z=i.c}
@collect DataFrame
end This will construct a chain of two iterators. The first iterator returns a named tuple for each row in the source The So how would this look with
I don't understand that, but maybe my exposition of the issue was just not very clear. Hopefully the explanation in this post is more comprehensive and closer to the problem I have in Query.jl with |
I just reread @vtjnash's comment and somehow I had missed/not really noticed the last paragraph:
struct Maybe{T} # née Nullable{T}
value::Union{T, Null}
end That kind of design of course would solve the issue for Query.jl, essentially this amounts to sticking with a container based approach, but swapping out the internals of how that container is structured. If there are performance benefits of that over the old In general, what do folks think about the problem in this issue at this point? I've played with various other attempts to get around this in Query.jl over the last weeks, and I just don't see a solution that works. I did see that various packages are going full steam ahead in switching over to Nulls.jl (DataStreams, CategoricalArrays, DataTables?), but unless we find a solution to this I fear this will just lead to the next fragmentation of the data ecosystem, i.e. Query.jl and friends won't be able to follow suite and will use DataValues.jl, and then the other half of the package world will use Nulls.jl... Is that the plan right now? |
@vtjnash This seems relevant for JuliaLang/julia#22682. |
I'm a bit stuck here on not understanding what One possibility is to use something like type promotion, where Another possibility is to use some interface that tells you the declared type of a field. For example
|
Here is a simpler setup that demonstrates the issue, using the named tuple syntax from the PR in base: using Nulls
A = [1,2,null,4]
B = [1.,2.,3.,null]
result = map((i,j)->(a=i,b=j),A,B) The question is a) can If I understand co-variance properly, then the idea that came up during juliacon (make named tuples co-variant) would actually imply "no" for question a), right? In that case the element type of the If named tuples are not co-variant, then One option is @JeffBezanson's suggestion to change the anonymous function so that it always returns instances of The promotion idea is interesting, but I can't think of a way of using that which wouldn't tie things even more to inference, whereas I'm trying to actually move away from the heavy reliance on inference. But on the flip-side, I haven't thought about this option long (well, since yesterday), so maybe there is some way to make this work. Here is the issue: if I was trying to implement a It is also just worth pointing out again how much easier everything is for the Query.jl situation with |
Thanks, that's a useful example.
As you probably know, that's exactly how
That would help reduce the number of re-allocations, though it wouldn't eliminate them. However, if we (improperly) exploit inference (the way Nullable and DataValue do) we could find out the (nullable) result type up front and avoid all re-allocations. |
@JeffBezanson, what exactly are you referring to when you say "if we (improperly) exploit inference"; are you talking about the liberal use of |
No, Arrays of NamedTuples will not have a compact memory layout (they're rotated wrong in memory for that, although by contrast that means they would be laid out in cache order). For that, you would need a RowView type (a la DataFrame's current design). |
Indeed, the fact that The problem is that currently An issue with the proposal in @JeffBezanson's last comment is that if |
Instinctively, I think this might be a reasonable choice for data packages, but I'm wondering if the default versions methods in
This seems like a good idea to me, For |
|
I've tried implementing this strategy at JuliaLang/julia#23940. Comments welcome. |
I have three memory layouts in mind: a) you can end up with a situation where you have an array of pointers to the actual values (I guess this is called boxed values?), b) you can have a dense memory layout of structs, i.e. no pointers, just the content of one element after another in a consecutive memory region, and c) you can store an array of structs as a struct of vectors. My understanding was that an |
@JeffBezanson I guess I also left out two complications in my simple example... Always tricky to find the right balance of a simple example and one that does justice to the complexity of the real problem. Here they are: The loops that materialize queries are largely not part of Query itself, they really should be part of the containers that can materialize a query. So that can essentially be any container that accepts an iterator. Things like The second complication is that my example was misleading in that it started out with a map over two vectors. That actually never happens in Query. Rather, every query operator is actually operating on just one iterator, and with tables that iterator will return named tuple elements. So a better example would have been this (using a slightly modified named tuple syntax from base): using Nulls
source = [
(a::Union{Int,Null}=1, b::Union{Float64,Null}=4.),
(a::Union{Int,Null}=2, b::Union{Float64,Null}=null),
(a::Union{Int,Null}=null, b::Union{Float64,Null}=3.),
(a::Union{Int,Null}=null, b::Union{Float64,Null}=null)
]
result = map(i->(a=i.a,b=i.b),source) With this, the anonymous function that you pass to |
I think at the end of the day there are a bunch of assumptions that drove folks towards the
The following is not a rigorous argument, but my gut feeling on this still is the following: LINQ was designed by a group of folks that had been thinking for a long time about composable software systems (Erik Meijer, Brian Beckman etc.). They essentially based everything around ideas from functional programming languages, and in particular monads. What they ended up with is a design that is truly a marvel at composability: you have all these really small, self-contained things (the IEnumerable monad, the Nullable monad etc.), that know nothing about each other and they all just happen to magically work wonderfully together to form larger, more complicated systems. I think I managed to bring most of that over to Query: the design right now for example is super, super close to one where the query side of things doesn't even know that things like missing values can exist, but it just happens to compose really well with DataValue, which also doesn't know anything about Query. That whole model seems to not carry over to I'd be really interested to hear @andyferris thoughts on this, in particular as it relates to the https://github.com/JuliaData/SplitApplyCombine.jl package. I would expect that you'll run into exactly the same issues, given how similar the designs of Query.jl and your package are? |
For SplitApplyCombine I'm flat out asserting that the null data problem is an orthogonal design issue (*). The package deals with iterable/indexable containers and transformations thereof, and the element/scalar-level functions are provided by the user. I agree 100% with the premise that these things must compose seamlessly, and I think several others have already expressed a similar opinion. I can appreciate that having typed null values (indicating e.g. a value of type I think my points above do add some constraints on "valid" ways of implementing missing values, but there is still some design space left to explore. Addressing the OP directly, we primarily circumvent the tension between covariance and invariance for containers because they are instantiated with Regarding LINQ, I wonder how much of the design is influenced by statically typed languages that use vtables to do a bit of dynamic stuff, and what parts should be different in a fully dynamic language. (That's an honest question). TLDR; Sorry, I'm not sure I provided any actual answers here! I think with some effort we could make any one of several approaches work OK (but like I hinted doing it really well might have wider implications for things as far ranging as how (*) I did play with what the SplitApplyCombine functions do to |
I do think I agree the annoyance of unwrapping might not be a showstopper. At one point I conceded that some small set of operations ( But then there are problems:
I agree this is an issue. We should find a way to abstract this pattern if possible. |
Having not worked much with I think the issue here is a mismatch of terms. A type-error, or kind-error if you will. This problem pops up in various forms: in Each package makes its choices, but one way or another it's trying to fit a square peg into a round hole. A missing value and a value are not the sme kind of thing. A concrete value represents an element of a set. A missing value, due to some form of uncertainty, represents a set of elements. A missing value is more like a type than a value. Sticking them together in a Similarly, an
I have no doubt these discussions will eventually lead to a solution or number of solutions with best tradeoffs. In that sense I am echoing @andyferris original comment, and also suggesting what I think may be the root cause. I think a deeper resolution to this problem would essentially require Julia to make computing with concrete representations of sets of values (abstract domains) a first class principle. Probably this isn't a common enough problem to warrant that. But it'd be neat. |
One more thing: What |
Agreed a 100%, that is the philosophy I followed with Query from the beginning. There was one place where I had messed that up, but I was just able to remove that today, so now QueryOperators.jl doesn't depend on DataValues.jl anymore, it just happens to work well together with it. I'd just be interested whether you share my view that you will run into the same problems that I have in Query.jl when you try to use a named tuple with fields of type
Could you elaborate a bit more, or have an example in mind, what you mean by that? I think you aren't referring to the reliance of Query.jl on type inference, but something else, right? I have a pretty solid plan at this point to get rid of the reliance of Query.jl on type inference (or at least only use it as a performance optimization) that would work well with
Yeah, that is still tricky. I should say though that I think that
In my mind this is an annoying problem, but I agree, it is probably not impossible to find some solution for it. At least it is doable, all these collections could after all just implement that complicated logic. Not elegant, and I'm not sure how realistic it is in practice, but it doesn't look entirely impossible to me. The much bigger concern in my mind is the type instability of the anonymous function in something like |
It works pretty well, but I ran into the issue recently that |
On the type inference issue, I'm referring to the need to determine the parameter in |
@zenna I agree about the connections to domain theory. It would indeed be interesting to explore e.g. making
would play out in that world? @davidanthoff I would seriously consider making the syntax Also, if you're consuming an iterator don't you have to worry about varying types anyway? For instance the user could give you |
Just for those that aren't following things over at DataValues.jl: this has been fixed now, that was just a bug.
We don't have any automatic lifting in any case, so it is just something one needs to manually take care of when defining the lifted version of a function. Some of the manual definitions right now rely on type inference (that code was mostly just copied from the Nullable implementations in base), and that seems to work fine because those are all very simple functions. I'm pretty sure one could actually get rid of that, it would just be a bit more work (famous last words, I hope not). In my mind this counterfactual return type problem is really serious for any attempt to automatically lift, but if the lifted methods are coded by hand in any case, it seems to mostly work. DataValues also has
Would that also work for say
Yes, I think Query.jl needs to not error in these situations. But I think if someone ends up in that situation, it is ok if things are slow. But the more common use cases really need to be fast, and a standard projection call that selects a couple of columns from a table is certainly a super common case, so that needs to be very performant. In some sense the usual rules apply here: if you write a type-stable function, you'll get speed. But it would be nice if the most typical use case actually is just type stable. |
Well, we have been adding code to the compiler to make |
Say we have a struct like this:
And a function like this:
Then the return type of this is currently inferred as
Union{Foo{Float64}, Foo{Nulls.Null}}
. This won't work for the whole Query.jl design, I would need the inferred return type here to beFoo{Union{Float64,Null}}
.But here is the twist: I'm not convinced that this would be the right thing to do in general, i.e. that changing the current julia behavior across the board to something that squares with the needs of Query.jl would be a good idea. Note that this is not an issue with the current approach taken in Query.jl where I use the
DataValue
type from the DataValues.jl package to represent missing values.The text was updated successfully, but these errors were encountered: