Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Made WeightVec a subtype of RealVector #248

Merged
merged 6 commits into from
Apr 24, 2017
Merged

Conversation

rofinn
Copy link
Member

@rofinn rofinn commented Apr 20, 2017

Seems like WeightVec should support the same functionality as regular vectors.

@@ -281,6 +288,9 @@ Base.mean{T<:Number,W<:Real}(A::AbstractArray{T}, w::WeightVec{W}, dim::Int) =


###### Weighted median #####
function Base.median(v::AbstractArray, w::WeightVec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this method be unnecessary since all you're doing is throwing a MethodError? Presumably without the method, it would also be a MethodError.

Copy link
Member Author

@rofinn rofinn Apr 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to added this because this test was failing with a BoundsError instead. It appears that making WeightVec a RealVector allowed it to dispatch to a method in base.

julia> median([4 3 2 1 0], weights(wt))
ERROR: BoundsError: attempt to access 2-element Array{Any,1} at index [3]
 in mapslices(::Base.#median!, ::Array{Int64,2}, ::StatsBase.WeightVec{Int64,Array{Int64,1},Int64}) at ./abstractarray.jl:1619
 in median(::Array{Int64,2}, ::StatsBase.WeightVec{Int64,Array{Int64,1},Int64}) at ./statistics.jl:579```

I figured it was better to manually throw a `MethodError` rather than letting it hit the `BoundsError` in that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. This is fine then. Thanks for clarifying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, the current behavior isn't that different:

julia> median([4 3 2 1 0], weights([1,2,3,4,5]))
ERROR: MethodError: no method matching start(::StatsBase.WeightVec{Int64,Array{Int64,1}})
Closest candidates are:
  start(::SimpleVector) at essentials.jl:259
  start(::Base.MethodList) at reflection.jl:560
  start(::ExponentialBackOff) at error.jl:107
  ...
Stacktrace:
 [1] append_any(::StatsBase.WeightVec{Int64,Array{Int64,1}}, ::Vararg{StatsBase.WeightVec{Int64,Array{Int64,1}},N} where N) at ./essentials.jl:170
 [2] median(::Array{Int64,2}, ::StatsBase.WeightVec{Int64,Array{Int64,1}}) at ./statistics.jl:619

So I'm not sure this MethodError is really needed. We typically don't throw them for all possible cases that fail.

@ararslan
Copy link
Member

It would be great if you could add tests for this.

@ararslan ararslan requested a review from nalimilan April 20, 2017 23:27
src/weights.jl Outdated
immutable WeightVec{W,Vec<:RealVector}
values::Vec
sum::W
immutable WeightVec{T, V<:RealVector, S} <: RealVector{T}
Copy link
Member

@nalimilan nalimilan Apr 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better do WeightVec{S<:Real, T<:Real, V<:AbstractVector{T}} <: AbstractVector{T}. That should simplify constructors since the parameter could be inferred from the input arguments.

EDIT: RealVector is just an alias used for conciseness, but it's not useful when defining types since it's redundant with the information provided by T<:Real.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really useful to allow typeof(sum) and eltype(values) to be different? Having three parameters for such a simple type seems overly complicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about this too, then I realized that it might be useful to allow storing weights using a small type (UInt8, Float16) to save space, and yet be able to store the sum in a type that won't overflow or lose precision too easily (Int64, BigInt, Float64).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My view was that all of these could be useful in the cases like WeightVec{Bool, NullableArray{Bool}, Int}, but I figured you'd be more likely to want to dispatch on V than S.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan Isn't WeightVec{S<:Real, T<:Real, V<:AbstractVector{T}} <: AbstractVector{T} triangular dispatch which is only supported in 0.6?

Copy link
Member Author

@rofinn rofinn Apr 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan I've updated the parameterization to match your suggestion, but I needed to version check whether we can use triangular dispatch (implemented in this PR).

- Parameterization to `WeightVec{S<:Real, T<:Real, T<:AbstractVector}`
- Uses triangular dispatch for julia version passed v"0.6.0-dev.2123"
src/weights.jl Outdated
end

function WeightVec{T<:Real, V<:AbstractVector{T}}(vs::V)
sum_ = sum(vs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trailing underscore is a bit weird. Why not use s or any other simple name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly just habit as I try to avoid using single character variable names. For math and stats code I tend to use the name of the function name with an underscore as my variable names if I can't think of anything better.

src/weights.jl Outdated
sum::S
end

function WeightVec{S<:Real, T<:Real, V<:AbstractVector{T}}(vs::V, s::S)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to move this function and the next one out of the version-dependent block using V<:RealVector and replacing T with eltype(vs). That is, just slightly adapting the existing constructors.

src/weights.jl Outdated
end

"""
WeightVec(vs, [wsum])
function WeightVec{S<:Real, V<:RealVector}(vs::V, s::S)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is incorrect, the signature should just reflect how the user can call the function, not how it's implemented.

src/weights.jl Outdated

Construct a `WeightVec` with weight values `vs` and sum of weights `wsum`.
If omitted, `wsum` is computed.
"""
WeightVec{Vec<:RealVector,W<:Real}(vs::Vec,wsum::W) = WeightVec{W,Vec}(vs, wsum)
WeightVec(vs::RealVector) = WeightVec(vs, sum(vs))
function WeightVec{S<:Real, V<:RealVector}(vs::V, s::S)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can actually use s::S=sum(vs) to get rid of the next method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I didn't realize that was valid syntax in 0.5 (ie: referencing an earlier argument when setting the default value).

src/weights.jl Outdated
@@ -281,6 +290,9 @@ Base.mean{T<:Number,W<:Real}(A::AbstractArray{T}, w::WeightVec{W}, dim::Int) =


###### Weighted median #####
function Base.median(v::AbstractArray, w::WeightVec)
throw(MethodError(median, (typeof(v), typeof(w))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hadn't noticed that one. Should pass (v, w) instead of their types. Otherwise, looks good to me.

@rofinn
Copy link
Member Author

rofinn commented Apr 24, 2017

Should I squash my commits or is a more verbose history preferred here?

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! No need for squashing, we can do it from GitHub when merging.

@ararslan ararslan merged commit 5a7fa11 into JuliaStats:master Apr 24, 2017
@ararslan
Copy link
Member

ararslan commented May 4, 2017

I'm pretty sure this change is what broke DataArrays. Weighted mean on a DataArray is throwing a MethodError when passed an object constructed with weights().

@rofinn
Copy link
Member Author

rofinn commented May 4, 2017

@ararslan Can you post the specific error?

This seems to work.

julia> da = @data([4.0, 2.0, 3.0])
3-element DataArrays.DataArray{Float64,1}:
 4.0
 2.0
 3.0

julia> w = weights(rand(3))
3-element StatsBase.Weights{Float64,Float64,Array{Float64,1}}:
 0.727859
 0.000978015
 0.301205

julia> mean(da, w)
3.7056809658486145

Not the exact same version of StatsBase, but pretty close for this use case.

@nalimilan
Copy link
Member

Can you post the code which triggers the failure?

@ararslan
Copy link
Member

ararslan commented May 4, 2017

From the DataArrays tests:

da1 = DataArray(randn(128))
da2 = DataArray(randn(128))
da1[1:3:end] = NA
da2[1:2:end] = NA
mean(da1, weights(da2))

The last line is now a MethodError:

julia> mean(da1, weights(da2))
ERROR: MethodError: no method matching StatsBase.WeightVec(::DataArrays.DataArray{Float64,1}, ::DataArrays.NAtype)
Closest candidates are:
  StatsBase.WeightVec(::V<:(AbstractArray{T,1} where T<:Real)) where V<:(AbstractArray{T,1} where T<:Real) at /Users/alex/.julia/v0.7/StatsBase/src/weights.jl:23
  StatsBase.WeightVec(::V<:(AbstractArray{T,1} where T<:Real), ::S<:Real) where {S<:Real, V<:(AbstractArray{T,1} where T<:Real)} at /Users/alex/.julia/v0.7/StatsBase/src/weights.jl:23
Stacktrace:
 [1] weights(::DataArrays.DataArray{Float64,1}) at /Users/alex/.julia/v0.7/StatsBase/src/weights.jl:31

@rofinn
Copy link
Member Author

rofinn commented May 4, 2017

@ararslan That makes sense cause DataArray.NAtype isn't a subtype of Real and sum(da2) returns NA.

@rofinn
Copy link
Member Author

rofinn commented May 4, 2017

Removing the restriction S<:Real would probably fix this.

@nalimilan
Copy link
Member

OK. I guess we could remove the S<:Real restriction since we don't really need it. But I'm not actually sure this is a use case we want to support: when do you want to use a missing weight? One can always use a zero weight to ignore an observation.

@ararslan
Copy link
Member

ararslan commented May 4, 2017

Perhaps, but I'm not sure it's worth loosening that restriction; I don't think it makes sense to have a weight vector whose sum is missing.

@ararslan
Copy link
Member

ararslan commented May 4, 2017

Perhaps it would make more sense to require that the input to a weight constructor be a Base.Array rather than any AbstractArray{<:Real}?

@nalimilan
Copy link
Member

Well, there's no point in restricting the type of underlying array. DataArray{T} is cheating by pretending that its eltype is T when it's actually Union{NAtype, T}.

Let's just remove that test from DataArrays if we don't have a use case for it.

@rofinn
Copy link
Member Author

rofinn commented May 4, 2017

@ararslan To answer your specific question, it was initially WeightVec{W,Vec<:RealVector} and this PR changed it to WeightVec{S<:Real, T<:Real, V<:RealArray} as I thought that made more sense. I tend to agree that having a weight vector whose sum is missing doesn't really make sense.

@rofinn
Copy link
Member Author

rofinn commented May 4, 2017

Seems like a better approach for that use case would be to use dropna before?

@ararslan
Copy link
Member

ararslan commented May 4, 2017

Seems like a better approach for that use case would be to use dropna before this?

Not necessarily, since you can dropna from your weight vector but you may have reason for legitimately missing values in your data vector, in which case you don't want to dropna from that, and then the vectors have differing lengths, which is an error.

@ararslan
Copy link
Member

ararslan commented May 4, 2017

Let's just remove that test from DataArrays if we don't have a use case for it.

Okay ¯\_(ツ)_/¯

@ararslan
Copy link
Member

ararslan commented May 4, 2017

Cool, DataArrays is all good now.

However, it appears this change still introduced ambiguities with Base methods:

julia> using StatsBase

julia> using Base.Test

julia> detect_ambiguities(StatsBase, Base, Core)
Skipping StatsBase.hist
Skipping StatsBase.wmean!
Skipping Base.<|
3-element Array{Tuple{Method,Method},1}:
 (stdm(v::AbstractArray{T,N} where N where T<:Real, wv::StatsBase.WeightVec, m::AbstractArray{T,N} where N where T<:Real, dim::Int64) in StatsBase at /Users/alex/.julia/v0.7/StatsBase/src/moments.jl:99, stdm(v::AbstractArray{T,N} where N where T<:Real, m::AbstractArray{T,N} where N where T<:Real, wv::StatsBase.WeightVec, dim::Int64) in StatsBase at deprecated.jl:50)
 (varm(A::AbstractArray{T,N} where N where T<:Real, wv::StatsBase.WeightVec, M::AbstractArray{T,N} where N where T<:Real, dim::Int64) in StatsBase at /Users/alex/.julia/v0.7/StatsBase/src/moments.jl:65, varm(A::AbstractArray{T,N} where N where T<:Real, M::AbstractArray{T,N} where N where T<:Real, wv::StatsBase.WeightVec, dim::Int64) in StatsBase at deprecated.jl:50)
 (stdm(v::AbstractArray{T,N} where N where T<:Real, wv::StatsBase.WeightVec, m::Real) in StatsBase at /Users/alex/.julia/v0.7/StatsBase/src/moments.jl:87, stdm(v::AbstractArray{T,N} where N where T<:Real, m::AbstractArray{T,N} where N where T<:Real, dim::Int64) in StatsBase at /Users/alex/.julia/v0.7/StatsBase/src/moments.jl:98)

@rofinn
Copy link
Member Author

rofinn commented May 4, 2017

Shouldn't the first two go away once those deprecations are removed? Not sure about the last one though.

@ararslan
Copy link
Member

ararslan commented May 4, 2017

Yeah the first two aren't too worrisome, it's mostly the last one. This previously wasn't a problem because we didn't have WeightVec <: AbstractVector. Not sure what to do about it now.

@rofinn
Copy link
Member Author

rofinn commented May 4, 2017

Could we do something like Base.stdm(v::RealVector, w::AbstractWeights, m::Real) and Base.stdm(v::RealMatrix, m::RealArray, dim::Int)?

@ararslan
Copy link
Member

ararslan commented May 4, 2017

We can't overload a Base function with Base types though; this is the sinful act of type piracy.

@rofinn
Copy link
Member Author

rofinn commented May 4, 2017

All the more reason for it not to be a Base function? :)

@rofinn
Copy link
Member Author

rofinn commented May 4, 2017

Wait also does this even count as type piracy cause we're overloading our own behaviour and not base? If StatsBase wasn't loading these would result in a MethodError.

@ararslan
Copy link
Member

ararslan commented May 4, 2017

If we extend a Base function using Base types, regardless of how we define the behavior, we've committed an act of treason type piracy.

@rofinn
Copy link
Member Author

rofinn commented May 4, 2017

Well in that case we're already committing type piracy with Base.stdm(v::RealArray, m::RealArray, dim::Int).

@ararslan
Copy link
Member

ararslan commented May 4, 2017

Oh dang you're right. If we get rid of those we should be okay on ambiguities. We should commence... the purge.

@nalimilan
Copy link
Member

Indeed it's surprising that we ship this method. It's not even documented in StatsBase. Let's remove it, the ambiguities due to deprecations are OK (though they shouldn't be hard to fix if we wanted to).

@rofinn
Copy link
Member Author

rofinn commented May 5, 2017

I think it makes sense to either keep that method or add it to Base which already has a varm(A::AbstractArray{T,N} where N, m::AbstractArray, region; corrected), but no corresponding method for stdm.

@nalimilan
Copy link
Member

I guess we could add it to Base. But since it's not exported, it's not really an issue whether a method is missing or not.

@rofinn
Copy link
Member Author

rofinn commented May 5, 2017

@nalimilan
Copy link
Member

Ah, cool. I extrapolated incorrectly from covm, which isn't exported.

@ararslan
Copy link
Member

ararslan commented May 5, 2017

That wouldn't help us for 0.5 and 0.6 though, since if we add the method to Base then it will only be available in 0.7+.

@ararslan
Copy link
Member

ararslan commented May 5, 2017

We can just define an internal function that behaves like Base.stdm with the appropriate methods, assuming the currently pirated method is being used in any significant way in the package. Otherwise it could just be a simple refactoring.

@rofinn
Copy link
Member Author

rofinn commented May 5, 2017

As far as i can tell it's only being used here, so we could probably rename it to _stdm and change that one reference for now and deprecate the previous method in case anyone was using it. I still think the correct long term solution would be to move the base stats stuff into statsbase, but that sounds like a reasonable interim solution.

@ararslan
Copy link
Member

ararslan commented May 5, 2017

Not worth a deprecation IMO since it's a Base function.

@nalimilan
Copy link
Member

Certainly not worth a deprecation. I think the easiest solution is to define the method on older Julia versions using a VERSION check (or a method_exists one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants