-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 2-arg versions of findmax/min, argmax/min #35316
Add 2-arg versions of findmax/min, argmax/min #35316
Conversation
I rebased this a few times and github is showing the commits in the wrong order for me on this page. On the repo's page and in the git log they are in the right order. The right order is this:
|
base/reduce.jl
Outdated
while true | ||
y = iterate(p, s) | ||
y === nothing && break | ||
m != m && break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation of _findmax(a, ::Colon)
does not handle missing
. I understand this is not part of the original plan for this PR. But it may be nice to just replace the function body with mapfoldl((k, v) -> (v, k), _rf_findmax, pairs(a))
, now that we have all the machinery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had planned to leave that and the more involved definition in reducedim.jl alone for this PR in the interest of getting it through quickly and with the least chance of introducing breaking changes.
I can just carefully read the code and reimplement both, though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewriting _findmax/min(a, ::Colon)
using _rf_findmax/min
also means that we can test _rf_findmax/min
much more heavily via existing tests on/using findmin/max
.
Though I guess doing this later is fine too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used keys
rather than pairs
here, but maybe I should have used pairs
. I'll benchmark in a little bit and maybe change it.
base/reduce.jl
Outdated
julia> findmax(identity, 5:9) | ||
(9, 9) | ||
|
||
julia> findmax(-, 1:10) | ||
(-1, 1) | ||
|
||
julia> findmax(cos, 0:π/2:2π) | ||
(1.0, 0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be nice to do demonstrate "first match wins" behavior. Maybe something like:
julia> findmax(first, [(1, :a), (2, :b), (2, :c)])
(2, (2, :b))
Maybe this was the intention of cos
example but it is not very obvious to realize this when floating point arithmetic is involved.
base/operators.jl
Outdated
# Implementation | ||
Types should not usually implement this function. Instead, implement `isless`. | ||
""" | ||
isgreater(x, y) = _is_reflexive(x) && _is_reflexive(y) ? isless(y, x) : isless(x, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should that be !isless(x, y)
in the final branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. That's the branch for NaN and missing and I just want those sorted the same as for isless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definition is the trickiest part of this PR and I think it deserves at least some comments about it in the source code.
Having said that, I still have a very uneasy feeling about this definition. In particular, proving that isgreater
satisfies transitivity requires that "_is_reflexive(x) && !_is_reflexive(y)
implies isless(x, y)
" holds. This is the case for NaN
and missing
but this is not in the API. It'd be much easier to understand and enforce that isgreater
is a total order [*] if we just use
function isgreater2(x, y)
xisnan = x != x
xisnan isa Bool || return false # x is missing
yisnan = y != y
yisnan isa Bool || return true # y is missing
xisnan && return false
yisnan && return true
return isless(y, x)
end
In other words, using _is_reflexive
may look like it makes isgreater
very generic but it crucially depends on the implementation details.
Anyway, I think this PR has to do at least one of the following:
- Use
isgreater2
(or something equivalent) instead of currentisgreater
. - Add the aforementioned property of
=
-reflexivity andisless
to the part of implementation requirement of=
andisless
.
(Note: Transitivity is very important as otherwise "min" is not going to be associative. If "min" is not associative, we can't have parallel argmin/findmin/minimum/extrema.)
[*] OK this is not quite enough for user-defined missing
-like types but we don't have such facility even for isless
yet.
Maybe it's too late, but it would be awesome if this made it into 1.5. Very handy function. (Of course, it's easy enough to copy the PR code via module FindMin
import Base.findmin, Base.findmax
isgreater(x, y) = _is_reflexive(x) && _is_reflexive(y) ? isless(y, x) : isless(x, y)
_is_reflexive(x) = let eq = x == x
isa(eq, Bool) && eq
end
findmax(f, domain) = mapfoldl(x -> (f(x), x), _rf_findmax, domain)
_rf_findmax((fm, m), (fx, x)) = isless(fm, fx) ? (fx, x) : (fm, m)
findmin(f, domain) = mapfoldl(x -> (f(x), x), _rf_findmin, domain)
_rf_findmin((fm, m), (fx, x)) = isgreater(fm, fx) ? (fx, x) : (fm, m)
end
using .FindMin to get the functionality already). |
Yes, this would be nice to have. What remains to be done here? Is this ready to go? |
I paused on this because I found that the existing methods use an optimisation to avoid calling The new definitions don't have that optimisation and are a bit slower, even when it is guaranteed that there are zero incomparables in the element type of the iterable. I wanted to try and work out why that was, but didn't get around to diving into the I can just not do that and then I think I only need to change the docstrings a bit to make sure we're giving a loose enough guarantee for the weird values. |
I think #35316 (comment) is still a blocker. |
I summarized the current issues around reducers in the " |
Bump. Can we get this merged? I hit the case today, where I needed it. |
The only remaining issues are:
Need to think about if the ::Colon and reducedim methods need to be done now or not. If they're not done they may be inconsistent. Similarly, min, and maximum will be a bit inconsistent with findmax if not updated here, but they're not consistent now, so it probably doesn't really matter. Though would be good to fix. |
I can rebase and stuff later and have a look, but might be good to get a steer on whether it is okay to make the existing funcs a bit slower or not. |
While I can see the motivation for |
Did triage have any opinion about whether it is okay to make The most common values that are not Here's how the function outputs would change: # These won't change
isless((NaN, 10), (NaN, 20)) # true
isless((NaN, 20), (NaN, 10)) # false
# Current implementation
findmax_orig([(NaN, 10), (NaN, 20)]) # (NaN, 10)
findmax_orig([(NaN, 20), (NaN, 10)]) # (NaN, 20)
# Proposed consistent implementation
findmax_new( [(NaN, 10), (NaN, 20)]) # (NaN, 20)
findmax_new( [(NaN, 20), (NaN, 10)]) # (NaN, 20) Implementations and benchmarking codefindmax_orig(a) = _findmax_orig(a, :)
function _findmax_orig(a, ::Colon)
p = pairs(a)
y = iterate(p)
if y === nothing
throw(ArgumentError("collection must be non-empty"))
end
(mi, m), s = y
i = mi
while true
y = iterate(p, s)
y === nothing && break
m != m && break
(i, ai), s = y
if ai != ai || isless(m, ai)
m = ai
mi = i
end
end
return (m, mi)
end
# Smallest possible change (2 lines)
findmax_new(a) = _findmax_new(a, :)
function _findmax_new(a, ::Colon)
p = pairs(a)
y = iterate(p)
if y === nothing
throw(ArgumentError("collection must be non-empty"))
end
(mi, m), s = y
i = mi
while true
y = iterate(p, s)
y === nothing && break
# m != m && break
(i, ai), s = y
if #= ai != ai || =# isless(m, ai)
m = ai
mi = i
end
end
return (m, mi)
end
# This PR
_rf_findmax((fm, m), (fx, x)) = isless(fm, fx) ? (fx, x) : (fm, m)
findmax_2arg(f, domain) = mapfoldl(x -> (f(x), x), _rf_findmax, domain)
# How the existing functions might be implemented with this PR
findmax_fold(a) = findmax_2arg(idx -> a[idx], keys(a))
# Same speed
# findmax_fold(a) = mapfoldl(((k, v),) -> (v, k), _rf_findmax, pairs(a))
# Differences in behaviour
isless((NaN, 10), (NaN, 20)) # true
isless((NaN, 20), (NaN, 10)) # false
# Inconsistant with folding `isless`
findmax_orig([(NaN, 10), (NaN, 20)]) # (NaN, 10)
findmax_orig([(NaN, 20), (NaN, 10)]) # (NaN, 20)
# Consistant
findmax_new( [(NaN, 10), (NaN, 20)]) # (NaN, 20)
findmax_new( [(NaN, 20), (NaN, 10)]) # (NaN, 20)
findmax_fold([(NaN, 20), (NaN, 10)]) # (NaN, 20)
findmax_fold([(NaN, 10), (NaN, 20)]) # (NaN, 20)
using BenchmarkTools
input = rand(Int, 10^6);
@btime findmax_orig($input) # 920 μs
@btime findmax_new($input) # 920 μs
@btime findmax_fold($input) # 1.1 ms
inputf = rand(Float64, 10^6);
@btime findmax_orig($inputf) # 2.3 ms
@btime findmax_new($inputf) # 2.8 ms
@btime findmax_fold($inputf) # 2.8 ms
inputfbig = rand(Float64, 10^9);
@btime findmax_orig($inputfbig) # 2.5 s
@btime findmax_new($inputfbig) # 3.0 s
@btime findmax_fold($inputfbig) # 2.9 s I did put a note in the docs for |
Bump. Would be great to get this rebased with Jameson's suggsted change applied (and without isgreater export). |
dc157d3
to
e4ac3bb
Compare
Done. Sorry for the delay :) Jameson said on slack that making the existing findmax/min slower in the name of consistency is okay, so I've done that too and The work never ends, though, because now |
c087cf9
to
4a10d55
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this implementation of isgreater
is incorrect? I get:
julia> isgreater(x, y) = _is_reflexive_eq(x) && _is_reflexive_eq(y) ? isless(y, x) : isless(x, y)
isgreater (generic function with 1 method)
julia> _is_reflexive_eq(x) = (x == x) === true
_is_reflexive_eq (generic function with 1 method)
julia> min1(x, y) = ifelse(isgreater(x, y), y, x)
min1 (generic function with 1 method)
julia> function isgreater2(x, y)
xisnan = x != x
xisnan isa Bool || return false # x is missing
yisnan = y != y
yisnan isa Bool || return true # y is missing
xisnan && return false
yisnan && return true
return isless(y, x)
end
isgreater2 (generic function with 1 method)
julia> min2(x, y) = ifelse(isgreater2(x, y), y, x)
min2 (generic function with 1 method)
julia> min((NaN, missing), (missing, NaN))
(NaN, missing)
julia> min1((NaN, missing), (missing, NaN))
(missing, NaN)
julia> min2((NaN, missing), (missing, NaN))
(NaN, missing)
Sorry to be persistent, but, as I mentioned in #35316 (comment), I think this pattern ... ? isless(y, x) : isless(x, y)
is hard to reason about and it'd be better to have a verbose and more "honest" implementation like isgreater2
as I've been suggesting. You can probably fix _is_reflexive_eq
so that min1
and min
behave identically. So I don't think the linked comment is the source of the bug. But I think the attempt to "minimize" the code size may be a contributing factor.
I'm sorry that I didn't address that earlier. I thought you were probably right, but after looking into it some more I think my initial approach is better and that julia> isgreater2((missing, NaN), (NaN, missing))
false
julia> isgreater2((NaN, missing), (missing, NaN))
false One of the pair has to evaluate to julia> min2((NaN, missing), (missing, NaN))
(NaN, missing)
julia> min2((missing, NaN), (NaN, missing))
(missing, NaN) Comparing with
That makes the ordering of min(1, NaN) === NaN
min( (1,), (NaN,) ) === (1,)
min( [1], [NaN] ) == [1] If min1(x, y) = ifelse(isgreater(x, y), y, x)
min1(1, NaN) === NaN
min1( (1,), (NaN,) ) === (NaN,)
min1( [1], [NaN] ) # [NaN] That's a change in behaviour, but I think it's better behaviour. We also wouldn't need special sauce for floats or missing and all the other functions would come out consistent and easy. In general, I don't think we can satisfy all of:
There's two ways out of it that I see:
Currently, this PR goes with scenario 2 and I think that's fine. A follow up PR could maybe change Here's some reproducible code that explores the consequences of those two scenarios for three inputs: x = (-1.0, NaN, 1.0)
y = ((NaN, missing), (missing, NaN))
z = ((1, 2), (missing, NaN), (NaN, missing))
# isgreater from this PR
_is_reflexive_eq(x) = (x == x) === true
isgreater(x, y) = _is_reflexive_eq(x) && _is_reflexive_eq(y) ? isless(y, x) : isless(x, y)
# min and minimum compatible with "Scenario 1"
min1(x, y) = ifelse(isgreater(x, y), y, x)
minimum1(itr) = foldl(min1, itr)
# findmin compatible with this PR
findmin_(itr) = reverse(foldl(
((k1, v1), (k2, v2)) -> ifelse(isgreater(v1, v2), (k2, v2), (k1, v1)),
pairs(itr)
))
# Scenario 1
# x, y, z all consistent:
@assert minimum1(x) === NaN === findmin_(x)[1]
@assert minimum1(y) === (missing, NaN) === findmin_(y)[1]
@assert minimum1(z) === (missing, NaN) === findmin_(z)[1]
# Scenario 2
# x is fine, y and z are weird.
@assert minimum(x) === NaN === findmin(x)[1]
@assert minimum(y) === (NaN, missing)
@assert findmin_(y)[1] === (missing, NaN)
@assert minimum(z) === (1, 2)
@assert findmin_(z)[1] === (missing, NaN)
@assert minimum(x) === NaN
@assert minimum(y) === (NaN, missing)
@assert minimum(z) === (1, 2) I agree with your earlier comment that |
Thanks @nalimilan for the suggestion. This is simpler, more extensible and should handle larger unions.
convert(T, Inf) fails for lots of T (e.g. Int).
f20aa8d
to
925a365
Compare
We did previously use that :) It gives the wrong result for containers like tuples (previously I thought it was the right result, but it's not). The new behaviour matches The old behaviour would imply Concretely: isunordered_old(x) = (x == x) !== true
isunordered_new(x) = false
isunordered_new(x::AbstractFloat) = isnan(x)
isunordered_new(x::Missing) = true
isunordered_old( (NaN, 1) ) == true
# This result is required to match the behaviour of `min`.
isunordered_new( (NaN, 1) ) == false I've renamed, added docs, and exported. Edit: The new
Great! I think I've done that and I've rebased, added a NEWS entry for If CI passes then I think this is good to go! |
…accept Inf This used to work, so it would be a regression to break it.
CI passes. I think this is ready to merge. |
At times I doubted this would ever be done ;) I started work on this on the 11th March 2020 and the issue was originally opened on the 16th June 2018. julia> using Dates
julia> today() - Date(2020, 3, 11)
282 days
julia> today() - Date(2018, 6, 16)
916 days Big thanks to @tkf and to the various reviewers! |
Yay! We're using this syntax in our new book. Thanks for all the hard work to get this right! |
Defines a descending total order, `isgreater` (not exported), where unordered values like NaNs and missing are last. This makes defining min, argmin, etc, simpler and more consistent. Also adds 2-arg versions of findmax/min, argmax/min. Defines and exports the `isunordered` predicate for testing whether a value is unordered like NaN and missing. Fixes JuliaLang#27613. Related: JuliaLang#27639, JuliaLang#27612, JuliaLang#34674. Thanks to @tkf, @StefanKarpinski and @drewrobson for their assistance with this PR. Co-authored-by: Jameson Nash <[email protected]> Co-authored-by: Takafumi Arakaki <[email protected]>
Fixes #27613. Related: #27639, #27612, #34674.
Thanks to @tkf, @StefanKarpinski and @drewrobson for their assistance with
this PR.
I (and Stefan) wanted this to use partial order comparisons, but that would not be backwards compatible without a bit more faffing around. If there's interest, I might introduce another PR to explore that. That would introduce two partial orderings that bend the rules for these special cases: infinitesimals, unorderables (
NaN
) andmissing
.I think reviewers should pay particular attention to the
isgreater
function I introduce. There are a lot of corner cases when comparing or ordering values and I'm not sure I've thought of them all. I'm fairly sure that it won't introduce any regressions infindmin
orargmin
(and it actually makes them work withmissing
where before they did not), but it may not be quite reliable enough to export.Despite adding the
isgreater
function, I have not updatedmin
,minimum
, or the one-arg versions offindmin
orargmin
to use it. I can do that, but I wanted to get some review of what I've done first and possibly get this smaller change merged.