Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

naming: setdiff, diff, gradient #26361

Closed
JeffBezanson opened this issue Mar 8, 2018 · 26 comments
Closed

naming: setdiff, diff, gradient #26361

JeffBezanson opened this issue Mar 8, 2018 · 26 comments
Labels
arrays [a, r, r, a, y, s] collections Data structures holding multiple items, e.g. sets deprecation This change introduces or involves a deprecation

Comments

@JeffBezanson
Copy link
Member

JeffBezanson commented Mar 8, 2018

I think there are a couple issues with these names:

  • They have a common suffix but are unrelated.
  • Every other function that begins with set involves assigning something, e.g. setenv, setfield, setrounding

Other possible names for setdiff: difference, complement, without, delete, remove

Other possible names for diff: finitediff, differences, decumulate

@JeffBezanson JeffBezanson added arrays [a, r, r, a, y, s] deprecation This change introduces or involves a deprecation collections Data structures holding multiple items, e.g. sets labels Mar 8, 2018
@Sacha0
Copy link
Member

Sacha0 commented Mar 8, 2018

Ref. #20041 (comment) and downstream discussion regarding diff. Best!

@stevengj
Copy link
Member

stevengj commented Mar 8, 2018

The standard notation for this is A \ B, and some authors use A - B. Would one of those be too confusing?

@KlausC
Copy link
Contributor

KlausC commented Mar 8, 2018

diff: I prefer differences from the proposals. decumulate sounds more general, while I feel, it should be a kind of inverse of cumsum. Then also differences(cumsum(x)) == x or cumsum(differences(x)) == x should hold.

setdiff: I would tend to difference and make \ an alias for that. At the same time deprecate all set operations when applied to Tuple{AbstractVector,AbstractVectorVector}. See also https://en.wikipedia.org/wiki/Set_(mathematics)#Complements

@StefanKarpinski
Copy link
Member

I’m not sure there’s another reasonable short name for the set difference operation; calling this setdiff is quite standard. That said, I’d like an operator for this. I’d be happy to ditch diff and introduce decumulate which is the proper inverse of accumulate.

@TsurHerman
Copy link

I vote strongly against decumulate .. for a non-english speaker this sounds like something Sheldon Cooper from big bang theory would suggest.
diff in Matlab and probably in R and probably in NumPy means finite differencing .. please don't change that name.

If you must change then I vote for finitediff but thats is wrong too because there isn't an infinite diff,
also there isn't any diff by factor 2 or 3, the later cases are covered by gradient so another option is to deprecate diff entirely and rely on gradient instead

@mbauman
Copy link
Member

mbauman commented Mar 8, 2018

I would support a general unaccumulate (which I think is a slightly better word than decumulate, since decumulate still sounds like it's cumulative but just doing the inverse operation), but I'd then also want to change the definition such that:

# pseudocode for one-dimension
unaccumulate(op, A) = unaccumulate(op, A[1], A)
unaccumulate(op, v0, A) = [v0, (op(A[i], A[i-1]) for i in 2:length(A))...]

With this definition, A == accumulate(op, unaccumulate(op, A)) == unaccumulate(op, accumulate(op, A)).

I think this is a bit of a tangent, though, since, like cumsum, we could still provide a helper function with a nicer name. In that vein, maybe uncumdiff :trollface:?

@JeffBezanson
Copy link
Member Author

If the names are getting a bit long it seems fine to me to use accum instead of accumulate.

@Sacha0
Copy link
Member

Sacha0 commented Mar 8, 2018

👍 for Matt's general unaccumulate/decumulate definition. Nice to mitigate the common off-by-one-errors that occur with diff. Best!

@stevengj
Copy link
Member

stevengj commented Mar 8, 2018

I'm skeptical that anyone will comprehend a neologism like unaccumulate, and I'm skeptical of the need for something generic here when people only really seem to use this with -

+1 for merging diff into gradient, with keyword arguments to specify the different possible boundary conditions and output lengths.

Note also that @mbauman's definition of unaccumulate is not equivalent to diff or gradient if you look at the first element — you'd always end up discarding that with popfirst! or something. Seems weird to compute something via a definition that probably no one would actually want (if they are coming from diff)…

@Sacha0
Copy link
Member

Sacha0 commented Mar 8, 2018

(Note that gradient is deprecated.)

@mbauman
Copy link
Member

mbauman commented Mar 8, 2018

I've used diff a ton in Matlab where you need to do all sorts of contortions to vectorize code. I find I don't reach for it nearly as often in Julia. Looking back through my crusty old code (which includes 100k+ LOC written by myself and other engineers), I find that we used it in predominantly three different ways:

  • Because I actually want to look at the sequential differences: mean(diff(A)) or median(diff(A)).
  • Because I want to relate it back to the original array. In these cases we almost always used explicit padding like find(diff([0 waveform]) ~= 0) or [1 diff(A)] or add .5 to their "axis" so they plot in-between elements.
  • As a shorthand for A[2] - A[1] or subtracting two columns with diff(A(:,1:2),[],2) for A[:,2] .- A[:,1]

The third one is by far the most common (I'd estimate roughly 80%)… but it's largely irrelevant in Julia since diff([x,y]) isn't y-x but rather [y-x].

Uses 1 and 2 are actually quite distinct from each other, and the conflation between the two has been a large source of bugs in my experience. Here's one of my favorite examples, which includes both use-cases and two different ways of dealing with off-by-one errors:

A([1; find(diff(A)>2*median(diff(A)))+1])

So, yes, my desire to change the behavior of diff was very intentional… but upon reflection I'm not sure it's as big of a win as I had thought.

@bramtayl
Copy link
Contributor

bramtayl commented Mar 9, 2018

I think I've seen python overload arithmetic for sets. Maybe + for union, * for intersect, - for setdiff, etc?

@stevengj
Copy link
Member

stevengj commented Mar 9, 2018

Python uses | for set union, & for set intersection, - for set difference, and ^ for the symmetric difference. These are aliases for union, intersection, difference, and symmetric_difference.

Julia has synonyms for (\cup) for union and (\cap) for intersect. The set-difference operator in Unicode is (U+2216), but this is so similar in appearance to \ (backslash) that I doubt we would want as a distinct operator. (Currently it is unsupported by the parser, I think for this reason.)

I think we don't currently have a symmetric difference function for sets?

@KlausC
Copy link
Contributor

KlausC commented Mar 9, 2018

I propose to use backslash (like reverse division) and deprecate all set operations on AbstractArray (currently set operations with vectors are allowed).

@stevengj
Copy link
Member

stevengj commented Mar 9, 2018

The problem with \ is that if I see A \ B in Julia code I assume it is a linear solve.

Honestly, I think setdiff is fine. I'm doubtful that this operation is so common in Julia code that it needs a binary operator.

@JeffBezanson JeffBezanson changed the title naming: setdiff, diff naming: setdiff, diff, gradient Mar 19, 2018
@JeffBezanson JeffBezanson added the triage This should be discussed on a triage call label Mar 19, 2018
@oschulz
Copy link
Contributor

oschulz commented Mar 20, 2018

Personally, I use things like mean(diff(A)) a lot, especially to take a quick look at data properties, etc. It would be nice to keep this (even under a different name).

@GunnarFarneback
Copy link
Contributor

Tangential, but I'm a little puzzled by the interest in mean(diff(A)) as it can be optimized to (A[end] - A[1]) / (length(A) - 1) and thus only depends on the endpoints and length of A (or slightly more complicated if A is more than one-dimensional).

@oschulz
Copy link
Contributor

oschulz commented Mar 21, 2018

True, but that was just an example (since @mbauman mentioned it, together with median). For median(diff(A)), var(diff(A)), etc., there isn't such a simple shortcut. :-)

@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Mar 22, 2018
@StefanKarpinski
Copy link
Member

Triage decided not to do anything here.

@juliohm
Copy link
Contributor

juliohm commented Jan 4, 2020

Came across this issue today while writing some set differences in some of my own packages. No chance to reconsider the - (minus) or \ (backslash) notation for Set types only? I understand the possible confusion when there are both vectors (as in vector spaces) and sets in the same line of code, but does that confusion eliminate the usefulness of the notation for sets? As mentioned above, we have the nice union and intersection notation. Adding the set difference to these symbols would be great.

(A  B) - (A  B)

reads so much nicer than

setdiff(A  B, A  B)

@StefanKarpinski
Copy link
Member

I will frequently, if I'm doing a lot of setdiff operations just locally define const \ = setdiff and then use backslash. This work pretty well since it's uncommon to do a lot of linear solves in the same codebase as one does lots of set diffs.

While it's somewhat tempting for the syntax, I'm pretty uncomfortable trying to actually stuff setdiff into either - or \: it's clearly not related to the \ operation on matrices and if you have - for setdiff then it's so very tempting to have + for set unions, which is a slippery slope I really don't think we want to go down since sets form a lattice rather than an abelian group.

@StefanKarpinski
Copy link
Member

It's rather unfortunate that the set minus symbol looks identical to backslash symbol \ in so many fonts. If it were reliably double wide, then I think it would be fine to give it this meaning.

@juliohm
Copy link
Contributor

juliohm commented Jan 5, 2020 via email

@StefanKarpinski
Copy link
Member

It's quite important for generic programming to work in Julia that we are strict about generic functions having a single meaning. So far that is working remarkably well. Let's not throw it out just so that we can have a little syntactic convenience.

@juliohm
Copy link
Contributor

juliohm commented Jan 5, 2020

You mean introducing the unicode syntax would restrict the meaning of setdiff? How does it restrict meaning, I don't get it, appreciate if you can clarify.

@StefanKarpinski
Copy link
Member

There's no problem with introducing a Unicode symbol for setdiff. However, the obvious Unicode character,, looks almost identical to \, which could potentially lead to confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] collections Data structures holding multiple items, e.g. sets deprecation This change introduces or involves a deprecation
Projects
None yet
Development

No branches or pull requests