Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map_values and select work together? #970

Closed
rikrd opened this issue Oct 1, 2015 · 36 comments
Closed

map_values and select work together? #970

rikrd opened this issue Oct 1, 2015 · 36 comments

Comments

@rikrd
Copy link

rikrd commented Oct 1, 2015

Hi, shouldn't this work?

https://jqplay.org/jq?q=.toto%20|%20map_values(select(.x==%22t%22))&j={%22toto%22:%20{%22a%22:%20{%22x%22:%20%22t%22},%20%22b%22:%20{%22x%22:%20%22t%22},%20%22c%22:%20{%22x%22:%20%22r%22}}}

The output I was expecting was:

"a": {"x": "t"}, "b": {"x": "t"}
@nicowilliams
Copy link
Contributor

Indeed, it doesn't. But for map(f) it does.

@pkoppstein
Copy link
Contributor

map_values(select(_)) is unlikely to be useful, because of the way empty works.

You can use with_entries to achieve the desired effect:

$ jq -nc 'def data: {"toto": {"a": {"x": "t"}, "b": {"x": "t"}, "c": {"x": "r"}}};
          data | .toto | with_entries(select((.value|.x) == "t"))'

produces:

{"a":{"x":"t"},"b":{"x":"t"}}

@nicowilliams
Copy link
Contributor

Not because of the way empty works but because of the way |= works.

@nicowilliams
Copy link
Contributor

Even if |= handled backtracking from the update expression correctly, since map_values/1 uses |= the end result would be that non-selected keys would not get updated. There's a way to express what @rikrd wants, but by writing a function that uses a reduction over the input's values and which deletes those keys for which the given filter produces no values.

@pkoppstein
Copy link
Contributor

@nicowilliams wrote:

Not because of the way empty works but because of the way |= works.

If we use null instead of the empty that is used by select, we get:

jq -c '.toto | map_values(if .x == "t" then . else null end)' toto.json
{"a":{"x":"t"},"b":{"x":"t"},"c":null}

Thus I think it's reasonable to say that it's the black-hole-like behavior of "empty" that makes the difference here.

@nicowilliams
Copy link
Contributor

@pkoppstein Really, this is NOT about empty. It's about |= not handling RHS expressions that backtrack without producing values (i.e., they evaluate the same as empty).

@pkoppstein
Copy link
Contributor

@nicowilliams - The difference between:

.toto | map_values(if .x == "t" then . else null end)

and

.toto | map_values(if .x == "t" then . else empty end)

is that in the second case, "empty" appears instead of "null". Furthermore, any JSON value in the place of that null will not cause the "implosion" that "empty" produces. So whatever else can be said, it's clear that it's "empty" that makes the difference here. Of course if you were to change |= then that would make a difference, too, but we're talking here about the behavior of jq-as-it-is, not jq as-it-might-be.

@nicowilliams
Copy link
Contributor

But it's nothing about empty and everything about |=. Please do not confuse others as to this.

@nicowilliams
Copy link
Contributor

Call it a bug or shortcoming of map_values if you wish, but not of empty. We should not make empty seem mysterious.

@pkoppstein
Copy link
Contributor

@nicowilliams wrote:

Please do not confuse others

Hopefully "others" will speak up if they feel the need to. I never spoke about short-comings in this thread, least of all about empty; on the contrary, I suggested to the OP that with_entries would do what was wanted.

@nicowilliams
Copy link
Contributor

Here's a version of map_values/1 that works as the OP expected:

def map_values(f):
  reduce (path(.[])) as $p (.;
    ([getpath($p)|first(f)]) as $r |
    if ($r|length) == 0 then delpaths([$p])
    else setpath($p; $r[0]) end
  );

This deletes any keys in the input when the filter produces no outputs, otherwise it updates the value for each key to the first output of the given filter on the value at that key. This isn't quite right as it doesn't work for array inputs for the obvious reason that if we delete items from the array while traversing it forwards... we get confused. But it works for object inputs, and making it also work for array inputs is left as an exercise for the reader :) EDIT: Hint: replace the expression to reduce over with if type=="array" then range(length; -1; -1)|[.] else path(.[]) end.

Example:

$ jq '
  def map_values(f):
    reduce (path(.[])) as $p (.;
      ([getpath($p)|first(f)]) as $r |
      if ($r|length) == 0 then delpaths([$p])
      else setpath($p; $r[0]) end
    );
    map_values(select(.>2)+1)'
{"a":0,"b":5}
{
  "b": 6
}
$ 

And as you can see, empty isn't the issue. The reduction, by the way, is very similar to the reduction in |= (which is def _modify(paths; update): reduce path(paths) as $p (.; setpath($p; getpath($p) | update));), but it gives meaning to the RHS backtracking without producing values.

We couldn't make |= behave like this because some users might expect the RHS producing no values to leave the value alone. But this would have been the better behavior for |=, IMO.

Someday I'll make it possible to import specific versions of the "jq core" module, then we could in fact make backwards-incompatible changes to things like |=.

@nicowilliams
Copy link
Contributor

Or, perhaps since the |= RHS backtracking with no values results in what we might called "undefined behavior", maybe we can make it mean "delete the value at the LHS path when the RHS outputs no values". Hmmm. @dtolnay, what think ye?

@dtolnay
Copy link
Member

dtolnay commented Oct 2, 2015

I think there are 4 possibilities we need to evaluate.

  1. {a:0} | .a |= empty --> null

    This is the current behavior, which I don't think is justifiable. I view this as a bug. It is an unintended consequence of the specific way |= is built on top of reduce.

  2. {a:0} | .a |= empty --> {}

    This is "backtrack means remove," which seems to be popular in this thread so far. Personally I think this may be the most powerful in the hands of someone who understands what it does. At the same time, I don't think it is the right thing to optimize for. Having to update OR remove, but not knowing which, seems unusual. If you need to remove, use syntax designed for removing things rather than for updating things.

  3. {a:0} | .a |= empty --> {a:0}

    This is "backtrack means leave alone." It seems to be popular in Unexpected "empty" behaviour #873 and Assignment operators should handle backtracking (fix #873) #879 and @nicowilliams even made some progress on implementing it.

  4. {a:0} | .a |= empty --> empty

    In my opinion, this is "|= behaves correctly," i.e. the result is the same as {a: (0 | empty)}. This is my favorite, but unfortunately this interpretation contradicts the following statement in the manual:

    If the right-hand side outputs multiple values, only the last one will be used.

    That is, {a:0} | .a |= (1,2) is explicitly documented to be different from {a: (0 | (1,2))}. Maybe this is something to clean up in a 2.0-style break.

    Please let me know if I am failing to notice any downsides of this interpretation, especially if the behavior documented in the manual is commonly useful. I defer to @pkoppstein here because he has way more experience as a user of jq than I do.

Overall my favorite at this point is option 4. I don't think it makes sense to leave option 1 to go to option 2 or 3, and later have a second break to go to option 4. Maybe we should document this case as unspecified behavior, and plan to do option 4 as part of a bigger break - unless of course I am missing downsides of option 4.

@nicowilliams
Copy link
Contributor

@dtolnay Regarding (2), the author of the |=-using expression may not know if the RHS expression might empty, after all, the RHS might just be an invocation of a closure argument to the function containing the |=-using expression.

Yes, I've decided I like (2) better than (3).

As to having to update OR remove, the way the reduction in |= would work would be to keep an array of two things: the value being updated, and the paths to delete at the end, then at the end the value and paths-to-delete would be extracted, paths deleted, value output.

def _modify(paths; update): reduce ... ([., []]; ...) | .[1] as $delpaths | .[0] | delpaths($delpaths);

Something like (UNTESTED):

def _modify(paths; update):
  reduce (path(paths)) as $p ([.,[]];
    ([.[0]|getpath($p)|first(f)]) as $r |
    if ($r|length) == 0 then [.[0],.[1]+[$p])
    else [.[0]|setpath($p; $r[0]),.[1]] end
  ) | .[1] as $delpaths | .[0] | delpaths($delpaths);

@pkoppstein
Copy link
Contributor

First, thanks to @dtolnay for laying out the alternatives so clearly and reasonably.

Second, I'd agree with @dtolnay that (1) can safely be viewed as a bug. Indeed, the following variant might well be viewed as a serious bug:

{a:1, b:2} |= (.a = empty) => null

At any rate, I cannot imagine many people intentionally relying on such behavior.

Regarding the merits of (2), (3) and (4), I'd say that, although (4) has the advantage of consistency with the ordinary understanding of |=, it has the drawback of being fairly useless: inexpert users will (sooner or later) find it surprising or annoying; and expert users will almost always want to avoid it.

Regarding the statement in the manual that: "If the right-hand side outputs multiple values, only the last one will be used.", I have a question:

If we were free of that constraint, would that make a difference with respect to the choice between (2) and (3)?

@nicowilliams
Copy link
Contributor

@pkoppstein I agree that (4) would be surprising and useless. To break out of a looping control structure (|= loops over the paths in . matched by the LHS) we should use break (or error).

As to taking the first or last value output by the RHS, it makes no difference to the choice between (2) and (3) (in my _modify above replace first/1 with last/1 to match the manual).

(3) makes some sense, but if you want to leave a path/value unchanged, just output the value (the RHS' input). OTOH, (2) could be surprising too, and (3) seems like the least surprising behavior -- just not the most useful.

I suppose we could introduce a new builtin/syntax to use instead of empty that would allow one to delete a path/value from the RHS. This would have the advantage of being more obvious than (2). EDIT: And, coupled with (3), not surprising at all.

@pkoppstein
Copy link
Contributor

@nicowilliams wrote:

As to taking the first or last value output by the RHS ...

The issue that @dtolnay and I had in mind was whether |= should subvert the normal backtracking behavior.

That is, [{a:0} | .a = (1,2)] | length is 2, as is [{a:0} | .a += (1,2)] | length, but change "=" to "|=", and the result is 1.

Perhaps I can be persuaded otherwise, but this seems like a misfeature that is severe enough to justify an imminent jq 2.0.

@dtolnay
Copy link
Member

dtolnay commented Oct 2, 2015

I can see why (4) may be fairly useless. I agree that (2) is better than (3), and I think we can safely move to (2) as long as we preserve the current behavior when one or more values are produced by the rhs.

There are a few more cases to pin down before coding this up:

  • [0,1,2] | .[1] |= empty

    Given (2), I don't think it is controversial that this should return [0,2]. The current behavior is null.

  • [0,1,2] | .[1:2] |= empty

    This should be the same as .[1:2] |= [] right?

  • [0,1,2] | . |= empty

    Should this be null or empty?

  • [0,1,2,3,4,5] | .[1,3] |= empty

    Should this be [0,2,3,5] or [0,2,4,5]?

@dtolnay
Copy link
Member

dtolnay commented Oct 2, 2015

@pkoppstein do you mean [{a:0} | .a = (1,2)] | length and [{a:0} | .a += (1,2)] | length should be changed to 1 to match our plan for |=?

@pkoppstein
Copy link
Contributor

@dtolnay asked:

do you mean [{a:0} | .a = (1,2)] | length and [{a:0} | .a += (1,2)] | length should be changed ...

Absolutely not. I was, on the contrary, suggesting that |= should be brought back into conformance with jq's general approach to backtracking. The question in my mind is whether this would "require" a jump in numbering to 2.0, or whether that statement in the manual can be regarded as a bug.

@nicowilliams
Copy link
Contributor

@dtolnay For the first corner case I already gave an answer (delpaths after all updates are done). For the second corner case I'd say if the RHS is empty then delete that slice. For the third corner case the answer is that . |= empty should behave the same as empty (this requires noticing that we're saying "delete ." and then breaking out of the reduction to backtrack). For the fourth corner case, hopefully delpaths takes care of this automatically (it does!) by deleting higher array indices first, thus producing [0,2,4,5] in your example.

@dtolnay
Copy link
Member

dtolnay commented Oct 2, 2015

I agree that [0,2,4,5] is the correct result in that example, but that is not consistent with [0,1,2,3,4,5] | (.[1:2],.[3:4]) |= [] which currently returns [0,2,3,5] (same for =). I guess that is a bug.

@nicowilliams
Copy link
Contributor

Yeah, I'd call that a bug.

@nicowilliams
Copy link
Contributor

@pkoppstein When you write things like "whether |= should subvert the normal backtracking behavior", I have no idea what you mean (is it code for how = and |= are different? something else?). From a pure VM point of view nothing in jq "subverts normal backtracking behavior" -- jq functions get to define how they handle the case where a closure argument backtracks without producing values.

I really don't want to re-litigate the behavior of = vs. |= -- that's a different issue, and the only thing on the table here is |= behaviors (1) through (4) as outlined by @dtolnay.

@pkoppstein
Copy link
Contributor

@nicowilliams wrote:

I really don't want to re-litigate the behavior of = vs. |=

In that case, that is, if {"a":0} | .a = empty must yield the empty stream, then I think @dtolnay is probably right: option (4) for |= wins for consistency, no?

@nicowilliams
Copy link
Contributor

@pkoppstein No, = and |= are already inconsistent and, as I've explained before, almost necessarily so. I think wanting them to be consistent is what might qualify as a "foolish consistency". Useful behavior is more important than consistent behavior.

That said, we could make = more like |= at the price of performance thusly:

def _assign(paths; value): . as $dot | reduce path(paths) as $p (.; setpath($p; $dot | value));

or more backwards-incompatibly:

def _assign(paths; value): reduce path(paths) as $p (.; setpath($p; value));

I think it's too late for this, and I'm absolutely not interested in it nor in a 2.0 for this either.

I do note that we could declare _assign/2 and _modify/2 to be interfaces, and even provide builtin modules that provide alternative ones. Then you could have N behaviors for = and |=.

And there, we've re-litigated this :(

@nicowilliams
Copy link
Contributor

For comparison, = currently is:

def _assign(paths; value): value as $v | reduce path(paths) as $p (.; setpath($p; $v));

This means that if value outputs multiple values then the reduction happens once for each output of value, which is a bit surprising.

For other readers, LHS = RHS expands to _assign(LHS; RHS) and LHS |= RHS expands to _modify(LHS; RHS). _modify is currently defined as:

def _modify(paths; update): reduce path(paths) as $p (.; setpath($p; getpath($p) | update));

Here the update closure is invoked in the reduction, which is why multiple outputs of the RHS in |= yield very different behavior than for =.

We could define _modify thusly:

def _modify(paths; update): path(paths) as $p | (getpath($p) | update) as $v | setpath($p; $v);

to attempt to get similar behavior to =, but this will fail to be similar to =. There is no way to make |= similar to =. But we can make = more similar to |= in a variety of ways.

@nicowilliams
Copy link
Contributor

@pkoppstein Ah, perhaps you just want more cartesian product behavior because that's so easy to get in jq. But because function arguments are closures, functions can use them to build new control structures -- cartesian product behavior needn't always be what you get. I don't believe we can get anything like useful cartesian product behavior out of |=, but maybe I'm not seeing something obvious.

@pkoppstein
Copy link
Contributor

@nicowilliams wrote:

Ah, perhaps you just want more cartesian product behavior because that's so easy to get in jq.

I thought that was obvious.

maybe I'm not seeing something obvious.

Yes, e.g. generate-and-test. If it's good enough for = it should be good enough for |=.

@nicowilliams
Copy link
Contributor

@pkoppstein No, as I said, when you write about "subverting normal backtracking behavior" I have no clue what you mean. If you have a _modify/2 to propose, let's see it, but I can't think of a way to define one that would give you what you want while still being useful at all (I grant that I'm not trying hard).

@pkoppstein
Copy link
Contributor

@nicowilliams - @dtolnay favored #4 (with backtracking, as I understand it), so I would expect he doesn't see any problem implementing it. In fact, if you would agree beforehand to give serious consideration to his implementation of #4 (with backtracking), maybe he'd be less reluctant to work on it :-)

@nicowilliams
Copy link
Contributor

nicowilliams commented Oct 4, 2015 via email

@pkoppstein
Copy link
Contributor

@nicwilliams wrote:

Tomorrow ...

I'm greatly honored! But if it's not possible to implement |= in a way that ensures .a |= STREAM will be comparable in performance to .a = STREAM, then maybe it would be better not to invest too much time in the effort. Btw, it's my turn to go a-traveling for a couple of weeks, which will probably result in some delays in my responses. Hopefully @dtolnay and others are following and will provide feedback. Thanks.

@amckinlay
Copy link

amckinlay commented Jan 23, 2017

I experienced this issue today. The manual implies map_values(x) is the same thing as map(x), but returns objects when an object is passed. As a new user, I spent a good 15 minutes trying to figure out what I was doing wrong with map_values(select(x)), as I thought map_values(x) was a drop-in replacement for map(x) and the manual states map(select(x)) is a useful pattern.

@pkoppstein
Copy link
Contributor

pkoppstein commented Jan 23, 2017

@amckinlay - As pointed out in this thread, the simplest workaround is to avoid using map_values as currently implemented. Here is an alternative definition that you may wish to use in its stead:

# The following variant of map_values/1 can be used with select/1 so that 
# map_values(select(f)) will select key-value pairs
def map_values(f):
   with_entries( [(.value|f)] as $v | if $v | length == 0 then empty else .value = $v[0] end  ) ;

If you find this or any other definition of map_values/1 to be superior to the one currently provided (in builtin.jq), then please indicate as much here.

@StephenWithPH
Copy link

I stumbled upon this issue after having a similar problem. Of note, I have jq-1.5-1-a5b5cbe (what comes with Ubuntu).

According to the 1.6 playground (https://jqplay.org/s/jprJeMQtYl), this now works as expected! 🎉

@itchyny itchyny closed this as completed Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants