Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit delegation to target class for ArrayBag #5189

Merged
merged 38 commits into from
Sep 4, 2024

Conversation

robsyme
Copy link
Collaborator

@robsyme robsyme commented Jul 31, 2024

Solves #5187

Copy link

netlify bot commented Jul 31, 2024

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit a856025
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/66d8602e9e9dc2000877dbec
😎 Deploy Preview https://deploy-preview-5189--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@pditommaso pditommaso changed the title Explicit delgation to target class for ArrayBag Explicit delegation to target class for ArrayBag Aug 5, 2024
@pditommaso
Copy link
Member

@robsyme is this PR solving the issue for you?

@robsyme
Copy link
Collaborator Author

robsyme commented Aug 5, 2024

Yup, the changes solve both the very minimal example in the new test included in this PR and also the slightly more complicated example in #5187

@robsyme robsyme force-pushed the ensure-arraybag-equivalence branch from b6cca1f to f5c7942 Compare August 25, 2024 19:06
@robsyme robsyme requested a review from a team as a code owner August 25, 2024 19:06
@robsyme robsyme force-pushed the ensure-arraybag-equivalence branch from f5c7942 to 088fc7e Compare August 25, 2024 19:11
Signed-off-by: Rob Syme <[email protected]>
@robsyme robsyme removed the request for review from a team August 25, 2024 19:16
@robsyme robsyme force-pushed the ensure-arraybag-equivalence branch from 23675e4 to 1daf22a Compare August 25, 2024 21:21
@robsyme
Copy link
Collaborator Author

robsyme commented Aug 25, 2024

If the point of the ArrayBag class is to provide a container for objects that order-invariant, it makes sense to ensure that hashCode and equivalence methods also reflect this property. The commented-out equals method only tests for length equivalence and a containsAll test, which would fail when testing ArrayBags with the same items with differing frequencies, e.g. [1,1,2] and [1,2,2]. I added a slighty more computationally costly equals method, but it will not fail for these cases. I also added a JoinOp unit test for this case.

@pditommaso
Copy link
Member

This opens an interesting point: should [1,1,2] be the same of [1,2,2] ?

@pditommaso
Copy link
Member

pditommaso commented Sep 3, 2024

Also it results into an inconsistency that hashCode is honoured, but it does not equals and == operator. If you check this tests will fail

     given:
     def alpha = new ArrayBag(['abc',123,'x', 9])
     def delta = new ArrayBag([123,9,'abc','x'])

     expect:
     alpha == delta
     alpha.equals(delta)

@bentsherman
Copy link
Member

ArrayBag itself seems over-complicated and contradictory. It's basically trying to be a List in some places and a Set in others. Instead I would argue that you should just use a List or a Set depending on what you want.

Here are all the places where ArrayBag is used:

  • collect operator: produce a List if sorting is enabled, otherwise produce a Set
  • groupTuple operator: same as above
  • process glob input: it is an unordered collection, use a Set

The two operators should have return type Collection, so that in general order is not guaranteed in general, but can be depending on runtime conditions. Or even better, separate the sort functionality from collect and groupTuple so that they always return Set, then the sorting can be done is a separate operation.

Glob patterns in general are unordered, so it doesn't make sense to give the illusion of ordering in this case. What does files('*.fastq')[0] mean? It has no meaning

If we do all of that, we can remove ArrayBag and avoid all of this mess

@bentsherman
Copy link
Member

Example of sorting groups explicitly after a groupTuple (source):

// ...
    | groupTuple
    | map { group, items ->
      def sorted = items
        .sort { item -> item[0].id }
        .collect { meta, csv -> csv }
      return [group, sorted]
    }

@robsyme
Copy link
Collaborator Author

robsyme commented Sep 4, 2024

Thanks for doing that analysis Ben, I agree that replacing the ArrayBag seems very sensible.

pditommaso and others added 5 commits September 4, 2024 09:53

Signed-off-by: Paolo Di Tommaso <[email protected]>
Co-authored-by: Jordi Deu-Pons <[email protected]>
Co-authored-by: Adam Talbot <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
pditommaso and others added 20 commits September 4, 2024 09:53
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
This commit disables the AWS Batch spot auto-retries. 

The main reasons to disable this capability is:

* The same tasks can be re-tried multiple times incurring in significant spending increase with the user is a aware of that
* The AWS automatic retry re-execute a task in the same working directory because it's not directly managed by nextflow. This can introduce nasty side effects with partial/corrupted data left in a previous execution
* There's not log/visual feedback during the pipeline execution, because it's managed directly by AWS Batch.

User can still enable this capability by setting the following option:

```
aws.batch.maxSpotAttempts = n 
```

where n is a integer > 0


Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Co-authored-by: Ben Sherman <[email protected]>
This commit disables the automatic retry made by Google Batch when a spot instance is reclaimed.

The main reasons to disable this capability is:

* The same tasks can be re-tried multiple times incurring in significant spending increase with the user is a aware of that
* The Google automatic retry re-execute a task in the same working directory because it's not directly managed by nextflow. This can introduce nasty side effects with partial/corrupted data left in a previous execution
* There's not log/visual feedback during the pipeline execution, because it's managed directly by Google Batch.

User can still enable this capability by setting the following option:

```
google.batch.maxSpotAttempts = n 
```

where n is a integer > 0


Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Docs pages under the developer/ directory (workflow diagram, packages, core plugins etc) had a broken Seqera logo in the sidebar.

Fixes this by using a Sphinx function to always use a path for that file that's relative to the build root.

Signed-off-by: Phil Ewels <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
The default boot image is batch-cos, but it may not be desired in all
situations. In particular, there is an ongoing issue in which the
batch-cos image does not retry pulling a docker image if there was a
network issue. The user may also have some some other configuration
pre-configured in their custom boot image.

This commit adds the ability to specify a custom boot disk image by using 
the configuration option 

```
google.batch.bootDiskImage = '<NAME>'
```


Signed-off-by: Siddhartha Bagaria <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Co-authored-by: Paolo Di Tommaso <[email protected]>
@pditommaso
Copy link
Member

The main reason for ArrayBag is to not invalidate the cache when files are emitted in different order

https://github.com/nextflow-io/nextflow/blob/ensure-arraybag-equivalence/modules/nf-commons/src/main/nextflow/util/HashBuilder.java#L158-L159

What is tricky is that hashCode and equals is honoured when invoked by Java SDK (and therefore fixes the JoinOp), and it is not when invoked by Groovy (and pipeline code).

Given that adding hashCode and equals methods at least fixes the join implementation, I think it's fine to add them with a big notice.

I would avoid however the check for differing frequencies tho.

Do you agree?

@bentsherman
Copy link
Member

The main reason for ArrayBag is to not invalidate the cache when files are emitted in different order

my point is that this is already supported for Set, so ArrayBag should just be replaced with Set. if the order of a collection doesn't matter for purposes of caching, it shouldn't matter for the user either.

@pditommaso
Copy link
Member

But Set is not a List! there's a lot of stuff (including) pipeline code that rely on that

@bentsherman
Copy link
Member

If the order doesn't matter... then they shouldn't be relying on it. Doing something like foo[0] on an ArrayBag is non-deterministic. Tbh it might explain why people have mysterious caching issues from time to time

That being said, I would be willing to defer this change to the introduction of static type checking. We'll have to clean up APIs like this anyway and then we can have a feature flag like nextflow.enable.types to make it opt-in

@pditommaso
Copy link
Member

Sound like a reasonable plan

@pditommaso pditommaso merged commit e7dc0d6 into master Sep 4, 2024
21 checks passed
@pditommaso pditommaso deleted the ensure-arraybag-equivalence branch September 4, 2024 14:06
@bentsherman
Copy link
Member

I forgot that Set requires uniqueness, which makes it the wrong abstraction for collect and groupTuple (should be fine for file globs though).

The Bag interface is actually the right abstraction for collecting values from a channel, as it is a generic collection, but serves as a marker for the hash builder to hash it like a set.

The problem is that ArrayBag implements List, which gives a false sense that it is ordered. If a user starts doing indexed access on an ArrayBag, they will be relying on non-deterministic behavior that could cause cryptic resume issues -- even though the ArrayBag itself is deterministic, their usage of the ArrayBag could lead to non-deterministic hashes downstream.

You mentioned frequencies as a way to test equality and that it would be expensive, but... what if we just implement Bag as a frequency map instead of an array list? Call it a HashBag. The elements already need to be hashable so it shouldn't break anything. Then the equality test is the same as for a map.

It looks like Guava even has this, it's called a Multiset:

A collection that supports order-independent equality, like Set, but may have duplicate elements. A multiset is also sometimes called a bag.

@bentsherman
Copy link
Member

In any case, I think the migration plan is the same, since replacing ArrayBag with Multiset would drop indexing, better to change it as part of the static type API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Join behaviour non-deterministic if key contains a nextflow.util.ArrayBag object
6 participants