Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MultimapState API #23491

Merged
merged 13 commits into from
Dec 16, 2022
Merged

Add MultimapState API #23491

merged 13 commits into from
Dec 16, 2022

Conversation

zhengbuqian
Copy link
Contributor

@zhengbuqian zhengbuqian commented Oct 5, 2022

Adds the API and and in-memory implementation for multimap state. This is the first PR for #22831. Doc - Beam MultimapState.

#23492 is the PR for the windmill impl.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

@codecov
Copy link

codecov bot commented Oct 5, 2022

Codecov Report

Merging #23491 (e1d765b) into master (7f54035) will decrease coverage by 0.71%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #23491      +/-   ##
==========================================
- Coverage   73.33%   72.62%   -0.72%     
==========================================
  Files         719      736      +17     
  Lines       95792    97267    +1475     
==========================================
+ Hits        70250    70638     +388     
- Misses      24231    25318    +1087     
  Partials     1311     1311              
Flag Coverage Δ
python 81.83% <ø> (-1.23%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...python/apache_beam/examples/complete/distribopt.py 0.00% <0.00%> (-98.57%) ⬇️
...hon/apache_beam/examples/cookbook/mergecontacts.py 23.21% <0.00%> (-73.22%) ⬇️
...he_beam/examples/cookbook/multiple_output_pardo.py 37.50% <0.00%> (-58.34%) ⬇️
...python/apache_beam/examples/wordcount_debugging.py 38.77% <0.00%> (-57.15%) ⬇️
...python/apache_beam/examples/dataframe/wordcount.py 38.46% <0.00%> (-53.85%) ⬇️
...s/python/apache_beam/examples/wordcount_minimal.py 40.74% <0.00%> (-51.86%) ⬇️
sdks/python/apache_beam/examples/wordcount.py 48.27% <0.00%> (-44.83%) ⬇️
...apache_beam/examples/cookbook/custom_ptransform.py 53.65% <0.00%> (-41.58%) ⬇️
.../apache_beam/examples/cookbook/group_with_coder.py 51.16% <0.00%> (-34.89%) ⬇️
sdks/python/apache_beam/examples/complete/tfidf.py 73.43% <0.00%> (-24.95%) ⬇️
... and 95 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@zhengbuqian zhengbuqian changed the title [DRAFT] Add MultimapState API Add MultimapState API Oct 12, 2022
@zhengbuqian zhengbuqian marked this pull request as ready for review October 12, 2022 03:28
@zhengbuqian
Copy link
Contributor Author

R: @reuvenlax @scwhittle @y1chi

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control


@Override
public void clear() {
contents = ArrayListMultimap.create();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just clear the instantiated contents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See

// Even though we're clearing we can't remove this from the in-memory state map, since

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should create a new container instead of clear(), this is also how Bag/Map/Set do clear().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The +1 was for your response that we need to keep stuff consistent for past reads that were returned.

Copy link
Contributor

@y1chi y1chi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, added some minor comments.

@zhengbuqian
Copy link
Contributor Author

Gentle ping for review.

R: @reuvenlax @scwhittle

runners/flink/flink_runner.gradle Outdated Show resolved Hide resolved
/** Returns an {@link Iterable} over the keys contained in this multimap. */
ReadableState<Iterable<K>> keys();

/** Returns an {@link Iterable} over all key-values pairs contained in this multimap. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/** Returns an {@link Iterable} over all key-values pairs contained in this multimap. */
/** Returns an {@link Iterable} over all key-value pairs contained in this multimap. */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is in each pair there may be multiple values, WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that we are returning Entry<K, Iterable<V>> which is why I suggested the swap to key-value over key-values.

Note that other map APIs use Entry<K, V>, see:
https://guava.dev/releases/23.0/api/docs/com/google/common/collect/ArrayListMultimap.html#entries--
https://commons.apache.org/proper/commons-collections/apidocs/org/apache/commons/collections4/MultiValuedMap.html#entries--

It would likely make sense to match what existing common libraries do on this front so we are idiomatic for Java users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, done.

/** Returns an {@link Iterable} over the keys contained in this multimap. */
ReadableState<Iterable<K>> keys();

/** Returns an {@link Iterable} over all key-values pairs contained in this multimap. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that we are returning Entry<K, Iterable<V>> which is why I suggested the swap to key-value over key-values.

Note that other map APIs use Entry<K, V>, see:
https://guava.dev/releases/23.0/api/docs/com/google/common/collect/ArrayListMultimap.html#entries--
https://commons.apache.org/proper/commons-collections/apidocs/org/apache/commons/collections4/MultiValuedMap.html#entries--

It would likely make sense to match what existing common libraries do on this front so we are idiomatic for Java users.


@Override
public void clear() {
contents = ArrayListMultimap.create();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


@Override
public ReadableState<Iterable<V>> get(K key) {
return CollectionViewState.of(contents.get(keyCoder.structuralValue(key)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns a stale view in this order of operations:

mm.put("A", 1);
ReadableState<Iterable<V>> iter = mm.get("A");
mm.clear();
iter.read(); <-- will contain "1"

I believe there are other combinations where you won't get what you want. Please ensure that ParDoTest covers:

  • get before and after remove/put/clear.
  • containsKey before and after remove/put/clear
  • entries before and after remove/put/clear
  • isEmpty before and after remove/put/clear
  • keys fore and after remove/put/clear

Currently I see your covering entries before and after put/remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, updated ParDoTest to include those as well.

@zhengbuqian zhengbuqian requested review from lukecwik and reuvenlax and removed request for reuvenlax and lukecwik December 1, 2022 07:55
@zhengbuqian
Copy link
Contributor Author

gentle ping

Copy link
Member

@lukecwik lukecwik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add coverage for get/containsKey for the structural multimap test and then LGTM


@Override
public void clear() {
contents = ArrayListMultimap.create();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The +1 was for your response that we need to keep stuff consistent for past reads that were returned.

assertEquals(4, Iterables.size(entries));
assertEquals(5, Iterables.size(entriesView.read()));
assertEquals(5, Iterables.size(state.entries().read()));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add coverage for containsKey/get since these accessors are important and need to do the structural key conversion.

Copy link
Contributor Author

@zhengbuqian zhengbuqian Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Can you help merge the PR if LGTY? thanks!

@zhengbuqian zhengbuqian requested a review from lukecwik December 14, 2022 00:21
@lukecwik
Copy link
Member

Run Java PreCommit

@lukecwik
Copy link
Member

Run Java_Spark3_Versions PreCommit

…work(that input elements are processed in the same order as in Create.of) is made.
@zhengbuqian
Copy link
Contributor Author

The unit test testMultimapStateRemove failed without commit e1d765b because I made the assumption that the test framework will process the following 4 elements in order, but seems that it's not the case, for example b, 33 may be the 1st/2nd/3rd/4th processed element. testMultimapStateRemove was the only unit test that relies on such ordering and has been fixed, all other tests verify the result only after all elements have been processed thus have no issue.

pipeline
              .apply(
                  Create.of(
                      KV.of("hello", KV.of("a", 97)), KV.of("hello", KV.of("a", 97)),
                      KV.of("hello", KV.of("a", 98)), KV.of("hello", KV.of("b", 33))))
              .apply(ParDo.of(fn));

@zhengbuqian
Copy link
Contributor Author

Run Java_GCP_IO_Direct PreCommit

@zhengbuqian
Copy link
Contributor Author

Run Java_Examples_Dataflow_Java11 PreCommit

@lukecwik lukecwik merged commit 7d5d62e into apache:master Dec 16, 2022
@zhengbuqian zhengbuqian deleted the multimapstate-api branch December 19, 2022 23:52
lostluck pushed a commit to lostluck/beam that referenced this pull request Dec 22, 2022
* Add Multimap API

* Update based on commit

* Fix format

* Fix typo and adopt comment suggestion

* Fix spotless check

* Update based on comments

* Change Multimap in memory implementation to use structural value for keys.

* Address comments

* Apply suggestions from code review

Co-authored-by: Lukasz Cwik <[email protected]>

* Update MultimapState.entries() to return Entry<K, V> instead of Entry<K, Iterable<V>>

* Update MultimapState inmemory runner impl and added more thorough unit tests

* In MultimapState ParDoTest also test against get/containsKey when testing structural values.

* Fix flaky unit test, in which incorrect assumptions of the test framework(that input elements are processed in the same order as in Create.of) is made.

Co-authored-by: Lukasz Cwik <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants