Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-3548] Optimize IO path Python interface for SystemDS #2154

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Nakroma
Copy link
Contributor

@Nakroma Nakroma commented Dec 17, 2024

Draft for the student project SYSTEMDS-3548.

Current contributions:

  • Fixes some minor bugs related to the performance tests
  • Parallelizes pandas_to_frame_block column processing (see image below for speed up, tested on my machine)

load_pandas

This commit fixes the load_numpy string performance test case. It keeps the CLI usage consistent with the other test cases, but converts the dtype to the correct one internally.
This commit fixes the array boolean convert breaking for row numbers above 64. It also adds a bit more error handling to prevent cases like this in the future.
This commit parallelizes the column processing in the pandas DataFrame to FrameBlock conversion.
Copy link

codecov bot commented Dec 17, 2024

Codecov Report

Attention: Patch coverage is 20.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 72.03%. Comparing base (d3fcfb1) to head (8323515).

Files with missing lines Patch % Lines
.../apache/sysds/runtime/util/Py4jConverterUtils.java 20.00% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               main    #2154   +/-   ##
=========================================
  Coverage     72.03%   72.03%           
+ Complexity    43937    43935    -2     
=========================================
  Files          1441     1441           
  Lines        166106   166110    +4     
  Branches      32428    32431    +3     
=========================================
+ Hits         119655   119659    +4     
- Misses        37199    37204    +5     
+ Partials       9252     9247    -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Nakroma Nakroma marked this pull request as ready for review December 18, 2024 14:04
@christinadionysio
Copy link
Contributor

LGTM! Thank you for your contribution @Nakroma!

@Baunsgaard
Copy link
Contributor

LGTM as well.

How did you measure the time?
Is it with startup time of the system?

@Nakroma
Copy link
Contributor Author

Nakroma commented Dec 18, 2024

@Baunsgaard
Copy link
Contributor

@Baunsgaard I used the IO benchmark scripts for the figure provided above:

https://github.com/apache/systemds/blob/main/scripts/perftest/runAllIO.sh https://github.com/apache/systemds/blob/main/scripts/perftest/python/io/load_pandas.py

Great!

Then you can get better numbers:

Modify the script to start the context not in the 'run' part, instead move it to the 'setup' part, and remember to shut down the system with ctx.close() after you are done measuring.

@Nakroma
Copy link
Contributor Author

Nakroma commented Dec 19, 2024

@Baunsgaard Okey yeah that makes sense - pushed a commit for that 👍 I didnt move it to the setup but rather inside the global context, so .close() time is not included in the timing and also to support the args.number parameter.

@Baunsgaard
Copy link
Contributor

@Baunsgaard Okey yeah that makes sense - pushed a commit for that 👍 I didnt move it to the setup but rather inside the global context, so .close() time is not included in the timing and also to support the args.number parameter.

what are the times then?

@Nakroma
Copy link
Contributor Author

Nakroma commented Dec 19, 2024

what are the times then?

seems to be about a difference of 1-2s, at least on my local machine

load_pandas

@Baunsgaard
Copy link
Contributor

what are the times then?

seems to be about a difference of 1-2s, at least on my local machine

load_pandas

60% speedup on int32 and 100% on int64 is great!
However, it does seem to me like there is something else taking time from your results. I would expect speedup closer to the number of cores in your system.

@Nakroma
Copy link
Contributor Author

Nakroma commented Dec 19, 2024

60% speedup on int32 and 100% on int64 is great! However, it does seem to me like there is something else taking time from your results. I would expect speedup closer to the number of cores in your system.

So there is some more constant time, the building of the frameblock a few lines before the .convert calls for example is around 400ms.

I looked at the profiling a bit more and it seems like most time is spent on socket communication between Java and Python. My assumption would be that this adds quite a bit of overhead and doesn't parallelize well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants