-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.5.3, .NET 8, Dependencies #1178
base: main
Are you sure you want to change the base?
Conversation
@dotnet-policy-service agree |
Can you share how many of the unit tests pass The UDF unit tests have not been updated. |
Hello @GeorgeS2019 , they do. Saw your issue, probably my env uses UTF8 by default. |
What is the status of this PR? |
It works, the tests pass, and performance-wise, it's the best solution I've found for integrating .NET with Spark. The next steps are on Microsoft's side. I'm also working on implementing CoGrouped UDFs, and I plan to push those updates here as well |
Can you investigate if your solution work in polyglot .NET notebook interactive? Previously we all had problem with the UDF after making adjustment to migrate to .net6.
|
Any idea who is "in charge" of this repo? |
I can take a look, but only if a lonely evening with bad weather rolls around :) No promises, as this isn’t my primary focus. There are two suggestions from developers that might help. The first is for a separate code cell, and the second is for a separate environment variable. Have you tried both approaches, and does the issue still persist? |
Hi Ihor (@grazy27), thanks for the contribution! I recently get the write permission of this repo and happy to move this forward. Due to limited bandwidth in our team and other priorities, we don't have concrete work items on this project. But we can review your code and let's work together to move this forward! |
Hello Dan <@wudanzy>, That's fantastic news—great to hear! I'd be happy to help with a few more issues to get this project back on track. In my opinion, the most important ones are:
|
Thanks for sharing that! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we split this PR a little bit? Which can speed up the review.
} | ||
catch (Exception) | ||
{ | ||
// It tries to delete non-existent file, but other from that its ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those exceptions excepted? If we expect them, we could add some logic here, if not, we could fail the test in such cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My logic here is that since nothing related to this API changed inside Dotnet.Spark, and it just calls AddArchive
on JVM sparkContext, and archive is added successfully - it must be an internal bug in spark itself.
I tested it with Scala directly, and it fails with the same exception.
I plan to test it more and report to Spark later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/csharp/Microsoft.Spark.E2ETest/IpcTests/Sql/CatalogTests.cs
Outdated
Show resolved
Hide resolved
src/csharp/Microsoft.Spark.E2ETest/IpcTests/Sql/DataFrameTests.cs
Outdated
Show resolved
Hide resolved
@@ -839,4 +838,143 @@ private CommandExecutorStat ExecuteDataFrameGroupedMapCommand( | |||
return stat; | |||
} | |||
} | |||
|
|||
internal class ArrowOrDataFrameCoGroupedMapCommandExecutor : ArrowOrDataFrameSqlCommandExecutor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Igor, is it possible to split this PR a little bit? This PR is huge, better to only contain 3.5 support and Net 8 support. We can leave other changes to future PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, Dan. I'll create a separate PR for CoGrouped UDFs and binary serializer. Vast majority of other fixes are related to each other though, so PR will still be relatively large.
At one point, I was unsure if this would ever get merged, so I ended up including all the improvements I needed to properly test whether the library meets my requirements in one place, so that if someone wants to build a version it's relatively simple to accomplish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will help this project if the support of polyglot notebook is included
#1178 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible to setup CI/CD so that each PR usability can be tracked?
@SparkSnail
https://github.com/SparkSnail/spark/actions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible to setup CI/CD so that each PR usability can be tracked? @SparkSnail https://github.com/SparkSnail/spark/actions
Yes, current test pipeline for the repo is broken, we are working to recover pipeline for PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed unnecessary refactoring and new features from the PR, preserved only .NET, Spark 3.5 and a few fixes
src/csharp/Microsoft.Spark.Worker/Processor/TaskContextProcessor.cs
Outdated
Show resolved
Hide resolved
333ea16
to
9e30f44
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @grazy27, I got a basic idea of what is changed, overall, it looks good to me. One thing I found that the scala files content are not changed too much Could you please see if you can move the files instead of adding new ones, which helps highlight what is changed.
@@ -0,0 +1,91 @@ | |||
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file copied from src/scala/microsoft-spark-3-2/pom.xml? Can you try to move it and then modify? Similar to this: 80c745b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It highlights what is changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it's already done, but as a separate commit: e7eccdf. Please let me know if that's ok.
There are 4 commits in total, for copypaste, for 3.5.1 fixes, for .net8 and for databricks fixes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
<groupId>org.apache.spark</groupId> | ||
<artifactId>spark-sql_${scala.binary.version}</artifactId> | ||
<version>${spark.version}</version> | ||
<scope>provided</scope> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those two the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @grazy27, it looks to me, but we have to wait for the check and another review. @SparkSnail is fixing the broken check and will do another review. |
Wonderful, thanks @wudanzy. |
…4.x, 3.3.4+ as well.
…onsole after successful run on windows. Added logging to help with troubleshooting
/AzurePipelines run |
Commenter does not have sufficient privileges for PR 1178 in repo dotnet/spark |
Changes:
Tested with:
Spark:
Databricks:
Fails on 15.4:
The following error occurs:
Works on 14.3:
Tested on Databricks 14.3, and it works. However, there is a missing functionality for Vector UDFs.
Since
UseArrow
is always set totrue
on Databricks, Vector UDFs do not function properly and can crash the entire job. This occurs because Spark splits a single expectedRecordBatch
into a collection of smaller batches, while the code assumes a single batch.Relevant Spark settings:
useArrow, maxRecordsPerBatch
.Affected Tickets: