Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 3.5.3, .NET 8, Dependencies #1178

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Spark 3.5.3, .NET 8, Dependencies #1178

wants to merge 5 commits into from

Conversation

grazy27
Copy link

@grazy27 grazy27 commented Jul 8, 2024

Changes:

  • Implemented compatibility with Spark 3.5.3 (fixes included in a separate commit).
  • Updated project dependencies.
  • Upgraded .NET 6 → .NET 8 and .NET 4.6.1 → .NET 4.8.
  • Fixed several small bugs, including:
    • Null reference exceptions.
    • Handling Windows paths with spaces.
    • Exceptions when after job completion.
    • A few more issues when running locally and on Databricks.

Tested with:

Spark:

  • Spark 3.5.0 on Databricks 14.3: Works, see comment bellow.
  • Spark 3.5.1 on Windows
  • Spark 3.5.2 on Windows

Databricks:

  • Fails on 15.4:
    The following error occurs:

    [Error] [JvmBridge] JVM method execution failed: Static method 'createPythonFunction' failed for class 'org.apache.spark.sql.api.dotnet.SQLUtils' when called with 7 arguments ([Index=1, Type=Byte[], Value=System.Byte[]], [Index=2, Type=Hashtable, Value=Microsoft.Spark.Interop.Internal.Java.Util.Hashtable], [Index=3, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=4, Type=String, Value=Microsoft.Spark.Worker], [Index=5, Type=String, Value=2.1.1.0], [Index=6, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=7, Type=null, Value=null])
    [2024-09-13T10:47:53.1569404Z] [machine] [Error] [JvmBridge] java.lang.NoSuchMethodError: org.apache.spark.api.python.SimplePythonFunction.<init>(Lscala/collection/Seq;Ljava/util/Map;Ljava/util/List;Ljava/lang/String;Ljava/lang/String;Ljava/util/List;Lorg/apache/spark/api/python/PythonAccumulatorV2;)V
    	at org.apache.spark.sql.api.dotnet.SQLUtils$.createPythonFunction(SQLUtils.scala:35)
    	at org.apache.spark.sql.api.dotnet.SQLUtils.createPythonFunction(SQLUtils.scala)
    
  • Works on 14.3:
    Tested on Databricks 14.3, and it works. However, there is a missing functionality for Vector UDFs.
    Since UseArrow is always set to true on Databricks, Vector UDFs do not function properly and can crash the entire job. This occurs because Spark splits a single expected RecordBatch into a collection of smaller batches, while the code assumes a single batch.
    Relevant Spark settings: useArrow, maxRecordsPerBatch.


Affected Tickets:

@grazy27 grazy27 changed the title Spark 3.5.1, .NET 8, Dependencies and documentation Spark 3.5.1, .NET 8, Dependencies and Documentation Jul 8, 2024
@grazy27
Copy link
Author

grazy27 commented Jul 8, 2024

@dotnet-policy-service agree

@GeorgeS2019
Copy link

GeorgeS2019 commented Jul 22, 2024

@grazy27

Can you share how many of the unit tests pass

The UDF unit tests have not been updated.
Are you able to get all of them to pass?

image

@grazy27
Copy link
Author

grazy27 commented Jul 22, 2024

@grazy27

Can you share how many of the unit tests pass

The UDF unit tests have not been updated. Are you able to get all of them to pass?

image

Hello @GeorgeS2019 , they do.
image

Saw your issue, probably my env uses UTF8 by default.
Several tests fail from time to time with executor driver): java.nio.file.NoSuchFileException: C:\Users\grazy27\AppData\Local\Temp\spark-cc2cf7bc-3c8c-4fdf-a496-266424de943d\userFiles-92d122bb-af9a-40ea-a430-131454afc705\archive.zip
But they pass if run second time, so I didn't dive deeper

@travis-leith
Copy link

What is the status of this PR?

@grazy27
Copy link
Author

grazy27 commented Aug 26, 2024

What is the status of this PR?

It works, the tests pass, and performance-wise, it's the best solution I've found for integrating .NET with Spark. The next steps are on Microsoft's side.

I'm also working on implementing CoGrouped UDFs, and I plan to push those updates here as well

@GeorgeS2019
Copy link

GeorgeS2019 commented Aug 26, 2024

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF after making adjustment to migrate to .net6.

#796

image
https://github.com/Apress/introducing-.net-for-apache-spark/tree/main/ch04/Chapter4

@travis-leith
Copy link

The next steps are on Microsoft's side.

Any idea who is "in charge" of this repo?

@grazy27
Copy link
Author

grazy27 commented Aug 26, 2024

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF.

#796

I can take a look, but only if a lonely evening with bad weather rolls around :) No promises, as this isn’t my primary focus.

There are two suggestions from developers that might help. The first is for a separate code cell, and the second is for a separate environment variable. Have you tried both approaches, and does the issue still persist?

@grazy27 grazy27 changed the title Spark 3.5.1, .NET 8, Dependencies and Documentation Spark 3.5.3, .NET 8, CoGrouped UDFs, Fixes, Dependencies and Documentation Nov 23, 2024
@grazy27 grazy27 mentioned this pull request Nov 23, 2024
@wudanzy wudanzy added the enhancement New feature or request label Nov 25, 2024
@wudanzy
Copy link
Collaborator

wudanzy commented Nov 25, 2024

Hi Ihor (@grazy27), thanks for the contribution! I recently get the write permission of this repo and happy to move this forward. Due to limited bandwidth in our team and other priorities, we don't have concrete work items on this project. But we can review your code and let's work together to move this forward!

@grazy27
Copy link
Author

grazy27 commented Nov 25, 2024

Hi Ihor (@grazy27), thanks for the contribution! I recently get the write permission of this repo and happy to move this forward. Due to limited bandwidth in our team and other priorities, we don't have concrete work items on this project. But we can review your code and let's work together to move this forward!

Hello Dan <@wudanzy>,

That's fantastic news—great to hear!

I'd be happy to help with a few more issues to get this project back on track. In my opinion, the most important ones are:

  • Support for UDFs when UseArrow = true
  • Migrating to a standalone NuGet package for BinarySerializer and upgrading the solution to .NET 9
  • Addressing the bug with Databricks 15.4

@wudanzy
Copy link
Collaborator

wudanzy commented Nov 25, 2024

Thanks for sharing that!

Copy link
Collaborator

@wudanzy wudanzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split this PR a little bit? Which can speed up the review.

docs/building/windows-instructions.md Outdated Show resolved Hide resolved
}
catch (Exception)
{
// It tries to delete non-existent file, but other from that its ok
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those exceptions excepted? If we expect them, we could add some logic here, if not, we could fail the test in such cases.

Copy link
Author

@grazy27 grazy27 Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My logic here is that since nothing related to this API changed inside Dotnet.Spark, and it just calls AddArchive on JVM sparkContext, and archive is added successfully - it must be an internal bug in spark itself.
I tested it with Scala directly, and it fails with the same exception.
I plan to test it more and report to Spark later.

image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it is the exact reason why other tests fail when run together with it:
image

I restricted this test to (3.1..3.2), versions on which it had already been tested on

src/csharp/Microsoft.Spark.E2ETest/SparkFixture.cs Outdated Show resolved Hide resolved
src/csharp/Microsoft.Spark.E2ETest/SparkFixture.cs Outdated Show resolved Hide resolved
src/csharp/Microsoft.Spark.UnitTest/BinarySerDeTests.cs Outdated Show resolved Hide resolved
@@ -839,4 +838,143 @@ private CommandExecutorStat ExecuteDataFrameGroupedMapCommand(
return stat;
}
}

internal class ArrowOrDataFrameCoGroupedMapCommandExecutor : ArrowOrDataFrameSqlCommandExecutor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Igor, is it possible to split this PR a little bit? This PR is huge, better to only contain 3.5 support and Net 8 support. We can leave other changes to future PRs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, Dan. I'll create a separate PR for CoGrouped UDFs and binary serializer. Vast majority of other fixes are related to each other though, so PR will still be relatively large.
At one point, I was unsure if this would ever get merged, so I ended up including all the improvements I needed to properly test whether the library meets my requirements in one place, so that if someone wants to build a version it's relatively simple to accomplish.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wudanzy

#1178 (comment)

It will help this project if the support of polyglot notebook is included
#1178 (comment)

Copy link

@GeorgeS2019 GeorgeS2019 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to setup CI/CD so that each PR usability can be tracked?
@SparkSnail
https://github.com/SparkSnail/spark/actions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to setup CI/CD so that each PR usability can be tracked? @SparkSnail https://github.com/SparkSnail/spark/actions

Yes, current test pipeline for the repo is broken, we are working to recover pipeline for PRs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed unnecessary refactoring and new features from the PR, preserved only .NET, Spark 3.5 and a few fixes

@grazy27 grazy27 changed the title Spark 3.5.3, .NET 8, CoGrouped UDFs, Fixes, Dependencies and Documentation Spark 3.5.3, .NET 8, Dependencies Nov 28, 2024
@grazy27 grazy27 force-pushed the main branch 3 times, most recently from 333ea16 to 9e30f44 Compare November 28, 2024 10:57
Copy link
Collaborator

@wudanzy wudanzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @grazy27, I got a basic idea of what is changed, overall, it looks good to me. One thing I found that the scala files content are not changed too much Could you please see if you can move the files instead of adding new ones, which helps highlight what is changed.

src/csharp/Microsoft.Spark/Sql/Catalog/Catalog.cs Outdated Show resolved Hide resolved
@@ -0,0 +1,91 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file copied from src/scala/microsoft-spark-3-2/pom.xml? Can you try to move it and then modify? Similar to this: 80c745b

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It highlights what is changed.

Copy link
Author

@grazy27 grazy27 Nov 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it's already done, but as a separate commit: e7eccdf. Please let me know if that's ok.
There are 4 commits in total, for copypaste, for 3.5.1 fixes, for .net8 and for databricks fixes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@grazy27
Copy link
Author

grazy27 commented Dec 1, 2024

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF after making adjustment to migrate to .net6.

#796

image https://github.com/Apress/introducing-.net-for-apache-spark/tree/main/ch04/Chapter4

Followed out in #1179, and found a related bug: #1043

wudanzy
wudanzy previously approved these changes Dec 5, 2024
Copy link
Collaborator

@wudanzy wudanzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those two the same?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly different, scala lang version is more specific, it has to match package name
image

@wudanzy
Copy link
Collaborator

wudanzy commented Dec 5, 2024

Hi @grazy27, it looks to me, but we have to wait for the check and another review. @SparkSnail is fixing the broken check and will do another review.

@grazy27
Copy link
Author

grazy27 commented Dec 5, 2024

Hi @grazy27, it looks to me, but we have to wait for the check and another review. @SparkSnail is fixing the broken check and will do another review.

Wonderful, thanks @wudanzy.
I'll create a few more PRs with another improvements after this one is merged

@grazy27
Copy link
Author

grazy27 commented Dec 14, 2024

/AzurePipelines run

Copy link

Commenter does not have sufficient privileges for PR 1178 in repo dotnet/spark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants