Spark 3.5.3, .NET 8, Dependencies #1178

grazy27 · 2024-07-08T20:50:28Z

Changes:

Implemented compatibility with Spark 3.5.3 (fixes included in a separate commit).
Updated project dependencies.
Upgraded .NET 6 → .NET 8 and .NET 4.6.1 → .NET 4.8.
Fixed several small bugs, including:
- Null reference exceptions.
- Handling Windows paths with spaces.
- Exceptions when after job completion.
- A few more issues when running locally and on Databricks.

Tested with:

Spark:

Spark 3.5.0 on Databricks 14.3: Works, see comment bellow.
Spark 3.5.1 on Windows
Spark 3.5.2 on Windows

Databricks:

Fails on 15.4:
The following error occurs:

[Error] [JvmBridge] JVM method execution failed: Static method 'createPythonFunction' failed for class 'org.apache.spark.sql.api.dotnet.SQLUtils' when called with 7 arguments ([Index=1, Type=Byte[], Value=System.Byte[]], [Index=2, Type=Hashtable, Value=Microsoft.Spark.Interop.Internal.Java.Util.Hashtable], [Index=3, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=4, Type=String, Value=Microsoft.Spark.Worker], [Index=5, Type=String, Value=2.1.1.0], [Index=6, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=7, Type=null, Value=null])
[2024-09-13T10:47:53.1569404Z] [machine] [Error] [JvmBridge] java.lang.NoSuchMethodError: org.apache.spark.api.python.SimplePythonFunction.<init>(Lscala/collection/Seq;Ljava/util/Map;Ljava/util/List;Ljava/lang/String;Ljava/lang/String;Ljava/util/List;Lorg/apache/spark/api/python/PythonAccumulatorV2;)V
	at org.apache.spark.sql.api.dotnet.SQLUtils$.createPythonFunction(SQLUtils.scala:35)
	at org.apache.spark.sql.api.dotnet.SQLUtils.createPythonFunction(SQLUtils.scala)

Works on 14.3:
Tested on Databricks 14.3, and it works. However, there is a missing functionality for Vector UDFs.
Since UseArrow is always set to true on Databricks, Vector UDFs do not function properly and can crash the entire job. This occurs because Spark splits a single expected RecordBatch into a collection of smaller batches, while the code assumes a single batch.
Relevant Spark settings: useArrow, maxRecordsPerBatch.

Affected Tickets:

grazy27 · 2024-07-08T20:55:41Z

@dotnet-policy-service agree

GeorgeS2019 · 2024-07-22T04:49:47Z

@grazy27

Can you share how many of the unit tests pass

The UDF unit tests have not been updated.
Are you able to get all of them to pass?

grazy27 · 2024-07-22T11:56:08Z

@grazy27

Can you share how many of the unit tests pass

The UDF unit tests have not been updated. Are you able to get all of them to pass?

Hello @GeorgeS2019 , they do.

Saw your issue, probably my env uses UTF8 by default.
Several tests fail from time to time with executor driver): java.nio.file.NoSuchFileException: C:\Users\grazy27\AppData\Local\Temp\spark-cc2cf7bc-3c8c-4fdf-a496-266424de943d\userFiles-92d122bb-af9a-40ea-a430-131454afc705\archive.zip
But they pass if run second time, so I didn't dive deeper

travis-leith · 2024-08-26T07:02:39Z

What is the status of this PR?

grazy27 · 2024-08-26T07:14:50Z

What is the status of this PR?

It works, the tests pass, and performance-wise, it's the best solution I've found for integrating .NET with Spark. The next steps are on Microsoft's side.

I'm also working on implementing CoGrouped UDFs, and I plan to push those updates here as well

GeorgeS2019 · 2024-08-26T07:17:02Z

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF after making adjustment to migrate to .net6.

#796

https://github.com/Apress/introducing-.net-for-apache-spark/tree/main/ch04/Chapter4

travis-leith · 2024-08-26T07:18:34Z

The next steps are on Microsoft's side.

Any idea who is "in charge" of this repo?

grazy27 · 2024-08-26T07:30:27Z

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF.

#796

I can take a look, but only if a lonely evening with bad weather rolls around :) No promises, as this isn’t my primary focus.

There are two suggestions from developers that might help. The first is for a separate code cell, and the second is for a separate environment variable. Have you tried both approaches, and does the issue still persist?

wudanzy · 2024-11-25T02:25:44Z

Hi Ihor (@grazy27), thanks for the contribution! I recently get the write permission of this repo and happy to move this forward. Due to limited bandwidth in our team and other priorities, we don't have concrete work items on this project. But we can review your code and let's work together to move this forward!

grazy27 · 2024-11-25T07:10:26Z

Hi Ihor (@grazy27), thanks for the contribution! I recently get the write permission of this repo and happy to move this forward. Due to limited bandwidth in our team and other priorities, we don't have concrete work items on this project. But we can review your code and let's work together to move this forward!

Hello Dan <@wudanzy>,

That's fantastic news—great to hear!

I'd be happy to help with a few more issues to get this project back on track. In my opinion, the most important ones are:

Support for UDFs when UseArrow = true
Migrating to a standalone NuGet package for BinarySerializer and upgrading the solution to .NET 9
Addressing the bug with Databricks 15.4

wudanzy · 2024-11-25T10:05:35Z

Thanks for sharing that!

wudanzy

Can we split this PR a little bit? Which can speed up the review.

docs/building/windows-instructions.md

wudanzy · 2024-11-26T06:10:12Z

src/csharp/Microsoft.Spark.E2ETest/IpcTests/SparkContextTests.cs

+            }
+            catch (Exception)
+            {
+                // It tries to delete non-existent file, but other from that its ok


Are those exceptions excepted? If we expect them, we could add some logic here, if not, we could fail the test in such cases.

My logic here is that since nothing related to this API changed inside Dotnet.Spark, and it just calls AddArchive on JVM sparkContext, and archive is added successfully - it must be an internal bug in spark itself.
I tested it with Scala directly, and it fails with the same exception.
I plan to test it more and report to Spark later.

Actually it is the exact reason why other tests fail when run together with it:

I restricted this test to (3.1..3.2), versions on which it had already been tested on

src/csharp/Microsoft.Spark.E2ETest/IpcTests/Sql/CatalogTests.cs

src/csharp/Microsoft.Spark.E2ETest/IpcTests/Sql/DataFrameTests.cs

src/csharp/Microsoft.Spark.E2ETest/SparkFixture.cs

src/csharp/Microsoft.Spark.UnitTest/BinarySerDeTests.cs

wudanzy · 2024-11-26T14:48:08Z

src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs

@@ -839,4 +838,143 @@ private CommandExecutorStat ExecuteDataFrameGroupedMapCommand(
            return stat;
        }
    }
+
+    internal class ArrowOrDataFrameCoGroupedMapCommandExecutor : ArrowOrDataFrameSqlCommandExecutor


Hi Igor, is it possible to split this PR a little bit? This PR is huge, better to only contain 3.5 support and Net 8 support. We can leave other changes to future PRs.

Sure, Dan. I'll create a separate PR for CoGrouped UDFs and binary serializer. Vast majority of other fixes are related to each other though, so PR will still be relatively large.
At one point, I was unsure if this would ever get merged, so I ended up including all the improvements I needed to properly test whether the library meets my requirements in one place, so that if someone wants to build a version it's relatively simple to accomplish.

@wudanzy

#1178 (comment)

It will help this project if the support of polyglot notebook is included
#1178 (comment)

Possible to setup CI/CD so that each PR usability can be tracked?
@SparkSnail
https://github.com/SparkSnail/spark/actions

Possible to setup CI/CD so that each PR usability can be tracked? @SparkSnail https://github.com/SparkSnail/spark/actions

Yes, current test pipeline for the repo is broken, we are working to recover pipeline for PRs.

Removed unnecessary refactoring and new features from the PR, preserved only .NET, Spark 3.5 and a few fixes

src/csharp/Microsoft.Spark.Worker/Processor/TaskContextProcessor.cs

wudanzy

Hi @grazy27, I got a basic idea of what is changed, overall, it looks good to me. One thing I found that the scala files content are not changed too much Could you please see if you can move the files instead of adding new ones, which helps highlight what is changed.

src/csharp/Microsoft.Spark.E2ETest/Microsoft.Spark.E2ETest.csproj

src/csharp/Microsoft.Spark.UnitTest/Microsoft.Spark.UnitTest.csproj

src/csharp/Microsoft.Spark/Sql/Catalog/Catalog.cs

wudanzy · 2024-11-28T12:15:55Z

src/scala/microsoft-spark-3-5/pom.xml

@@ -0,0 +1,91 @@
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"


Is this file copied from src/scala/microsoft-spark-3-2/pom.xml? Can you try to move it and then modify? Similar to this: 80c745b

It highlights what is changed.

Sure, it's already done, but as a separate commit: e7eccdf. Please let me know if that's ok.
There are 4 commits in total, for copypaste, for 3.5.1 fixes, for .net8 and for databricks fixes

Looks good.

grazy27 · 2024-12-01T15:59:38Z

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF after making adjustment to migrate to .net6.

#796

https://github.com/Apress/introducing-.net-for-apache-spark/tree/main/ch04/Chapter4

Followed out in #1179, and found a related bug: #1043

wudanzy

LGTM

wudanzy · 2024-12-05T09:26:14Z

src/scala/microsoft-spark-3-5/pom.xml

+      <groupId>org.apache.spark</groupId>
+      <artifactId>spark-sql_${scala.binary.version}</artifactId>
+      <version>${spark.version}</version>
+      <scope>provided</scope>


Are those two the same?

Slightly different, scala lang version is more specific, it has to match package name

wudanzy · 2024-12-05T09:32:35Z

Hi @grazy27, it looks to me, but we have to wait for the check and another review. @SparkSnail is fixing the broken check and will do another review.

grazy27 · 2024-12-05T16:55:43Z

Hi @grazy27, it looks to me, but we have to wait for the check and another review. @SparkSnail is fixing the broken check and will do another review.

Wonderful, thanks @wudanzy.
I'll create a few more PRs with another improvements after this one is merged

…4.x, 3.3.4+ as well.

…dependencies.

…onsole after successful run on windows. Added logging to help with troubleshooting

grazy27 · 2024-12-14T10:22:50Z

/AzurePipelines run

azure-pipelines · 2024-12-14T10:22:56Z

Commenter does not have sufficient privileges for PR 1178 in repo dotnet/spark

grazy27 changed the title ~~Spark 3.5.1, .NET 8, Dependencies and documentation~~ Spark 3.5.1, .NET 8, Dependencies and Documentation Jul 8, 2024

GeorgeS2019 mentioned this pull request Jul 22, 2024

[BUG]: [Spark.NET 3.5.1] Unable to get Charset 'cp65001' for property 'sun.stderr.encoding' #1180

Open

grazy27 changed the title ~~Spark 3.5.1, .NET 8, Dependencies and Documentation~~ Spark 3.5.3, .NET 8, CoGrouped UDFs, Fixes, Dependencies and Documentation Nov 23, 2024

grazy27 mentioned this pull request Nov 23, 2024

[FEATURE REQUEST]: #1184

Closed

wudanzy requested review from SparkSnail and wudanzy November 25, 2024 02:03

wudanzy assigned grazy27 Nov 25, 2024

wudanzy added the enhancement New feature or request label Nov 25, 2024

grazy27 force-pushed the main branch from e42631e to 0fe97fe Compare November 25, 2024 08:09

wudanzy reviewed Nov 26, 2024

View reviewed changes

SparkSnail reviewed Nov 27, 2024

View reviewed changes

src/csharp/Microsoft.Spark.Worker/Processor/TaskContextProcessor.cs Outdated Show resolved Hide resolved

grazy27 force-pushed the main branch from 0fe97fe to 992daf4 Compare November 28, 2024 09:54

grazy27 changed the title ~~Spark 3.5.3, .NET 8, CoGrouped UDFs, Fixes, Dependencies and Documentation~~ Spark 3.5.3, .NET 8, Dependencies Nov 28, 2024

grazy27 force-pushed the main branch 3 times, most recently from 333ea16 to 9e30f44 Compare November 28, 2024 10:57

grazy27 requested review from SparkSnail and wudanzy November 28, 2024 11:05

wudanzy reviewed Nov 28, 2024

View reviewed changes

grazy27 force-pushed the main branch from 9e30f44 to b8b3085 Compare November 30, 2024 10:55

grazy27 requested a review from wudanzy November 30, 2024 11:03

wudanzy previously approved these changes Dec 5, 2024

View reviewed changes

grazy27 added 4 commits December 14, 2024 10:36

Copy-pasted scala impl for 3.5 from 3.2

38cf4fa

Implemented support for Spark 3.5.1. Fixes are relevant for 3.5.0, 3.…

1b231ea

…4.x, 3.3.4+ as well.

Update .NET 6 to .NET 8, .NET Framework 4.6.1 -> .NET Framework 4.8, …

69a6811

…dependencies.

Fixes: Crash on Databricks when using zip deployment, exceptions in c…

9dd05f5

…onsole after successful run on windows. Added logging to help with troubleshooting

grazy27 force-pushed the main branch from b8b3085 to 9dd05f5 Compare December 14, 2024 09:36

Updated CI and Nightly pipelines with Spark 3.5

2ac494a

grazy27 dismissed wudanzy’s stale review via 2ac494a December 14, 2024 10:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 3.5.3, .NET 8, Dependencies #1178

Spark 3.5.3, .NET 8, Dependencies #1178

grazy27 commented Jul 8, 2024 •

edited

Loading

grazy27 commented Jul 8, 2024

GeorgeS2019 commented Jul 22, 2024 •

edited

Loading

grazy27 commented Jul 22, 2024

travis-leith commented Aug 26, 2024

grazy27 commented Aug 26, 2024

GeorgeS2019 commented Aug 26, 2024 •

edited

Loading

travis-leith commented Aug 26, 2024

grazy27 commented Aug 26, 2024

wudanzy commented Nov 25, 2024

grazy27 commented Nov 25, 2024

wudanzy commented Nov 25, 2024

wudanzy left a comment

wudanzy Nov 26, 2024

grazy27 Nov 28, 2024 •

edited

Loading

grazy27 Nov 28, 2024

wudanzy Nov 26, 2024

grazy27 Nov 27, 2024

GeorgeS2019 Nov 27, 2024

GeorgeS2019 Nov 27, 2024 •

edited

Loading

SparkSnail Nov 27, 2024

grazy27 Nov 28, 2024

wudanzy left a comment

wudanzy Nov 28, 2024

wudanzy Nov 28, 2024

grazy27 Nov 30, 2024 •

edited

Loading

wudanzy Dec 5, 2024

grazy27 commented Dec 1, 2024

wudanzy left a comment

wudanzy Dec 5, 2024

grazy27 Dec 5, 2024

wudanzy commented Dec 5, 2024

grazy27 commented Dec 5, 2024

grazy27 commented Dec 14, 2024

azure-pipelines bot commented Dec 14, 2024

		@@ -0,0 +1,91 @@
		<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

Spark 3.5.3, .NET 8, Dependencies #1178

Are you sure you want to change the base?

Spark 3.5.3, .NET 8, Dependencies #1178

Conversation

grazy27 commented Jul 8, 2024 • edited Loading

Changes:

Tested with:

Spark:

Databricks:

Affected Tickets:

grazy27 commented Jul 8, 2024

GeorgeS2019 commented Jul 22, 2024 • edited Loading

grazy27 commented Jul 22, 2024

travis-leith commented Aug 26, 2024

grazy27 commented Aug 26, 2024

GeorgeS2019 commented Aug 26, 2024 • edited Loading

travis-leith commented Aug 26, 2024

grazy27 commented Aug 26, 2024

wudanzy commented Nov 25, 2024

grazy27 commented Nov 25, 2024

wudanzy commented Nov 25, 2024

wudanzy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grazy27 Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GeorgeS2019 Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wudanzy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grazy27 Nov 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grazy27 commented Dec 1, 2024

wudanzy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wudanzy commented Dec 5, 2024

grazy27 commented Dec 5, 2024

grazy27 commented Dec 14, 2024

azure-pipelines bot commented Dec 14, 2024

grazy27 commented Jul 8, 2024 •

edited

Loading

GeorgeS2019 commented Jul 22, 2024 •

edited

Loading

GeorgeS2019 commented Aug 26, 2024 •

edited

Loading

grazy27 Nov 28, 2024 •

edited

Loading

GeorgeS2019 Nov 27, 2024 •

edited

Loading

grazy27 Nov 30, 2024 •

edited

Loading