Deploy .NET for Apache Spark worker and user-defined function binaries

Articolo
12/16/2022

This how-to provides general instructions on how to deploy .NET for Apache Spark worker and user-defined function binaries. You learn which Environment Variables to set up, as well as some commonly used parameters for launching applications with spark-submit.

Warning

.NET for Apache Spark targets an out-of-support version of .NET (.NET Core 3.1). For more information, see the .NET Support Policy.

Configurations

Configurations show the general environment variables and parameters settings in order to deploy .NET for Apache Spark worker and user-defined function binaries.

Environment variables

When deploying workers and writing UDFs, there are a few commonly used environment variables that you may need to set:

Environment Variable	Description
DOTNET_WORKER_DIR	Path where the `Microsoft.Spark.Worker` binary has been generated. It's used by the Spark driver and will be passed to Spark executors. If this variable is not set up, the Spark executors will search the path specified in the `PATH` environment variable. e.g. "C:\bin\Microsoft.Spark.Worker"
DOTNET_ASSEMBLY_SEARCH_PATHS	Comma-separated paths where `Microsoft.Spark.Worker` will load assemblies. Note that if a path starts with ".", the working directory will be prepended. If in yarn mode, "." would represent the container's working directory. e.g. "C:\Users\<user name>\<mysparkapp>\bin\Debug\<dotnet version>"
DOTNET_WORKER_DEBUG	If you want to debug a UDF, then set this environment variable to `1` before running `spark-submit`.

Parameter options

Once the Spark application is bundled, you can launch it using spark-submit. The following table shows some of the commonly used options:

Parameter Name	Description
--class	The entry point for your application. e.g. org.apache.spark.deploy.dotnet.DotnetRunner
--master	The master URL for the cluster. e.g. yarn
--deploy-mode	Whether to deploy your driver on the worker nodes (`cluster`) or locally as an external client (`client`). Default: `client`
--conf	Arbitrary Spark configuration property in `key=value` format. e.g. spark.yarn.appMasterEnv.DOTNET_WORKER_DIR=.\worker\Microsoft.Spark.Worker
--files	Comma-separated list of files to be placed in the working directory of each executor. Please note that this option is only applicable for yarn mode. It supports specifying file names with # similar to Hadoop. e.g. `myLocalSparkApp.dll#appSeen.dll`. Your application should use the name as `appSeen.dll` to reference `myLocalSparkApp.dll` when running on YARN.
--archives	Comma-separated list of archives to be extracted into the working directory of each executor. Please note that this option is only applicable for yarn mode. It supports specifying file names with # similar to Hadoop. e.g. `hdfs://<path to your worker file>/Microsoft.Spark.Worker.zip#worker`. This will copy and extract the zip file to `worker` folder.
application-jar	Path to a bundled jar including your application and all dependencies. e.g. hdfs://<path to your jar>/microsoft-spark-<version>.jar
application-arguments	Arguments passed to the main method of your main class, if any. e.g. hdfs://<path to your app>/<your app>.zip <your app name> <app args>

Note

Specify all the --options before application-jar when launching applications with spark-submit, otherwise they will be ignored. For more information, see spark-submit options and running spark on YARN details.

Frequently asked questions

When I run a spark app with UDFs, I get a `FileNotFoundException` error. What should I do?

Error: [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.IO.FileNotFoundException: Assembly 'mySparkApp, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' file not found: 'mySparkApp.dll'

Answer: Check that the DOTNET_ASSEMBLY_SEARCH_PATHS environment variable is set correctly. It should be the path that contains your mySparkApp.dll.

After I upgraded my .NET for Apache Spark version and reset the `DOTNET_WORKER_DIR` environment variable, why do I still get the following `IOException` error?

Error: Lost task 0.0 in stage 11.0 (TID 24, localhost, executor driver): java.io.IOException: Cannot run program "Microsoft.Spark.Worker.exe": CreateProcess error=2, The system cannot find the file specified.

Answer: Try restarting your PowerShell window (or other command windows) first so that it can take the latest environment variable values. Then start your program.

After submitting my Spark application, I get the error `System.TypeLoadException: Could not load type 'System.Runtime.Remoting.Contexts.Context'`.

Error: [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.TypeLoadException: Could not load type 'System.Runtime.Remoting.Contexts.Context' from assembly 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=...'.

Answer: Check the Microsoft.Spark.Worker version you are using. There are two versions: .NET Framework 4.6.1 and .NET Core 3.1.x. In this case, Microsoft.Spark.Worker.net461.win-x64-<version> (which you can download) should be used since System.Runtime.Remoting.Contexts.Context is only for .NET Framework.

How do I run my spark application with UDFs on YARN? Which environment variables and parameters should I use?

Answer: To launch the spark application on YARN, the environment variables should be specified as spark.yarn.appMasterEnv.[EnvironmentVariableName]. Please see below as an example using spark-submit:

spark-submit \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--master yarn \
--deploy-mode cluster \
--conf spark.yarn.appMasterEnv.DOTNET_WORKER_DIR=./worker/Microsoft.Spark.Worker-<version> \
--conf spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS=./udfs \
--archives hdfs://<path to your files>/Microsoft.Spark.Worker.net461.win-x64-<version>.zip#worker,hdfs://<path to your files>/mySparkApp.zip#udfs \
hdfs://<path to jar file>/microsoft-spark-<spark_majorversion-spark_minorversion>_<scala_majorversion.scala_minorversion>-<spark_dotnet_version>.jar \
hdfs://<path to your files>/mySparkApp.zip mySparkApp

Condividi tramite

Deploy .NET for Apache Spark worker and user-defined function binaries

Configurations

Environment variables

Parameter options

Frequently asked questions

When I run a spark app with UDFs, I get a `FileNotFoundException` error. What should I do?

After I upgraded my .NET for Apache Spark version and reset the `DOTNET_WORKER_DIR` environment variable, why do I still get the following `IOException` error?

After submitting my Spark application, I get the error `System.TypeLoadException: Could not load type 'System.Runtime.Remoting.Contexts.Context'`.

How do I run my spark application with UDFs on YARN? Which environment variables and parameters should I use?

Next steps

Risorse aggiuntive

Condividi tramite

Deploy .NET for Apache Spark worker and user-defined function binaries

Configurations

Environment variables

Parameter options

Frequently asked questions

When I run a spark app with UDFs, I get a FileNotFoundException error. What should I do?

After I upgraded my .NET for Apache Spark version and reset the DOTNET_WORKER_DIR environment variable, why do I still get the following IOException error?

After submitting my Spark application, I get the error System.TypeLoadException: Could not load type 'System.Runtime.Remoting.Contexts.Context'.

How do I run my spark application with UDFs on YARN? Which environment variables and parameters should I use?

Next steps

Risorse aggiuntive

When I run a spark app with UDFs, I get a `FileNotFoundException` error. What should I do?

After I upgraded my .NET for Apache Spark version and reset the `DOTNET_WORKER_DIR` environment variable, why do I still get the following `IOException` error?

After submitting my Spark application, I get the error `System.TypeLoadException: Could not load type 'System.Runtime.Remoting.Contexts.Context'`.