Getting Started with Blaze for Apache Spark

Build from source

To build Blaze, please follow the steps below:

Install Rust

The native execution lib is written in Rust. So you're required to install Rust (nightly) first for compilation. We recommend you to use rustup.

Install Protobuf

Ensure protoc is available in PATH environment. protobuf can be installed via linux system package manager (or Homebrew on mac), or manually download and build from https://github.com/protocolbuffers/protobuf/releases .

Install JDK+Maven

Blaze has been well tested on jdk8 and maven3.5, should work fine with higher versions.

Check out the source code.

git clone git@github.com:kwai/blaze.git
cd blaze

Build the project.

Specify shims package of which spark version that you would like to run on.

Currently we have supported these shims:

spark-3.0 - for spark3.0.x
spark-3.1 - for spark3.1.x
spark-3.2 - for spark3.2.x
spark-3.3 - for spark3.3.x
spark-3.4 - for spark3.4.x
spark-3.5 - for spark3.5.x.

You could either build Blaze in pre mode for debugging or in release mode to unlock the full potential of Blaze.

SHIM=spark-3.3 # or spark-3.0/spark-3.1/spark-3.2/spark-3.3/spark-3.4/spark-3.5
MODE=release # or pre
mvn package -P"${SHIM}" -P"${MODE}"

After the build is finished, a fat Jar package that contains all the dependencies will be generated in the target directory.

Build with docker

You can use the following command to build a centos-7 compatible release:

SHIM=spark-3.3 MODE=release ./release-docker.sh

Run Spark Job with Blaze Accelerator

This section describes how to submit and configure a Spark Job with Blaze support.

move blaze jar package to spark client classpath (normally spark-xx.xx.xx/jars/).
add the follow confs to spark configuration in spark-xx.xx.xx/conf/spark-default.conf:

spark.blaze.enable true
spark.sql.extensions org.apache.spark.sql.blaze.BlazeSparkSessionExtension
spark.shuffle.manager org.apache.spark.sql.execution.blaze.shuffle.BlazeShuffleManager
spark.memory.offHeap.enabled false

# suggested executor memory configuration
spark.executor.memory 4g
spark.executor.memoryOverhead 4096

submit a query with spark-sql, or other tools like spark-thriftserver:

spark-sql -f tpcds/q01.sql