Setup Apache Spark

Published:
1 minute read

Instruction to setup an Apache Spark cluster - Standalone mode (Ubuntu 16.04)

First make sure you have installed all the dependencies in all nodes. (master and clients):

sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install scala

Create a user of same name in master and all slaves to make your tasks easier during ssh and also switch to that user in master.

Add entries in hosts file (master and slaves)

Edit hosts file.

sudo vim /etc/hosts

And add entries of master and slaves in hosts file

<MASTER-IP> master
<SLAVE01-IP> slave01
<SLAVE02-IP> slave02

Install Open SSH Server-Client and Generate key pairs, then copy your key to all nodes to make passwordless ssh available.

sudo apt-get install openssh-server openssh-client
ssh-keygen -t rsa -P ""
ssh-copy-id your-user@slave-01
ssh-copy-id your-user@slave-02

Download spark binaries, extract it and put it in (/usr/local/spark)

wget http://www-us.apache.org/dist/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz
tar xvf spark-2.3.2-bin-hadoop2.7.tgz
sudo mv spark-2.3.2-bin-hadoop2.7 /usr/local/spark

Set up the environment for Spark

vim ~/.bashrc
# add these two lines at the end of the file(remove the #s):
# export SPARK_HOME="/usr/loca/spark"
# export PATH=$PATH:/usr/local/spark/bin
source ~/.bashrc

Make a copy of spark-env-template and Update its content

cd /usr/local/spark/conf
cp spark-env.sh.template spark-env.sh
vim spark-env.sh
# add these two lines to the file (remove the #s):
# export SPARK_MASTER_HOST='SERVER-IP'
# export JAVA_HOME=<JAVA_PATH>

Add these lines to the slave file (in the current directory):

master
slave01
slave02
mvn clean install