Setup Apache Spark
Instruction to setup an Apache Spark cluster - Standalone mode (Ubuntu 16.04)
First make sure you have installed all the dependencies in all nodes. (master and clients):
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install scala
Create a user of same name in master and all slaves to make your tasks easier during ssh and also switch to that user in master.
Add entries in hosts file (master and slaves)
Edit hosts file.
sudo vim /etc/hosts
And add entries of master and slaves in hosts file
<MASTER-IP> master
<SLAVE01-IP> slave01
<SLAVE02-IP> slave02
Configure SSH (needed on master, but recommended to do on all nodes)
Install Open SSH Server-Client and Generate key pairs, then copy your key to all nodes to make passwordless ssh available.
sudo apt-get install openssh-server openssh-client
ssh-keygen -t rsa -P ""
ssh-copy-id your-user@slave-01
ssh-copy-id your-user@slave-02
Download spark binaries, extract it and put it in (/usr/local/spark)
wget http://www-us.apache.org/dist/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz
tar xvf spark-2.3.2-bin-hadoop2.7.tgz
sudo mv spark-2.3.2-bin-hadoop2.7 /usr/local/spark
Set up the environment for Spark
vim ~/.bashrc
# add these two lines at the end of the file(remove the #s):
# export SPARK_HOME="/usr/loca/spark"
# export PATH=$PATH:/usr/local/spark/bin
source ~/.bashrc
Make a copy of spark-env-template and Update its content
cd /usr/local/spark/conf
cp spark-env.sh.template spark-env.sh
vim spark-env.sh
# add these two lines to the file (remove the #s):
# export SPARK_MASTER_HOST='SERVER-IP'
# export JAVA_HOME=<JAVA_PATH>
Add these lines to the slave file (in the current directory):
master
slave01
slave02
mvn clean install