-
Visit the link below and choose your right OS version to download: https://www.virtualbox.org/wiki/Downloads
-
Install. Follow the instructions based on your OS. If you are using Ubuntu, I recommend you download and install from the Ubuntu Software Center.
-
Download the Ubuntu OS version 18: Click here to download
-
Create a new VM in your Oracle Virtualbox and install the Ubuntu 18. Follow the instructions.
-
(Optional) After the installation, clone the newly install VM (for backup in case something goes wrong).
-
Update your packages:
sudo apt-get update -
Install Git:
sudo apt-get install git -y -
Let's clone a repository on the desktop:
cd ~/Desktop sudo git clone https://github.com/dseneh-eit/hadoop -
cd into the cloned repository and execute the bash command
cd Hadoop/ sudo bash install.shWait for the installation to complete. Sometimes it takes a little longer.
-
Test you installation:
jpsIf you get
Jpsback, then congratulations! If not, source your .bash_profile and .bashrc files respectively:source ~/.bash_profile source ~/.bashrc
-
In your terminal, paste the below code:
cd ~/opt sudo wget http://archive.apache.org/dist/hive/hive-2.3.5/apache-hive-2.3.5-bin.tar.gz -
Unzip the downloaded file and rename the folder:
tar -xvf apache-hive-2.3.5-bin.tar.gz sudo mv apache-hive-2.3.5-bin hive -
Let's open and edit the .bash_profile file:
sudo gedit ~/.bash_profile -
In your .bash_profile file, paste the following:
#HIVE_HOME export HIVE_HOME=~/opt/hive export PATH=$PATH:$HIVE_HOME/bin -
Source your
.bash_profilefile:source ~/.bash_profile -
Give it a quick test with:
hive --versionYou should get the version of hive back
-
Next, we need to create some directories in HDFS. But before that, let's start our hadoop cluster. If you have yours started already, skip this step:
start-all.shTo verfiy if your cluster is running, run the following command:
jpsIf all goes well, you should see the below (the order doesn't matter):
NameNode DataNode ResourceManager Jps NodeManager SecondaryNameNode
If you didn't get them all, then please check your configurations.
-
Create directories and add permissions in HDFS:
hadoop fs -mkdir -p /user/hive/warehouse hadoop fs -chmod g+w /user/hive/warehouse -
cd into hive config folder and create/edit
hive-env.sh:cd ~/opt/hive/conf sudo gedit hive-env.sh -
In the
hive-env.shfile, find, uncommet and replace the values of the follow variables and to look like:export HADOOP_HOME=~/opt/hadoop-2.7.3 export HADOOP_HEAPSIZE=512 export HIVE_CONF_DIR=~/opt/hive/conf -
While still in
~/hive/conf, create/edithive-site.xml:sudo gedit hive-site.xmlPaste the below and save:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:derby:;databaseName=~/opt/hive/metastore_db;create=true</value> <description>JDBC connect string for a JDBC metastore.</description> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> <description>location of default database for the warehouse</description> </property> <property> <name>hive.metastore.uris</name> <value>thrift://localhost:9083</value> <description>Thrift URI for the remote metastore.</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.apache.derby.jdbc.EmbeddedDriver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.PersistenceManagerFactoryClass</name> <value>org.datanucleus.api.jdo.JDOPersistenceManagerFactory</value> <description>class implementing the jdo persistence</description> </property> <property> <name>hive.server2.enable.doAs</name> <value>false</value> </property> </configuration>
-
(optional) Since Hive and Kafka are running on the same system, you'll get a warning message about some SLF4J logging file. From your Hive home you can just rename the file:
cd ~/opt/hive sudo mv lib/log4j-slf4j-impl-2.6.2.jar lib/log4j-slf4j-impl-2.6.2.jar.bak -
Now we need to create a database schema for Hive to work with using schematool:
schematool -initSchema -dbType derby -
We are now ready to enter the Hive shell and create the database for holding tweets. First, we need to start the Hive Metastore server with the following command:
This should give some output that indicates that the metastore server is running. You'll need to keep this running, so open up a new terminal tab to continue with the next steps.hive --service metastore -
Now, leave the hive service running and open a new tab, start the Hive shell with the
hivecommand:hive -
If you are able to get to this point: CONGRATULATIONS!
- First, let's update our packages:
sudo apt-get update - Next, install MySQL server:
Enter the password as root when it prompts to enter a password
sudo apt-get install mysql-server - Login to mysql and check the available default databases:
sudo mysql -u root -p [YOUR PASSWORD]show databases; - (Optional) Set the root user's password to 'root':
ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'root'; - Install the MySQL connector:
sudo apt-get install libmysql-java
-
Let's cd into our
optfolder and download hbase:cd ~/opt sudo wget http://archive.apache.org/dist/hbase/1.1.4/hbase-1.1.4-bin.tar.gz -
Unzip the
.tar.gzfile:tar -xvf hbase-1.1.4-bin.tar.gz -
In your .bash_profile file, paste the following:
#HBASE_HOME export HBASE_HOME=~/opt/hbase-1.1.4 export PATH=$PATH:$HBASE_HOME/bin -
Source your
.bash_profilefile:source ~/.bash_profile -
cd into the hbase conf folder and edit the
hbase-env.shfile:cd ~/opt/hbase-1.1.4/conf/ sudo gedit hbase-env.sh -
In the
hbase-env.shfile, find theexport HBASE_REGIONSERVERSvariable and uncomment it, replace it's value to look like this:export JAVA_HOME=~/opt/jdk1.8.0_221Also find and uncommet the following, then save and colse the file:
export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers export HBASE_MANAGES_ZK=true -
While still in the hbase conf directory, also open and edit the
hbase-site.xmlfile:sudo gedit hbase-site.xml -
Paste the below between the
<configuration>tags:<property> <name>hbase.rootdir</name> <value>hdfs://localhost:9000/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>localhost</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>2181</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>~/opt/hbase-1.1.4/zookeeper</value> </property>
-
Start the Hbase daemons:
start-hbase.shTo ensure everything is working, run the
jpscommand and you should be able to get the following. If you didn't get them all, then please check your configurations:HQuorumPeer HMaster HRegionServer
-
To login into HBase shell:
hbase shell
-
Let's first install
pipfor linux:sudo apt-get install python3-pip python-dev -
Verify the installation:
pip3 --version -
Let's create an
airflowdirectory, and inside this directory, let's also create adagsdirectory. This is where we’ll store our python dag files:mkdir ~/airflow cd ~/airflow mkdir dags -
(Optional) uninstall any old apache-airflow installations using pip:
sudo pip3 uninstall apache-airflow -
Install apache-airflow using pip:
sudo pip3 install apache-airflow -
Initialize apache-airflow database (default is sqlite):
airflow db init -
Create admin user and password:
airflow users create \ --username admin \ --firstname [YOUR_FIRST_NAME] \ --lastname [yOUR_LAST_NAME] \ --role Admin \ --email spiderman@superhero.org -
Open another terminal, start the web server and let it run:
airflow web server --port 8080 -
Open another terminal, start the scheduler and let it run:
airflow scheduler -
Visit
localhost:8080in the browser to access the GUI -
Enter your username and password (from step 8)