Tuesday, November 27, 2018

How to install Hadoop Hive on CentOS7

How to install Hadoop Hive on CentOS7



What is Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. It is used by different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not
  • A relational database
  • A design for OnLine Transaction Processing (OLTP)
  • A language for real-time queries and row-level updates
  • It stores schema in a database and processed data into HDFS.
  • It is designed for OLAP.
  • It provides SQL type language for querying called HiveQL or HQL.
  • It is familiar, fast, scalable, and extensible.
The following component diagram depicts the architecture of Hive:Hive Architecture

Requirements: Install JAVA and Hadoop

Apache Hive2.1 requires java 7 or later version. We also need to install hadoop first before installing apache hive on our system. In example below, I'm using java 8
$ java -version
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)



Download Hive package

After configuring hadoop successfully on your linux system. lets start hive setup. First download latest hive source code and extract archive using following commands.
$ cd /opt/hadoop
$ wget http://archive.apache.org/dist/hive/hive-2.1.0/apache-hive-2.1.0-bin.tar.gz
$ tar xzf apache-hive-2.1.0-bin.tar.gz
$ mv hive-2.1.0-bin hive
$ chown hadoop -R hive

Setup Environment Variables

Beside Hadoop environments, there are two more environment variables you need to add to ~.bashrc
# su - hadoop
export HIVE_HOME=/opt/hadoop/hive
export PATH=$HIVE_HOME/bin:$PATH

Preparation before starting Hive

Create /tmp and /user/hive/warehouse and set them chmod g+w in HDFS.

$ hdfs dfs -mkdir /tmp
$ hdfs dfs -mkdir /user/hive/warehouse
$ hdfs dfs -chmod g+w /tmp
$ hdfs dfs -chmod g+w /user/hive/warehouse

Hive metastore schema initialization

On Hive2.1.0, it's required to initialize metastore schema before hive starts, in the case below, I used db type derby

$ schematool -dbType derby -initSchemaTo 2.0.0 -verbose

which: no hbase in (/opt/hadoop/hive/bin:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/usr/X11R6/bin:/opt/local/bin:/usr/local/bin:/opt/hadoop/hadoop-2.7.3/sbin:/opt/hadoop/hadoop-2.7.3/bin:/home/hadoop/.local/bin:/home/hadoop/bin:/opt/hadoop/hadoop-2.7.3/sbin:/opt/hadoop/hadoop-2.7.3/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hadoop/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:     jdbc:derby:;databaseName=metastore_db;create=true
Metastore Connection Driver :     org.apache.derby.jdbc.EmbeddedDriver
Metastore connection User:     APP
Starting metastore schema initialization to 2.0.0
Initialization script hive-schema-2.0.0.derby.sql
Connecting to jdbc:derby:;databaseName=metastore_db;create=true
Connected to: Apache Derby (version 10.10.2.0 - (1582446))
Driver: Apache Derby Embedded JDBC Driver (version 10.10.2.0 - (1582446))
Transaction isolation: TRANSACTION_READ_COMMITTED
0: jdbc:derby:> !autocommit on
Autocommit status: true
0: jdbc:derby:> CREATE FUNCTION "APP"."NUCLEUS_ASCII" (C CHAR(1)) RETURNS INTEGER LANGUAGE JAVA PARAMETER STYLE JAVA READS SQL DATA CALLED ON NULL INPUT EXTERNAL NAME 'org.datanucleus.store.rdbms.adapter.DerbySQLFunction.ascii'
No rows affected (0.374 seconds)


Note:
You may have noticed that I initialized the schema to version 2.0.0 instead of 2.1.0, why?
This is caused old known problem of Hive, new release of Hive quite often missing the latest version of db schema support. To avoid that, just use previous release.
Here is the example error output when using version 2.1.0

$ schematool -dbType derby -initSchemaTo 2.1.0 -verbose

which: no hbase in (/opt/hadoop/hive/bin:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/usr/X11R6/bin:/opt/local/bin:/usr/local/bin:/opt/hadoop/hadoop-2.7.3/sbin:/opt/hadoop/hadoop-2.7.3/bin:/home/hadoop/.local/bin:/home/hadoop/bin:/opt/hadoop/hadoop-2.7.3/sbin:/opt/hadoop/hadoop-2.7.3/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hadoop/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:     jdbc:derby:;databaseName=metastore_db;create=true
Metastore Connection Driver :     org.apache.derby.jdbc.EmbeddedDriver
Metastore connection User:     APP
Starting metastore schema initialization to 2.1.0
org.apache.hadoop.hive.metastore.HiveMetaException: Unknown version specified for initialization: 2.1.0
org.apache.hadoop.hive.metastore.HiveMetaException: Unknown version specified for initialization: 2.1.0
    at org.apache.hadoop.hive.metastore.MetaStoreSchemaInfo.generateInitFileName(MetaStoreSchemaInfo.java:128)
    at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:282)
    at org.apache.hive.beeline.HiveSchemaTool.main(HiveSchemaTool.java:508)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
*** schemaTool failed ***

Tips:
In old Hive version, you don't have init metastore schema, so if you accidently ran hive before schema initialization, you may see error like below:
Error: FUNCTION 'NUCLEUS_ASCII' already exists. (state=X0Y68,code=30000)
Here is the fix for it
rm -f $HIVE_HOME/scripts/metastore/upgrade/derby/hive-schema-2.1.0.derby.sql 
then re init Hive metastore schema

Start Hive

Lets start using hive using following command.
$ hive
which: no hbase in (/opt/hadoop/hive/bin:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/usr/X11R6/bin:/opt/local/bin:/usr/local/bin:/opt/hadoop/hadoop-2.7.3/sbin:/opt/hadoop/hadoop-2.7.3/bin:/home/hadoop/.local/bin:/home/hadoop/bin:/opt/hadoop/hadoop-2.7.3/sbin:/opt/hadoop/hadoop-2.7.3/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hadoop/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/hadoop/hive/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
$ hive >

Create  test Table

At this stage you have successfully installed hive. Lets create a sample table using following command
hive>  CREATE TABLE test (id int, name string, age int);
OK
Time taken: 1.223 seconds
Show the created tables with below command.
hive> SHOW TABLES;
OK
test
Time taken: 0.431 seconds, Fetched: 1 row(s)
Drop the table using below command.
hive> DROP TABLE test;
OK
Time taken: 1.393 seconds
Now you have Hadoop Hive installed and configured on your Hadoop cluster!

http://fibrevillage.com/storage/649-how-to-install-hadoop-hive-on-rhel7-centos7


How to install Hadoop Hive on CentOS7

How to install Hadoop Hive on CentOS7 What is Hive Hive is a data warehouse infrastructure tool to process structured data in Had...