This guide provides step-by-step instructions to install and configure Apache Hive on a Linux system.
Before installing Apache Hive, ensure that the following prerequisites are met:
- Java Development Kit (JDK): Hive requires Java to run. Ensure that Java is installed and set up correctly.
- Hadoop: Hive runs on top of Hadoop. Ensure that Hadoop is installed and configured.
Start the Hadoop services:
start-dfs.sh
start-yarn.shDownload the latest stable version of Apache Hive from the official website or using the following wget command:
wget https://dlcdn.apache.org/hive/hive-4.0.0/apache-hive-4.0.0-bin.tar.gzExtract the downloaded Hive tarball:
tar -xzvf apache-hive-4.0.0-bin.tar.gzMove the extracted Hive directory to your preferred installation path:
sudo mv apache-hive-4.0.0-bin /usr/local/hiveTo ensure that Hive is accessible from any directory, add Hive to your system's environment variables.
-
Open your
.bashrcfile:nano ~/.bashrc -
Add the following lines at the end of the file:
# hive export HIVE_HOME=/usr/local/hive export PATH=$PATH:$HIVE_HOME/bin export CLASSPATH=$CLASSPATH:$HADOOP_HOME/lib/*:$HIVE_HOME/lib/*
-
Save and close the file, then apply the changes:
source ~/.bashrc
Hive uses a metastore to store schema information. By default, Hive uses a Derby database, but for production, it's recommended to use MySQL or PostgreSQL.
-
Create the necessary directories:
mkdir -p $HIVE_HOME/hive_metastore -
Edit the
hive-site.xmlfile to adjust temporary file storage and username settings:a. Replace all occurrences of
${system:java.io.tmpdir}with/tmp/hive:- Use the
Ctrl + \keys to open the search and replace function innano. - Enter
${system:java.io.tmpdir}as the search term. - Enter
/tmp/hiveas the replacement. - Confirm the replacements.
This sets the location where Hive stores all its temporary files.
b. Replace all occurrences of
${system:user.name}withhadoop(or your username):- Again, use
Ctrl + \to search and replace. - Enter
${system:user.name}as the search term. - Enter
hadoop(or your username) as the replacement. - Confirm the replacements.
This sets the correct username for Hive operations.
- Use the
-
Save and close the file by pressing
Ctrl + S,Ctrl + O,Enterto save, thenCtrl + Xto exit.
Before proceeding with Hive, ensure that SSH, HDFS, and YARN services are running.
-
Start the SSH service:
sudo service ssh start
ssh localhost
-
Start the HDFS services:
start-dfs.sh
-
Start the YARN services:
start-yarn.sh
-
Verify that the services are running correctly by using the
jpscommand to list Java processes:jps
hdfs dfs -mkdir /
hdfs dfs -mkdir -p </path/to/folder/with parents>Hive requires specific directories in HDFS to store data and temporary files.
Create the data warehouse directory:
hdfs dfs -mkdir -p /user/hive/warehouseCreate the temporary directory:
hdfs dfs -mkdir -p /user/tmpGrant write permissions to the group for both directories:
hdfs dfs -chmod g+w /user/tmp
hdfs dfs -chmod g+w /user/hive/warehouseHive uses an RDBMS like Derby for efficient management, retrieval, and updating of metadata. This step initializes the Derby database for Hive:
Navigate to the Hive home directory:
cd $HIVE_HOMEInitialize the schema for Derby:
schematool -initSchema -dbType derbyThese steps ensure that the necessary directories are created with the correct permissions and that Hive's Derby database is initialized for metadata management.
To verify that Hive is installed correctly, check the Hive version:
bin/hive --versionThis command will display the version of Hive installed, confirming that the setup is correct.
Due to limitations with the traditional Hive CLI, it has been deprecated in favor of Beeline. Beeline allows you to connect to Hive from a remote server or local machine using the Hive JDBC connection string.
Launch Beeline and connect to Hive:
bin/beeline -u jdbc:hive2:// -n scott -p tigerReplace scott and tiger with your actual Hive username and password. Beeline provides a more flexible and efficient way to interact with Hive, especially in distributed environments.
or
Start the Hive shell:
hiveYou should see the Hive prompt, indicating that Hive is running and ready to execute queries.
If you no longer need the installation files, you can remove them:
rm apache-hive-4.0.0-bin.tar.gzThis hive.md provides a comprehensive guide to installing and configuring Apache Hive on a Linux system.
-
Launch Hive CLI:
hive
This command opens the Hive Command Line Interface (CLI) for running Hive queries.
-
Create a new database:
CREATE DATABASE sample_db;
This command creates a new Hive database named
sample_db. -
Switch to a specific database:
USE sample_db;
This command switches the context to the
sample_dbdatabase. -
Create a new table:
CREATE TABLE sample_table ( id INT, name STRING, age INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
This command creates a new table named
sample_tablewith columnsid,name, andage. The data is stored as a text file, with fields separated by commas. -
Load data into the table:
LOAD DATA LOCAL INPATH '/path/to/datafile.csv' INTO TABLE sample_table;This command loads data from a local CSV file (
/path/to/datafile.csv) into thesample_table. -
Query the table:
SELECT * FROM sample_table;
This command retrieves all records from the
sample_table. -
Exit the Hive CLI:
exit;
^C
This command exits the Hive Command Line Interface.
To configure and run Hive using a specific IP address, follow these steps:
-
Stop the Running Hive Session:
If Hive is already running, stop the session by pressing
Ctrl+C. -
Navigate to the Hive Home Directory:
cd $HIVE_HOME
-
Edit the
hive-site.xmlFile:Open the
hive-site.xmlconfiguration file for editing:vi conf/hive-site.xml
-
Set
hive.server2.enable.doAsto False:In the
hive-site.xmlfile, find the following property and set its value tofalse:<property> <name>hive.server2.enable.doAs</name> <value>false</value> </property>
-
Modify the
hive.conf.restricted.listProperty:In the same
hive-site.xmlfile, locate thehive.conf.restricted.listproperty and remove the valuehive.users.in.admin.role. Then save the file.<property> <name>hive.conf.restricted.list</name> <!-- Remove the value “hive.users.in.admin.role” --> </property>
-
Run HiveServer 2:
Start HiveServer 2 by running the following command:
$HIVE_SERVER/bin/hiveserver2 -
Open a New Terminal Window:
Open a new terminal window to proceed with the next steps.
-
Navigate to the Hive Home Directory:
cd $HIVE_HOME
-
Start Beeline Hive with IP Address:
Launch Beeline and connect to Hive using the specific IP address:
bin/beeline -u jdbc:hive2://10.4.47.55:10000 hadoop
-
Access the Hive Command Line:
You should now have access to the Hive Command Line through Beeline.