Skip to content

Latest commit

 

History

History
378 lines (240 loc) · 8.94 KB

File metadata and controls

378 lines (240 loc) · 8.94 KB

Apache Hive Installation Guide

This guide provides step-by-step instructions to install and configure Apache Hive on a Linux system.

Step 1: Prerequisites

Before installing Apache Hive, ensure that the following prerequisites are met:

  • Java Development Kit (JDK): Hive requires Java to run. Ensure that Java is installed and set up correctly.
  • Hadoop: Hive runs on top of Hadoop. Ensure that Hadoop is installed and configured.

Check Java Installation

Ensure Hadoop is Running

Start the Hadoop services:

start-dfs.sh
start-yarn.sh

Step 2: Download Apache Hive

Download the latest stable version of Apache Hive from the official website or using the following wget command:

wget https://dlcdn.apache.org/hive/hive-4.0.0/apache-hive-4.0.0-bin.tar.gz

Step 3: Extract the Downloaded Archive

Extract the downloaded Hive tarball:

tar -xzvf apache-hive-4.0.0-bin.tar.gz

Step 4: Move Hive to the Desired Directory

Move the extracted Hive directory to your preferred installation path:

sudo mv apache-hive-4.0.0-bin /usr/local/hive

Step 5: Set Up Environment Variables

To ensure that Hive is accessible from any directory, add Hive to your system's environment variables.

  1. Open your .bashrc file:

    nano ~/.bashrc
  2. Add the following lines at the end of the file:

    # hive
    export HIVE_HOME=/usr/local/hive
    export PATH=$PATH:$HIVE_HOME/bin
    export CLASSPATH=$CLASSPATH:$HADOOP_HOME/lib/*:$HIVE_HOME/lib/*
  3. Save and close the file, then apply the changes:

    source ~/.bashrc

Step 6: Configure Hive Metastore

Hive uses a metastore to store schema information. By default, Hive uses a Derby database, but for production, it's recommended to use MySQL or PostgreSQL.

Configure Hive to Use Derby (Default)

  1. Create the necessary directories:

    mkdir -p $HIVE_HOME/hive_metastore
  2. Edit the hive-site.xml file to adjust temporary file storage and username settings:

    a. Replace all occurrences of ${system:java.io.tmpdir} with /tmp/hive:

    • Use the Ctrl + \ keys to open the search and replace function in nano.
    • Enter ${system:java.io.tmpdir} as the search term.
    • Enter /tmp/hive as the replacement.
    • Confirm the replacements.

    This sets the location where Hive stores all its temporary files.

    b. Replace all occurrences of ${system:user.name} with hadoop (or your username):

    • Again, use Ctrl + \ to search and replace.
    • Enter ${system:user.name} as the search term.
    • Enter hadoop (or your username) as the replacement.
    • Confirm the replacements.

    This sets the correct username for Hive operations.

  3. Save and close the file by pressing Ctrl + S, Ctrl + O, Enter to save, then Ctrl + X to exit.

Step 7: Start SSH, HDFS, and YARN Services

Before proceeding with Hive, ensure that SSH, HDFS, and YARN services are running.

  1. Start the SSH service:

    sudo service ssh start
    ssh localhost
  2. Start the HDFS services:

    start-dfs.sh
  3. Start the YARN services:

    start-yarn.sh
  4. Verify that the services are running correctly by using the jps command to list Java processes:

    jps

Step 8: Set Up Hive Directories and Initialize Derby Database

Check Directories on HDFS

hdfs dfs -mkdir /
hdfs dfs -mkdir -p </path/to/folder/with parents>

Create the Hive Data Warehouse and Temporary Directories on HDFS

Hive requires specific directories in HDFS to store data and temporary files.

Create the data warehouse directory:

hdfs dfs -mkdir -p /user/hive/warehouse

Create the temporary directory:

hdfs dfs -mkdir -p /user/tmp

Set Permissions for the Directories

Grant write permissions to the group for both directories:

hdfs dfs -chmod g+w /user/tmp
hdfs dfs -chmod g+w /user/hive/warehouse

Initialize the Derby Database

Hive uses an RDBMS like Derby for efficient management, retrieval, and updating of metadata. This step initializes the Derby database for Hive:

Navigate to the Hive home directory:

cd $HIVE_HOME

Initialize the schema for Derby:

schematool -initSchema -dbType derby

These steps ensure that the necessary directories are created with the correct permissions and that Hive's Derby database is initialized for metadata management.

Step 8: Verify Hive Installation and Launch Query Shell

Check if Hive is Installed Correctly

To verify that Hive is installed correctly, check the Hive version:

bin/hive --version

This command will display the version of Hive installed, confirming that the setup is correct.

Launch the Hive Query Shell

Due to limitations with the traditional Hive CLI, it has been deprecated in favor of Beeline. Beeline allows you to connect to Hive from a remote server or local machine using the Hive JDBC connection string.

Launch Beeline and connect to Hive:

bin/beeline -u jdbc:hive2:// -n scott -p tiger

Replace scott and tiger with your actual Hive username and password. Beeline provides a more flexible and efficient way to interact with Hive, especially in distributed environments.

or

Start the Hive shell:

hive

You should see the Hive prompt, indicating that Hive is running and ready to execute queries.

Step 9: Clean Up

If you no longer need the installation files, you can remove them:

rm apache-hive-4.0.0-bin.tar.gz

This hive.md provides a comprehensive guide to installing and configuring Apache Hive on a Linux system.

Sample Hive CLI Commands

  1. Launch Hive CLI:

    hive

    This command opens the Hive Command Line Interface (CLI) for running Hive queries.

  2. Create a new database:

    CREATE DATABASE sample_db;

    This command creates a new Hive database named sample_db.

  3. Switch to a specific database:

    USE sample_db;

    This command switches the context to the sample_db database.

  4. Create a new table:

    CREATE TABLE sample_table (
        id INT,
        name STRING,
        age INT
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE;

    This command creates a new table named sample_table with columns id, name, and age. The data is stored as a text file, with fields separated by commas.

  5. Load data into the table:

    LOAD DATA LOCAL INPATH '/path/to/datafile.csv' INTO TABLE sample_table;

    This command loads data from a local CSV file (/path/to/datafile.csv) into the sample_table.

  6. Query the table:

    SELECT * FROM sample_table;

    This command retrieves all records from the sample_table.

  7. Exit the Hive CLI:

    exit;
    ^C

    This command exits the Hive Command Line Interface.

Running Hive Using an IP Address

To configure and run Hive using a specific IP address, follow these steps:

  1. Stop the Running Hive Session:

    If Hive is already running, stop the session by pressing Ctrl+C.

  2. Navigate to the Hive Home Directory:

    cd $HIVE_HOME
  3. Edit the hive-site.xml File:

    Open the hive-site.xml configuration file for editing:

    vi conf/hive-site.xml
  4. Set hive.server2.enable.doAs to False:

    In the hive-site.xml file, find the following property and set its value to false:

    <property>
       <name>hive.server2.enable.doAs</name>
       <value>false</value> 
    </property>
  5. Modify the hive.conf.restricted.list Property:

    In the same hive-site.xml file, locate the hive.conf.restricted.list property and remove the value hive.users.in.admin.role. Then save the file.

    <property>
       <name>hive.conf.restricted.list</name>
       <!-- Remove the value “hive.users.in.admin.role” -->
    </property>
  6. Run HiveServer 2:

    Start HiveServer 2 by running the following command:

    $HIVE_SERVER/bin/hiveserver2
  7. Open a New Terminal Window:

    Open a new terminal window to proceed with the next steps.

  8. Navigate to the Hive Home Directory:

    cd $HIVE_HOME
  9. Start Beeline Hive with IP Address:

    Launch Beeline and connect to Hive using the specific IP address:

    bin/beeline -u jdbc:hive2://10.4.47.55:10000 hadoop
  10. Access the Hive Command Line:

    You should now have access to the Hive Command Line through Beeline.