2/24/2020

Sqoop - Importing Data into Hadoop (Hive)



In my previous article I walked through using Sqoop to import data to Hadoop (HDFS).  In this article, I'll walk through using Sqoop to import data into Hadoop (Hive).

Import to Hive Managed table


There are only a few args that need to be supplied in order to instruct the sqoop import command to import data directly to Hive:

  • --hive-import 
  • --create-hive-table 
  • --hive-database
  • --hive-overwrite

The following is an example command that will connect to Oracle and import data directly into Hive:

sqoop import --connect 'jdbc:oracle:thin:@//[Hostname:Port]/[ServiceName]' --username '[Login]' -P -m 1 --table [TableName] --hive-import --create-hive-table --hive-database [DBName]

You'll need to ensure the Hive database does exist prior to running the import or else you will get an error:
  • Error: Error while compiling statement: FAILED: SemanticException [Error 10072]: Database does not exist:
Sqoop completes the import task by running MapReduce jobs importing the data to HDFS, and then running Hive commands (CREATE TABLE / LOAD DATA INPATH) to move the data to Hive.  The default HDFS location is: /user/[login]/[TABLENAME].  If you have any issues during the import you may need to remove the HDFS directory prior to re-running, or else you will get an Error:


  •  ERROR tool.ImportTool: Import failed: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory ... already exists
Sqoop does clean up the temporary HDFS directory, and following a successful import, the HDFS directory: /user/[login]/[TABLENAME] should no longer exist.  Using Sqoop to import directly to Hive creates a Hive "managed" table.  Running describe on the Sqoop created Hive table will provide you with the HDFS location where the data is located.

Replacing --create-hive-table with --hive-overwrite will overwrite the existing Hive table:

sqoop import --connect 'jdbc:oracle:thin:@//[Hostname:Port]/[ServiceName]' --username '[Login]' -P -m 1 --table [TableName] --hive-import --hive-database [DBName] --hive-overwrite

This task is completed the same way by Sqoop as the prior import, the only difference being that the Hive LOAD DATA INPATH command includes OVERWRITE INTO TABLE.

There are some other args that can be used like --target-dir which allows you to control the temp directory where Sqoop stages the data in HDFS.  Again, just make sure the directory doesn't already exist.

Import to Hive External table


It is important to note that you can accomplish the goal of importing data to a Hive External table without using any of the "hive" sqoop import args that we just went through.  This can be useful if you'd like the data to live in HDFS and be accessible by Hive AND Spark.  As of the time of this writing, Spark is unable to read Hive Managed tables but Spark can read data from HDFS.

In order to import to a Hive external table, the first step is creating the database table.  It looks something like this:

CREATE EXTERNAL TABLE [TABLENAME] (col_name data_type...)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE

Running describe on the table will provide you with the location of the directory in HDFS where Hive is looking for the related data.  The default directory will look like this:

hdfs://[ClusterName]/warehouse/tablespace/external/hive/[DBName].db/[TableName]

From here you can run a standard Sqoop to HDFS import utilizing the target-dir.  Because running with --target-dir will generate an error if the target-dir exists, we'll also add: --delete-target-dir:

sqoop import --connect 'jdbc:oracle:thin:@//[Hostname:Port]/[ServiceName]' --username '[Login]' -P --table [TableName] -m 1 --table [TableName] --target-dir [HDFS Location] --delete-target-dir


No comments:

Post a Comment