2/17/2020

Sqoop - Importing Data into Hadoop (HDFS)



In this article, I'll walk through using Sqoop to import data to Hadoop (HDFS).

"Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases."

Validate Connectivity


In my environment I'm usually running Sqoop from a linux shell and pulling data into Hadoop from a relational database management system (RDBMS) like Oracle or Microsoft SQL Server.  Before getting started with a new Sqoop project, I like to validate that I can connect to the RDBMS.  A good way to test connectivity is with Curl:

curl telnet://[Hostname:Port] -v
If you connect successfully you should see a message like:
* Connected to...
If you fail to connect the message will be:
* Could not resolve host...

Once you've established that you can connect we can perform some preliminary Sqoop Actions.

Oracle:

This will validate that you can connect and run a simple query:
sqoop eval --connect 'jdbc:oracle:thin:@//[Hostname:Port]/[ServiceName]' --username '[Login]' -P --query "select 1 from dual"
-P indicates that you will be prompted for your password.

Assuming you have the necessary permissions, this one will list the databases, or schemas, in your Oracle environment:
sqoop list-databases --connect 'jdbc:oracle:thin:@//[Hostname:Port]/[ServiceName]' --username '[Login]' -P

Again, assuming you have the necessary permissions, this one will list the tables, in your default schema in your Oracle environment:
sqoop list-tables --connect 'jdbc:oracle:thin:@//[Hostname:Port]/[ServiceName]' --username '[Login]' -P

Microsoft SQL Server:

This will validate that you can connect and run a simple query:
sqoop eval --connect 'jdbc:sqlserver://[Hostname];instanceName=[Instance];database=[DBName]' --query "select 1" --username '[Login]' -P

Assuming you have the necessary permissions, this one will list the tables in your default database:
sqoop list-tables  --connect 'jdbc:sqlserver://[Hostname]  --username '[Login]' -P

(There is no 'list-databases' equivalent in Sqoop for MS SQL Server)

Import to HDFS


OK so we've validated we can connect to a RDBMS host and have used Sqoop to further validate general access to the database.  Next, we'll run a manual one time data import to HDFS.

All we need to do is use the "sqoop import" command and pass in a table name:

sqoop import --connect 'jdbc:oracle:thin:@//[Hostname:Port]/[ServiceName]' --username '[Login]' -P --table [TableName]

Sqoop handles all the datatype conversions and drops your table into HDFS as a CSV file.  By default Sqoop is going to want to attempt to split up the data coming from your RDBMS based on the primary key so that it can run the import as multiple processes in parallel via MapReduce.  If you don't have a primary key or have a textual index column, you may receive errors like the following:


  • ERROR tool.ImportTool: Import failed: java.io.IOException: Generating splits for a textual index column allowed only in case of "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" property passed as a parameter
  • ERROR tool.ImportTool: Import failed: No primary key could be found for table [TableName]. Please specify one with --split-by or perform a sequential import with '-m 1'.

As stated in the error message, this can be addressed by adding '-m 1' to the import command.  This tells sqoop not to attempt to run the import in parallel and to run only a single process:

sqoop import --connect 'jdbc:oracle:thin:@//[Hostname:Port]/[ServiceName]' --username '[Login]' -P --table [TableName] -m 1

Another potential issue when using Oracle is that you need to UPPER() the table name.  If you don't you'll receive this error:

  • ERROR tool.ImportTool: Import failed: There is no column found in the target table [TableName]. Please ensure that your table name is correct.

You may receive additional errors if there are insufficient rights to the HDFS directory, so make sure the permissions are correct if you receive those errors.  The default HDFS location is /user/[Login]/

You can validate your import with the following HDFS command:
hdfs dfs -ls /user/[login]/[TABLENAME]

Here you may notice a file ending in .deflate.  This is because, as mentioned earlier, Sqoop uses MapReduce to execute the import process.  Therefore, the files that get written will use the MapReduce defaults.  In the case where you get a .deflate file, this is because MapReduce is configured by default to compress the output.  The number of files here will be equal to the number of MapReduce jobs that were run to perform the import.  So if you use '-m 1' you'll have 1 file.  In my test case I ended up with 4 files.  To read a .deflate file (part-m-00000.deflate for example) from HDFS directly use the following:

hdfs dfs -text /user/[login]/[TABLENAME]/part-m-00000.deflate

A couple of other args you can pass into the sqoop import command are a where clause and a target directory:

sqoop import --connect 'jdbc:oracle:thin:@//[Hostname:Port]/[ServiceName]' --username '[Login]' -P --table [TableName] -m 1 --table [TableName] --where "[Predicate]" --target-dir [HDFS Location]

When running with --target-dir, make sure the target-dir does not exist or you will get an error.

A really neat thing that Sqoop can do, is grab the entire database or schema (depending on which RDBMS you're using).  The only caveat is that every table must have a primary key.  Here's what that command looks like:

sqoop import-all-tables --connect 'jdbc:oracle:thin:@//[Hostname:Port]/[ServiceName]' --username '[Login]' -P --table [TableName] -m 1

You can use the "exclude-tables" argument to tell Sqoop to ignore tables.  For example when running against MS SQL you'll want to: --exclude-tables sysdiagrams.