Sqoop - Scheduling and Security

In previous articles, I've walk through using Sqoop to import data to HDFS.  I've also detailed how to perform full and incremental imports to Hive external and Hive managed tables.

In this article I'm going to show you how to automate execution of Sqoop jobs via Cron.

However, before we get to scheduling we need to address security.  In prior examples I've used -P to prompt the user for login credentials interactively.  With a scheduled job, this isn't going to work.  Fortunately Sqoop provides us with the "password-alias" arg which allows us to pass in passwords stored in a protected keystore.
Here are a couple of helpful articles related to using this functionality:

The tl;dr is:
  1. Create a credential in your HDFS home directory:
    hadoop credential create My.password -provider jceks://hdfs/user/MyPassword.jceks
  2. Reference this password in your Sqoop import command:
    sqoop import -Dhadoop.security.credential.provider.path=jceks://hdfs/user/MyPassword.jceks ... --username '[Login]' --password-alias My.password ...
Now that we have the ability to pass in our credentials automatically we can discuss how to automate execution of Sqoop jobs.

The first way is very simple and should be familiar to any Linux user: Cron / Crontab.  There are ton of articles out there (like this one) that explain how to use Cron to schedule jobs in Linux.  For our purposes it's as simple as creating a new file like /home/[User]/MyFirstSqoopJob.sh and editing the file to look something like this:

sqoop import -Dhadoop.security.credential.provider.path=jceks://hdfs/user/MyPassword.jceks --connect 'jdbc:oracle:thin:@//[Hostname:Port]/[ServiceName]' --username '[Login]' --password-alias My.password --table [TableName] -m 1 --table [TableName] --target-dir [HDFS Location] --delete-target-dir

You could also create a sqoop job, for an incremental import, and edit the file to look like:

sqoop job -exec [JobName]

Next configure Cron to execute MyFirstSqoopJob.sh:

Crontab -e
0 8 * * * /home/[User]/MyFirstSqoopJob.sh

The above schedule will run daily at 8AM.

Another way to schedule sqoop jobs is via Oozie.  I'll discuss Oozie in my next blog post.