Importing data from Sqoop into Hive External Table with Avro encoding

I wanted to figure out how to import content of RDBMS table into Hive with Avro encoding, during this process i wanted to use external hive tables so that i have complete control over the location of files. First i did create following table in the mysql database which is on the same machine as that of my HortonWorks Sandbox

  1. First create CUSTOMER table like this in mysql

CREATE TABLE CUSTOMER (       contactid INTEGER NOT NULL ,       firstname VARCHAR(50),       lastname  VARCHAR(50),       email varchar(50) );
  • After creating table add couple of records in it by executing following insert statement insert into customer values(1,'Sachin','Tendulark','sachin@gmail.com');
  • Next step is to run sqoop query that downloads records of the table into HDFS at /tmp/customer/sample. In real world you might want to download only first 10 records or so into Hive, because you need few sample records just to create avro schema
    sqoop import --connect jdbc:mysql://localhost/test --table CUSTOMER --username sqoop1 --password sqoop -m 1 --create-hive-table --hive-table CONTACT --as-avrodatafile  --target-dir /tmp/customer/sample
  • Running sqoop command it will dump records in HDFS, so first download the avro file generated by sqoop
    hdfs dfs -get /tmp/customer/sample/part-m-00000.avro
  • Use the avro-tools-*.jar, to read schema of the file generated by sqoop. by executing following command
    java -jar avro-tools-1.7.5.jar getschema part-m-00000.avro > customer.avsc

    This is how the customer.avsc file looks like in my case

    {  "type" : "record",  "name" : "CUSTOMER",  "doc" : "Sqoop import of CUSTOMER",  "fields" : [ {    "name" : "contactid",    "type" : [ "int", "null" ],    "columnName" : "contactid",    "sqlType" : "4"  }, {    "name" : "firstname",    "type" : [ "string", "null" ],    "columnName" : "firstname",    "sqlType" : "12"  }, {    "name" : "lastname",    "type" : [ "string", "null" ],    "columnName" : "lastname",    "sqlType" : "12"  }, {    "name" : "email",    "type" : [ "string", "null" ],    "columnName" : "email",    "sqlType" : "12"  } ],  "tableName" : "CUSTOMER"}
  • Next step is to upload the avro schema file that you created in the last step back to HDFS, in my case i had HDFS folder called /tmp/customer/schema and i uploaded the avro schema file in it
    hdfs dfs -put customer.avsc /tmp/customer/schema/
  • Now go to hive and execute the following command to define External Customer Hive table with avro schema defined in last step
    CREATE EXTERNAL TABLE CUSTOMERROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'LOCATION '/tmp/customer/data'TBLPROPERTIES ('avro.schema.url'='hdfs:///tmp/customer/schema/customer.avsc');
  • Last step is to run sqoop again but this time with all the data in the external directory that Customer hive table is pointing to.
    sqoop import --connect jdbc:mysql://localhost/test --table CUSTOMER --username sqoop1 --password sqoop -m 1 --as-avrodatafile  --target-dir /tmp/customer/data --compression-codec snappy
  • Now if you run select query on the CUSTOMER table you should be able to get all the data that you see in your RDBMS