Powered By Blogger

Saturday, November 2, 2019

Sqoop file format


Importing fileformat

By default it will come as text file format.
We can import as text file, sequence file,
as avro file
as parquet file.

By default orc is not supported, we need to make
extra changes


[cloudera@quickstart ~]$ sqoop import --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --username root --password cloudera --table order_items --warehouse-dir /user/training/sqoop_import/retail_db --delete-target-dir --num-mappers 8 --as-avrodatafile
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/02 02:53:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
19/11/02 02:53:49 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/11/02 02:53:50 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
19/11/02 02:53:50 INFO tool.CodeGenTool: Beginning code generation
19/11/02 02:53:50 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_items` AS t LIMIT 1
19/11/02 02:53:51 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_items` AS t LIMIT 1
19/11/02 02:53:51 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/4679aa752358218113a2b16c00eef2f1/order_items.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/11/02 02:53:54 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/4679aa752358218113a2b16c00eef2f1/order_items.jar
19/11/02 02:53:56 INFO tool.ImportTool: Destination directory /user/training/sqoop_import/retail_db/order_items deleted.
19/11/02 02:53:56 WARN manager.MySQLManager: It looks like you are importing from mysql.
19/11/02 02:53:56 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
19/11/02 02:53:56 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
19/11/02 02:53:56 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
19/11/02 02:53:56 INFO mapreduce.ImportJobBase: Beginning import of order_items
19/11/02 02:53:56 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
19/11/02 02:53:56 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/11/02 02:53:56 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_items` AS t LIMIT 1
19/11/02 02:53:56 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_items` AS t LIMIT 1
19/11/02 02:53:56 INFO mapreduce.DataDrivenImportJob: Writing Avro schema file: /tmp/sqoop-cloudera/compile/4679aa752358218113a2b16c00eef2f1/order_items.avsc
19/11/02 02:53:56 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/11/02 02:53:57 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/11/02 02:54:04 INFO db.DBInputFormat: Using read commited transaction isolation
19/11/02 02:54:04 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`order_item_id`), MAX(`order_item_id`) FROM `order_items`
19/11/02 02:54:04 INFO db.IntegerSplitter: Split size: 21524; Num splits: 8 from: 1 to: 172198
19/11/02 02:54:04 INFO mapreduce.JobSubmitter: number of splits:8
19/11/02 02:54:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1572629054486_0006
19/11/02 02:54:05 INFO impl.YarnClientImpl: Submitted application application_1572629054486_0006
19/11/02 02:54:05 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1572629054486_0006/
19/11/02 02:54:05 INFO mapreduce.Job: Running job: job_1572629054486_0006
19/11/02 02:54:16 INFO mapreduce.Job: Job job_1572629054486_0006 running in uber mode : false
19/11/02 02:54:16 INFO mapreduce.Job:  map 0% reduce 0%
19/11/02 02:54:44 INFO mapreduce.Job:  map 13% reduce 0%
19/11/02 02:54:45 INFO mapreduce.Job:  map 25% reduce 0%
19/11/02 02:54:51 INFO mapreduce.Job:  map 38% reduce 0%
19/11/02 02:54:55 INFO mapreduce.Job:  map 50% reduce 0%
19/11/02 02:54:56 INFO mapreduce.Job:  map 75% reduce 0%
19/11/02 02:55:06 INFO mapreduce.Job:  map 100% reduce 0%
19/11/02 02:55:07 INFO mapreduce.Job: Job job_1572629054486_0006 completed successfully
19/11/02 02:55:08 INFO mapreduce.Job: Counters: 31
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=1376832
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1028
HDFS: Number of bytes written=3939042
HDFS: Number of read operations=32
HDFS: Number of large read operations=0
HDFS: Number of write operations=16
Job Counters
Killed map tasks=1
Launched map tasks=8
Other local map tasks=8
Total time spent by all maps in occupied slots (ms)=226848
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=226848
Total vcore-milliseconds taken by all map tasks=226848
Total megabyte-milliseconds taken by all map tasks=232292352
Map-Reduce Framework
Map input records=172198
Map output records=172198
Input split bytes=1028
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=1970
CPU time spent (ms)=46110
Physical memory (bytes) snapshot=1955717120
Virtual memory (bytes) snapshot=12548964352
Total committed heap usage (bytes)=1317535744
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=3939042
19/11/02 02:55:08 INFO mapreduce.ImportJobBase: Transferred 3.7566 MB in 71.0198 seconds (54.1641 KB/sec)
19/11/02 02:55:08 INFO mapreduce.ImportJobBase: Retrieved 172198 records.
[cloudera@quickstart ~]$ hdfs dfs -ls /user/training/sqoop_import/retail_db/order_itemsFound 9 items
-rw-r--r--   1 cloudera supergroup          0 2019-11-02 02:55 /user/training/sqoop_import/retail_db/order_items/_SUCCESS
-rw-r--r--   1 cloudera supergroup     467151 2019-11-02 02:54 /user/training/sqoop_import/retail_db/order_items/part-m-00000.avro
-rw-r--r--   1 cloudera supergroup     496015 2019-11-02 02:54 /user/training/sqoop_import/retail_db/order_items/part-m-00001.avro
-rw-r--r--   1 cloudera supergroup     496016 2019-11-02 02:54 /user/training/sqoop_import/retail_db/order_items/part-m-00002.avro
-rw-r--r--   1 cloudera supergroup     496017 2019-11-02 02:54 /user/training/sqoop_import/retail_db/order_items/part-m-00003.avro
-rw-r--r--   1 cloudera supergroup     496010 2019-11-02 02:54 /user/training/sqoop_import/retail_db/order_items/part-m-00004.avro
-rw-r--r--   1 cloudera supergroup     495999 2019-11-02 02:54 /user/training/sqoop_import/retail_db/order_items/part-m-00005.avro
-rw-r--r--   1 cloudera supergroup     495951 2019-11-02 02:55 /user/training/sqoop_import/retail_db/order_items/part-m-00006.avro
-rw-r--r--   1 cloudera supergroup     495883 2019-11-02 02:55 /user/training/sqoop_import/retail_db/order_items/part-m-00007.avro
[cloudera@quickstart ~]$

Observe the files created are  in avro format.


We can use compression
--compress : gzip compression will be used
We can mention explictly the compression codec

--compression-codec

[cloudera@quickstart ~]$ sqoop import --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --username root --password cloudera --table order_items --warehouse-dir /user/training/sqoop_import/retail_db --delete-target-dir --num-mappers 8 --compress
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/02 03:00:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
19/11/02 03:00:00 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/11/02 03:00:00 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
19/11/02 03:00:00 INFO tool.CodeGenTool: Beginning code generation
19/11/02 03:00:01 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_items` AS t LIMIT 1
19/11/02 03:00:01 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_items` AS t LIMIT 1
19/11/02 03:00:01 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/7415c1a3098a34bab10f1b3a218e0272/order_items.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/11/02 03:00:04 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/7415c1a3098a34bab10f1b3a218e0272/order_items.jar
19/11/02 03:00:06 INFO tool.ImportTool: Destination directory /user/training/sqoop_import/retail_db/order_items deleted.
19/11/02 03:00:06 WARN manager.MySQLManager: It looks like you are importing from mysql.
19/11/02 03:00:06 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
19/11/02 03:00:06 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
19/11/02 03:00:06 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
19/11/02 03:00:06 INFO mapreduce.ImportJobBase: Beginning import of order_items
19/11/02 03:00:06 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
19/11/02 03:00:06 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/11/02 03:00:06 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/11/02 03:00:06 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/11/02 03:00:12 INFO db.DBInputFormat: Using read commited transaction isolation
19/11/02 03:00:12 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`order_item_id`), MAX(`order_item_id`) FROM `order_items`
19/11/02 03:00:12 INFO db.IntegerSplitter: Split size: 21524; Num splits: 8 from: 1 to: 172198
19/11/02 03:00:12 INFO mapreduce.JobSubmitter: number of splits:8
19/11/02 03:00:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1572629054486_0007
19/11/02 03:00:13 INFO impl.YarnClientImpl: Submitted application application_1572629054486_0007
19/11/02 03:00:13 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1572629054486_0007/
19/11/02 03:00:13 INFO mapreduce.Job: Running job: job_1572629054486_0007
19/11/02 03:00:23 INFO mapreduce.Job: Job job_1572629054486_0007 running in uber mode : false
19/11/02 03:00:23 INFO mapreduce.Job:  map 0% reduce 0%
19/11/02 03:00:50 INFO mapreduce.Job:  map 13% reduce 0%
19/11/02 03:00:52 INFO mapreduce.Job:  map 25% reduce 0%
19/11/02 03:00:53 INFO mapreduce.Job:  map 38% reduce 0%
19/11/02 03:00:56 INFO mapreduce.Job:  map 50% reduce 0%
19/11/02 03:00:58 INFO mapreduce.Job:  map 63% reduce 0%
19/11/02 03:00:59 INFO mapreduce.Job:  map 75% reduce 0%
19/11/02 03:01:09 INFO mapreduce.Job:  map 88% reduce 0%
19/11/02 03:01:10 INFO mapreduce.Job:  map 100% reduce 0%
19/11/02 03:01:11 INFO mapreduce.Job: Job job_1572629054486_0007 completed successfully
19/11/02 03:01:11 INFO mapreduce.Job: Counters: 31
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=1371960
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1028
HDFS: Number of bytes written=1031191
HDFS: Number of read operations=32
HDFS: Number of large read operations=0
HDFS: Number of write operations=16
Job Counters
Killed map tasks=1
Launched map tasks=8
Other local map tasks=8
Total time spent by all maps in occupied slots (ms)=200085
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=200085
Total vcore-milliseconds taken by all map tasks=200085
Total megabyte-milliseconds taken by all map tasks=204887040
Map-Reduce Framework
Map input records=172198
Map output records=172198
Input split bytes=1028
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=1338
CPU time spent (ms)=37500
Physical memory (bytes) snapshot=1616883712
Virtual memory (bytes) snapshot=12531953664
Total committed heap usage (bytes)=1187512320
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=1031191
19/11/02 03:01:11 INFO mapreduce.ImportJobBase: Transferred 1,007.0225 KB in 65.0638 seconds (15.4775 KB/sec)
19/11/02 03:01:11 INFO mapreduce.ImportJobBase: Retrieved 172198 records.
[cloudera@quickstart ~]$ hdfs dfs -ls /user/training/sqoop_import/retail_db/order_itemsFound 9 items
-rw-r--r--   1 cloudera supergroup          0 2019-11-02 03:01 /user/training/sqoop_import/retail_db/order_items/_SUCCESS
-rw-r--r--   1 cloudera supergroup     127703 2019-11-02 03:00 /user/training/sqoop_import/retail_db/order_items/part-m-00000.gz
-rw-r--r--   1 cloudera supergroup     130033 2019-11-02 03:00 /user/training/sqoop_import/retail_db/order_items/part-m-00001.gz
-rw-r--r--   1 cloudera supergroup     129596 2019-11-02 03:00 /user/training/sqoop_import/retail_db/order_items/part-m-00002.gz
-rw-r--r--   1 cloudera supergroup     129103 2019-11-02 03:00 /user/training/sqoop_import/retail_db/order_items/part-m-00003.gz
-rw-r--r--   1 cloudera supergroup     129773 2019-11-02 03:00 /user/training/sqoop_import/retail_db/order_items/part-m-00004.gz
-rw-r--r--   1 cloudera supergroup     130172 2019-11-02 03:00 /user/training/sqoop_import/retail_db/order_items/part-m-00005.gz
-rw-r--r--   1 cloudera supergroup     128645 2019-11-02 03:01 /user/training/sqoop_import/retail_db/order_items/part-m-00006.gz
-rw-r--r--   1 cloudera supergroup     126166 2019-11-02 03:01 /user/training/sqoop_import/retail_db/order_items/part-m-00007.gz
[cloudera@quickstart ~]$ sqoop import --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --username root --password cloudera --table order_items --warehouse-dir /user/training/sqoop_import/retail_db --delete-target-dir --num-mappers 8 --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/02 03:02:54 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
19/11/02 03:02:54 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/11/02 03:02:55 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
19/11/02 03:02:55 INFO tool.CodeGenTool: Beginning code generation
19/11/02 03:02:55 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_items` AS t LIMIT 1
19/11/02 03:02:55 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_items` AS t LIMIT 1
19/11/02 03:02:55 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/0cb7f59c5efc5021dccf99b0e64ccb2f/order_items.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/11/02 03:02:59 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/0cb7f59c5efc5021dccf99b0e64ccb2f/order_items.jar
19/11/02 03:03:01 INFO tool.ImportTool: Destination directory /user/training/sqoop_import/retail_db/order_items deleted.
19/11/02 03:03:01 WARN manager.MySQLManager: It looks like you are importing from mysql.
19/11/02 03:03:01 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
19/11/02 03:03:01 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
19/11/02 03:03:01 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
19/11/02 03:03:01 INFO mapreduce.ImportJobBase: Beginning import of order_items
19/11/02 03:03:01 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
19/11/02 03:03:01 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/11/02 03:03:01 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/11/02 03:03:01 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/11/02 03:03:09 INFO db.DBInputFormat: Using read commited transaction isolation
19/11/02 03:03:09 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`order_item_id`), MAX(`order_item_id`) FROM `order_items`
19/11/02 03:03:09 INFO db.IntegerSplitter: Split size: 21524; Num splits: 8 from: 1 to: 172198
19/11/02 03:03:09 INFO mapreduce.JobSubmitter: number of splits:8
19/11/02 03:03:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1572629054486_0008
19/11/02 03:03:09 INFO impl.YarnClientImpl: Submitted application application_1572629054486_0008
19/11/02 03:03:10 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1572629054486_0008/
19/11/02 03:03:10 INFO mapreduce.Job: Running job: job_1572629054486_0008
19/11/02 03:03:19 INFO mapreduce.Job: Job job_1572629054486_0008 running in uber mode : false
19/11/02 03:03:19 INFO mapreduce.Job:  map 0% reduce 0%
19/11/02 03:03:41 INFO mapreduce.Job:  map 13% reduce 0%
19/11/02 03:03:46 INFO mapreduce.Job:  map 25% reduce 0%
19/11/02 03:03:52 INFO mapreduce.Job:  map 50% reduce 0%
19/11/02 03:03:54 INFO mapreduce.Job:  map 63% reduce 0%
19/11/02 03:03:55 INFO mapreduce.Job:  map 75% reduce 0%
19/11/02 03:04:04 INFO mapreduce.Job:  map 100% reduce 0%
19/11/02 03:04:06 INFO mapreduce.Job: Job job_1572629054486_0008 completed successfully
19/11/02 03:04:06 INFO mapreduce.Job: Counters: 31
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=1373528
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1028
HDFS: Number of bytes written=1877875
HDFS: Number of read operations=32
HDFS: Number of large read operations=0
HDFS: Number of write operations=16
Job Counters
Killed map tasks=1
Launched map tasks=8
Other local map tasks=8
Total time spent by all maps in occupied slots (ms)=198994
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=198994
Total vcore-milliseconds taken by all map tasks=198994
Total megabyte-milliseconds taken by all map tasks=203769856
Map-Reduce Framework
Map input records=172198
Map output records=172198
Input split bytes=1028
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=946
CPU time spent (ms)=35350
Physical memory (bytes) snapshot=1830961152
Virtual memory (bytes) snapshot=12676071424
Total committed heap usage (bytes)=1250426880
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=1877875
19/11/02 03:04:06 INFO mapreduce.ImportJobBase: Transferred 1.7909 MB in 64.9677 seconds (28.2273 KB/sec)
19/11/02 03:04:06 INFO mapreduce.ImportJobBase: Retrieved 172198 records.
[cloudera@quickstart ~]$ hdfs dfs -ls /user/training/sqoop_import/retail_db/order_itemsFound 9 items
-rw-r--r--   1 cloudera supergroup          0 2019-11-02 03:04 /user/training/sqoop_import/retail_db/order_items/_SUCCESS
-rw-r--r--   1 cloudera supergroup     233103 2019-11-02 03:03 /user/training/sqoop_import/retail_db/order_items/part-m-00000.snappy
-rw-r--r--   1 cloudera supergroup     235883 2019-11-02 03:03 /user/training/sqoop_import/retail_db/order_items/part-m-00001.snappy
-rw-r--r--   1 cloudera supergroup     236049 2019-11-02 03:03 /user/training/sqoop_import/retail_db/order_items/part-m-00002.snappy
-rw-r--r--   1 cloudera supergroup     236247 2019-11-02 03:03 /user/training/sqoop_import/retail_db/order_items/part-m-00003.snappy
-rw-r--r--   1 cloudera supergroup     235934 2019-11-02 03:03 /user/training/sqoop_import/retail_db/order_items/part-m-00004.snappy
-rw-r--r--   1 cloudera supergroup     236833 2019-11-02 03:03 /user/training/sqoop_import/retail_db/order_items/part-m-00005.snappy
-rw-r--r--   1 cloudera supergroup     233914 2019-11-02 03:04 /user/training/sqoop_import/retail_db/order_items/part-m-00006.snappy
-rw-r--r--   1 cloudera supergroup     229912 2019-11-02 03:04 /user/training/sqoop_import/retail_db/order_items/part-m-00007.snappy
[cloudera@quickstart ~]$ 

No comments:

Post a Comment