sqoop
When we connect to hadoop, we connect to edge node and that internally
takes care of executing stuff in the cluster.
connecting to mysql from hadoop
sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
--password cloudera
[cloudera@quickstart ~]$ hdfs dfs -ls
Found 1 items
drwxr-xr-x - cloudera cloudera 0 2019-11-03 03:58 _sqoop
[cloudera@quickstart ~]$ sqoop-list-databases \
> --connect "jdbc:mysql://quickstart.cloudera:3306" \
> --username retail_dba \
> --password cloudera
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/29 21:49:45 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
19/11/29 21:49:45 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/11/29 21:49:45 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
retail_db
[cloudera@quickstart ~]$
now try to see the tables using the root user to see all tables listed.
sqoop-list-tables \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera
[cloudera@quickstart ~]$ sqoop-list-tables \
> --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
> --username retail_dba \
> --password cloudera
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/29 21:56:50 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
19/11/29 21:56:50 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/11/29 21:56:50 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
categories
customers
departments
order_items
orders
products
[cloudera@quickstart ~]$
sqoop-eval \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
--password cloudera \
--query "select * from retail_db.customers limit 10"
https://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html#_controlling_the_import_process
sqoop import \
-Dgapreduce.job.queuename=${job_queue_nm} \
-Dmapreduce.job.queuename=${job_queue_nm} \
-Dmapred.job.queuename=${job_queue_nm} \
--options-file ${goa_lkup_dir}/connect.txt \
--delete-target-dir \
--target-dir ${landingdir}/order_node_allocation_reason_e \
--hive-drop-import-delims \
--query "select nearest_node_json from order_node_allocation_reason \
where insert_time >= '$strt_ts' and \$CONDITIONS" \
--null-string '\\N' --null-non-string '\\N' \
--num-mappers ${nummapper} \
--fields-terminated-by '|' \
--lines-terminated-by '\n'
create a table in sql and insert some data
CREATE TABLE people (PersonID int,LastName varchar(255),FirstName varchar(255),Address varchar(255),City varchar(255));
insert into people values(1,'patil', 'basan', 'address','bangalore');
insert into people values(2,'patil2', 'basan2', 'address2','bangalore2');
insert into people values(3,'patil3', 'basan3', 'address3','bangalore3');
commit;
mysql> insert into people values(1,'patil', 'basan', 'address','bangalore');
Query OK, 1 row affected (0.01 sec)
mysql> commit
-> ;
Query OK, 0 rows affected (0.00 sec)
mysql>
Data can be sqooped without primary key but we will get only one mapper
.
If the output of the sqoop already exists then it will throw error
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table people \
--target-dir /queryresult
[cloudera@quickstart ~]$ sqoop import \
> --connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
> --username root \
> --password cloudera \
> --table people \
> --target-dir /queryresult
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/29 22:15:26 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
19/11/29 22:15:26 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/11/29 22:15:26 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
19/11/29 22:15:26 INFO tool.CodeGenTool: Beginning code generation
19/11/29 22:15:26 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `people` AS t LIMIT 1
19/11/29 22:15:26 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `people` AS t LIMIT 1
19/11/29 22:15:26 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/22a1f58b27b214b8bc8a6b8727c10967/people.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/11/29 22:15:28 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/22a1f58b27b214b8bc8a6b8727c10967/people.jar
19/11/29 22:15:28 WARN manager.MySQLManager: It looks like you are importing from mysql.
19/11/29 22:15:28 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
19/11/29 22:15:28 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
19/11/29 22:15:28 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
19/11/29 22:15:28 ERROR tool.ImportTool: Import failed: No primary key could be found for table people. Please specify one with --split-by or perform a sequential import with '-m 1'.
[cloudera@quickstart ~]$
hdfs dfs -ls /queryresult
To fix this pass -m =1
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table people \
-m 1 \
--target-dir /queryresult
[cloudera@quickstart ~]$ sqoop import \
> --connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
> --username root \
> --password cloudera \
> --table people \
> -m 1 \
> --target-dir /queryresult
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/29 22:17:58 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
19/11/29 22:17:58 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/11/29 22:17:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
19/11/29 22:17:58 INFO tool.CodeGenTool: Beginning code generation
19/11/29 22:17:59 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `people` AS t LIMIT 1
19/11/29 22:17:59 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `people` AS t LIMIT 1
19/11/29 22:17:59 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/aea02172d63d47af0d50fec188aa6c21/people.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/11/29 22:18:00 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/aea02172d63d47af0d50fec188aa6c21/people.jar
19/11/29 22:18:00 WARN manager.MySQLManager: It looks like you are importing from mysql.
19/11/29 22:18:00 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
19/11/29 22:18:00 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
19/11/29 22:18:00 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
19/11/29 22:18:00 INFO mapreduce.ImportJobBase: Beginning import of people
19/11/29 22:18:00 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
19/11/29 22:18:00 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/11/29 22:18:01 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/11/29 22:18:01 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/11/29 22:18:03 INFO db.DBInputFormat: Using read commited transaction isolation
19/11/29 22:18:03 INFO mapreduce.JobSubmitter: number of splits:1
19/11/29 22:18:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1572771724749_0035
19/11/29 22:18:04 INFO impl.YarnClientImpl: Submitted application application_1572771724749_0035
19/11/29 22:18:04 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1572771724749_0035/
19/11/29 22:18:04 INFO mapreduce.Job: Running job: job_1572771724749_0035
19/11/29 22:18:10 INFO mapreduce.Job: Job job_1572771724749_0035 running in uber mode : false
19/11/29 22:18:10 INFO mapreduce.Job: map 0% reduce 0%
19/11/29 22:18:15 INFO mapreduce.Job: map 100% reduce 0%
19/11/29 22:18:15 INFO mapreduce.Job: Job job_1572771724749_0035 completed successfully
19/11/29 22:18:15 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=171167
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=104
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=3008
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=3008
Total vcore-milliseconds taken by all map tasks=3008
Total megabyte-milliseconds taken by all map tasks=3080192
Map-Reduce Framework
Map input records=3
Map output records=3
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=32
CPU time spent (ms)=850
Physical memory (bytes) snapshot=215085056
Virtual memory (bytes) snapshot=1573986304
Total committed heap usage (bytes)=235929600
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=104
19/11/29 22:18:15 INFO mapreduce.ImportJobBase: Transferred 104 bytes in 14.0751 seconds (7.3889 bytes/sec)
19/11/29 22:18:15 INFO mapreduce.ImportJobBase: Retrieved 3 records.
[cloudera@quickstart ~]$ hdfs dfs -ls /queryresultFound 2 items
-rw-r--r-- 1 cloudera supergroup 0 2019-11-29 22:18 /queryresult/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 104 2019-11-29 22:18 /queryresult/part-m-00000
[cloudera@quickstart ~]$
By default 4 mappers will be used. But if the table is not having key
then 1 mapper will be used. SO we see one file as the output.
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--target-dir /queryresultorders
hdfs dfs -ls /queryresultorders
hdfs dfs -cat /queryresultorders/*
[cloudera@quickstart ~]$ sqoop import \
> --connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
> --username root \
> --password cloudera \
> --table orders \
> --target-dir /queryresultorders
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/29 22:21:41 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
19/11/29 22:21:41 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/11/29 22:21:41 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
19/11/29 22:21:41 INFO tool.CodeGenTool: Beginning code generation
19/11/29 22:21:41 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `orders` AS t LIMIT 1
19/11/29 22:21:41 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `orders` AS t LIMIT 1
19/11/29 22:21:41 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/4038ef3cab402a8d089b3d928036f385/orders.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/11/29 22:21:43 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/4038ef3cab402a8d089b3d928036f385/orders.jar
19/11/29 22:21:43 WARN manager.MySQLManager: It looks like you are importing from mysql.
19/11/29 22:21:43 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
19/11/29 22:21:43 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
19/11/29 22:21:43 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
19/11/29 22:21:43 INFO mapreduce.ImportJobBase: Beginning import of orders
19/11/29 22:21:43 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
19/11/29 22:21:43 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/11/29 22:21:43 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/11/29 22:21:44 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/11/29 22:21:45 INFO db.DBInputFormat: Using read commited transaction isolation
19/11/29 22:21:45 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`order_id`), MAX(`order_id`) FROM `orders`
19/11/29 22:21:45 INFO db.IntegerSplitter: Split size: 17220; Num splits: 4 from: 1 to: 68883
19/11/29 22:21:45 INFO mapreduce.JobSubmitter: number of splits:4
19/11/29 22:21:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1572771724749_0036
19/11/29 22:21:45 INFO impl.YarnClientImpl: Submitted application application_1572771724749_0036
19/11/29 22:21:46 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1572771724749_0036/
19/11/29 22:21:46 INFO mapreduce.Job: Running job: job_1572771724749_0036
19/11/29 22:21:52 INFO mapreduce.Job: Job job_1572771724749_0036 running in uber mode : false
19/11/29 22:21:52 INFO mapreduce.Job: map 0% reduce 0%
19/11/29 22:22:02 INFO mapreduce.Job: map 25% reduce 0%
19/11/29 22:22:05 INFO mapreduce.Job: map 50% reduce 0%
19/11/29 22:22:06 INFO mapreduce.Job: map 75% reduce 0%
19/11/29 22:22:07 INFO mapreduce.Job: map 100% reduce 0%
19/11/29 22:22:07 INFO mapreduce.Job: Job job_1572771724749_0036 completed successfully
19/11/29 22:22:07 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=685420
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=469
HDFS: Number of bytes written=2999944
HDFS: Number of read operations=16
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
Job Counters
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=36317
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=36317
Total vcore-milliseconds taken by all map tasks=36317
Total megabyte-milliseconds taken by all map tasks=37188608
Map-Reduce Framework
Map input records=68883
Map output records=68883
Input split bytes=469
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=335
CPU time spent (ms)=13480
Physical memory (bytes) snapshot=1053622272
Virtual memory (bytes) snapshot=6251585536
Total committed heap usage (bytes)=945291264
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=2999944
19/11/29 22:22:07 INFO mapreduce.ImportJobBase: Transferred 2.861 MB in 23.5627 seconds (124.3332 KB/sec)
19/11/29 22:22:07 INFO mapreduce.ImportJobBase: Retrieved 68883 records.
[cloudera@quickstart ~]$ hdfs dfs -ls /queryresultorders
Found 5 items
-rw-r--r-- 1 cloudera supergroup 0 2019-11-29 22:22 /queryresultorders/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 741614 2019-11-29 22:22 /queryresultorders/part-m-00000
-rw-r--r-- 1 cloudera supergroup 753022 2019-11-29 22:22 /queryresultorders/part-m-00001
-rw-r--r-- 1 cloudera supergroup 752368 2019-11-29 22:22 /queryresultorders/part-m-00002
-rw-r--r-- 1 cloudera supergroup 752940 2019-11-29 22:22 /queryresultorders/part-m-00003
[cloudera@quickstart ~]$
Observer by default 4 mappers got created.
hdfs dfs -cat /queryresultorders/*
If the table is huge and having no key single mapper will not work. We need to optimize.
Bringing all of the tables
sqoop-import-all-tables \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--as-sequencefile \
-m 4 \
--warehouse-dir /user/cloudera/sqoopdir
We can mention file formats while sqooping data.
By default it uses text file formats
sequence file format is supported.
avro file format
parquet file format
Orc is not supported.
warehouse-dir and target-dir?
warehouse-dir : With the name of the table sub directory will be created. When we are importing multiple tables
we have to use warehouse-dir, target-dir cannot be used.
target-dir :
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /orderswithwarehouse
hdfs dfs -ls /orderswithwarehouse
[cloudera@quickstart ~]$ sqoop import \
> --connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
> --username root \
> --password cloudera \
> --table orders \
> --warehouse-dir /orderswithwarehouse
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/29 22:34:23 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
19/11/29 22:34:23 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/11/29 22:34:24 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
19/11/29 22:34:24 INFO tool.CodeGenTool: Beginning code generation
19/11/29 22:34:24 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `orders` AS t LIMIT 1
19/11/29 22:34:24 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `orders` AS t LIMIT 1
19/11/29 22:34:24 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/2c43aaa902ea582daec0aa0cd99368a7/orders.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/11/29 22:34:25 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/2c43aaa902ea582daec0aa0cd99368a7/orders.jar
19/11/29 22:34:25 WARN manager.MySQLManager: It looks like you are importing from mysql.
19/11/29 22:34:25 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
19/11/29 22:34:25 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
19/11/29 22:34:25 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
19/11/29 22:34:25 INFO mapreduce.ImportJobBase: Beginning import of orders
19/11/29 22:34:25 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
19/11/29 22:34:26 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/11/29 22:34:26 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/11/29 22:34:27 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/11/29 22:34:28 INFO db.DBInputFormat: Using read commited transaction isolation
19/11/29 22:34:28 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`order_id`), MAX(`order_id`) FROM `orders`
19/11/29 22:34:28 INFO db.IntegerSplitter: Split size: 17220; Num splits: 4 from: 1 to: 68883
19/11/29 22:34:28 INFO mapreduce.JobSubmitter: number of splits:4
19/11/29 22:34:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1572771724749_0037
19/11/29 22:34:28 INFO impl.YarnClientImpl: Submitted application application_1572771724749_0037
19/11/29 22:34:28 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1572771724749_0037/
19/11/29 22:34:28 INFO mapreduce.Job: Running job: job_1572771724749_0037
19/11/29 22:34:36 INFO mapreduce.Job: Job job_1572771724749_0037 running in uber mode : false
19/11/29 22:34:36 INFO mapreduce.Job: map 0% reduce 0%
19/11/29 22:34:47 INFO mapreduce.Job: map 25% reduce 0%
19/11/29 22:34:50 INFO mapreduce.Job: map 50% reduce 0%
19/11/29 22:34:51 INFO mapreduce.Job: map 100% reduce 0%
19/11/29 22:34:52 INFO mapreduce.Job: Job job_1572771724749_0037 completed successfully
19/11/29 22:34:52 INFO mapreduce.Job: Counters: 31
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=685476
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=469
HDFS: Number of bytes written=2999944
HDFS: Number of read operations=16
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
Job Counters
Killed map tasks=1
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=38440
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=38440
Total vcore-milliseconds taken by all map tasks=38440
Total megabyte-milliseconds taken by all map tasks=39362560
Map-Reduce Framework
Map input records=68883
Map output records=68883
Input split bytes=469
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=424
CPU time spent (ms)=13190
Physical memory (bytes) snapshot=989405184
Virtual memory (bytes) snapshot=6315798528
Total committed heap usage (bytes)=946339840
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=2999944
19/11/29 22:34:52 INFO mapreduce.ImportJobBase: Transferred 2.861 MB in 25.5098 seconds (114.8435 KB/sec)
19/11/29 22:34:52 INFO mapreduce.ImportJobBase: Retrieved 68883 records.
[cloudera@quickstart ~]$ hdfs dfs -ls /orderswithwarehouse
Found 1 items
drwxr-xr-x - cloudera supergroup 0 2019-11-29 22:34 /orderswithwarehouse/orders
[cloudera@quickstart ~]$ hdfs dfs -ls /orderswithwarehouse/orders
Found 5 items
-rw-r--r-- 1 cloudera supergroup 0 2019-11-29 22:34 /orderswithwarehouse/orders/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 741614 2019-11-29 22:34 /orderswithwarehouse/orders/part-m-00000
-rw-r--r-- 1 cloudera supergroup 753022 2019-11-29 22:34 /orderswithwarehouse/orders/part-m-00001
-rw-r--r-- 1 cloudera supergroup 752368 2019-11-29 22:34 /orderswithwarehouse/orders/part-m-00002
-rw-r--r-- 1 cloudera supergroup 752940 2019-11-29 22:34 /orderswithwarehouse/orders/part-m-00003
[cloudera@quickstart ~]$
Observe [cloudera@quickstart ~]$ hdfs dfs -ls /orderswithwarehouse/orders sub folder been created for the config warehouse-dir
==============================================================================================
#getting help of all
sqoop help
sqoop version
[cloudera@quickstart ~]$ sqoop version
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/29 22:42:21 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
Sqoop 1.4.6-cdh5.13.0
git commit id
Compiled by jenkins on Wed Oct 4 11:04:44 PDT 2017
[cloudera@quickstart ~]$
sqoop help eval
#getting help of import
sqoop help import
==============================================================================================
--password we need to enter password in plain text
-P it will prompt for password
sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
-P
[cloudera@quickstart ~]$ sqoop-list-databases \
> --connect "jdbc:mysql://quickstart.cloudera:3306" \
> --username retail_dba \
> -P
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/29 22:43:51 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
Enter password:
19/11/29 22:43:55 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
retail_db
[cloudera@quickstart ~]$
Observer it expects Enter password: to be entered
==============================================================================================
sqoop eval command -e --query
-e or --query are same
-m or -nummapper are same
==============================================================================================
redirecting logs - stdout stderr
1 is the code for stdout
2 is the code for stderr
1>query.out 2>query.err
When we automate and see the logs this is needed
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--target-dir /queryresulttargetdir-err 1>query.out 2>query.err
[cloudera@quickstart ~]$ sqoop import \
> --connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
> --username root \
> --password cloudera \
> --table orders \
> --target-dir /queryresulttargetdir-err 1>query.out 2>query.err
;
[cloudera@quickstart ~]$ ;
bash: syntax error near unexpected token `;'
[cloudera@quickstart ~]$ ls
cloudera-manager Desktop Downloads enterprise-deployment.json kerberos Music parcels Pictures query.err Templates workspace
cm_api.py Documents eclipse express-deployment.json lib orders.java people.java Public query.out Videos
[cloudera@quickstart ~]$ vi query.out
[cloudera@quickstart ~]$ vi query.err
[cloudera@quickstart ~]$ ls query.out
query.out
[cloudera@quickstart ~]$ ls query.err
query.err
[cloudera@quickstart ~]$
Observer the file query.out and query.err created
==============================================================================================
sqoop import
target-dir
warehouse-dir
--append
--delete-target-dir
==============================================================================================
If there is no primary key we will see how it can be optimized.
==============================================================================================
What is bounding val query?
When there is no primary key we have to specify mapper as 1 or specify split-by
Use sqoop with the column which is of integral type.
==============================================================================================
file formats
--as-avrodatafile
--as-sequencefile
--as-parquetfile
==============================================================================================
by default --compression is gz compression.
If we specify the specific compression
specify --compression-codec can pass SnappyCodec
-compress or -z (default compression is gzip and we will see .gz extension)
--compression-codec (specify the compression algorithm)
/etc/hadoop/conf core-site.xml
--compress
--compression-codec org.apache.hadoop.io.compress.SnappyCodec
==============================================================================================
When we connect to hadoop, we connect to edge node and that internally
takes care of executing stuff in the cluster.
connecting to mysql from hadoop
sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
--password cloudera
[cloudera@quickstart ~]$ hdfs dfs -ls
Found 1 items
drwxr-xr-x - cloudera cloudera 0 2019-11-03 03:58 _sqoop
[cloudera@quickstart ~]$ sqoop-list-databases \
> --connect "jdbc:mysql://quickstart.cloudera:3306" \
> --username retail_dba \
> --password cloudera
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/29 21:49:45 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
19/11/29 21:49:45 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/11/29 21:49:45 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
retail_db
[cloudera@quickstart ~]$
now try to see the tables using the root user to see all tables listed.
sqoop-list-tables \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera
[cloudera@quickstart ~]$ sqoop-list-tables \
> --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
> --username retail_dba \
> --password cloudera
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/29 21:56:50 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
19/11/29 21:56:50 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/11/29 21:56:50 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
categories
customers
departments
order_items
orders
products
[cloudera@quickstart ~]$
sqoop-eval \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
--password cloudera \
--query "select * from retail_db.customers limit 10"
https://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html#_controlling_the_import_process
sqoop import \
-Dgapreduce.job.queuename=${job_queue_nm} \
-Dmapreduce.job.queuename=${job_queue_nm} \
-Dmapred.job.queuename=${job_queue_nm} \
--options-file ${goa_lkup_dir}/connect.txt \
--delete-target-dir \
--target-dir ${landingdir}/order_node_allocation_reason_e \
--hive-drop-import-delims \
--query "select nearest_node_json from order_node_allocation_reason \
where insert_time >= '$strt_ts' and \$CONDITIONS" \
--null-string '\\N' --null-non-string '\\N' \
--num-mappers ${nummapper} \
--fields-terminated-by '|' \
--lines-terminated-by '\n'
create a table in sql and insert some data
CREATE TABLE people (PersonID int,LastName varchar(255),FirstName varchar(255),Address varchar(255),City varchar(255));
insert into people values(1,'patil', 'basan', 'address','bangalore');
insert into people values(2,'patil2', 'basan2', 'address2','bangalore2');
insert into people values(3,'patil3', 'basan3', 'address3','bangalore3');
commit;
mysql> insert into people values(1,'patil', 'basan', 'address','bangalore');
Query OK, 1 row affected (0.01 sec)
mysql> commit
-> ;
Query OK, 0 rows affected (0.00 sec)
mysql>
Data can be sqooped without primary key but we will get only one mapper
.
If the output of the sqoop already exists then it will throw error
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table people \
--target-dir /queryresult
[cloudera@quickstart ~]$ sqoop import \
> --connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
> --username root \
> --password cloudera \
> --table people \
> --target-dir /queryresult
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/29 22:15:26 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
19/11/29 22:15:26 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/11/29 22:15:26 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
19/11/29 22:15:26 INFO tool.CodeGenTool: Beginning code generation
19/11/29 22:15:26 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `people` AS t LIMIT 1
19/11/29 22:15:26 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `people` AS t LIMIT 1
19/11/29 22:15:26 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/22a1f58b27b214b8bc8a6b8727c10967/people.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/11/29 22:15:28 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/22a1f58b27b214b8bc8a6b8727c10967/people.jar
19/11/29 22:15:28 WARN manager.MySQLManager: It looks like you are importing from mysql.
19/11/29 22:15:28 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
19/11/29 22:15:28 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
19/11/29 22:15:28 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
19/11/29 22:15:28 ERROR tool.ImportTool: Import failed: No primary key could be found for table people. Please specify one with --split-by or perform a sequential import with '-m 1'.
[cloudera@quickstart ~]$
hdfs dfs -ls /queryresult
To fix this pass -m =1
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table people \
-m 1 \
--target-dir /queryresult
[cloudera@quickstart ~]$ sqoop import \
> --connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
> --username root \
> --password cloudera \
> --table people \
> -m 1 \
> --target-dir /queryresult
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/29 22:17:58 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
19/11/29 22:17:58 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/11/29 22:17:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
19/11/29 22:17:58 INFO tool.CodeGenTool: Beginning code generation
19/11/29 22:17:59 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `people` AS t LIMIT 1
19/11/29 22:17:59 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `people` AS t LIMIT 1
19/11/29 22:17:59 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/aea02172d63d47af0d50fec188aa6c21/people.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/11/29 22:18:00 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/aea02172d63d47af0d50fec188aa6c21/people.jar
19/11/29 22:18:00 WARN manager.MySQLManager: It looks like you are importing from mysql.
19/11/29 22:18:00 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
19/11/29 22:18:00 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
19/11/29 22:18:00 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
19/11/29 22:18:00 INFO mapreduce.ImportJobBase: Beginning import of people
19/11/29 22:18:00 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
19/11/29 22:18:00 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/11/29 22:18:01 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/11/29 22:18:01 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/11/29 22:18:03 INFO db.DBInputFormat: Using read commited transaction isolation
19/11/29 22:18:03 INFO mapreduce.JobSubmitter: number of splits:1
19/11/29 22:18:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1572771724749_0035
19/11/29 22:18:04 INFO impl.YarnClientImpl: Submitted application application_1572771724749_0035
19/11/29 22:18:04 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1572771724749_0035/
19/11/29 22:18:04 INFO mapreduce.Job: Running job: job_1572771724749_0035
19/11/29 22:18:10 INFO mapreduce.Job: Job job_1572771724749_0035 running in uber mode : false
19/11/29 22:18:10 INFO mapreduce.Job: map 0% reduce 0%
19/11/29 22:18:15 INFO mapreduce.Job: map 100% reduce 0%
19/11/29 22:18:15 INFO mapreduce.Job: Job job_1572771724749_0035 completed successfully
19/11/29 22:18:15 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=171167
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=104
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=3008
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=3008
Total vcore-milliseconds taken by all map tasks=3008
Total megabyte-milliseconds taken by all map tasks=3080192
Map-Reduce Framework
Map input records=3
Map output records=3
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=32
CPU time spent (ms)=850
Physical memory (bytes) snapshot=215085056
Virtual memory (bytes) snapshot=1573986304
Total committed heap usage (bytes)=235929600
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=104
19/11/29 22:18:15 INFO mapreduce.ImportJobBase: Transferred 104 bytes in 14.0751 seconds (7.3889 bytes/sec)
19/11/29 22:18:15 INFO mapreduce.ImportJobBase: Retrieved 3 records.
[cloudera@quickstart ~]$ hdfs dfs -ls /queryresultFound 2 items
-rw-r--r-- 1 cloudera supergroup 0 2019-11-29 22:18 /queryresult/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 104 2019-11-29 22:18 /queryresult/part-m-00000
[cloudera@quickstart ~]$
By default 4 mappers will be used. But if the table is not having key
then 1 mapper will be used. SO we see one file as the output.
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--target-dir /queryresultorders
hdfs dfs -ls /queryresultorders
hdfs dfs -cat /queryresultorders/*
[cloudera@quickstart ~]$ sqoop import \
> --connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
> --username root \
> --password cloudera \
> --table orders \
> --target-dir /queryresultorders
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/29 22:21:41 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
19/11/29 22:21:41 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/11/29 22:21:41 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
19/11/29 22:21:41 INFO tool.CodeGenTool: Beginning code generation
19/11/29 22:21:41 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `orders` AS t LIMIT 1
19/11/29 22:21:41 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `orders` AS t LIMIT 1
19/11/29 22:21:41 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/4038ef3cab402a8d089b3d928036f385/orders.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/11/29 22:21:43 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/4038ef3cab402a8d089b3d928036f385/orders.jar
19/11/29 22:21:43 WARN manager.MySQLManager: It looks like you are importing from mysql.
19/11/29 22:21:43 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
19/11/29 22:21:43 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
19/11/29 22:21:43 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
19/11/29 22:21:43 INFO mapreduce.ImportJobBase: Beginning import of orders
19/11/29 22:21:43 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
19/11/29 22:21:43 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/11/29 22:21:43 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/11/29 22:21:44 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/11/29 22:21:45 INFO db.DBInputFormat: Using read commited transaction isolation
19/11/29 22:21:45 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`order_id`), MAX(`order_id`) FROM `orders`
19/11/29 22:21:45 INFO db.IntegerSplitter: Split size: 17220; Num splits: 4 from: 1 to: 68883
19/11/29 22:21:45 INFO mapreduce.JobSubmitter: number of splits:4
19/11/29 22:21:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1572771724749_0036
19/11/29 22:21:45 INFO impl.YarnClientImpl: Submitted application application_1572771724749_0036
19/11/29 22:21:46 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1572771724749_0036/
19/11/29 22:21:46 INFO mapreduce.Job: Running job: job_1572771724749_0036
19/11/29 22:21:52 INFO mapreduce.Job: Job job_1572771724749_0036 running in uber mode : false
19/11/29 22:21:52 INFO mapreduce.Job: map 0% reduce 0%
19/11/29 22:22:02 INFO mapreduce.Job: map 25% reduce 0%
19/11/29 22:22:05 INFO mapreduce.Job: map 50% reduce 0%
19/11/29 22:22:06 INFO mapreduce.Job: map 75% reduce 0%
19/11/29 22:22:07 INFO mapreduce.Job: map 100% reduce 0%
19/11/29 22:22:07 INFO mapreduce.Job: Job job_1572771724749_0036 completed successfully
19/11/29 22:22:07 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=685420
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=469
HDFS: Number of bytes written=2999944
HDFS: Number of read operations=16
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
Job Counters
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=36317
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=36317
Total vcore-milliseconds taken by all map tasks=36317
Total megabyte-milliseconds taken by all map tasks=37188608
Map-Reduce Framework
Map input records=68883
Map output records=68883
Input split bytes=469
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=335
CPU time spent (ms)=13480
Physical memory (bytes) snapshot=1053622272
Virtual memory (bytes) snapshot=6251585536
Total committed heap usage (bytes)=945291264
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=2999944
19/11/29 22:22:07 INFO mapreduce.ImportJobBase: Transferred 2.861 MB in 23.5627 seconds (124.3332 KB/sec)
19/11/29 22:22:07 INFO mapreduce.ImportJobBase: Retrieved 68883 records.
[cloudera@quickstart ~]$ hdfs dfs -ls /queryresultorders
Found 5 items
-rw-r--r-- 1 cloudera supergroup 0 2019-11-29 22:22 /queryresultorders/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 741614 2019-11-29 22:22 /queryresultorders/part-m-00000
-rw-r--r-- 1 cloudera supergroup 753022 2019-11-29 22:22 /queryresultorders/part-m-00001
-rw-r--r-- 1 cloudera supergroup 752368 2019-11-29 22:22 /queryresultorders/part-m-00002
-rw-r--r-- 1 cloudera supergroup 752940 2019-11-29 22:22 /queryresultorders/part-m-00003
[cloudera@quickstart ~]$
Observer by default 4 mappers got created.
hdfs dfs -cat /queryresultorders/*
If the table is huge and having no key single mapper will not work. We need to optimize.
Bringing all of the tables
sqoop-import-all-tables \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--as-sequencefile \
-m 4 \
--warehouse-dir /user/cloudera/sqoopdir
We can mention file formats while sqooping data.
By default it uses text file formats
sequence file format is supported.
avro file format
parquet file format
Orc is not supported.
warehouse-dir and target-dir?
warehouse-dir : With the name of the table sub directory will be created. When we are importing multiple tables
we have to use warehouse-dir, target-dir cannot be used.
target-dir :
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /orderswithwarehouse
hdfs dfs -ls /orderswithwarehouse
[cloudera@quickstart ~]$ sqoop import \
> --connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
> --username root \
> --password cloudera \
> --table orders \
> --warehouse-dir /orderswithwarehouse
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/29 22:34:23 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
19/11/29 22:34:23 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/11/29 22:34:24 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
19/11/29 22:34:24 INFO tool.CodeGenTool: Beginning code generation
19/11/29 22:34:24 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `orders` AS t LIMIT 1
19/11/29 22:34:24 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `orders` AS t LIMIT 1
19/11/29 22:34:24 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/2c43aaa902ea582daec0aa0cd99368a7/orders.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/11/29 22:34:25 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/2c43aaa902ea582daec0aa0cd99368a7/orders.jar
19/11/29 22:34:25 WARN manager.MySQLManager: It looks like you are importing from mysql.
19/11/29 22:34:25 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
19/11/29 22:34:25 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
19/11/29 22:34:25 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
19/11/29 22:34:25 INFO mapreduce.ImportJobBase: Beginning import of orders
19/11/29 22:34:25 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
19/11/29 22:34:26 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/11/29 22:34:26 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/11/29 22:34:27 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/11/29 22:34:28 INFO db.DBInputFormat: Using read commited transaction isolation
19/11/29 22:34:28 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`order_id`), MAX(`order_id`) FROM `orders`
19/11/29 22:34:28 INFO db.IntegerSplitter: Split size: 17220; Num splits: 4 from: 1 to: 68883
19/11/29 22:34:28 INFO mapreduce.JobSubmitter: number of splits:4
19/11/29 22:34:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1572771724749_0037
19/11/29 22:34:28 INFO impl.YarnClientImpl: Submitted application application_1572771724749_0037
19/11/29 22:34:28 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1572771724749_0037/
19/11/29 22:34:28 INFO mapreduce.Job: Running job: job_1572771724749_0037
19/11/29 22:34:36 INFO mapreduce.Job: Job job_1572771724749_0037 running in uber mode : false
19/11/29 22:34:36 INFO mapreduce.Job: map 0% reduce 0%
19/11/29 22:34:47 INFO mapreduce.Job: map 25% reduce 0%
19/11/29 22:34:50 INFO mapreduce.Job: map 50% reduce 0%
19/11/29 22:34:51 INFO mapreduce.Job: map 100% reduce 0%
19/11/29 22:34:52 INFO mapreduce.Job: Job job_1572771724749_0037 completed successfully
19/11/29 22:34:52 INFO mapreduce.Job: Counters: 31
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=685476
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=469
HDFS: Number of bytes written=2999944
HDFS: Number of read operations=16
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
Job Counters
Killed map tasks=1
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=38440
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=38440
Total vcore-milliseconds taken by all map tasks=38440
Total megabyte-milliseconds taken by all map tasks=39362560
Map-Reduce Framework
Map input records=68883
Map output records=68883
Input split bytes=469
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=424
CPU time spent (ms)=13190
Physical memory (bytes) snapshot=989405184
Virtual memory (bytes) snapshot=6315798528
Total committed heap usage (bytes)=946339840
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=2999944
19/11/29 22:34:52 INFO mapreduce.ImportJobBase: Transferred 2.861 MB in 25.5098 seconds (114.8435 KB/sec)
19/11/29 22:34:52 INFO mapreduce.ImportJobBase: Retrieved 68883 records.
[cloudera@quickstart ~]$ hdfs dfs -ls /orderswithwarehouse
Found 1 items
drwxr-xr-x - cloudera supergroup 0 2019-11-29 22:34 /orderswithwarehouse/orders
[cloudera@quickstart ~]$ hdfs dfs -ls /orderswithwarehouse/orders
Found 5 items
-rw-r--r-- 1 cloudera supergroup 0 2019-11-29 22:34 /orderswithwarehouse/orders/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 741614 2019-11-29 22:34 /orderswithwarehouse/orders/part-m-00000
-rw-r--r-- 1 cloudera supergroup 753022 2019-11-29 22:34 /orderswithwarehouse/orders/part-m-00001
-rw-r--r-- 1 cloudera supergroup 752368 2019-11-29 22:34 /orderswithwarehouse/orders/part-m-00002
-rw-r--r-- 1 cloudera supergroup 752940 2019-11-29 22:34 /orderswithwarehouse/orders/part-m-00003
[cloudera@quickstart ~]$
Observe [cloudera@quickstart ~]$ hdfs dfs -ls /orderswithwarehouse/orders sub folder been created for the config warehouse-dir
==============================================================================================
#getting help of all
sqoop help
sqoop version
[cloudera@quickstart ~]$ sqoop version
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/29 22:42:21 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
Sqoop 1.4.6-cdh5.13.0
git commit id
Compiled by jenkins on Wed Oct 4 11:04:44 PDT 2017
[cloudera@quickstart ~]$
sqoop help eval
#getting help of import
sqoop help import
==============================================================================================
--password we need to enter password in plain text
-P it will prompt for password
sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
-P
[cloudera@quickstart ~]$ sqoop-list-databases \
> --connect "jdbc:mysql://quickstart.cloudera:3306" \
> --username retail_dba \
> -P
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/11/29 22:43:51 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
Enter password:
19/11/29 22:43:55 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
retail_db
[cloudera@quickstart ~]$
Observer it expects Enter password: to be entered
==============================================================================================
sqoop eval command -e --query
-e or --query are same
-m or -nummapper are same
==============================================================================================
redirecting logs - stdout stderr
1 is the code for stdout
2 is the code for stderr
1>query.out 2>query.err
When we automate and see the logs this is needed
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--target-dir /queryresulttargetdir-err 1>query.out 2>query.err
[cloudera@quickstart ~]$ sqoop import \
> --connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
> --username root \
> --password cloudera \
> --table orders \
> --target-dir /queryresulttargetdir-err 1>query.out 2>query.err
;
[cloudera@quickstart ~]$ ;
bash: syntax error near unexpected token `;'
[cloudera@quickstart ~]$ ls
cloudera-manager Desktop Downloads enterprise-deployment.json kerberos Music parcels Pictures query.err Templates workspace
cm_api.py Documents eclipse express-deployment.json lib orders.java people.java Public query.out Videos
[cloudera@quickstart ~]$ vi query.out
[cloudera@quickstart ~]$ vi query.err
[cloudera@quickstart ~]$ ls query.out
query.out
[cloudera@quickstart ~]$ ls query.err
query.err
[cloudera@quickstart ~]$
Observer the file query.out and query.err created
==============================================================================================
sqoop import
target-dir
warehouse-dir
--append
--delete-target-dir
==============================================================================================
If there is no primary key we will see how it can be optimized.
==============================================================================================
What is bounding val query?
When there is no primary key we have to specify mapper as 1 or specify split-by
Use sqoop with the column which is of integral type.
==============================================================================================
file formats
--as-avrodatafile
--as-sequencefile
--as-parquetfile
==============================================================================================
by default --compression is gz compression.
If we specify the specific compression
specify --compression-codec can pass SnappyCodec
-compress or -z (default compression is gzip and we will see .gz extension)
--compression-codec (specify the compression algorithm)
/etc/hadoop/conf core-site.xml
--compress
--compression-codec org.apache.hadoop.io.compress.SnappyCodec
==============================================================================================
No comments:
Post a Comment