本文主要是介绍大数据-ETL工具:Sqoop【关系型数据库(MySQL,Oracle...) <==(业务)数据==> Hive/HBase/HDFS】【Hadoop与关系数据库之间传送数据的工具】,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
我们常用的 ETL 工具有Sqoop、Kettle、Nifi:
- Kettle虽然功能较完善,但当处理大数据量的时候瓶颈问题比较突出;
- NiFi的功能强大,且支持大数据量操作,但NiFi集群是独立于Hadoop集群的,需要独立的服务器来支撑,强大也就意味着有上手门槛,学习难度大,用人成本高;
- Sqoop专为关系型数据库和Hadoop之间的ETL而生,支持海量数据,符合项目的需求,且操作简单门槛低。
Sqoop与Kettle比较
- 如果需求只是将关系型数据库数据(Oracle、MySQL)迁移至非关系型数据库(HDFS、Hbase、Hive),推荐使用Sqoop工具,足够满足需求
- 如果是将不同种类的关系型数据库(Oracle、MySQL、SQL server)整合到同一个关系型数据库中,如MySQL。推荐使用Kettle,有GUI界面操作简单。
一、Sqoop介绍
Sqoop是Apache下的顶级项目,用来将Hadoop和关系型数据库中的数据相互转移,可以将一个关系型数据库(例如:MySQL,Oracle,PostgreSQL等)中的数据导入到Hadoop的HDFS中,也可以将HDFS的数据导入到关系型数据库中。
- 导入: 从RDBMS(关系型数据库)到hadoop
- 导出:从Hadoop到RDBMS(关系型数据库)
目前在各个公司应用广泛,且发展前景比较乐观。其特点在于:
- 专门为Hadoop而生,随Hadoop版本更新支持程度好,且原本即是从CDH版本孵化出来的开源项目,支持CDH的各个版本号。
- 它支持多种关系型数据库,比如mysql、oracle、postgresql等。
- 可以高效、可控的利用资源。
- 可以自动的完成数据映射和转换。
- 大部分企业还在使用sqoop1版本,sqoop1能满足公司的基本需求。
- 自带的辅助工具比较丰富,如sqoop-import、sqoop-list-databases、sqoop-list-tables等。
二、sqoop的安装
1、解压 jar 包到任意节点(比如:hadoop102节点)
sqoop是一个客户端工具,想在哪个服务器上使用就在该服务器上安装,不需要分发配置集群。
[whx@hadoop102 soft]$ tar -zxf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz -C /opt/module/
2、修改sqoop文件夹名称
[whx@hadoop102 module]$ ll
total 36
drwxrwxr-x. 9 whx whx 4096 Jan 31 14:45 flume
drwxr-xr-x. 11 whx whx 4096 Jan 31 10:43 hadoop-2.7.2
drwxrwxr-x. 8 whx whx 4096 Feb 2 10:48 hbase
drwxrwxr-x. 9 whx whx 4096 Jan 30 19:27 hive
drwxr-xr-x. 8 whx whx 4096 Dec 13 2016 jdk1.8.0_121
drwxr-xr-x. 8 whx whx 4096 Feb 1 16:32 kafka
drwxrwx---. 5 whx whx 4096 May 24 2019 phoenix
drwxr-xr-x. 9 whx whx 4096 Apr 27 2015 sqoop-1.4.6.bin__hadoop-2.0.4-alpha
drwxr-xr-x. 11 whx whx 4096 Jan 29 22:01 zookeeper-3.4.10
[whx@hadoop102 module]$ mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha/ sqoop
[whx@hadoop102 module]$ ll
total 36
drwxrwxr-x. 9 whx whx 4096 Jan 31 14:45 flume
drwxr-xr-x. 11 whx whx 4096 Jan 31 10:43 hadoop-2.7.2
drwxrwxr-x. 8 whx whx 4096 Feb 2 10:48 hbase
drwxrwxr-x. 9 whx whx 4096 Jan 30 19:27 hive
drwxr-xr-x. 8 whx whx 4096 Dec 13 2016 jdk1.8.0_121
drwxr-xr-x. 8 whx whx 4096 Feb 1 16:32 kafka
drwxrwx---. 5 whx whx 4096 May 24 2019 phoenix
drwxr-xr-x. 9 whx whx 4096 Apr 27 2015 sqoop
drwxr-xr-x. 11 whx whx 4096 Jan 29 22:01 zookeeper-3.4.10
[whx@hadoop102 module]$
3、修改配置文件
可以在/etc/profile中配置,导出为全局变量。或者在 /opt/module/sqoop/conf/sqoop-env-template.sh修改为:sqoop-env.sh。并配置 HADOOP_HOME,HIVE_HOME,HBASE_HOME,ZOOKEEPER_HOME
JAVA_HOME=/opt/module/jdk1.8.0_121
HADOOP_HOME=/opt/module/hadoop-2.7.2
ZOOKEEPER_HOME=/opt/module/hadoop-2.7.2
HIVE_HOME=/opt/module/hive
FLUME_HOME=/opt/module/flume
HBASE_HOME=/opt/module/hbase
PHOENIX_HOME=/opt/module/phoenix
PHOENIX_CLASSPATH=$PHOENIX_HOME
SQOOP_HOME=/opt/module/sqoop
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$ZOOKEEPER_HOME/bin:$HIVE_HOME/bin:$FLUME_HOME/bin:$HBASE_HOME/bin:$PHOENIX_HOME/bin:$SQOOP_HOME/binexport PATH JAVA_HOME HADOOP_HOME ZOOKEEPER_HOME HIVE_HOME FLUME_HOME HBASE_HOME PHOENIX_HOME PHOENIX_CLASSPATH SQOOP_HOME
4、拷贝JDBC驱动
找到一个可用jdbc驱动,拷贝该jdbc驱动到sqoop的lib目录下
[whx@hadoop102 ~]$ cp /opt/module/hive/lib/mysql-connector-java-5.1.27-bin.jar /opt/module/sqoop/lib/
5、验证Sqoop
我们可以通过某一个command来验证sqoop配置是否正确:
[whx@hadoop102 ~]$ sqoop help
出现一些Warning警告(警告信息已省略),并伴随着帮助命令的输出:
[whx@hadoop102 ~]$ sqoop help
Warning: /opt/module/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/module/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
21/02/04 14:12:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
usage: sqoop COMMAND [ARGS]Available commands:codegen Generate code to interact with database recordscreate-hive-table Import a table definition into Hiveeval Evaluate a SQL statement and display the resultsexport Export an HDFS directory to a database tablehelp List available commandsimport Import a table from a database to HDFSimport-all-tables Import tables from a database to HDFSimport-mainframe Import datasets from a mainframe server to HDFSjob Work with saved jobslist-databases List available databases on a serverlist-tables List available tables in a databasemerge Merge results of incremental importsmetastore Run a standalone Sqoop metastoreversion Display version informationSee 'sqoop help COMMAND' for information on a specific command.
[whx@hadoop102 ~]$
6、测试Sqoop是否能够成功连接Mysql数据库
[whx@hadoop102 ~]$ sqoop list-databases --connect jdbc:mysql://hadoop102:3306/ --username root --password 123456
Warning: /opt/module/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/module/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
21/02/04 14:15:18 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
21/02/04 14:15:18 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
21/02/04 14:15:18 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/phoenix-4.14.2-HBase-1.3-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
information_schema
metastore
mysql
performance_schema
test
[whx@hadoop102 ~]$
出现如下输出:
information_schema
metastore
mysql
performance_schema
test
三、sqoop的使用
在Mysql中新建一张表并插入一些数据用于测试
$ mysql -uroot -p123456
mysql> create database company;
mysql> create table company.staff(id int(4) primary key not null auto_increment, name varchar(255), sex varchar(255));
mysql> insert into company.staff(name, sex) values('Thomas', 'Male');
mysql> insert into company.staff(name, sex) values('Catalina', 'FeMale');
1、开启Hadoop-Yarn
[whx@hadoop102 ~]$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-whx-resourcemanager-hadoop102.out
hadoop103: starting nodemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-whx-nodemanager-hadoop103.out
hadoop101: starting nodemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-whx-nodemanager-hadoop101.out
hadoop102: starting nodemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-whx-nodemanager-hadoop102.out
[whx@hadoop102 ~]$ xcall.sh jps
要执行的命令是jps
----------------------------hadoop101----------------------------------
5411 HRegionServer
3476 DataNode
3591 QuorumPeerMain
6583 NodeManager
3354 NameNode
5547 HMaster
6718 Jps
----------------------------hadoop102----------------------------------
4336 QuorumPeerMain
18641 Jps
5123 SqlLine
14551 HRegionServer
4077 DataNode
18493 NodeManager
14381 HMaster
18190 ResourceManager
----------------------------hadoop103----------------------------------
3553 QuorumPeerMain
3474 SecondaryNameNode
5874 HRegionServer
3347 DataNode
7524 Jps
6009 HMaster
7387 NodeManager
[whx@hadoop102 ~]$
2、导入数据
在Sqoop中,“导入”概念指:从非大数据集群(RDBMS)向大数据集群(HDFS,HIVE,HBASE)中传输数据,叫做:导入,即使用import关键字。
2.1 Mysql到HDFS
- 查看mysql中所有的数据库
sqoop list-databases --connect jdbc:mysql://hadoop01:3306 --username root --password 123456
- 查看mysql中某个库下所有的表
sqoop list-tables --connect jdbc:mysql://hadoop01:3306/scm --username root --password 123456
注意:
当忘记某个属性的时候, 可以通过使用 --help方式查询
- 从mysql中将数据导入到HDFS(全量导入)
命令1:
sqoop import \
--connect jdbc:mysql://hadoop01:3306/test \
--username root --password 123456 \
--table emp 通过测试发现, 只需要指定输入端, 即可将数据导入HDFS中, 默认情况下, 将数据导入到所操作用户对应家目录下, 建立一个与导入表同名的目录, 同时发现表中有几条数据, sqoop默认就会启动几个map来读取数据, 默认分割符号为 逗号如何减少map的数量呢? 只需要添加 -m的参数即可
sqoop import \
--connect jdbc:mysql://hadoop01:3306/test \
--username root --password 123456 \
--table emp \
-m 1如果想使用两个map呢? 建议添加--split-by 表示按照那个字段进行分割表数据
sqoop import \
--connect jdbc:mysql://hadoop01:3306/test \
--username root --password 123456 \
--table emp \
--split-by id \
-m 2想要修改其默认的分割符号: 更改为 空格
sqoop import \
--connect jdbc:mysql://hadoop01:3306/test \
--username root --password 123456 \
--table emp \
--fields-terminated-by ' ' \
--split-by id \
-m 2想要指定某个目的地: --target-dir (指定目的地) 和 --delete-target-dir(目的地目录存在, 先删除)
sqoop import \
--connect jdbc:mysql://hadoop01:3306/test \
--username root --password 123456 \
--table emp \
--fields-terminated-by ' ' \
--target-dir '/sqoop_emp' \
--delete-target-dir \
--split-by id \
-m 2
- 使用sqoop导入到hive中(全量导入)
第一步: 在hive中先建立目标表 create database sqooptohive;use sqooptohive;create table sqooptohive.emp_hive(id int,name string,deg string,salary int ,dept string) row format delimited fields terminated by '\t' stored as orc;第二部: 执行数据导入操作: HCataLog
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/test \
--username root \
--password 123456 \
--table emp \
--fields-terminated-by '\t' \
--hcatalog-database sqooptohive \
--hcatalog-table emp_hive \
-m 1属性说明:
--hcatalog-database 指定数据库名称
--hcatalog-table 指定表的名称注意: 使用此种方式, 在hive中建表的时候, 必须保证hive表字段和对应mysql表的字段名称保持一致
- 使用where条件的方式, 导入数据到HDFS(条件导入):
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/test \
--username root \
--password 123456 \
--table emp_add \
--target-dir /sqoop/emp_add \
--delete-target-dir \
-m 1 \
--where "city = 'sec-bad'"
- 使用SQL方式将数据导入到HDFS(条件导入)
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/test \
--username root \
--password 123456 \
--query 'select phno from emp_conn where 1=1 and $CONDITIONS' \
--target-dir /sqoop/emp_conn \
--delete-target-dir \
-m 1 注意:1)当采用SQL的方式来导入数据的时候, SQL的最后面必须添加 $CONDITIONS 关键词2) 整个SQL如果使用 "" 包裹的 $CONDITIONS 关键词前面需要使用\进行转义"select phno from emp_conn where 1=1 and \$CONDITIONS"
- 使用SQL方式将数据导入hive(条件导入) – 增量导入方式
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/test \
--username root \
--password 123456 \
--query "select * from emp where id>1205 and \$CONDITIONS" \
--fields-terminated-by '\t' \
--hcatalog-database sqooptohive \
--hcatalog-table emp_hive \
-m 1
2.1.1 全部导入
- 连接的url
–connect jdbc:mysql://hadoop102:3306/company \ - 用户名
–username root \ - 密码
–password 123456 \ - 要导出哪个表的数据
–table staff \ - 将数据导入到hdfs的哪个路径
–target-dir /company \ - 如果目标目录存在就删除
–delete-target-dir \ - 导入到hdfs上时,mysql中的字段使用\t作为分隔符
–fields-terminated-by “\t” \ - 设置几个MapTask来运行
–num-mappers 2 \ - 基于ID列,将数据切分为2片,只有在–num-mappers>1时才需要指定,选的列最好不要有null值,否则null是无法被导入的!尽量选取主键列,数字列
–split-by id
[whx@hadoop102 ~]$ sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--table staff \
--target-dir /company \
--delete-target-dir \
--fields-terminated-by "\t" \
--num-mappers 2 \
--split-by id
Warning: /opt/module/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/module/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
21/02/04 14:43:51 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
21/02/04 14:43:51 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
21/02/04 14:43:51 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
21/02/04 14:43:51 INFO tool.CodeGenTool: Beginning code generation
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/phoenix-4.14.2-HBase-1.3-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/02/04 14:43:51 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `staff` AS t LIMIT 1
21/02/04 14:43:52 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `staff` AS t LIMIT 1
21/02/04 14:43:52 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/module/hadoop-2.7.2
Note: /tmp/sqoop-whx/compile/0438fe2a8b732eae5cb55256e845c9c4/staff.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
21/02/04 14:43:53 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-whx/compile/0438fe2a8b732eae5cb55256e845c9c4/staff.jar
21/02/04 14:43:54 INFO tool.ImportTool: Destination directory /company is not present, hence not deleting.
21/02/04 14:43:54 WARN manager.MySQLManager: It looks like you are importing from mysql.
21/02/04 14:43:54 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
21/02/04 14:43:54 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
21/02/04 14:43:54 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
21/02/04 14:43:54 INFO mapreduce.ImportJobBase: Beginning import of staff
21/02/04 14:43:54 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
21/02/04 14:43:54 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
21/02/04 14:43:54 INFO client.RMProxy: Connecting to ResourceManager at hadoop102/192.168.1.102:8032
21/02/04 14:43:56 INFO db.DBInputFormat: Using read commited transaction isolation
21/02/04 14:43:56 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `staff`
21/02/04 14:43:56 INFO mapreduce.JobSubmitter: number of splits:2
21/02/04 14:43:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1612420994275_0001
21/02/04 14:43:56 INFO impl.YarnClientImpl: Submitted application application_1612420994275_0001
21/02/04 14:43:56 INFO mapreduce.Job: The url to track the job: http://hadoop102:8088/proxy/application_1612420994275_0001/
21/02/04 14:43:56 INFO mapreduce.Job: Running job: job_1612420994275_0001
21/02/04 14:44:04 INFO mapreduce.Job: Job job_1612420994275_0001 running in uber mode : false
21/02/04 14:44:04 INFO mapreduce.Job: map 0% reduce 0%
21/02/04 14:44:08 INFO mapreduce.Job: map 50% reduce 0%
21/02/04 14:44:09 INFO mapreduce.Job: map 100% reduce 0%
21/02/04 14:44:09 INFO mapreduce.Job: Job job_1612420994275_0001 completed successfully
21/02/04 14:44:09 INFO mapreduce.Job: Counters: 30File System CountersFILE: Number of bytes read=0FILE: Number of bytes written=270734FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=197HDFS: Number of bytes written=32HDFS: Number of read operations=8HDFS: Number of large read operations=0HDFS: Number of write operations=4Job Counters Launched map tasks=2Other local map tasks=2Total time spent by all maps in occupied slots (ms)=5689Total time spent by all reduces in occupied slots (ms)=0Total time spent by all map tasks (ms)=5689Total vcore-milliseconds taken by all map tasks=5689Total megabyte-milliseconds taken by all map tasks=5825536Map-Reduce FrameworkMap input records=2Map output records=2Input split bytes=197Spilled Records=0Failed Shuffles=0Merged Map outputs=0GC time elapsed (ms)=271CPU time spent (ms)=1370Physical memory (bytes) snapshot=341798912Virtual memory (bytes) snapshot=4193730560Total committed heap usage (bytes)=229638144File Input Format Counters Bytes Read=0File Output Format Counters Bytes Written=32
21/02/04 14:44:09 INFO mapreduce.ImportJobBase: Transferred 32 bytes in 15.7015 seconds (2.038 bytes/sec)
21/02/04 14:44:09 INFO mapreduce.ImportJobBase: Retrieved 2 records.
[whx@hadoop102 ~]$
2.1.2 导入指定列 --columns(提示:columns中如果涉及到多列,用逗号分隔,分隔时不要添加空格)
- 只导出id和name 列
–columns id,name
$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--table staff \
--columns id,name \
--target-dir /company \
--delete-target-dir \
--fields-terminated-by "\t" \
--num-mappers 1 \
--split-by id
2.1.3 导入指定行 --where
- 只导入复合过滤条件的行
–where ‘id >= 10 and id <= 20’ \
$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--table staff \
--where 'id >= 10 and id <= 20' \
--target-dir /company \
--delete-target-dir \
--fields-terminated-by "\t" \
--num-mappers 1 \
--split-by id
2.1.4 查询导入
- 提示:must contain ‘$CONDITIONS’ in WHERE clause.
- 如果query后使用的是双引号,则$CONDITIONS前必须加转移符,防止shell识别为自己的变量。
- 执行查询的SQL,讲查询的数据进行导入,如果使用了–query,不加再用–table,–where,–columns
- 只要使用–query ,必须添加$CONDITONS,这个条件会被Sqoop自动替换为一些表达式
–query “select * from staff where $CONDITIONS and id <= 25” \
$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--query "select * from staff where \$CONDITIONS and id <= 25" \
--target-dir /company \
--delete-target-dir \
--fields-terminated-by "\t" \
--num-mappers 2 \
--split-by id
2.2 Mysql到Hive
Sqoop导入到Hive:先将数据导入到HDFS,再将HDFS的数据Load到Hive表中。
- 导入到Hive
–hive-import \ - 导入到Hive表中字段的分隔符
–fields-terminated-by “\t” \ - 是否以insert overwrite方式覆盖导入数据
–hive-overwrite \ - 要导入的Hive表的名称,会自动帮助我们建表。建议还是在Hive中手动建表,需要注意和Mysql表的数据类型匹配
–hive-table staff_hive
$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--table staff \
--num-mappers 1 \
--hive-import \
--fields-terminated-by "\t" \
--hive-overwrite \
--hive-table staff_hive
提示:该过程分为两步,第一步将数据导入到HDFS,第二步将导入到HDFS的数据迁移到Hive仓库,第一步默认的临时目录是/user/atguigu/表名
3.3 Mysql到Hbase
- sqoop1.4.6对应的是低版本的hbase,1.3.0的hbase版本相对来说有点高!
- 在执行导入时,sqoop是可以帮我们自动建表,如果sqoop使用sqoop1.4.6版本,hbase使用1.3.0hbase,则建表会失败。 建议手动建表。
- 如果表不存在,hbase自动建表
–hbase-create-table \ - 导入的表名
–hbase-table “t_emp” \ - mysql的哪一列作为rowkey
–hbase-row-key “id” \ - 导入的列族名
–column-family “info” \
$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--table staff \
--where 'id <= 5' \
--hbase-create-table \
--hbase-table "staff_hbase" \
--hbase-row-key "id" \
--column-family "info" \
--num-mappers 1 \
--split-by id
提示:sqoop1.4.6只支持HBase1.0.1之前的版本的自动创建HBase表的功能
解决方案:手动创建HBase表
hbase> create 'staff_hbase','info'
3、导出数据( Hive/HDFS到Mysql)
-
在Sqoop中,“导出”概念指:从大数据集群(HDFS,HIVE)向非大数据集群(RDBMS)Mysql中传输数据,叫做:导出,即使用export关键字。
-
Mysql的表如果不存在,不会自动创建。所以需要提前创建。
-
不支持直接从HBase向非大数据集群(RDBMS)直接导出。
-
要导出的mysql的表名
–table staff2
–num-mappers 1 \ -
导出的数据在HDFS上的路径
–export-dir /company \ -
导出时,基于哪一列判断数据重复。如果要导入的表已经有数据了,此时可以指定–update-key参数,通过此参数,可以讲导入的数据,使用updata语句进行导入,此时,只会更新重复的数据,不重复的数据是无法导入的!
–update-key -
如果希望遇到重复的数据,就更新,不重复的数据就新增导入,可以使用–update-key 结合 --update-mode(默认为updateonly)。
–update-mode=allowinsert -
导出的数据的分隔符
–input-fields-terminated-by “\t”
[whx@hadoop102 ~]$ sqoop export \
--connect jdbc:mysql://hadoop102:3306/company?useUnicode=true&characterEncoding=utf-8' \
--username root \
--password 123456 \
--table staff2 \
--num-mappers 1 \
--export-dir /company \
--input-fields-terminated-by "\t"
Warning: /opt/module/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/module/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
21/02/04 15:25:37 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
21/02/04 15:25:37 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
21/02/04 15:25:37 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
21/02/04 15:25:37 INFO tool.CodeGenTool: Beginning code generation
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/phoenix-4.14.2-HBase-1.3-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/02/04 15:25:37 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `staff2` AS t LIMIT 1
21/02/04 15:25:37 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `staff2` AS t LIMIT 1
21/02/04 15:25:37 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/module/hadoop-2.7.2
Note: /tmp/sqoop-whx/compile/3030961eef5236a3e482591dec69c120/staff2.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
21/02/04 15:25:38 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-whx/compile/3030961eef5236a3e482591dec69c120/staff2.jar
21/02/04 15:25:38 INFO mapreduce.ExportJobBase: Beginning export of staff2
21/02/04 15:25:38 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
21/02/04 15:25:39 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
21/02/04 15:25:39 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
21/02/04 15:25:39 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
21/02/04 15:25:39 INFO client.RMProxy: Connecting to ResourceManager at hadoop102/192.168.1.102:8032
21/02/04 15:25:40 INFO input.FileInputFormat: Total input paths to process : 2
21/02/04 15:25:40 INFO input.FileInputFormat: Total input paths to process : 2
21/02/04 15:25:40 INFO mapreduce.JobSubmitter: number of splits:1
21/02/04 15:25:40 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
21/02/04 15:25:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1612420994275_0002
21/02/04 15:25:41 INFO impl.YarnClientImpl: Submitted application application_1612420994275_0002
21/02/04 15:25:41 INFO mapreduce.Job: The url to track the job: http://hadoop102:8088/proxy/application_1612420994275_0002/
21/02/04 15:25:41 INFO mapreduce.Job: Running job: job_1612420994275_0002
21/02/04 15:25:46 INFO mapreduce.Job: Job job_1612420994275_0002 running in uber mode : false
21/02/04 15:25:46 INFO mapreduce.Job: map 0% reduce 0%
21/02/04 15:25:51 INFO mapreduce.Job: map 100% reduce 0%
21/02/04 15:25:51 INFO mapreduce.Job: Job job_1612420994275_0002 completed successfully
21/02/04 15:25:51 INFO mapreduce.Job: Counters: 30File System CountersFILE: Number of bytes read=0FILE: Number of bytes written=135055FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=220HDFS: Number of bytes written=0HDFS: Number of read operations=7HDFS: Number of large read operations=0HDFS: Number of write operations=0Job Counters Launched map tasks=1Rack-local map tasks=1Total time spent by all maps in occupied slots (ms)=2304Total time spent by all reduces in occupied slots (ms)=0Total time spent by all map tasks (ms)=2304Total vcore-milliseconds taken by all map tasks=2304Total megabyte-milliseconds taken by all map tasks=2359296Map-Reduce FrameworkMap input records=2Map output records=2Input split bytes=182Spilled Records=0Failed Shuffles=0Merged Map outputs=0GC time elapsed (ms)=177CPU time spent (ms)=490Physical memory (bytes) snapshot=168833024Virtual memory (bytes) snapshot=2093150208Total committed heap usage (bytes)=115343360File Input Format Counters Bytes Read=0File Output Format Counters Bytes Written=0
21/02/04 15:25:51 INFO mapreduce.ExportJobBase: Transferred 220 bytes in 12.3045 seconds (17.8796 bytes/sec)
21/02/04 15:25:51 INFO mapreduce.ExportJobBase: Exported 2 records.
[whx@hadoop102 ~]$
案例:
需求: 将hive中emp_hive表导出到mysql中(全量导出)
- 第一步: 需要在mysql中创建目标表 (必须操作)
CREATE TABLE `emp_out` (`id` INT(11) DEFAULT NULL,`name` VARCHAR(100) DEFAULT NULL,`deg` VARCHAR(100) DEFAULT NULL,`salary` INT(11) DEFAULT NULL,`dept` VARCHAR(10) DEFAULT NULL
) ENGINE=INNODB DEFAULT CHARSET=utf8;
- 第二步: 执行sqoop的导出操作:
sqoop export \
--connect jdbc:mysql://192.168.52.150:3306/test \
--username root \
--password 123456 \
--table emp_out \
--hcatalog-database sqooptohive \
--hcatalog-table emp_hive \
-m 1
sqoop的相关参数
参数 | 说明 |
---|---|
–connect | 连接关系型数据库的URL |
–username | 连接数据库的用户名 |
–password | 连接数据库的密码 |
–driver | JDBC的driver class |
–query或–e | 将查询结果的数据导入,使用时必须伴随参–target-dir,–hcatalog-table,如果查询中有where条件,则条件后必须加上CONDITIONS关键字。 如果使用双引号包含sql,则CONDITIONS前要加上\以完成转义:\$CONDITIONS |
–hcatalog-database | 指定HCatalog表的数据库名称。如果未指定,default则使用默认数据库名称。提供 --hcatalog-database不带选项–hcatalog-table是错误的。 |
–hcatalog-table | 此选项的参数值为HCatalog表名。该–hcatalog-table选项的存在表示导入或导出作业是使用HCatalog表完成的,并且是HCatalog作业的必需选项。 |
–create-hcatalog-table | 此选项指定在导入数据时是否应自动创建HCatalog表。表名将与转换为小写的数据库表名相同。 |
–hcatalog-storage-stanza ‘stored as orc tblproperties (“orc.compress”=“SNAPPY”)’ \ | 建表时追加存储格式到建表语句中,tblproperties修改表的属性,这里设置orc的压缩格式为SNAPPY |
-m | 指定并行处理的MapReduce任务数量。 -m不为1时,需要用split-by指定分片字段进行并行导入,尽量指定int型。 |
–split-by id | 如果指定-split by, 必须使用$CONDITIONS关键字, 双引号的查询语句还要加\ |
–hcatalog-partition-keys --hcatalog-partition-values | keys和values必须同时存在,相当于指定静态分区。允许将多个键和值提供为静态分区键。多个选项值之间用,(逗号)分隔。比如: --hcatalog-partition-keys year,month,day --hcatalog-partition-values 1999,12,31 |
–null-string ‘\N’ --null-non-string ‘\N’ | 指定mysql数据为空值时用什么符号存储,null-string针对string类型的NULL值处理,–null-non-string针对非string类型的NULL值处理 |
–hive-drop-import-delims | 设置无视字符串中的分割符(hcatalog默认开启) |
–fields-terminated-by ‘\t’ | 设置字段分隔符 |
四、sqoop命令脚本打包
使用sqoop命令脚本文件,可以实现定时任务来进行导入导出。
sql的命令编写在脚本中,参数名和参数值之间,需要换行
创建一个.opt文件
$ mkdir opt
$ touch opt/job_mysql2hdfs.opt
脚本格式:
--connect
jdbc:mysql://hadoop102:3306/company?useUnicode=true&characterEncoding=utf-8'
--username
root
--password
123456
--table
staff
--target-dir
/company3
--delete-target-dir
--num-mappers
1
--fields-terminated-by
"\t"
--split-by
id
执行脚本:
sqoop --options-file opt/job_mysql2hdfs.opt
这篇关于大数据-ETL工具:Sqoop【关系型数据库(MySQL,Oracle...) <==(业务)数据==> Hive/HBase/HDFS】【Hadoop与关系数据库之间传送数据的工具】的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!