ETL可视化工具 DataX -- 安装部署 ( 二)

本文主要是介绍ETL可视化工具 DataX -- 安装部署 ( 二)，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

引言

DataX 系列文章：

ETL可视化工具 DataX – 简介 ( 一)

DataX 私有仓库：

https://gitee.com/dazhong000/datax.git
https://gitee.com/dazhong000/datax-web.git
本地地址：E:\soft\2023-08-datax

2.1 DataX安装

安装文档 git地址：https://github.com/alibaba/DataX/blob/master/userGuid.md

2.1.1 解压安装

方法一、直接下载DataX工具包：
下载地址（https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/202308/datax.tar.gz）
下载后解压至本地某个目录，进入bin目录，即可运行同步作业：

$ cd  {YOUR_DATAX_HOME}/bin
$ python datax.py {YOUR_JOB.json}

自检脚本：

python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json

方法二、下载DataX源码，自己编译：
DataX源码

(1)、下载DataX源码：

$ git clone git@github.com:alibaba/DataX.git

(2)、通过maven打包：

$ cd  {DataX_source_code_home}
$ mvn -U clean package assembly:assembly -Dmaven.test.skip=true
打包成功，日志显示如下：
[INFO] BUILD SUCCESS
[INFO] -----------------------------------------------------------------
[INFO] Total time: 08:12 min
[INFO] Finished at: 2015-12-13T16:26:48+08:00
[INFO] Final Memory: 133M/960M
[INFO] -----------------------------------------------------------------

打包成功后的DataX包位于 {DataX_source_code_home}/target/datax/datax/ ，结构如下：

$ cd  {DataX_source_code_home}
$ ls ./target/datax/datax/
bin        conf        job        lib        log        log_perf    plugin        script        tmp

2.1.2 配置示例从stream读取数据并打印到控制台

第一步、创建作业的配置文件（json格式）

可以通过命令查看配置模板： python datax.py -r {YOUR_READER} -w {YOUR_WRITER}

$ cd  {YOUR_DATAX_HOME}/bin
$  python datax.py -r streamreader -w streamwriter
DataX (UNKNOWN_DATAX_VERSION), From Alibaba !
Copyright (C) 2010-2015, Alibaba Group. All Rights Reserved.
Please refer to the streamreader document:https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md Please refer to the streamwriter document:https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md Please save the following configuration as a json file and  usepython {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json 
to run the job.{"job": {"content": [{"reader": {"name": "streamreader", "parameter": {"column": [], "sliceRecordCount": ""}}, "writer": {"name": "streamwriter", "parameter": {"encoding": "", "print": true}}}], "setting": {"speed": {"channel": ""}}}
}

根据模板配置json如下：

#stream2stream.json
{"job": {"content": [{"reader": {"name": "streamreader","parameter": {"sliceRecordCount": 10,"column": [{"type": "long","value": "10"},{"type": "string","value": "hello，你好，世界-DataX"}]}},"writer": {"name": "streamwriter","parameter": {"encoding": "UTF-8","print": true}}}],"setting": {"speed": {"channel": 5}}}
}

示例：Mysql 同步数据配置：

{"job": {"content": [{"reader": {//读取端"name": "mysqlreader","parameter": {//源数据库连接用户"username": "root",//源数据库连接密码"password": "root",//需要同步的列(*表示所有的列)"column": ["*"],"connection": [{//源数据库连接"jdbcUrl": ["jdbc:mysql://127.0.0.3:3360/studysource?useUnicode=true&characterEncoding=utf8"],//源表"table": ["staff_info"]}]}},"writer": {//写入端"name": "mysqlwriter","parameter": {//目标数据库连接用户"username": "root",//目标数据库连接密码"password": "root","connection": [{//目标数据库连接"jdbcUrl": "jdbc:mysql://127.2.3.4:3360/studysync?useUnicode=true&characterEncoding=utf8",//目标表"table": ["staff_info"]}],//同步前.要做的事"preSql": ["TRUNCATE TABLE staff_info"],//需要同步的列"column": ["*"]}}}],"setting": {"speed": {//指定并发数"channel": "5"}}}
}

第二步：启动DataX

$ cd {YOUR_DATAX_DIR_BIN}
$ python datax.py ./stream2stream.json

同步结束，显示日志如下：

...
2015-12-17 11:20:25.263 [job-0] INFO  JobContainer - 
任务启动时刻                    : 2015-12-17 11:20:15
任务结束时刻                    : 2015-12-17 11:20:25
任务总计耗时                    :                 10s
任务平均流量                    :              205B/s
记录写入速度                    :              5rec/s
读出记录总数                    :                  50
读写失败总数                    :                   0