本文主要是介绍存储过程实现上亿级图数据分块ETL,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
图重构-重复关系重构
- 图数据分块ETL
- 函数与过程功能介绍
- 完整实现案例
图数据分块ETL
图数据ETL的一个场景是需要将上亿条上百G的原始数据构建为图数据,在内存不够用的情况下保证数据构建过程可以平稳顺利运行,需要使用数据分块的方式进行构建。如下通过存储过程实现数据分块方案。该解决方案依赖于原始数据库的自增ID【超大CSV文件的构建可以导入MySQL之后构建】,经过测试可以在生产环境正常运行并且避免过多的内存消耗。
函数与过程功能介绍
- 从关系数据库加载数据
apoc.load.jdbc
- 函数实现数据块ID拆分
olab.ids.batch
- 迭代处理数据块
apoc.periodic.iterate
- 对包含特殊字符的变量进行转义操作
olab.escape
- 数据分块-从数据库获取最大最小自增ID
WITH 'jdbc:mysql://test.amazonaws.com.cn:3306/database?user=dev&password=test&useUnicode=true&characterEncoding=utf8&serverTimezone=UTC' AS url,'SELECT MIN(puid) AS min,MAX(puid) AS max FROM table' AS sql,10000000 AS batch
CALL apoc.load.jdbc(url,sql) YIELD row WITH row.min AS min,row.max AS max,batch
WITH olab.ids.batch(min,max,batch) AS value
UNWIND value AS list
RETURN list[0] AS min,list[1] AS max
过程与函数插件ongdb-lab-apoc
过程与函数插件neo4j-apoc-procedures
完整实现案例
WITH 'jdbc:mysql://test.amazonaws.com.cn:3306/database?user=dev&password=test&useUnicode=true&characterEncoding=utf8&serverTimezone=UTC' AS url,'SELECT MIN(puid) AS min,MAX(puid) AS max FROM table' AS sql,1000000 AS batch,'SELECT parent_pcode AS `name`,CONVERT(DATE_FORMAT(hupdatetime,\'%Y%m%d%H%i%S\'),UNSIGNED INTEGER) AS hupdatetime FROM table WHERE hisvalid=1 AND parent_pcode IS NOT NULL AND puid>=? AND puid<=?' AS loadSql
CALL apoc.load.jdbc(url,sql) YIELD row WITH row.min AS min,row.max AS max,url,batch,loadSql
WITH olab.ids.batch(min,max,batch) AS value,url,batch,loadSql
UNWIND value AS list
WITH list[0] AS min,list[1] AS max,url,loadSql,'CALL apoc.load.jdbc({url},{loadSql},[{min},{max}])' AS jdbc
CALL apoc.periodic.iterate(olab.replace(jdbc,[{raw:'{url}',rep:'\''+url+'\''},{raw:'{loadSql}',rep:'\''+olab.escape(loadSql)+'\''},{raw:'{min}',rep:min},{raw:'{max}',rep:max}]),'MERGE (n:PREPCODE {name:row.name}) SET n+=row',{parallel:false,batchSize:1000,iterateList: false}) YIELD batches,total,timeTaken,committedOperations,failedOperations,failedBatches,retries,errorMessages,batch,operations RETURN batches,total,timeTaken,committedOperations,failedOperations,failedBatches,retries,errorMessages,batch,operations;
这篇关于存储过程实现上亿级图数据分块ETL的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!