本文主要是介绍Spark写入HBase(BulkLoad方式),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
在使用Spark时经常需要把数据落入HBase中,如果使用普通的Java API,写入会速度很慢。Spark提供了Bulk写入方式的接口。那么Bulk写入与普通写入相比有什么优势呢?
- BulkLoad不会写WAL,也不会产生flush以及split。
- 如果我们大量调用PUT接口插入数据,可能会导致大量的GC操作。除了影响性能之外,严重时甚至可能会对HBase节点的稳定性造成影响。但是采用Bulk就不会有这个顾虑。
- 过程中没有大量的接口调用消耗性能
/*** GDPR数据脱敏* 1、将铭文的的设备ID进行MD5加密* 2、增量更新到HBase* 3、同步一份数据到HIVE*/
object DimDevInfoPro extends AnalysysLogger{def main(args: Array[String]): Unit = {Logger.getLogger("org.apache.spark").setLevel(Level.OFF)if (args.length < 2) {throw new IllegalArgumentException("Need to 2 args!!!")}val Array(proDate, proTableName) = argsval spark = SparkSession.builder().enableHiveSupport.getOrCreate()val sc = spark.sparkContextval tablename = "dim_device_id_info"sc.hadoopConfiguration.set(TableOutputFormat.OUTPUT_TABLE, tablename)val job = Job.getInstance(sc.hadoopConfiguration)job.setOutputKeyClass(classOf[ImmutableBytesWritable])job.setOutputValueClass(classOf[Result])job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])// 获取待更新的设备var dim_device_id_info_sql = s"select md5(a.device_id) di_md5 ,a.device_id di from(select device_id from ${proTableName} where day=${proDate} and device_id is not null and length(device_id)>15 and length(device_id)<100 group by 1) a left join dim.dim_device_id_info b on b.di=a.device_id where b.di is null"if(proTableName.equals("source.source_dev_filter")){dim_device_id_info_sql = s"select md5(a.device_id) di_md5 ,a.device_id di from(select device_id from source.source_dev_filter where day=${proDate} and status=0 and device_id is not null and length(device_id)>15 and length(device_id)<100 group by 1) a left join dim.dim_device_id_info b on b.di=a.device_id where b.di is null"}logger.info("dim_device_id_info_sql: " + dim_device_id_info_sql)val dim_device_id_info = spark.sql(dim_device_id_info_sql)// 更新hbaseval rdd = dim_device_id_info.rdd.map { x => {val di_md5 = x.getAs("di_md5").toStringval di = x.getAs("di").toStringval put = new Put(di_md5.getBytes()) //行健的值put.addColumn(Bytes.toBytes("dev"), Bytes.toBytes("di"), di.getBytes()) //dev:di列的值(new ImmutableBytesWritable, put)}}rdd.saveAsNewAPIHadoopDataset(job.getConfiguration())// 同步到hive表中方便查询spark.sql("insert overwrite table dim.dim_device_id_info select * from dim.dim_device_id_info_hbase")spark.stop()}}
这篇关于Spark写入HBase(BulkLoad方式)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!