本文主要是介绍[数据湖iceberg]-hive集成数据湖读取数据的正确姿势,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
1 概述
Iceberg作为一种表格式管理规范,其数据分为元数据和表数据。元数据和表数据独立存储,元数据目前支持存储在本地文件系统、HMS、Hadoop、JDBC数据库、AWS Glue和自定义存储。表数据支持本地文件系统、HDFS、S3、MinIO、OBS、OSS等。元数据存储基于HMS比较广泛,在这篇文章中,表数据存储基于MinIO、元数据存储主要基于HMS。实际上,基于HMS存储的元数据也只是很少的信息存储在HMS中,主体元数据还是存储在外部系统,如HDFS、MinIO等,所以这里也不得不提基于Hadoop存储的方式。
Iceberg的元数据管理通过Catalog类型指定,我们主要关注hive、hadoop和location_based_table 和自定义。
如果不指定任何Catalog类型,直接创建表会怎么样呢?
1.1 如果我们不指定catalog 建表
CREATE TABLE tb_no_cata_demo01(
id bigint,
name string
)
PARTITIONED BY(city string )
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'; +--------------------+
| tab_name |
+--------------------+
| tb_no_cata_demo01 |
+--------------------+
创建Iceberg表需要借助HiveIcebergStorageHandler来实现,此时只会在本地HMS(hive-site.xml中配置) 中存储元数据,元数据表table_params显示,当前表的存储位置使用了本地HMS (hive-site.xml中配置)的文件存储路径:
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| col_name | data_type | comment |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name | data_type | comment |
| id | bigint | from deserializer |
| name | string | from deserializer |
| city | string | from deserializer |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | default | NULL |
| OwnerType: | USER | NULL |
| Owner: | root | NULL |
| CreateTime: | Tue Mar 12 21:19:08 CST 2024 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://doitedu01:8020/user/hive/warehouse/tb_no_cata_demo01 | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | bucketing_version | 2 |
| | current-schema | {\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":false,\"type\":\"long\"},{\"id\":2,\"name\":\"name\",\"required\":false,\"type\":\"string\"},{\"id\":3,\"name\":\"city\",\"required\":false,\"type\":\"string\"}]} |
| | default-partition-spec | {\"spec-id\":0,\"fields\":[{\"name\":\"city\",\"transform\":\"identity\",\"source-id\":3,\"field-id\":1000}]} |
| | engine.hive.enabled | true |
| | external.table.purge | TRUE |
| | metadata_location | hdfs://doitedu01:8020/user/hive/warehouse/tb_no_cata_demo01/metadata/00000-6626f3eb-c538-4099-b965-507bbffb0b60.metadata.json |
| | snapshot-count | 0 |
| | storage_handler | org.apache.iceberg.mr.hive.HiveIcebergStorageHandler |
| | table_type | ICEBERG |
| | transient_lastDdlTime | 1710249548 |
| | uuid | a65c4850-7028-4914-8ea9-08c15eb9ac58 |
| | write.parquet.compression-codec | zstd |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.iceberg.mr.hive.HiveIcebergSerDe | NULL |
| InputFormat: | org.apache.iceberg.mr.hive.HiveIcebergInputFormat | NULL |
| OutputFormat: | org.apache.iceberg.mr.hive.HiveIcebergOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | 0 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
1.2 注册hive - catalog
SET iceberg.catalog.another_hive.type=hive;
SET iceberg.catalog.another_hive.uri=thrift://doitedu01:9083;
SET iceberg.catalog.another_hive.clients=10;
SET iceberg.catalog.another_hive.warehouse=hdfs://doitedu01:8020/iceber_warehouse;
drop table tb_hive_cata_demo01 ;CREATE TABLE tb_hive_cata_demo01(id bigint, name string ) PARTITIONED BY( city string )
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
location 'hdfs://doitedu01:8020/iceber_warehouse/default/tb_hive_cata_demo01'
TBLPROPERTIES ('iceberg.catalog'='another_hive');
查看表信息
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| col_name | data_type | comment |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name | data_type | comment |
| id | bigint | from deserializer |
| name | string | from deserializer |
| city | string | from deserializer |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | default | NULL |
| OwnerType: | USER | NULL |
| Owner: | root | NULL |
| CreateTime: | Tue Mar 12 21:32:26 CST 2024 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://doitedu01:8020/iceber_warehouse/default/tb_hive_cata_demo01 | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | bucketing_version | 2 |
| | current-schema | {\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":false,\"type\":\"long\"},{\"id\":2,\"name\":\"name\",\"required\":false,\"type\":\"string\"},{\"id\":3,\"name\":\"city\",\"required\":false,\"type\":\"string\"}]} |
| | default-partition-spec | {\"spec-id\":0,\"fields\":[{\"name\":\"city\",\"transform\":\"identity\",\"source-id\":3,\"field-id\":1000}]} |
| | engine.hive.enabled | true |
| | external.table.purge | TRUE |
| | iceberg.catalog | another_hive |
| | metadata_location | hdfs://doitedu01:8020/iceber_warehouse/default/tb_hive_cata_demo01/metadata/00000-802bd3f5-13b7-46c5-b28b-996c93fa9f8d.metadata.json |
| | snapshot-count | 0 |
| | storage_handler | org.apache.iceberg.mr.hive.HiveIcebergStorageHandler |
| | table_type | ICEBERG |
| | transient_lastDdlTime | 1710250346 |
| | uuid | ef5644ae-3823-460a-b4d4-f841088e0ecc |
| | write.parquet.compression-codec | zstd |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.iceberg.mr.hive.HiveIcebergSerDe | NULL |
| InputFormat: | org.apache.iceberg.mr.hive.HiveIcebergInputFormat | NULL |
| OutputFormat: | org.apache.iceberg.mr.hive.HiveIcebergOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | 0 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
查看建表语句查看更多信息
0: jdbc:hive2://doitedu01:10000> show create table tb_hive_cata_demo01 ;
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE TABLE `tb_hive_cata_demo01`( |
| `id` bigint COMMENT 'from deserializer', |
| `name` string COMMENT 'from deserializer', |
| `city` string COMMENT 'from deserializer') |
| ROW FORMAT SERDE |
| 'org.apache.iceberg.mr.hive.HiveIcebergSerDe' |
| STORED BY |
| 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' |
| |
| LOCATION |
| 'hdfs://doitedu01:8020/iceber_warehouse/default/tb_hive_cata_demo01' |
| TBLPROPERTIES ( |
| 'bucketing_version'='2', |
| 'current-schema'='{"type":"struct","schema-id":0,"fields":[{"id":1,"name":"id","required":false,"type":"long"},{"id":2,"name":"name","required":false,"type":"string"},{"id":3,"name":"city","required":false,"type":"string"}]}', |
| 'default-partition-spec'='{"spec-id":0,"fields":[{"name":"city","transform":"identity","source-id":3,"field-id":1000}]}', |
| 'engine.hive.enabled'='true', |
| 'external.table.purge'='TRUE', |
| 'iceberg.catalog'='another_hive', |
| 'metadata_location'='hdfs://doitedu01:8020/iceber_warehouse/default/tb_hive_cata_demo01/metadata/00000-802bd3f5-13b7-46c5-b28b-996c93fa9f8d.metadata.json', |
| 'snapshot-count'='0', |
| 'table_type'='ICEBERG', |
| 'transient_lastDdlTime'='1710250346', |
| 'uuid'='ef5644ae-3823-460a-b4d4-f841088e0ecc', |
| 'write.parquet.compression-codec'='zstd') |
+----------------------------------------------------+
这表示,Hive Cli返回的信息来自本地HMS,同普通Hive类似。
1.3 加载外部iceberg表
如果使用spark或者是flink创建的iceberg表 使用hve如何加载呢 ? 使用外部表即可擦偶偶
CREATE EXTERNAL TABLE tb_hive_cata_demo02 (id bigint, name string )
PARTITIONED BY( city string ) STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
location 'hdfs://doitedu01:8020/iceber_warehouse/default/tb_hive_cata_demo01'
TBLPROPERTIES ('iceberg.catalog'='another_hive');
这样就可以实现使用hive加载iceberg中的表了
这篇关于[数据湖iceberg]-hive集成数据湖读取数据的正确姿势的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!