第62课:SparkSQL下的Parquet使用最佳实践和代码实践学习笔记

2024-01-09 19:08

本文主要是介绍第62课:SparkSQL下的Parquet使用最佳实践和代码实践学习笔记,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

62课:SparkSQL下的Parquet使用最佳实践和代码实践学习笔记

本期内容:

1 SparkSQL下的Parquet使用最佳实践

2 SparkSQL下的Parquet实战

 

一:Spark SQL下的Parquet使用最佳实践

1, 过去整个业界对大数据的分析的技术栈的Pipeline一般分为以下两种方式:

a) Data Source->HDFS->MR/Hive/Spark(相当于ETL)->HDFS Parquet->Spark SQL/Impala->Result Service(可以放在DB中,也有可能被通过JDBC/ODBC来作为数据服务使用)

b) Data Source->Real time update data to HBase/DB->Export to Parquet->Spark SQL/Impala-> Result Service(可以放在DB中,也有可能被通过JDBC/ODBC来作为数据服务使用)

上述的第二种方式完全可以通过Kafka+Spark Streaming+Spark SQL(内部也强烈建议采用Parquet的方式来存储数据)的方式取代

任何情况下都需要实时处理!人脸识别、信用卡诈骗等都是基于流处理。

2, 期待的方式:Data Source->Kafka->Spark Streaming->Parquet->Spark SQL(SparkSQL可以结合MLGraphX)->Parquet->其它各种Data Mining

 

二:Parquet的精要介绍

1, Parquet是列式存储格式的一种文件类型,列式存储有以下的核心优势:

a.可以跳过不符合条件的数据,只读取需要的数据,降低IO数据量。
b.压缩编码可以降低磁盘存储空间。由于同一列的数据类型是一样的,可以使用更高效的压缩编码(例如Run Length Encoding和Delta Encoding)进一步节约存储空间。

c.只读取需要的列,支持向量运算,能够获取更好的扫描性能。

 

三.下面编写代码读取parquet文件内容并打印:

package SparkSQLByJava;

import java.util.List;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.sql.DataFrame;

import org.apache.spark.sql.Row;

import org.apache.spark.sql.SQLContext;

public class SparkSQLParquetOps {

public static void main(String[] args) {

SparkConf conf = new SparkConf().setMaster("local").setAppName("SparkSQLParquetOps");

JavaSparkContext sc = new JavaSparkContext(conf);

SQLContext sqlContext = new SQLContext(sc);

DataFrame usersDF = sqlContext.read().parquet("D:\\DT-IMF\\testdata\\users.parquet");

//注册成为临时表以供后续的SQL查询操作

usersDF.registerTempTable("users");

//进行数据的多维度分析

DataFrame result = sqlContext.sql("select * from users");

//对结果进行处理,包括由DataFrame转换成为RDD<Row>,以及结构持久化。

 

List<Row> listRow = result.javaRDD().collect();

for(Row row : listRow){

System.out.println(row);

}

}

}

 

eclipse中的运行console:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

16/04/02 09:17:56 INFO SparkContext: Running Spark version 1.6.0

16/04/02 09:18:07 INFO SecurityManager: Changing view acls to: think

16/04/02 09:18:07 INFO SecurityManager: Changing modify acls to: think

16/04/02 09:18:07 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(think); users with modify permissions: Set(think)

16/04/02 09:18:09 INFO Utils: Successfully started service 'sparkDriver' on port 60088.

16/04/02 09:18:11 INFO Slf4jLogger: Slf4jLogger started

16/04/02 09:18:11 INFO Remoting: Starting remoting

16/04/02 09:18:11 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.56.1:60101]

16/04/02 09:18:11 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 60101.

16/04/02 09:18:11 INFO SparkEnv: Registering MapOutputTracker

16/04/02 09:18:12 INFO SparkEnv: Registering BlockManagerMaster

16/04/02 09:18:12 INFO DiskBlockManager: Created local directory at C:\Users\think\AppData\Local\Temp\blockmgr-c045274d-ef94-471d-819a-93e044022e60

16/04/02 09:18:12 INFO MemoryStore: MemoryStore started with capacity 1773.8 MB

16/04/02 09:18:12 INFO SparkEnv: Registering OutputCommitCoordinator

16/04/02 09:18:13 INFO Utils: Successfully started service 'SparkUI' on port 4040.

16/04/02 09:18:13 INFO SparkUI: Started SparkUI at http://192.168.56.1:4040

16/04/02 09:18:13 INFO Executor: Starting executor ID driver on host localhost

16/04/02 09:18:13 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60108.

16/04/02 09:18:13 INFO NettyBlockTransferService: Server created on 60108

16/04/02 09:18:13 INFO BlockManagerMaster: Trying to register BlockManager

16/04/02 09:18:13 INFO BlockManagerMasterEndpoint: Registering block manager localhost:60108 with 1773.8 MB RAM, BlockManagerId(driver, localhost, 60108)

16/04/02 09:18:13 INFO BlockManagerMaster: Registered BlockManager

16/04/02 09:18:17 WARN : Your hostname, think-PC resolves to a loopback/non-reachable address: fe80:0:0:0:d401:a5b5:2103:6d13%eth8, but we couldn't find any external IP address!

16/04/02 09:18:18 INFO ParquetRelation: Listing file:/D:/DT-IMF/testdata/users.parquet on driver

16/04/02 09:18:20 INFO SparkContext: Starting job: parquet at SparkSQLParquetOps.java:16

16/04/02 09:18:20 INFO DAGScheduler: Got job 0 (parquet at SparkSQLParquetOps.java:16) with 1 output partitions

16/04/02 09:18:20 INFO DAGScheduler: Final stage: ResultStage 0 (parquet at SparkSQLParquetOps.java:16)

16/04/02 09:18:20 INFO DAGScheduler: Parents of final stage: List()

16/04/02 09:18:20 INFO DAGScheduler: Missing parents: List()

16/04/02 09:18:20 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at parquet at SparkSQLParquetOps.java:16), which has no missing parents

16/04/02 09:18:20 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 61.5 KB, free 61.5 KB)

16/04/02 09:18:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 20.6 KB, free 82.1 KB)

16/04/02 09:18:20 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:60108 (size: 20.6 KB, free: 1773.7 MB)

16/04/02 09:18:20 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006

16/04/02 09:18:20 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at parquet at SparkSQLParquetOps.java:16)

16/04/02 09:18:20 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks

16/04/02 09:18:21 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2180 bytes)

16/04/02 09:18:21 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)

16/04/02 09:18:21 INFO ParquetFileReader: Initiating action with parallelism: 5

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

16/04/02 09:18:24 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1842 bytes result sent to driver

16/04/02 09:18:24 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 3301 ms on localhost (1/1)

16/04/02 09:18:24 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool

16/04/02 09:18:24 INFO DAGScheduler: ResultStage 0 (parquet at SparkSQLParquetOps.java:16) finished in 3.408 s

16/04/02 09:18:24 INFO DAGScheduler: Job 0 finished: parquet at SparkSQLParquetOps.java:16, took 4.121836 s

16/04/02 09:18:26 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 61.8 KB, free 143.9 KB)

16/04/02 09:18:26 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 19.3 KB, free 163.2 KB)

16/04/02 09:18:26 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:60108 (size: 19.3 KB, free: 1773.7 MB)

16/04/02 09:18:26 INFO SparkContext: Created broadcast 1 from javaRDD at SparkSQLParquetOps.java:23

16/04/02 09:18:28 INFO deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize

16/04/02 09:18:28 INFO ParquetRelation: Reading Parquet file(s) from file:/D:/DT-IMF/testdata/users.parquet

16/04/02 09:18:28 INFO SparkContext: Starting job: collect at SparkSQLParquetOps.java:23

16/04/02 09:18:28 INFO DAGScheduler: Got job 1 (collect at SparkSQLParquetOps.java:23) with 1 output partitions

16/04/02 09:18:28 INFO DAGScheduler: Final stage: ResultStage 1 (collect at SparkSQLParquetOps.java:23)

16/04/02 09:18:28 INFO DAGScheduler: Parents of final stage: List()

16/04/02 09:18:28 INFO DAGScheduler: Missing parents: List()

16/04/02 09:18:28 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[3] at javaRDD at SparkSQLParquetOps.java:23), which has no missing parents

16/04/02 09:18:28 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 4.6 KB, free 167.8 KB)

16/04/02 09:18:28 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.6 KB, free 170.4 KB)

16/04/02 09:18:28 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:60108 (size: 2.6 KB, free: 1773.7 MB)

16/04/02 09:18:28 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006

16/04/02 09:18:28 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[3] at javaRDD at SparkSQLParquetOps.java:23)

16/04/02 09:18:28 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks

16/04/02 09:18:28 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,PROCESS_LOCAL, 2179 bytes)

16/04/02 09:18:28 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)

16/04/02 09:18:28 INFO ParquetRelation$$anonfun$buildInternalScan$1$$anon$1: Input split: ParquetInputSplit{part: file:/D:/DT-IMF/testdata/users.parquet start: 0 end: 615 length: 615 hosts: []}

16/04/02 09:18:28 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl

16/04/02 09:18:28 INFO CatalystReadSupport: Going to read the following fields from the Parquet file:

 

Parquet form:

message spark_schema {

  required binary name (UTF8);

  optional binary favorite_color (UTF8);

  required group favorite_numbers (LIST) {

    repeated int32 array;

  }

}

 

Catalyst form:

StructType(StructField(name,StringType,false), StructField(favorite_color,StringType,true), StructField(favorite_numbers,ArrayType(IntegerType,false),false))

       

16/04/02 09:18:29 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:60108 in memory (size: 20.6 KB, free: 1773.7 MB)

16/04/02 09:18:29 INFO ContextCleaner: Cleaned accumulator 1

16/04/02 09:18:29 INFO GenerateUnsafeProjection: Code generated in 422.989887 ms

16/04/02 09:18:29 INFO InternalParquetRecordReader: RecordReader initialized will read a total of 2 records.

16/04/02 09:18:29 INFO InternalParquetRecordReader: at row 0. reading next block

16/04/02 09:18:29 INFO CodecPool: Got brand-new decompressor [.snappy]

16/04/02 09:18:29 INFO InternalParquetRecordReader: block read in memory in 54 ms. row count = 2

16/04/02 09:18:30 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 3532 bytes result sent to driver

16/04/02 09:18:30 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 1796 ms on localhost (1/1)

16/04/02 09:18:30 INFO DAGScheduler: ResultStage 1 (collect at SparkSQLParquetOps.java:23) finished in 1.798 s

16/04/02 09:18:30 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool

16/04/02 09:18:30 INFO DAGScheduler: Job 1 finished: collect at SparkSQLParquetOps.java:23, took 1.863220 s

[Alyssa,null,WrappedArray(3, 9, 15, 20)]

[Ben,red,WrappedArray()]

16/04/02 09:18:30 INFO SparkContext: Invoking stop() from shutdown hook

16/04/02 09:18:30 INFO SparkUI: Stopped Spark web UI at http://192.168.56.1:4040

16/04/02 09:18:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

16/04/02 09:18:30 INFO MemoryStore: MemoryStore cleared

16/04/02 09:18:30 INFO BlockManager: BlockManager stopped

16/04/02 09:18:30 INFO BlockManagerMaster: BlockManagerMaster stopped

16/04/02 09:18:30 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!

16/04/02 09:18:30 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

16/04/02 09:18:30 INFO SparkContext: Successfully stopped SparkContext

16/04/02 09:18:30 INFO ShutdownHookManager: Shutdown hook called

16/04/02 09:18:30 INFO ShutdownHookManager: Deleting directory C:\Users\think\AppData\Local\Temp\spark-46e1adfd-4a69-42a8-9b91-24fb8dd8da16

16/04/02 09:18:30 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.

 


以上内容是王家林老师DT大数据梦工厂《 IMF传奇行动》第62课的学习笔记。
王家林老师是Spark、Flink、Docker、Android技术中国区布道师。Spark亚太研究院院长和首席专家,DT大数据梦工厂创始人,Android软硬整合源码级专家,英语发音魔术师,健身狂热爱好者。

微信公众账号:DT_Spark

联系邮箱18610086859@126.com 

电话:18610086859

QQ:1740415547

微信号:18610086859  

新浪微博:ilovepains



 

附:

apache parquet官网http://parquet.apache.org/documentation/latest/ 1/7parquet的解释:

  /  Apache Parquet (http://parquet.apache.org)

Motivation

We created Parquet to make the advantages of compressed, efficient columnar data

representation available to any project in the Hadoop ecosystem.

Parquet is built from the ground up with complex nested data structures in mind, and uses the

record shredding and assembly algorithm (https://github.com/Parquet/parquet-mr/wiki/The-

striping-and-assembly-algorithms-from-the-Dremel-paper) described in the Dremel paper. We

believe this approach is superior to simple flattening of nested name spaces.

Parquet is built to support very efficient compression and encoding schemes. Multiple projects

have demonstrated the performance impact of applying the right compression and encoding

scheme to the data. Parquet allows compression schemes to be specified on a per-column level,

and is future-proofed to allow adding more encodings as they are invented and implemented.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing

frameworks, and we are not interested in playing favorites. We believe that an efficient, well-

implemented columnar storage substrate should be useful to all frameworks without the cost of

extensive and difficult to set up dependencies.

Modules

The  parquet-format project contains format specifications and Thrift definitions of metadata

required to properly read Parquet files.

The  parquet-mr project contains multiple sub-modules, which implement the core components of

reading and writing a nested, column-oriented data stream, map this core onto the parquet

format, and provide Hadoop Input/Output Formats, Pig loaders, and other java-based utilities for

interacting with Parquet.

The  parquet-compatibility project contains compatibility tests that can be used to verify that

implementations in different languages can read and write each others files.

Building

Java resources can be build using  mvn package. The current stable version should always be

available from Maven Central.

C++ thrift resources can be generated via make.

Thrift can be also code-genned into any other thrift-supported language.

Releasing

See How to Release (../how-to-release/).

Glossary

Apache Software Foundation

2016/3/23 Apache Parquet

http://parquet.apache.org/documentation/latest/ 2/7

Block (hdfs block): This means a block in hdfs and the meaning is unchanged for

describing this file format. The file format is designed to work well on top of hdfs.

File: A hdfs file that must include the metadata for the file. It does not need to actually

contain the data.

Row group: A logical horizontal partitioning of the data into rows. There is no physical

structure that is guaranteed for a row group. A row group consists of a column chunk for

each column in the dataset.

Column chunk: A chunk of the data for a particular column. These live in a particular row

group and is guaranteed to be contiguous in the file.

Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit

(in terms of compression and encoding). There can be multiple page types which is

interleaved in a column chunk.

Hierarchically, a file consists of one or more row groups. A row group contains exactly one column

chunk per column. Column chunks contain one or more pages.

Unit of parallelization

MapReduce - File/Row Group

IO - Column chunk

Encoding/Compression - Page

File format

This file and the thrift definition should be read together to understand the format.

4-byte magic number "PAR1"

<Column 1 Chunk 1 + Column Metadata>

<Column 2 Chunk 1 + Column Metadata>

...

<Column N Chunk 1 + Column Metadata>

<Column 1 Chunk 2 + Column Metadata>

<Column 2 Chunk 2 + Column Metadata>

...

<Column N Chunk 2 + Column Metadata>

...

<Column 1 Chunk M + Column Metadata>

<Column 2 Chunk M + Column Metadata>

...

<Column N Chunk M + Column Metadata>

File Metadata

4-byte length in bytes of file metadata

4-byte magic number "PAR1"

In the above example, there are N columns in this table, split into M row groups. The file metadata

contains the locations of all the column metadata start locations. More details on what is

contained in the metadata can be found in the thrift files.

Metadata is written after the data to allow for single pass writing.

2016/3/23 Apache Parquet

http://parquet.apache.org/documentation/latest/ 3/7

Readers are expected to first read the file metadata to find all the column chunks they are

interested in. The columns chunks should then be read sequentially.

Metadata

There are three types of metadata: file metadata, column (chunk) metadata and page header

metadata. All thrift structures are serialized using the TCompactProtocol.

2016/3/23 Apache Parquet

http://parquet.apache.org/documentation/latest/ 4/7

Types

The types supported by the file format are intended to be as minimal as possible, with a focus on

how the types effect on disk storage. For example, 16-bit ints are not explicitly supported in the

storage format since they are covered by 32-bit ints with an efficient encoding. This reduces the

complexity of implementing readers and writers for the format. The types are: - BOOLEAN: 1 bit

boolean - INT32: 32 bit signed ints - INT64: 64 bit signed ints - INT96: 96 bit signed ints - FLOAT:

IEEE 32-bit floating point values - DOUBLE: IEEE 64-bit floating point values - BYTE_ARRAY:

arbitrarily long byte arrays.

2016/3/23 Apache Parquet

http://parquet.apache.org/documentation/latest/ 5/7

Logical Types

Logical types are used to extend the types that parquet can be used to store, by specifying how

the primitive types should be interpreted. This keeps the set of primitive types to a minimum and

reuses parquets efficient encodings. For example, strings are stored as byte arrays (binary) with

a UTF8 annotation. These annotations define how to further decode and interpret the data.

Annotations are stored as a  ConvertedType in the file metadata and are documented in

LogicalTypes.md (https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md).

Nested Encoding

To encode nested columns, Parquet uses the Dremel encoding with definition and repetition

levels. Definition levels specify how many optional fields in the path for the column are defined.

Repetition levels specify at what repeated field in the path has the value repeated. The max

definition and repetition levels can be computed from the schema (i.e. how much nesting there

is). This defines the maximum number of bits required to store the levels (levels are defined for all

values in the column).

Two encodings for the levels are supported BITPACKED and RLE. Only RLE is now used as it

supersedes BITPACKED.

Nulls

Nullity is encoded in the definition levels (which is run-length encoded). NULL values are not

encoded in the data. For example, in a non-nested schema, a column with 1000 NULLs would be

encoded with run-length encoding (0, 1000 times) for the definition levels and nothing else.

Data Pages

For data pages, the 3 pieces of information are encoded back to back, after the page header.

We have the - definition levels data,

- repetition levels data, - encoded values. The size of specified in the header is for all 3 pieces

combined.

The data for the data page is always required. The definition and repetition levels are optional,

based on the schema definition. If the column is not nested (i.e. the path to the column has length

1), we do not encode the repetition levels (it would always have the value 1). For data that is

required, the definition levels are skipped (if encoded, it will always have the value of the max

definition level).

For example, in the case where the column is non-nested and required, the data in the page is

only the encoded values.

The supported encodings are described in Encodings.md (https://github.com/Parquet/parquet-

format/blob/master/Encodings.md)

Column chunks

2016/3/23 Apache Parquet

http://parquet.apache.org/documentation/latest/ 6/7

Column chunks are composed of pages written back to back. The pages share a common header

and readers can skip over page they are not interested in. The data for the page follows the

header and can be compressed and/or encoded. The compression and encoding is specified in

the page metadata.

Checksumming

Data pages can be individually checksummed. This allows disabling of checksums at the HDFS

file level, to better support single row lookups.

Error recovery

If the file metadata is corrupt, the file is lost. If the column metdata is corrupt, that column chunk is

lost (but column chunks for this column in other row groups are okay). If a page header is

corrupt, the remaining pages in that chunk are lost. If the data within a page is corrupt, that page

is lost. The file will be more resilient to corruption with smaller row groups.

Potential extension: With smaller row groups, the biggest issue is placing the file metadata at the

end. If an error happens while writing the file metadata, all the data written will be unreadable.

This can be fixed by writing the file metadata every Nth row group.

Each file metadata would be cumulative and include all the row groups written so far. Combining

this with the strategy used for rc or avro files using sync markers, a reader could recover partially

written files.

Separating metadata and column data.

The format is explicitly designed to separate the metadata from the data. This allows splitting

columns into multiple files, as well as having a single metadata file reference multiple parquet

files.

Configurations

Row group size: Larger row groups allow for larger column chunks which makes it possible

to do larger sequential IO. Larger groups also require more buffering in the write path (or a

two pass write). We recommend large row groups (512MB - 1GB). Since an entire row

group might need to be read, we want it to completely fit on one HDFS block. Therefore,

HDFS block sizes should also be set to be larger. An optimized read setup would be: 1GB

row groups, 1GB HDFS block size, 1 HDFS block per HDFS file.

Data page size: Data pages should be considered indivisible so smaller data pages allow

for more fine grained reading (e.g. single row lookup). Larger page sizes incur less space

overhead (less page headers) and potentially less parsing overhead (processing

headers). Note: for sequential scans, it is not expected to read a page at a time; this is not

the IO chunk. We recommend 8KB for page sizes.

Extensibility

2016/3/23 Apache Parquet

http://parquet.apache.org/documentation/latest/ 7/7

There are many places in the format for compatible extensions: - File Version: The file metadata

contains a version. - Encodings: Encodings are specified by enum and more can be added in the

future.

- Page types: Additional page types can be added and safely skipped.

Copyright 2014 Apache Software Foundation (http://www.apache.org/). Licensed under the

Apache License v2.0 (http://www.apache.org/licenses/). Apache Parquet and the Apache feather

logo are trademarks of The Apache Software Foundation.

这篇关于第62课:SparkSQL下的Parquet使用最佳实践和代码实践学习笔记的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/588124

相关文章

HarmonyOS学习(七)——UI(五)常用布局总结

自适应布局 1.1、线性布局(LinearLayout) 通过线性容器Row和Column实现线性布局。Column容器内的子组件按照垂直方向排列,Row组件中的子组件按照水平方向排列。 属性说明space通过space参数设置主轴上子组件的间距,达到各子组件在排列上的等间距效果alignItems设置子组件在交叉轴上的对齐方式,且在各类尺寸屏幕上表现一致,其中交叉轴为垂直时,取值为Vert

Ilya-AI分享的他在OpenAI学习到的15个提示工程技巧

Ilya(不是本人,claude AI)在社交媒体上分享了他在OpenAI学习到的15个Prompt撰写技巧。 以下是详细的内容: 提示精确化:在编写提示时,力求表达清晰准确。清楚地阐述任务需求和概念定义至关重要。例:不用"分析文本",而用"判断这段话的情感倾向:积极、消极还是中性"。 快速迭代:善于快速连续调整提示。熟练的提示工程师能够灵活地进行多轮优化。例:从"总结文章"到"用

基于MySQL Binlog的Elasticsearch数据同步实践

一、为什么要做 随着马蜂窝的逐渐发展,我们的业务数据越来越多,单纯使用 MySQL 已经不能满足我们的数据查询需求,例如对于商品、订单等数据的多维度检索。 使用 Elasticsearch 存储业务数据可以很好的解决我们业务中的搜索需求。而数据进行异构存储后,随之而来的就是数据同步的问题。 二、现有方法及问题 对于数据同步,我们目前的解决方案是建立数据中间表。把需要检索的业务数据,统一放到一张M

中文分词jieba库的使用与实景应用(一)

知识星球:https://articles.zsxq.com/id_fxvgc803qmr2.html 目录 一.定义: 精确模式(默认模式): 全模式: 搜索引擎模式: paddle 模式(基于深度学习的分词模式): 二 自定义词典 三.文本解析   调整词出现的频率 四. 关键词提取 A. 基于TF-IDF算法的关键词提取 B. 基于TextRank算法的关键词提取

使用SecondaryNameNode恢复NameNode的数据

1)需求: NameNode进程挂了并且存储的数据也丢失了,如何恢复NameNode 此种方式恢复的数据可能存在小部分数据的丢失。 2)故障模拟 (1)kill -9 NameNode进程 [lytfly@hadoop102 current]$ kill -9 19886 (2)删除NameNode存储的数据(/opt/module/hadoop-3.1.4/data/tmp/dfs/na

Hadoop数据压缩使用介绍

一、压缩原则 (1)运算密集型的Job,少用压缩 (2)IO密集型的Job,多用压缩 二、压缩算法比较 三、压缩位置选择 四、压缩参数配置 1)为了支持多种压缩/解压缩算法,Hadoop引入了编码/解码器 2)要在Hadoop中启用压缩,可以配置如下参数

【前端学习】AntV G6-08 深入图形与图形分组、自定义节点、节点动画(下)

【课程链接】 AntV G6:深入图形与图形分组、自定义节点、节点动画(下)_哔哩哔哩_bilibili 本章十吾老师讲解了一个复杂的自定义节点中,应该怎样去计算和绘制图形,如何给一个图形制作不间断的动画,以及在鼠标事件之后产生动画。(有点难,需要好好理解) <!DOCTYPE html><html><head><meta charset="UTF-8"><title>06

Makefile简明使用教程

文章目录 规则makefile文件的基本语法:加在命令前的特殊符号:.PHONY伪目标: Makefilev1 直观写法v2 加上中间过程v3 伪目标v4 变量 make 选项-f-n-C Make 是一种流行的构建工具,常用于将源代码转换成可执行文件或者其他形式的输出文件(如库文件、文档等)。Make 可以自动化地执行编译、链接等一系列操作。 规则 makefile文件

学习hash总结

2014/1/29/   最近刚开始学hash,名字很陌生,但是hash的思想却很熟悉,以前早就做过此类的题,但是不知道这就是hash思想而已,说白了hash就是一个映射,往往灵活利用数组的下标来实现算法,hash的作用:1、判重;2、统计次数;

使用opencv优化图片(画面变清晰)

文章目录 需求影响照片清晰度的因素 实现降噪测试代码 锐化空间锐化Unsharp Masking频率域锐化对比测试 对比度增强常用算法对比测试 需求 对图像进行优化,使其看起来更清晰,同时保持尺寸不变,通常涉及到图像处理技术如锐化、降噪、对比度增强等 影响照片清晰度的因素 影响照片清晰度的因素有很多,主要可以从以下几个方面来分析 1. 拍摄设备 相机传感器:相机传