第64课:SparkSQL下Parquet的数据切分和压缩内幕详解学习笔记

2024-01-09 19:08

本文主要是介绍第64课:SparkSQL下Parquet的数据切分和压缩内幕详解学习笔记,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

第64课:SparkSQLParquet的数据切分和压缩内幕详解学习笔记

本期内容:

1  SparkSQLParquet数据切分

2  SparkSQL下的Parquet数据压缩

 

Spark官网上的SparkSQL操作Parquet的实例进行讲解:

Schema Merging

Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.

 

Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by

 

setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or

setting the global SQL option spark.sql.parquet.mergeSchema to true.

// sqlContext from the previous example is used in this example.// This is used to implicitly convert an RDD to a DataFrame.

import sqlContext.implicits._

// Create a simple DataFrame, stored into a partition directory

val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")

df1.write.parquet("data/test_table/key=1")

// Create another DataFrame in a new partition directory,// adding a new column and dropping an existing column

val df2 = sc.makeRDD(6 to 10).map(i => (i, i * 3)).toDF("single", "triple")

df2.write.parquet("data/test_table/key=2")

// Read the partitioned table

val df3 = sqlContext.read.option("mergeSchema", "true").parquet("data/test_table")

df3.printSchema()

// The final schema consists of all 3 columns in the Parquet files together// with the partitioning column appeared in the partition directory paths.// root// |-- single: int (nullable = true)// |-- double: int (nullable = true)// |-- triple: int (nullable = true)// |-- key : int (nullable = true)

 

 

实际运行结果:

scala> val df1 = sc.makeRDD(1 to 5).map(i => (i,i * 2)).toDF("single","double")

df1: org.apache.spark.sql.DataFrame = [single: int, double: int]

 

scala> df1.write.parquet("data/text_table/key=1")

16/04/02 04:27:07 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id

16/04/02 04:27:07 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id

16/04/02 04:27:07 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id

16/04/02 04:27:07 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap

16/04/02 04:27:07 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition

16/04/02 04:27:07 INFO parquet.ParquetRelation: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter

16/04/02 04:27:07 INFO datasources.DefaultWriterContainer: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter

16/04/02 04:27:09 INFO spark.SparkContext: Starting job: parquet at <console>:33

16/04/02 04:27:09 INFO scheduler.DAGScheduler: Got job 0 (parquet at <console>:33) with 3 output partitions

16/04/02 04:27:09 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (parquet at <console>:33)

16/04/02 04:27:09 INFO scheduler.DAGScheduler: Parents of final stage: List()

16/04/02 04:27:09 INFO scheduler.DAGScheduler: Missing parents: List()

16/04/02 04:27:09 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at parquet at <console>:33), which has no missing parents

16/04/02 04:27:12 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 68.0 KB, free 68.0 KB)

16/04/02 04:27:12 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.6 KB, free 92.5 KB)

16/04/02 04:27:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.121:56069 (size: 24.6 KB, free: 517.4 MB)

16/04/02 04:27:12 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006

16/04/02 04:27:12 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at parquet at <console>:33)

16/04/02 04:27:12 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 3 tasks

16/04/02 04:27:13 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, slq1, partition 0,PROCESS_LOCAL, 2078 bytes)

16/04/02 04:27:13 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, slq2, partition 1,PROCESS_LOCAL, 2078 bytes)

16/04/02 04:27:13 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, slq3, partition 2,PROCESS_LOCAL, 2135 bytes)

16/04/02 04:27:17 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on slq2:44836 (size: 24.6 KB, free: 517.4 MB)

16/04/02 04:27:17 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on slq3:53765 (size: 24.6 KB, free: 517.4 MB)

16/04/02 04:27:18 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on slq1:44043 (size: 24.6 KB, free: 517.4 MB)

16/04/02 04:28:13 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 60174 ms on slq3 (1/3)

16/04/02 04:28:16 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 62700 ms on slq2 (2/3)

16/04/02 04:28:27 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 74088 ms on slq1 (3/3)

16/04/02 04:28:27 INFO scheduler.DAGScheduler: ResultStage 0 (parquet at <console>:33) finished in 74.105 s

16/04/02 04:28:27 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool

16/04/02 04:28:27 INFO scheduler.DAGScheduler: Job 0 finished: parquet at <console>:33, took 78.540234 s

16/04/02 04:28:29 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

16/04/02 04:28:35 INFO datasources.DefaultWriterContainer: Job job_201604020427_0000 committed.

16/04/02 04:28:36 INFO parquet.ParquetRelation: Listing hdfs://slq1:9000/user/richard/data/text_table/key=1 on driver

16/04/02 04:28:36 INFO parquet.ParquetRelation: Listing hdfs://slq1:9000/user/richard/data/text_table/key=1 on driver

 

scala> 16/04/02 04:39:10 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on slq2:44836 in memory (size: 24.6 KB, free: 517.4 MB)

16/04/02 04:39:10 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on 192.168.1.121:56069 in memory (size: 24.6 KB, free: 517.4 MB)

16/04/02 04:39:11 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on slq3:53765 in memory (size: 24.6 KB, free: 517.4 MB)

16/04/02 04:39:11 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on slq1:44043 in memory (size: 24.6 KB, free: 517.4 MB)

16/04/02 04:39:11 INFO spark.ContextCleaner: Cleaned accumulator 3

16/04/02 04:39:11 INFO spark.ContextCleaner: Cleaned accumulator 2

 

 

scala> val df2 = sc.makeRDD(6 to 10).map(i => (i,i * 3)).toDF("single","triple")

df2: org.apache.spark.sql.DataFrame = [single: int, triple: int]

 

scala> df2.write.parquet("data/text_table/key=2")

16/04/02 04:56:13 INFO parquet.ParquetRelation: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter

16/04/02 04:56:13 INFO datasources.DefaultWriterContainer: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter

16/04/02 04:56:14 INFO spark.SparkContext: Starting job: parquet at <console>:33

16/04/02 04:56:14 INFO scheduler.DAGScheduler: Got job 1 (parquet at <console>:33) with 3 output partitions

16/04/02 04:56:14 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (parquet at <console>:33)

16/04/02 04:56:14 INFO scheduler.DAGScheduler: Parents of final stage: List()

16/04/02 04:56:14 INFO scheduler.DAGScheduler: Missing parents: List()

16/04/02 04:56:14 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[14] at parquet at <console>:33), which has no missing parents

16/04/02 04:56:14 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 68.0 KB, free 68.0 KB)

16/04/02 04:56:14 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 24.6 KB, free 92.5 KB)

16/04/02 04:56:14 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.121:56069 (size: 24.6 KB, free: 517.4 MB)

16/04/02 04:56:14 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006

16/04/02 04:56:14 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (MapPartitionsRDD[14] at parquet at <console>:33)

16/04/02 04:56:14 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 3 tasks

16/04/02 04:56:14 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 3, slq1, partition 0,PROCESS_LOCAL, 2078 bytes)

16/04/02 04:56:14 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 4, slq2, partition 1,PROCESS_LOCAL, 2078 bytes)

16/04/02 04:56:14 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 5, slq3, partition 2,PROCESS_LOCAL, 2135 bytes)

16/04/02 04:56:15 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on slq3:53765 (size: 24.6 KB, free: 517.4 MB)

16/04/02 04:56:15 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on slq2:44836 (size: 24.6 KB, free: 517.4 MB)

16/04/02 04:56:15 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on slq1:44043 (size: 24.6 KB, free: 517.4 MB)

16/04/02 04:56:16 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 4) in 1472 ms on slq2 (1/3)

16/04/02 04:56:16 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 5) in 1486 ms on slq3 (2/3)

16/04/02 04:56:16 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 3) in 2093 ms on slq1 (3/3)

16/04/02 04:56:16 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool

16/04/02 04:56:16 INFO scheduler.DAGScheduler: ResultStage 1 (parquet at <console>:33) finished in 2.095 s

16/04/02 04:56:16 INFO scheduler.DAGScheduler: Job 1 finished: parquet at <console>:33, took 2.673089 s

16/04/02 04:56:17 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5

16/04/02 04:56:18 INFO datasources.DefaultWriterContainer: Job job_201604020456_0000 committed.

16/04/02 04:56:18 INFO parquet.ParquetRelation: Listing hdfs://slq1:9000/user/richard/data/text_table/key=2 on driver

16/04/02 04:56:18 INFO parquet.ParquetRelation: Listing hdfs://slq1:9000/user/richard/data/text_table/key=2 on driver

 

scala> val df3 = sqlContext.read.option("mergeSchema","true").parquet("data/text_table")

16/04/02 05:00:59 INFO parquet.ParquetRelation: Listing hdfs://slq1:9000/user/richard/data/text_table on driver

16/04/02 05:00:59 INFO parquet.ParquetRelation: Listing hdfs://slq1:9000/user/richard/data/text_table/key=1 on driver

16/04/02 05:00:59 INFO parquet.ParquetRelation: Listing hdfs://slq1:9000/user/richard/data/text_table/key=2 on driver

16/04/02 05:01:00 INFO spark.SparkContext: Starting job: parquet at <console>:28

16/04/02 05:01:00 INFO scheduler.DAGScheduler: Got job 2 (parquet at <console>:28) with 3 output partitions

16/04/02 05:01:00 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 (parquet at <console>:28)

16/04/02 05:01:00 INFO scheduler.DAGScheduler: Parents of final stage: List()

16/04/02 05:01:00 INFO scheduler.DAGScheduler: Missing parents: List()

16/04/02 05:01:00 INFO scheduler.DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[17] at parquet at <console>:28), which has no missing parents

16/04/02 05:01:00 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 61.7 KB, free 154.2 KB)

16/04/02 05:01:00 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 21.0 KB, free 175.2 KB)

16/04/02 05:01:00 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.121:56069 (size: 21.0 KB, free: 517.4 MB)

16/04/02 05:01:00 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006

16/04/02 05:01:00 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 2 (MapPartitionsRDD[17] at parquet at <console>:28)

16/04/02 05:01:00 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 3 tasks

16/04/02 05:01:00 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 6, slq3, partition 0,PROCESS_LOCAL, 2524 bytes)

16/04/02 05:01:00 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 2.0 (TID 7, slq1, partition 1,PROCESS_LOCAL, 2524 bytes)

16/04/02 05:01:00 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 2.0 (TID 8, slq2, partition 2,PROCESS_LOCAL, 2469 bytes)

16/04/02 05:01:00 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on slq2:44836 (size: 21.0 KB, free: 517.4 MB)

16/04/02 05:01:00 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on slq3:53765 (size: 21.0 KB, free: 517.4 MB)

16/04/02 05:01:01 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on slq1:44043 (size: 21.0 KB, free: 517.4 MB)

16/04/02 05:01:02 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 6) in 1697 ms on slq3 (1/3)

16/04/02 05:01:02 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 2.0 (TID 8) in 2189 ms on slq2 (2/3)

16/04/02 05:01:05 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 2.0 (TID 7) in 4740 ms on slq1 (3/3)

16/04/02 05:01:05 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool

16/04/02 05:01:05 INFO scheduler.DAGScheduler: ResultStage 2 (parquet at <console>:28) finished in 4.804 s

16/04/02 05:01:05 INFO scheduler.DAGScheduler: Job 2 finished: parquet at <console>:28, took 5.169726 s

df3: org.apache.spark.sql.DataFrame = [single: int, double: int, triple: int, key: int]

 

scala> df3.printSchema()

root

 |-- single: integer (nullable = true)

 |-- double: integer (nullable = true)

 |-- triple: integer (nullable = true)

 |-- key: integer (nullable = true)

 

 

scala> df3.show()

16/04/02 05:03:35 INFO datasources.DataSourceStrategy: Selected 2 partitions out of 2, pruned 0.0% partitions.

16/04/02 05:03:36 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 62.4 KB, free 237.6 KB)

16/04/02 05:03:36 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 19.7 KB, free 257.3 KB)

16/04/02 05:03:36 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.1.121:56069 (size: 19.7 KB, free: 517.3 MB)

16/04/02 05:03:36 INFO spark.SparkContext: Created broadcast 3 from show at <console>:31

16/04/02 05:03:38 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize

16/04/02 05:03:38 INFO parquet.ParquetRelation: Reading Parquet file(s) from hdfs://slq1:9000/user/richard/data/text_table/key=2/part-r-00000-2f220b3f-43a1-4093-ad51-1d3af7707ca8.gz.parquet, hdfs://slq1:9000/user/richard/data/text_table/key=2/part-r-00001-2f220b3f-43a1-4093-ad51-1d3af7707ca8.gz.parquet, hdfs://slq1:9000/user/richard/data/text_table/key=2/part-r-00002-2f220b3f-43a1-4093-ad51-1d3af7707ca8.gz.parquet

16/04/02 05:03:38 INFO parquet.ParquetRelation: Reading Parquet file(s) from hdfs://slq1:9000/user/richard/data/text_table/key=1/part-r-00000-f6a15341-401e-41b0-8f8a-acbf97ce42fb.gz.parquet, hdfs://slq1:9000/user/richard/data/text_table/key=1/part-r-00001-f6a15341-401e-41b0-8f8a-acbf97ce42fb.gz.parquet, hdfs://slq1:9000/user/richard/data/text_table/key=1/part-r-00002-f6a15341-401e-41b0-8f8a-acbf97ce42fb.gz.parquet

16/04/02 05:03:38 INFO spark.SparkContext: Starting job: show at <console>:31

16/04/02 05:03:38 INFO scheduler.DAGScheduler: Got job 3 (show at <console>:31) with 1 output partitions

16/04/02 05:03:38 INFO scheduler.DAGScheduler: Final stage: ResultStage 3 (show at <console>:31)

16/04/02 05:03:38 INFO scheduler.DAGScheduler: Parents of final stage: List()

16/04/02 05:03:38 INFO scheduler.DAGScheduler: Missing parents: List()

16/04/02 05:03:38 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[24] at show at <console>:31), which has no missing parents

16/04/02 05:03:38 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 7.1 KB, free 264.4 KB)

16/04/02 05:03:38 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 3.9 KB, free 268.4 KB)

16/04/02 05:03:38 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.1.121:56069 (size: 3.9 KB, free: 517.3 MB)

16/04/02 05:03:38 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006

16/04/02 05:03:38 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[24] at show at <console>:31)

16/04/02 05:03:38 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks

16/04/02 05:03:39 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 9, slq1, partition 0,NODE_LOCAL, 2353 bytes)

16/04/02 05:03:39 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on slq1:44043 (size: 3.9 KB, free: 517.4 MB)

16/04/02 05:03:39 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on slq1:44043 (size: 19.7 KB, free: 517.3 MB)

16/04/02 05:03:44 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 9) in 5898 ms on slq1 (1/1)

16/04/02 05:03:44 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool

16/04/02 05:03:44 INFO scheduler.DAGScheduler: ResultStage 3 (show at <console>:31) finished in 5.901 s

16/04/02 05:03:44 INFO scheduler.DAGScheduler: Job 3 finished: show at <console>:31, took 6.358506 s

16/04/02 05:03:44 INFO spark.SparkContext: Starting job: show at <console>:31

16/04/02 05:03:44 INFO scheduler.DAGScheduler: Got job 4 (show at <console>:31) with 5 output partitions

16/04/02 05:03:44 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (show at <console>:31)

16/04/02 05:03:44 INFO scheduler.DAGScheduler: Parents of final stage: List()

16/04/02 05:03:44 INFO scheduler.DAGScheduler: Missing parents: List()

16/04/02 05:03:44 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[24] at show at <console>:31), which has no missing parents

16/04/02 05:03:45 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 7.1 KB, free 275.4 KB)

16/04/02 05:03:45 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 3.9 KB, free 279.4 KB)

16/04/02 05:03:45 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.1.121:56069 (size: 3.9 KB, free: 517.3 MB)

16/04/02 05:03:45 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006

16/04/02 05:03:45 INFO scheduler.DAGScheduler: Submitting 5 missing tasks from ResultStage 4 (MapPartitionsRDD[24] at show at <console>:31)

16/04/02 05:03:45 INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 5 tasks

16/04/02 05:03:45 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 10, slq3, partition 1,NODE_LOCAL, 2354 bytes)

16/04/02 05:03:45 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 4.0 (TID 11, slq1, partition 2,NODE_LOCAL, 2354 bytes)

16/04/02 05:03:45 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 4.0 (TID 12, slq2, partition 3,NODE_LOCAL, 2353 bytes)

16/04/02 05:03:45 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on slq1:44043 (size: 3.9 KB, free: 517.3 MB)

16/04/02 05:03:45 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on slq2:44836 (size: 3.9 KB, free: 517.4 MB)

16/04/02 05:03:45 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on slq3:53765 (size: 3.9 KB, free: 517.4 MB)

16/04/02 05:03:45 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on slq3:53765 (size: 19.7 KB, free: 517.3 MB)

16/04/02 05:03:46 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 4.0 (TID 13, slq1, partition 4,NODE_LOCAL, 2354 bytes)

16/04/02 05:03:46 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 4.0 (TID 11) in 1205 ms on slq1 (1/5)

16/04/02 05:03:47 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 4.0 (TID 14, slq1, partition 5,NODE_LOCAL, 2354 bytes)

16/04/02 05:03:47 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 4.0 (TID 13) in 703 ms on slq1 (2/5)

16/04/02 05:03:47 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on slq2:44836 (size: 19.7 KB, free: 517.3 MB)

16/04/02 05:03:49 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 4.0 (TID 14) in 2032 ms on slq1 (3/5)

16/04/02 05:03:52 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 10) in 7654 ms on slq3 (4/5)

16/04/02 05:03:54 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 4.0 (TID 12) in 9789 ms on slq2 (5/5)

16/04/02 05:03:54 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool

16/04/02 05:03:54 INFO scheduler.DAGScheduler: ResultStage 4 (show at <console>:31) finished in 9.805 s

16/04/02 05:03:54 INFO scheduler.DAGScheduler: Job 4 finished: show at <console>:31, took 9.980420 s

+------+------+------+---+

|single|double|triple|key|

+------+------+------+---+

|     6|  null|    18|  2|

|     7|  null|    21|  2|

|     8|  null|    24|  2|

|     9|  null|    27|  2|

|    10|  null|    30|  2|

|     1|     2|  null|  1|

|     2|     4|  null|  1|

|     3|     6|  null|  1|

|     4|     8|  null|  1|

|     5|    10|  null|  1|

+------+------+------+---+

 

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on 192.168.1.121:56069 in memory (size: 3.9 KB, free: 517.3 MB)

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on slq1:44043 in memory (size: 3.9 KB, free: 517.3 MB)

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on slq3:53765 in memory (size: 3.9 KB, free: 517.3 MB)

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on slq2:44836 in memory (size: 3.9 KB, free: 517.3 MB)

16/04/02 05:09:12 INFO spark.ContextCleaner: Cleaned accumulator 8

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on 192.168.1.121:56069 in memory (size: 3.9 KB, free: 517.3 MB)

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on slq1:44043 in memory (size: 3.9 KB, free: 517.3 MB)

16/04/02 05:09:12 INFO spark.ContextCleaner: Cleaned accumulator 7

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on 192.168.1.121:56069 in memory (size: 19.7 KB, free: 517.4 MB)

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on slq3:53765 in memory (size: 19.7 KB, free: 517.4 MB)

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on slq2:44836 in memory (size: 19.7 KB, free: 517.4 MB)

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on slq1:44043 in memory (size: 19.7 KB, free: 517.4 MB)

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on 192.168.1.121:56069 in memory (size: 21.0 KB, free: 517.4 MB)

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on slq1:44043 in memory (size: 21.0 KB, free: 517.4 MB)

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on slq3:53765 in memory (size: 21.0 KB, free: 517.4 MB)

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on slq2:44836 in memory (size: 21.0 KB, free: 517.4 MB)

16/04/02 05:09:12 INFO spark.ContextCleaner: Cleaned accumulator 6

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.1.121:56069 in memory (size: 24.6 KB, free: 517.4 MB)

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on slq3:53765 in memory (size: 24.6 KB, free: 517.4 MB)

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on slq2:44836 in memory (size: 24.6 KB, free: 517.4 MB)

16/04/02 05:09:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on slq1:44043 in memory (size: 24.6 KB, free: 517.4 MB)

16/04/02 05:09:13 INFO spark.ContextCleaner: Cleaned accumulator 5

16/04/02 05:09:13 INFO spark.ContextCleaner: Cleaned accumulator 4

 

 

 

实例中使用了df.write方法将DataFrame数据以parquet格式写入到HDFS上。

下面从源码的角度解读此实例:

DataFrame.scala类中,可以找到write方法:

/**
 * :: Experimental ::
 * Interface for saving the content of the
[[DataFrame]] out into external storage.
 *
 *
@group output
 *
@since 1.4.0
 */
@Experimental
def write: DataFrameWriter = new DataFrameWriter(this)

可以看出,DataFramewrite方法直接生成了一个DataFrameWriter实例。

DataFrameWriter类中可以找到parquet方法:

/**
 * Saves the content of the
[[DataFrame]] in Parquet format at the specified path.
 * This is equivalent to:
 *
{{{
 *   format("parquet").save(path)
 *
}}}
 *
 *
@since 1.4.0
 */
def parquet(path: String): Unit = format("parquet").save(path)

可以看出parquet方法只是format("parquet").save(path)方法的快捷方式。

format方法的源码如下:

/**
 * Specifies the underlying output data source. Built-in options include "parquet", "json", etc.
 *
 *
@since 1.4.0
 */
def format(source: String): DataFrameWriter = {
  this.source = source
  this
}

format方法只是返回“parquet”格式名称本身,然后进行save操作。

/**
 * Saves the content of the
[[DataFrame]] at the specified path.
 *
 *
@since 1.4.0
 */
def save(path: String): Unit = {
  this.extraOptions += ("path" -> path)
  save()
}

可以看出save操作中调用了extraOptions方法:

private var extraOptions = new scala.collection.mutable.HashMap[String, String]

可以看出extraOptions 是一个HashMap

save操作还调用了save()方法:

/**
 * Saves the content of the
[[DataFrame]] as the specified table.
 *
 *
@since 1.4.0
 */
def save(): Unit = {
  ResolvedDataSource(
    df.sqlContext,
    source,
    partitioningColumns.map(_.toArray).getOrElse(Array.empty[String]),
    mode,
    extraOptions.toMap,
    df)
}

save()方法主要就是调用ResolvedDataSource的apply方法:

/** Create a [[ResolvedDataSource]] for saving the content of the given DataFrame. */
  
def apply(
      sqlContext: SQLContext,  //对应save()方法中的df.sqlContext。
      provider: String,    //对应save()方法中的source,即“parquet”格式名称
      partitionColumns: Array[String],
      mode: SaveMode,
      options: Map[String, String],
      data: DataFrame): ResolvedDataSource = {
    if (data.schema.map(_.dataType).exists(_.isInstanceOf[CalendarIntervalType])) {
      throw new AnalysisException("Cannot save interval data type into external storage.")
    }
    val clazz: Class[_] = lookupDataSource(provider)
    val relation = clazz.newInstance() match {
      case dataSource: CreatableRelationProvider =>
        dataSource.createRelation(sqlContext, mode, options, data)
      case dataSource: HadoopFsRelationProvider =>
        // Don't glob path for the write path.  The contracts here are:
        //  1. Only one output path can be specified on the write path;
        //  2. Output path must be a legal HDFS style file system path;
        //  3. It's OK that the output path doesn't exist yet;
        
val caseInsensitiveOptions = new CaseInsensitiveMap(options)
        val outputPath = {
          val path = new Path(caseInsensitiveOptions("path"))
          val fs = path.getFileSystem(sqlContext.sparkContext.hadoopConfiguration)
          path.makeQualified(fs.getUri, fs.getWorkingDirectory)
        }

        val caseSensitive = sqlContext.conf.caseSensitiveAnalysis
        PartitioningUtils.validatePartitionColumnDataTypes(
          data.schema, partitionColumns, caseSensitive)

        val equality = columnNameEquality(caseSensitive)
        val dataSchema = StructType(
          data.schema.filterNot(f => partitionColumns.exists(equality(_, f.name))))
        val r = dataSource.createRelation(
          sqlContext,
          Array(outputPath.toString),
          Some(dataSchema.asNullable),
          Some(partitionColumnsSchema(data.schema, partitionColumns, caseSensitive)),
          caseInsensitiveOptions)

        // For partitioned relation r, r.schema's column ordering can be different from the column
        // ordering of data.logicalPlan (partition columns are all moved after data column).  This
        // will be adjusted within InsertIntoHadoopFsRelation.
        
sqlContext.executePlan(
          InsertIntoHadoopFsRelation(
            r,
            data.logicalPlan,
            mode)).toRdd
        
r
      case _ =>
        sys.error(s"${clazz.getCanonicalName} does not allow create table as select.")
    }
    ResolvedDataSource(clazz, relation)
  }
}

 

save()方法中的source的源码为:

private var source: String = df.sqlContext.conf.defaultDataSourceName

SQLContext的conf中的defaultDataSourceName方法为:

private[spark] def defaultDataSourceName: String = getConf(DEFAULT_DATA_SOURCE_NAME)

在SQLConf.scala中可以看到:
// This is used to set the default data source
val DEFAULT_DATA_SOURCE_NAME = stringConf("spark.sql.sources.default",
  defaultValue = Some("org.apache.spark.sql.parquet"),
  doc = "The default data source to use in input/output.")

即默认数据源是parquet

 

parquet.block.size基本上是压缩后的大小。读取数据时可能数据还在encoding

 

page内部有repetitionLevel DefinitionLevel data

Java的二进制就是字节流

Parquet非常耗内存,采用高压缩比率,采用很多Cache

解压后的大小是解压前的5-10倍。

BlockSize采用默认256MB

 

 

 

 

 

以上内容是王家林老师DT大数据梦工厂《 IMF传奇行动》第64课的学习笔记。
王家林老师是Spark、Flink、Docker、Android技术中国区布道师。Spark亚太研究院院长和首席专家,DT大数据梦工厂创始人,Android软硬整合源码级专家,英语发音魔术师,健身狂热爱好者。

微信公众账号:DT_Spark

联系邮箱18610086859@126.com 

电话:18610086859

QQ:1740415547

微信号:18610086859  

新浪微博:ilovepains


 

 

这篇关于第64课:SparkSQL下Parquet的数据切分和压缩内幕详解学习笔记的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/588125

相关文章

java中反射(Reflection)机制举例详解

《java中反射(Reflection)机制举例详解》Java中的反射机制是指Java程序在运行期间可以获取到一个对象的全部信息,:本文主要介绍java中反射(Reflection)机制的相关资料... 目录一、什么是反射?二、反射的用途三、获取Class对象四、Class类型的对象使用场景1五、Class

golang 日志log与logrus示例详解

《golang日志log与logrus示例详解》log是Go语言标准库中一个简单的日志库,本文给大家介绍golang日志log与logrus示例详解,感兴趣的朋友一起看看吧... 目录一、Go 标准库 log 详解1. 功能特点2. 常用函数3. 示例代码4. 优势和局限二、第三方库 logrus 详解1.

MySQL大表数据的分区与分库分表的实现

《MySQL大表数据的分区与分库分表的实现》数据库的分区和分库分表是两种常用的技术方案,本文主要介绍了MySQL大表数据的分区与分库分表的实现,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有... 目录1. mysql大表数据的分区1.1 什么是分区?1.2 分区的类型1.3 分区的优点1.4 分

一文详解如何从零构建Spring Boot Starter并实现整合

《一文详解如何从零构建SpringBootStarter并实现整合》SpringBoot是一个开源的Java基础框架,用于创建独立、生产级的基于Spring框架的应用程序,:本文主要介绍如何从... 目录一、Spring Boot Starter的核心价值二、Starter项目创建全流程2.1 项目初始化(

Mysql删除几亿条数据表中的部分数据的方法实现

《Mysql删除几亿条数据表中的部分数据的方法实现》在MySQL中删除一个大表中的数据时,需要特别注意操作的性能和对系统的影响,本文主要介绍了Mysql删除几亿条数据表中的部分数据的方法实现,具有一定... 目录1、需求2、方案1. 使用 DELETE 语句分批删除2. 使用 INPLACE ALTER T

Spring Boot3虚拟线程的使用步骤详解

《SpringBoot3虚拟线程的使用步骤详解》虚拟线程是Java19中引入的一个新特性,旨在通过简化线程管理来提升应用程序的并发性能,:本文主要介绍SpringBoot3虚拟线程的使用步骤,... 目录问题根源分析解决方案验证验证实验实验1:未启用keep-alive实验2:启用keep-alive扩展建

Java异常架构Exception(异常)详解

《Java异常架构Exception(异常)详解》:本文主要介绍Java异常架构Exception(异常),具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录1. Exception 类的概述Exception的分类2. 受检异常(Checked Exception)

Python Dash框架在数据可视化仪表板中的应用与实践记录

《PythonDash框架在数据可视化仪表板中的应用与实践记录》Python的PlotlyDash库提供了一种简便且强大的方式来构建和展示互动式数据仪表板,本篇文章将深入探讨如何使用Dash设计一... 目录python Dash框架在数据可视化仪表板中的应用与实践1. 什么是Plotly Dash?1.1

C#基础之委托详解(Delegate)

《C#基础之委托详解(Delegate)》:本文主要介绍C#基础之委托(Delegate),具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录1. 委托定义2. 委托实例化3. 多播委托(Multicast Delegates)4. 委托的用途事件处理回调函数LINQ

Python GUI框架中的PyQt详解

《PythonGUI框架中的PyQt详解》PyQt是Python语言中最强大且广泛应用的GUI框架之一,基于Qt库的Python绑定实现,本文将深入解析PyQt的核心模块,并通过代码示例展示其应用场... 目录一、PyQt核心模块概览二、核心模块详解与示例1. QtCore - 核心基础模块2. QtWid