本文主要是介绍Spark 中 Dataset.show 如何使用?有哪些值得注意的地方?,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
前言
本文隶属于专栏《大数据技术体系》,该专栏为笔者原创,引用请注明来源,不足和错误之处请在评论区帮忙指出,谢谢!
本专栏目录结构和参考文献请见大数据技术体系
WHAT
Dataset.show
在我们平常 Spark 开发测试中经常使用。
它可以用来展示 Dataset
。
重载方法
def show(numRows: Int): Unitdef show(): Unitdef show(truncate: Boolean): Unitdef show(numRows: Int, truncate: Boolean): Unitdef show(numRows: Int, truncate: Int): Unitdef show(numRows: Int, truncate: Int, vertical: Boolean): Unit
使用注意点
上面的 6 个重载方法有哪些值得注意的地方呢?
1. vertical
show() 方法存在 2 种打印模式
默认是横式的,如下所示:
year month AVG('Adj Close) MAX('Adj Close)1980 12 0.503218 0.5951031981 01 0.523289 0.5703071982 02 0.436504 0.4752561983 03 0.410516 0.4421941984 04 0.450090 0.483521
另一种是竖式的,如下所示:
-RECORD 0-------------------year | 1980month | 12AVG('Adj Close) | 0.503218AVG('Adj Close) | 0.595103
-RECORD 1-------------------year | 1981month | 01AVG('Adj Close) | 0.523289AVG('Adj Close) | 0.570307
-RECORD 2-------------------year | 1982month | 02AVG('Adj Close) | 0.436504AVG('Adj Close) | 0.475256
-RECORD 3-------------------year | 1983month | 03AVG('Adj Close) | 0.410516AVG('Adj Close) | 0.442194
-RECORD 4-------------------year | 1984month | 04AVG('Adj Close) | 0.450090AVG('Adj Close) | 0.483521
2. numRows
show() 方法可以通过设置 numRows 来控制最终返回多少行数据,默认 20。
3. truncate
- show() 方法可以通过设置 truncate 参数来控制单个数据列字符串最长显示的长度,并且所有列都会靠右对齐。
- 字符串如果超过 truncate(默认是 20),将会截取前面的
truncate - 3
长度,后面再加上...
str.substring(0, truncate - 3) + "..."
- 对于数据类型是 Array[Byte] 的数据列,会用
"[", " ", "]"
的格式输出
binary.map("%02X".format(_)).mkString("[", " ", "]")
Dataset.show 具体的源码解析请参考我的这篇博客——Spark SQL 工作流程源码解析(四)optimization 阶段(基于 Spark 3.3.0)
实践
源码下载
spark-examples 代码已开源,本项目致力于提供最具实践性的 Apache Spark 代码开发学习指南。
点击链接前往 github 下载源码:spark-examples
数据
{"name": "Alice","age": 18,"sex": "Female","addr": ["address_1","address_2", " address_3"]}
{"name": "Thomas","age": 20, "sex": "Male","addr": ["address_1"]}
{"name": "Tom","age": 50, "sex": "Male","addr": ["address_1","address_2","address_3"]}
{"name": "Catalina","age": 30, "sex": "Female","addr": ["address_1","address_2"]}
代码
package com.shockang.study.spark.sql.showimport com.shockang.study.spark.SQL_DATA_DIR
import com.shockang.study.spark.util.Utils.formatPrint
import org.apache.spark.sql.SparkSession/**** @author Shockang*/
object ShowExample {val DATA_PATH: String = SQL_DATA_DIR + "user.json"def main(args: Array[String]): Unit = {val spark = SparkSession.builder.master("local[*]").appName("ShowExample").getOrCreate()spark.sparkContext.setLogLevel("ERROR")spark.read.json(DATA_PATH).createTempView("t_user")val df = spark.sql("SELECT * FROM t_user")formatPrint("""df.show""")df.showformatPrint("""df.show(2)""")df.show(2)formatPrint("""df.show(true)""")df.show(true)formatPrint("""df.show(false)""")df.show(false)formatPrint("""df.show(2, truncate = true)""")df.show(2, truncate = true)formatPrint("""df.show(2, truncate = false)""")df.show(2, truncate = false)formatPrint("""df.show(2, truncate = 0)""")df.show(2, truncate = 0)formatPrint("""df.show(2, truncate = 20)""")df.show(2, truncate = 20)formatPrint("""df.show(2, truncate = 0, vertical = true)""")df.show(2, truncate = 0, vertical = true)formatPrint("""df.show(2, truncate = 20, vertical = true)""")df.show(2, truncate = 20, vertical = true)formatPrint("""df.show(2, truncate = 0, vertical = false)""")df.show(2, truncate = 0, vertical = false)formatPrint("""df.show(2, truncate = 20, vertical = false)""")df.show(2, truncate = 20, vertical = false)spark.stop()}
}
打印
========== df.show ==========
+--------------------+---+--------+------+
| addr|age| name| sex|
+--------------------+---+--------+------+
|[address_1, addre...| 18| Alice|Female|
| [address_1]| 20| Thomas| Male|
|[address_1, addre...| 50| Tom| Male|
|[address_1, addre...| 30|Catalina|Female|
+--------------------+---+--------+------+========== df.show(2) ==========
+--------------------+---+------+------+
| addr|age| name| sex|
+--------------------+---+------+------+
|[address_1, addre...| 18| Alice|Female|
| [address_1]| 20|Thomas| Male|
+--------------------+---+------+------+
only showing top 2 rows========== df.show(true) ==========
+--------------------+---+--------+------+
| addr|age| name| sex|
+--------------------+---+--------+------+
|[address_1, addre...| 18| Alice|Female|
| [address_1]| 20| Thomas| Male|
|[address_1, addre...| 50| Tom| Male|
|[address_1, addre...| 30|Catalina|Female|
+--------------------+---+--------+------+========== df.show(false) ==========
+----------------------------------+---+--------+------+
|addr |age|name |sex |
+----------------------------------+---+--------+------+
|[address_1, address_2, address_3]|18 |Alice |Female|
|[address_1] |20 |Thomas |Male |
|[address_1, address_2, address_3] |50 |Tom |Male |
|[address_1, address_2] |30 |Catalina|Female|
+----------------------------------+---+--------+------+========== df.show(2, truncate = true) ==========
+--------------------+---+------+------+
| addr|age| name| sex|
+--------------------+---+------+------+
|[address_1, addre...| 18| Alice|Female|
| [address_1]| 20|Thomas| Male|
+--------------------+---+------+------+
only showing top 2 rows========== df.show(2, truncate = false) ==========
+----------------------------------+---+------+------+
|addr |age|name |sex |
+----------------------------------+---+------+------+
|[address_1, address_2, address_3]|18 |Alice |Female|
|[address_1] |20 |Thomas|Male |
+----------------------------------+---+------+------+
only showing top 2 rows========== df.show(2, truncate = 0) ==========
+----------------------------------+---+------+------+
|addr |age|name |sex |
+----------------------------------+---+------+------+
|[address_1, address_2, address_3]|18 |Alice |Female|
|[address_1] |20 |Thomas|Male |
+----------------------------------+---+------+------+
only showing top 2 rows========== df.show(2, truncate = 20) ==========
+--------------------+---+------+------+
| addr|age| name| sex|
+--------------------+---+------+------+
|[address_1, addre...| 18| Alice|Female|
| [address_1]| 20|Thomas| Male|
+--------------------+---+------+------+
only showing top 2 rows========== df.show(2, truncate = 0, vertical = true) ==========
-RECORD 0----------------------------------addr | [address_1, address_2, address_3] age | 18 name | Alice sex | Female
-RECORD 1----------------------------------addr | [address_1] age | 20 name | Thomas sex | Male
only showing top 2 rows========== df.show(2, truncate = 20, vertical = true) ==========
-RECORD 0--------------------addr | [address_1, addre... age | 18 name | Alice sex | Female
-RECORD 1--------------------addr | [address_1] age | 20 name | Thomas sex | Male
only showing top 2 rows========== df.show(2, truncate = 0, vertical = false) ==========
+----------------------------------+---+------+------+
|addr |age|name |sex |
+----------------------------------+---+------+------+
|[address_1, address_2, address_3]|18 |Alice |Female|
|[address_1] |20 |Thomas|Male |
+----------------------------------+---+------+------+
only showing top 2 rows========== df.show(2, truncate = 20, vertical = false) ==========
+--------------------+---+------+------+
| addr|age| name| sex|
+--------------------+---+------+------+
|[address_1, addre...| 18| Alice|Female|
| [address_1]| 20|Thomas| Male|
+--------------------+---+------+------+
only showing top 2 rows
这篇关于Spark 中 Dataset.show 如何使用?有哪些值得注意的地方?的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!