数据分析中的统计Test

2024-08-26 04:32

文章标签 统计 test 数据分析

本文主要是介绍数据分析中的统计Test，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

在数据分析中，遇到统计问题的时候，基本可以按照下表来：
statistics method

(图片来源自网上，出处不详)

那么首先我们需要判断是否是正态分布（Normal Distribution）, 四种方法：

绘制数据的直方图，看叠加线——这是一种粗略的方法，且不是硬性（ hard-and-fast）指标。一般来说看得多了你就知道哪些是属于正态分布，哪些不属于。
看偏态值（Skewness）和峰态值（Kurtosis）:
Skewness 是关于分布是否对称的指标。
分为正偏态分布（positively skewed distribution ，整体往左偏）和负偏态分布（negatively skewed distribution，整体往右偏）
Kurtosis 是关于分布峰值陡峭情况的一个指标。
它是指整个曲线的形状是钟型（bell-shaped ）的而不是例如肥胖型或尖峰型等等。
正态分布的Skewness 和 Kurtosis 都是 0，所以离0 越远越不是正态分布，但是到底多少距离 0 我们可以认为它是正态的呢？这个就难办了，所以出现了下面的办法，它是结合了偏态值和峰态值的一种统计检验方法。
Kolmogorov-Smirnov test (K-S) 和 Shapiro-Wilk (S-W) test
他们是通过comparing your data to a normal distribution with the same mean and standard deviation of your sample 来检验是否正态的。
如果检验不显著（NOT significant，即大于0.05），则是正态的，显著的话（significant，即小于0.05），则是非正态的。
需要注意的是，样本越大，越有可能得到显著的结果。
另外一种方法就是做图画点的方法，叫做“Normal Q-Q Plot”。
The black line indicates the values your sample should adhere to if the distribution was normal. The dots are your actual data. If the dots fall exactly on the black line, then your data are normal. If they deviate from the black line, your data are non-normal.

一些很明显不是正态分布的情形：
when the outcome is an ordinal variable or a rank
when there are definite outliers or
when the outcome has clear limits of detection.