本文主要是介绍python - 单因子分析,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np## 读取数据
df = pd.read_csv("./HR.csv", header=0)# 查看数据结构
summary = df.describe()# 求均值
row_mean = df.mean(axis=1)
col_mean = df.mean()# 选择数据
## 列
print(df["satisfaction_level"].head())
print(df[0:3])## 标签
print(df.loc[0:3])
print(df.loc[0, ["satisfaction_level"]])### 1
### 异常值分析
### 空值处理
sl_l = df["satisfaction_level"]
df[df['satisfaction_level'].isnull()]
#print(sl_l.isnull())
print(sl_l.isnull().sum())
print(sl_l[sl_l.isnull()])
## 对空值的填充
#print(sl_l.fillna(value=5))## 对空值的丢弃
#print(sl_l.dropna(how="any"))
sl_l = sl_l.dropna(how="any")### 2
### 数据过大、过小异常处理
le_s = df['last_evaluation']
le_s[le_s.isnull()]
le_s.isnull().sum()## 偏度
le_s.skew()
## 峰度
le_s.kurt()## 连续异常值处理方式(取四分位上下界)
#(1) le_s = le_s[le_s <= 1]
q_low = le_s.quantile(q=0.25)
q_high = le_s.quantile(q=0.75)
q_interval = q_high - q_low
k = 1.5### 数据筛选
le_s = le_s[le_s<q_interval+k*q_interval][le_s>q_low-k*q_interval]### 分布情况
np.histogram(le_s.values,bins=np.arange(0.0, 1.1,0.1 ))### 3
## 排序
np_s = df['number_project']
np_s.value_counts(normalize=True).sort_index()### 4 分布情况
pl5_s = df['promotion_last_5years']
pl5_s.value_counts()
pl5_s.value_counts(normalize=True)## 5 条件筛选
s_s = df['salary']
s_s.where(s_s!=="nme").dropna()### 总结
# 去空值
df = pd.read_csv("./HR.csv", header=0)
df = df.dropna(axis=0,how='any')df[df['last_evaluation']<=1][df['salary']!='nme']
le_s = df['last_evaluation']
q_low = le_s.quantile(q=0.25)
q_high = le_s.quantile(q=0.75)
q_interval = q_high - q_low
k=1.5
le_s = le_s[le_s<k*q_interval+q_high][le_s>k*q_interval-q_low]
df[le_s<k*q_interval+q_high][le_s>k*q_interval-q_low][df['salary']!='nme']
简单对比分析
df.groupby("department").mean()
待续。。。
这篇关于python - 单因子分析的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!