数据分析实例1(英文报告)--预测未来收入--SAS 逻辑回归--1994年美国人口普查数据

本文主要是介绍数据分析实例1(英文报告)--预测未来收入--SAS 逻辑回归--1994年美国人口普查数据,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

Prediction of Future Income by Using Logistic Regression

Matthew LaFrance, Yu Zhang

 

1 Introduction

Many factors could influence a person’s annual income, for example, age, gender, race, level of education, marriage status, nationality, etc. The authors tried to fit four models of influential factors that based on a census dataset and to find a precise prediction of annual income.

The dataset was found at the UCI Machine Learning Repository. The data consists of 48,844 observations from a 1994 U.S. Census. The target variable, “Salary”, has two levels: >50k and <=50k. There are 8 categorical features and 5
numeric features consisting of demographic, educational, and occupational information. Table 1 is the variables in SAS version.
 

Table 1 Variables in the Income Dataset
在这里插入图片描述
The target variable “Salary” is notably unbalanced (in table 2). As a result, we noted that using raw accuracy as a success metric could potentially be misleading because a model that classifies all observations as <=50k would achieve roughly 76% accuracy.
 
Table 2 Salary Overview
在这里插入图片描述

2 Data Preprocessing

2.1 Missing Values

3,622 observations contained missing values primarily in the “occupation” column. Because the occupation column had many factor levels, we decided many imputation methods wouldn’t retain enough information to justify the increase in bias, so we decided to take only complete cases. As a result, our conclusions may be biased, and we assume that the deleted observations had missing values at random. For future work, it may be worthwhile to explore other methods of handling the missing data. After deletion, the final dataset had 45,222 observations.
 

2.2 Multicollinearity Checks

None of our numeric features showed strong correlations between each other (see table 3). As a result, we were not particularly concerned about multicollinearity.
 

Table 3 The Result of Correlation Checks
在这里插入图片描述

2.3 Exploratory Data Analysis and Feature Engineering

In our initial looks at the data, we were able to make several noteworthy observations which will be detailed below by variables and summarized at the end.

2.3.1 Capital Gains and Capital Losses
Figure 1 Overview of Capital Gains
Figure 1 Overview of Capital Gains
在这里插入图片描述
Figure 2 Overview of Capital Losses
 

Regarding the capital gains and loss variables, it is worth noting that most individuals in our dataset do not have any investments (see Figure 1 and Figure 2).

2.3.2 Native Country
As expected, most observations are U.S. natives (see Figure 3). The native country variable consists of many factor levels. In order to avoid having too many dummy variables later on, we decided it would be necessary to rebin this feature into a “0, and 1 ” indicator variable of being a native-born citizen.
在这里插入图片描述
Figure 3 Overview of Native Countries
 

2.3.3 Marital Status
In looking into the marital status feature, we noticed several different levels all representing married (see Figure 4). We decided to combine all these levels into one level, “married”, for more convenient interpretation. Additionally, it was worth noting that married individuals appear to more consistently make greater than 50k.

这篇关于数据分析实例1(英文报告)--预测未来收入--SAS 逻辑回归--1994年美国人口普查数据的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/700653

相关文章

Java中注解与元数据示例详解

《Java中注解与元数据示例详解》Java注解和元数据是编程中重要的概念,用于描述程序元素的属性和用途,:本文主要介绍Java中注解与元数据的相关资料,文中通过代码介绍的非常详细,需要的朋友可以参... 目录一、引言二、元数据的概念2.1 定义2.2 作用三、Java 注解的基础3.1 注解的定义3.2 内

将sqlserver数据迁移到mysql的详细步骤记录

《将sqlserver数据迁移到mysql的详细步骤记录》:本文主要介绍将SQLServer数据迁移到MySQL的步骤,包括导出数据、转换数据格式和导入数据,通过示例和工具说明,帮助大家顺利完成... 目录前言一、导出SQL Server 数据二、转换数据格式为mysql兼容格式三、导入数据到MySQL数据

C++中使用vector存储并遍历数据的基本步骤

《C++中使用vector存储并遍历数据的基本步骤》C++标准模板库(STL)提供了多种容器类型,包括顺序容器、关联容器、无序关联容器和容器适配器,每种容器都有其特定的用途和特性,:本文主要介绍C... 目录(1)容器及简要描述‌php顺序容器‌‌关联容器‌‌无序关联容器‌(基于哈希表):‌容器适配器‌:(

C#提取PDF表单数据的实现流程

《C#提取PDF表单数据的实现流程》PDF表单是一种常见的数据收集工具,广泛应用于调查问卷、业务合同等场景,凭借出色的跨平台兼容性和标准化特点,PDF表单在各行各业中得到了广泛应用,本文将探讨如何使用... 目录引言使用工具C# 提取多个PDF表单域的数据C# 提取特定PDF表单域的数据引言PDF表单是一

一文详解Python中数据清洗与处理的常用方法

《一文详解Python中数据清洗与处理的常用方法》在数据处理与分析过程中,缺失值、重复值、异常值等问题是常见的挑战,本文总结了多种数据清洗与处理方法,文中的示例代码简洁易懂,有需要的小伙伴可以参考下... 目录缺失值处理重复值处理异常值处理数据类型转换文本清洗数据分组统计数据分箱数据标准化在数据处理与分析过

大数据小内存排序问题如何巧妙解决

《大数据小内存排序问题如何巧妙解决》文章介绍了大数据小内存排序的三种方法:数据库排序、分治法和位图法,数据库排序简单但速度慢,对设备要求高;分治法高效但实现复杂;位图法可读性差,但存储空间受限... 目录三种方法:方法概要数据库排序(http://www.chinasem.cn对数据库设备要求较高)分治法(常

java图像识别工具类(ImageRecognitionUtils)使用实例详解

《java图像识别工具类(ImageRecognitionUtils)使用实例详解》:本文主要介绍如何在Java中使用OpenCV进行图像识别,包括图像加载、预处理、分类、人脸检测和特征提取等步骤... 目录前言1. 图像识别的背景与作用2. 设计目标3. 项目依赖4. 设计与实现 ImageRecogni

Python将大量遥感数据的值缩放指定倍数的方法(推荐)

《Python将大量遥感数据的值缩放指定倍数的方法(推荐)》本文介绍基于Python中的gdal模块,批量读取大量多波段遥感影像文件,分别对各波段数据加以数值处理,并将所得处理后数据保存为新的遥感影像... 本文介绍基于python中的gdal模块,批量读取大量多波段遥感影像文件,分别对各波段数据加以数值处

使用MongoDB进行数据存储的操作流程

《使用MongoDB进行数据存储的操作流程》在现代应用开发中,数据存储是一个至关重要的部分,随着数据量的增大和复杂性的增加,传统的关系型数据库有时难以应对高并发和大数据量的处理需求,MongoDB作为... 目录什么是MongoDB?MongoDB的优势使用MongoDB进行数据存储1. 安装MongoDB

Python MySQL如何通过Binlog获取变更记录恢复数据

《PythonMySQL如何通过Binlog获取变更记录恢复数据》本文介绍了如何使用Python和pymysqlreplication库通过MySQL的二进制日志(Binlog)获取数据库的变更记录... 目录python mysql通过Binlog获取变更记录恢复数据1.安装pymysqlreplicat