
2023-11-04 03:20


Image for post

人口统计学可以告诉我们选民在大选中的政党选择 (What Demographics Can Tell Us About Voter’s Choice of Party in House Elections)

2017年美国国会众议院选举 (2017 Election for National House of Representatives)

Using data put together by Steve Riffe at data.world for the 2017 Election for the House of Representatives, I built a simple logistic regression model for predicting whether a voting district will a Democratic or Republican candidate based on its ethno-racial demographics. Riffe’s demographic data were taken from the US Census Bureau’s 2013 estimates.

我使用了史蒂夫·里夫 ( Steve Riffe)在2017年美国众议院选举中在data.world上汇总的数据,建立了一个简单的逻辑回归模型,用于基于其种族族裔人口统计预测投票区是民主党还是共和党候选人。 Riffe的人口统计数据取自美国人口普查局2013年的估算 。

The Census Bureau’s estimates use the following designations for the various ethno-racial groups:


  • Hispanic

  • White

  • Black

  • Native American

  • Asian

  • Pacific Islander

  • Other

  • Multiple Races


训练模型 (Training the Model)

After converting the data’s raw values to reflect each demographic as a percentage of the total estimated population of each district, I created a voter turnout feature. I do need to note here that, due to Riffe’s lack of clarity in the dataset’s data dictionary, it is not clear whether this feature represents actual voter turnout or simply the total number of votes the winning candidate received. The target category, the victorious candidates’ party affiliation, was then converted to binary values with Democrat = 0 and Republican = 1. Before my reader asks where all the independents went, surprisingly, there were none.

在转换数据的原始值以反映每个人口占每个地区总估计人口的百分比之后,我创建了一个投票者投票功能。 我确实需要在此指出,由于Riffe在数据集的数据字典中缺乏清晰度,因此尚不清楚此功能是代表实际选民投票率还是仅代表获胜候选人获得的总票数。 然后,将目标类别(胜利的候选人的政党隶属关系)转换为二进制值,其中民主党= 0,共和党=1。在我的读者询问所有独立人士去哪儿之前,令人惊讶的是,那里没有人。

I used with the scikit-learn implementation of logistic regression with a train, validate, test split to keep in line with best practices. This was a natural first choice given the fact that the question called for classification to solve. The model’s accuracy score was taken as the primary error metric over the Receiver-operator characteristic (ROC) score. The model measured against a mode-baseline accuracy score in which each district was predicted to elect a Republican representative. This decision was made given that the accuracy score is fairly interpret-able to the layman and the fact that the mode was roughly 55% meant that the ROC score was not needed to compensate for imbalanced classes.

我使用了逻辑回归的scikit-learn实现,并进行了训练,验证和测试拆分,以与最佳实践保持一致。 考虑到该问题需要分类来解决,因此这是自然的首选。 该模型的准确度得分被视为超过接收者-操作者特征(ROC)得分的主要误差指标。 该模型是根据模式基线准确度得分进行衡量的,其中预测每个地区都将选举一名共和党代表。 考虑到准确度分数对于外行来说是可以解释的,因此做出此决定是因为该模式大约为55%,这意味着不需要ROC分数来补偿班级不平衡。

Since the data were originally in alphabetical order by state name and sequentially by district number, I shuffled the data and performed a 70/15/15 train, val, test split. This was done in a SciKitLearn pipeline with StandardScaler as shown below.

由于数据最初是按州名按字母顺序排列的,而按区号依次按顺序排列,所以我对数据进行了混洗,并进行了70/15/15的火车,Val,测试拆分。 如下所示,这是在带有StandardScaler的SciKitLearn管道中完成的。

from sklearn.pipeline import make_pipelinefrom sklearn.linear_model import LogisticRegressionfrom sklearn.preprocessing import StandardScaler
lr = make_pipeline(
lr.fit(X_train, y_train)
lr_accuracy_ = lr.score(X_val, y_val)

This simple model netted 84.62 % accuracy, nearly 30% greater than the baseline.


分析数据 (Analyzing the Data)

Logit功能 (The Logit Function)

Given that the above model was sufficiently accurate to draw conclusions from, I used the the intercept and coefficients of the logit function, the linear form of the sigmoid logistic function, to see how each group contributed to the outcome of their districts choice of candidate. Each coefficient was then converted to a probability for facility of interpretation. The process and these data are shown below:

鉴于上述模型足够准确,可以得出结论,因此,我使用了logit函数的截距和系数(S型logistic函数的线性形式)来查看每个组如何对他们的地区选择候选人做出贡献。 然后将每个系数转换为便于解释的概率。 该过程和这些数据如下所示:

import mathlr_model = lr.named_steps['logisticregression']lr_coef = list(lr_model.coef_[0])
lr_coef_data = {'Feature' : features, 'Coefficients' : lr_coef}
lr_coefficients = pd.DataFrame(lr_coef_data)
lr_coefficients = lr_coefficients.sort_values(by = 'Coefficients',
ascending = False)probabilities = []def log_odds_to_prob(coefficient):
numerator = math.e ** coefficient
denominator = 1 + numeratorreturn numerator / denominatorfor coefficient in lr_coefficients.Coefficients:
lr_coefficients['Probabilities'] = probabilities
Image for post
Logit Fuction, Coefficients and Probabilities

The sign of the coefficients represents the direction that each feature pushes the vote in with positive values indicating a benefit for the Republicans and negative values indicating a benefit for the Democrats. The probabilities indicate the likelihood that a district composed entirely of the selected demographic would have a Republican representative. In the case of the voter turnout category — assuming that is the correct interpretation of the feature — the probability indicates the likelihood of a randomly selected district anywhere in the nation electing a Republican candidate given a 100% voter turnout. Given that only data from a single year, 2017, is being examined here, it would be prudent not to jump to any hasty conclusions. Even with 100% voter turnout, a Republican candidate would likely still stand a strong chance of winning given that the probability is close to 0.5. If we had analyzed data over a longer time span, we can safely presume that the value would come closer to 0.5. Given that the intercept of the logit function was 0.11225385011865346, or 0.5280340308125522 when expressed as a probability, it can be ignored and the above interpretation is still valid, even when examining the year 2017 alone.

系数的符号表示每个功能推动投票的方向,正值表示对共和党人有利,而负值表示对民主党人有利。 概率表明,完全由所选人口组成的一个地区将有共和党代表的可能性。 在选民投票率类别的情况下(假设这是对功能的正确解释),该概率表示在100%选民投票率的情况下,在全国任何地方随机选择的地区选举共和党候选人的可能性。 鉴于此处仅检查了2017年的年度数据,因此请谨慎考虑,不要得出任何仓促的结论。 即使有100%的选民投票率,鉴于该概率接近0.5,共和党候选人仍然有很大的获胜机会。 如果我们分析了较长时间的数据,则可以安全地假定该值将接近0.5。 假设以概率表示时logit函数的截距为0.11225385011865346或0.5280340308125522,则可以忽略不计,并且即使仅检查2017年,上述解释仍然有效。

What the data do tell us, however, can be extremely insightful. Being mindful of the fact that the coefficients tell us the relationship between the percentage of each ethno-racial group has and an elected official’s party affiliation, we find that states where the percentage of Whites and Native Americans is the highest tend to elect Republican Representatives. This does not indicate that the latter group, being a minority, actually votes Republican; rather, it tells us that the states with the highest per capita American Indian populations tend to be in the west. An example of this would be Arizona, a red state which also happens to be home to the Navajo Nation. Nor does this finding indicate that the former group predominantly votes Republican; rather, those Whites who live in states in which a larger proportion of the population is white do.

但是,数据告诉我们的内容可能非常有见地。 考虑到系数可以告诉我们每个民族的比例与民选党派之间的关系,我们发现白人和美洲原住民比例最高的州倾向于选举共和党代表。 这并不表示后者是少数派,实际上是对共和党投了赞成票; 相反,它告诉我们人均美洲印第安人人口最多的州往往在西方。 一个例子就是亚利桑那州,这是一个红色州,也恰好是纳瓦霍族的故乡。 这一发现也没有表明前者主要是共和党。 相反,那些居住在白人人口比例较高的州的白人确实如此。

Note also the magnitude to which the percentage of Pacific Islanders seem to influence the direction of the vote. This is likely due to the fact that Hawaii is the only state with a majority Pacific Islander population, and Hawaii just happens to be a Democratic-run state. Given that the largest Pacific Islander populations on the US mainland are found in Utah and the Carolinas, it’s likely that the trend would run in the opposite direction if Hawaii were omitted from the data.

另请注意,太平洋岛民所占百分比似乎会影响投票方向。 这很可能是由于夏威夷是太平洋岛民人口最多的唯一州,而夏威夷恰好是一个由民主管理的州。 考虑到美国大陆上最大的太平洋岛民人口位于犹他州和卡罗来纳州,如果从数据中省略夏威夷,则趋势可能会朝相反的方向发展。

In my opinion, quite possibly the most fascinating finding here is the indication that states with higher percentages of people who choose to check the “Other” box on the Census, tend to elect Democratic candidates. This may simply be because coastal areas and large cities tend to have more diverse populations, or it may be something more curious.

我认为,最令人着迷的发现很可能表明,选择在人口普查中选择“其他”框的人比例较高的州倾向于选举民主党候选人。 这可能仅仅是因为沿海地区和大城市的人口趋于多样化,或者可能是出于某种原因。

排列重要性 (Permutation Importances)

Calculating the permutation importances, another statistical tool which examines the effect which each features has on the target, the magnitude of this effect seems to dwarf the other categories’ influence.


Image for post
Permutation Importances

The permutation importance algorithm calculates a weight for each feature’s contribution to the target variable. Unlike the logit function’s intercept and coefficients, here sign indicates the magnitude rather than the effect of those contributions. Taking these weights into our analysis, all categories except Other, Multiple Races, Pacific Islanders, and Native Americans contributions can effectively be discounted, with the Other being the only definite contributor to the net result. This is strong evidence for the diversity theory described above.

排列重要性算法计算每个特征对目标变量的贡献的权重。 与logit函数的截距和系数不同,这里的符号表示这些贡献的大小而不是影响。 将这些权重纳入我们的分析中,除“ 其他”,“多个种族”,“太平洋岛民”和“ 美洲原住民”以外的所有类别都可以有效地折现,“ 其他”是最终结果的唯一确定贡献者。 这是上述多样性理论的有力证据。

翻译自: https://medium.com/@samswank/democrat-or-republican-politics-and-logistic-regression-7639648be5f0



  • 2012联邦选举委员会数据库中赞助人和赞助模式的分析统计
  • python简介_Python合奏简介
  • 西电数据挖掘实验三 关联规则挖掘 投票记录
  • 15 Python总结之数据分析与挖掘
  • 机器学习--朴素贝叶斯分类器(python手动实现)
  • 数据分析项目-合集-day03
  • R语言forcats包处理因子
  • python数据分析基础03——练习项目
  • Incorrect integer value: ''for column 'id' at row 1问题请指定列名
  • 【面向初学者】四个例子带你了解如何《利用Python进行数据分析》
  • pandas数据分析小案例:以美国大选数据为例
  • 数据分析——从入门到精通(十七)
  • 机器学习之Apriori算法(从零实现)
  • 【一图流思维导图】团队管理 项目管理
  • vcruntime140_1.dll丢失怎样修复,推荐4个vcruntime140_1.dll丢失的修复方法
  • vcruntime140_1.dll丢失怎么办?vcruntime140_1.dll丢失最新解决方法
  • 找不到vcruntime140_1.dll解决方法,分享5个常见的解决方法
  • 软件提示vcruntime140_1.dll丢失的解决方法,以及丢失的原因总结
  • 计算机提示“找不到vcruntime140.dll,无法继续执行代码可”以这样子修复
  • 信息学奥赛一本通(c++版) 2063【例1.4】牛吃牧草
  • 【C++ 一本通】2063:【例1.4】牛吃牧草
  • 奶牛吃草 DP
  • 打表法练习之吃草问题
  • 测试圈相亲平台开发流程(3):架构的初步设计
  • 输入一些整数,编程计算并输出其中所有正数的和,输入负数或零时表示输入数据结束。输出正数的和以及正数的项目。
  • Java编写程序,从命令行输入两个整数,求他们的商。。要求党除数为零时,捕捉ArithmaticException异常。
  • 创建零时表
  • 喜讯||零时科技中标“商洛市农产品智慧监管平台”建设项目
  • Popsicle攻击事件复盘分析 | 零时科技
  • spring boot Tomcat文件上传找不到零时文件夹
  • 这篇关于民主或共和政治与后勤回归的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



    ✨机器学习笔记(二)—— 线性回归、代价函数、梯度下降

    1️⃣线性回归(linear regression) f w , b ( x ) = w x + b f_{w,b}(x) = wx + b fw,b​(x)=wx+b 🎈A linear regression model predicting house prices: 如图是机器学习通过监督学习运用线性回归模型来预测房价的例子,当房屋大小为1250 f e e t 2 feet^

    用Python实现时间序列模型实战——Day 14: 向量自回归模型 (VAR) 与向量误差修正模型 (VECM)

    一、学习内容 1. 向量自回归模型 (VAR) 的基本概念与应用 向量自回归模型 (VAR) 是多元时间序列分析中的一种模型,用于捕捉多个变量之间的相互依赖关系。与单变量自回归模型不同,VAR 模型将多个时间序列作为向量输入,同时对这些变量进行回归分析。 VAR 模型的一般形式为: 其中: ​ 是时间  的变量向量。 是常数向量。​ 是每个时间滞后的回归系数矩阵。​ 是误差项向量,假


    文章目录 使用Python实现线性回归:从基础到scikit-learn1. 环境准备2. 数据准备和可视化3. 使用numpy实现线性回归4. 使用模型进行预测5. 可视化预测结果6. 使用scikit-learn实现线性回归7. 梯度下降法8. 随机梯度下降和小批量梯度下降9. 比较不同的梯度下降方法总结 使用Python实现线性回归:从基础到scikit-learn 线性

    【python因果推断库11】工具变量回归与使用 pymc 验证工具变量4

    目录  Wald 估计与简单控制回归的比较 CausalPy 和 多变量模型 感兴趣的系数 复杂化工具变量公式  Wald 估计与简单控制回归的比较 但现在我们可以将这个估计与仅包含教育作为控制变量的简单回归进行比较。 naive_reg_model, idata_reg = make_reg_model(covariate_df.assign(education=df[


    文章目录 知识回顾GPT-3的自回归架构何为自回归架构为什么架构会影响任务表现自回归架构的局限性与双向模型的对比小结 为何无需梯度更新和微调为什么不需要怎么做到不需要 🍃作者介绍:双非本科大四网络工程专业在读,阿里云专家博主,专注于Java领域学习,擅长web应用开发,目前开始人工智能领域相关知识的学习 🦅个人主页:@逐梦苍穹 📕所属专栏:人工智能 🌻gitee地址:x

    回归预测 | MATLAB实现PSO-LSTM(粒子群优化长短期记忆神经网络)多输入单输出

    回归预测 | MATLAB实现PSO-LSTM(粒子群优化长短期记忆神经网络)多输入单输出 目录 回归预测 | MATLAB实现PSO-LSTM(粒子群优化长短期记忆神经网络)多输入单输出预测效果基本介绍模型介绍PSO模型LSTM模型PSO-LSTM模型 程序设计参考资料致谢 预测效果 Matlab实现PSO-LSTM多变量回归预测 1.input和outpu

    【ML--04】第四课 logistic回归

    1、什么是逻辑回归? 当要预测的y值不是连续的实数(连续变量),而是定性变量(离散变量),例如某个客户是否购买某件商品,这时线性回归模型不能直接作用,我们就需要用到logistic模型。 逻辑回归是一种分类的算法,它用给定的输入变量(X)来预测二元的结果(Y)(1/0,是/不是,真/假)。我们一般用虚拟变量来表示二元/类别结果。你可以把逻辑回归看成一种特殊的线性回归,只是因为最后的结果是类别变


    概述 逻辑回归和线性回归是两种常用的预测模型,它们在目标函数和应用场景上存在显著差异。本文将详细比较这两种回归模型,并提供相应的代码示例。 线性回归 线性回归是一种预测连续数值的模型,其目标是找到特征( X )和目标变量( Y )之间的线性关系。线性回归的目标函数是最小化预测值和实际值之间的平方差,即均方误差(MSE)。 目标函数 线性回归的损失函数是均方误差: [ J(\theta)

    【python pytorch】Pytorch实现逻辑回归

    pytorch 逻辑回归学习demo: import torchimport torch.nn as nnimport torchvision.datasets as dsetsimport torchvision.transforms as transformsfrom torch.autograd import Variable# Hyper Parameters input_si

    回归预测 | Matlab基于贝叶斯算法优化XGBoost(BO-XGBoost/Bayes-XGBoost)的数据回归预测+交叉验证

    回归预测 | Matlab基于贝叶斯算法优化XGBoost(BO-XGBoost/Bayes-XGBoost)的数据回归预测+交叉验证 目录 回归预测 | Matlab基于贝叶斯算法优化XGBoost(BO-XGBoost/Bayes-XGBoost)的数据回归预测+交叉验证效果一览基本介绍程序设计参考资料 效果一览 基本介绍 Matlab实现基于贝叶斯算法优化X