Data Mining数据挖掘—5. Association Analysis关联分析

2023-12-10 11:01

本文主要是介绍Data Mining数据挖掘—5. Association Analysis关联分析,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

6. Association Analysis

Given a set of records each of which contains some number of items from a given collection.
Produce dependency rules that will predict the occurrence of an item based on occurrences of other items.
Application area: Marketing and Sales Promotion, Content-based recommendation, Customer loyalty programs

Initially used for Market Basket Analysis to find how items purchased by customers are related. Later extended to more complex data structures: sequential patterns and subgraph patterns

6.1 Simple Approach: Pearson’s correlation coefficient

Pearson's correlation coefficient in Association Analysis

correlation not equals to causality

6.2 Definitoin

6.2.1 Frequent Itemset

Frequent Itemset

6.2.2 Association Rule

Association Rule

6.2.3 Evaluation Metrics

Evaluation Metrics

6.3 Associate Rule Mining Task

Given a set of transactions T, the goal of association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
minsup and minconf are provided by the user
Brute-force approach
Step1: List all possible association rules
Step2: Compute the support and confidence for each rule
Step3: Remove rules that fail the minsup and minconf thresholds

But Computationally prohibitive due to large number of candidates!

Brute-force Approach

Mining Association Rules

6.4 Apriori Algorithm

Two-step approach
Step1: Frequent Itemset Generation (Generate all itemsets whose support ≥ minsup)
Step2: Rule Generation (Generate high confidence rules from each frequent itemset; where each rule is a binary partitioning of a frequent itemset)

However, frequent itemset generation is still computationally expensive… Given d items, there are 2^d candidate itemsets!

Anti-Monotonicity of Support
Anti-Monotonicity of Support

Steps

  1. Start at k=1
  2. Generate frequent itemsets of length k=1
  3. Repeat until no new frequent itemsets are identified
    1. Generate length (k+1) candidate itemsets from length k frequent itemsets; increase k
    2. Prune candidate itemsets that cannot be frequent because they contain subsets of length k that are infrequent (Apriori Principle)
    3. Count the support of each remaining candidate by scanning the DB
    4. Eliminate candidates that are infrequent, leaving only those that are frequent

Illustrating the Apriori Principle

From Frequent Itemsets to Rules
From Frequent Itemsets to Rules

Challenge: Combinatorial Explosion1
Challenge: Combinatorial Explosion2

Rule Generation

Rule Generation for Apriori Algorithm

Complexity of Apriori Algorithm
Complexity of Apriori Algorithm

6.5 FP-growth Algorithm

usually faster than Apriori, requires at most two passes over the database
Use a compressed representation of the database using an FP-tree
Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets
FP-Tree Construction

FP-Tree Construction

FP-Growth(Summary)

6.6 Interestingness Measures

Interestingness measures can be used to prune or rank the derived rules
In the original formulation of association rules, support & confidence are the only interest measures used
various other measures have been proposed

Drawback of Confidence
Drawback of Confidence1

Drawback of Confidence2

6.6.1 Correlation

Correlation takes into account all data at once.
In our scenario: corr(tea,coffee) = -0.25
i.e., the correlation is negative
Interpretation: people who drink tea are less likely to drink coffee

6.6.2 Lift

Lift1

Lift2

Example: Lift

lift and correlation are symmetric [lift(tea → coffee) = lift(coffee → tea)]
confidence is asymmetric

6.6.3 Others

6.7 Handling Continuous and Categorical Attributes

6.7.1 Handling Categorical Attributes

Transform categorical attribute into asymmetric binary variables. Introduce a new “item” for each distinct attribute-value pair -> one-hot-encoding
Potential Issues
(1) Many attribute values
Many of the attribute values may have very low support
Potential solution: Aggregate the low-support attribute values -> bin for “other”
(2) Highly skewed attribute values
Example: 95% of the visitors have Buy = No
Most of the items will be associated with (Buy=No) item
Potential solution: drop the highly frequent items

6.7.2 Handling Continuous Attributes

Transform continuous attribute into binary variables using discretization:
Equal-width binning & Equal-frequency binning
Issue: Size of the intervals affects support & confidence - Too small intervals: not enough support but Too large intervals: not enough confidence

6.8 Effect of Support Distribution

Many real data sets have a skewed support distribution
How to set the appropriate minsup threshold?
If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products)
If minsup is set too low, it is computationally expensive and the number of itemsets is very large
Using a single minimum support threshold may not be effective
Multiple Minimum Support
Multiple Minimum Support

6.9 Association Rules with Temporal Components

Association Rules with Temporal Components

6.10 Subgroup Discovery

Association Rule Mining: Find all patterns in the data
Classification: Identify the best patterns that can predict a target variable
Find all patterns that can explain a target variable.
从数据集中发现具有特定属性和特征的子群或子集。这个任务的目标是识别数据中与感兴趣的属性或行为相关的子群,以便更深入地理解数据、做出预测或采取相关行动。在某些情况下,子群发现可以用于生成新的特征,然后将这些特征用于分类任务。
子群发现旨在发现数据中的子群,而分类旨在将数据分为已知的类别。子群发现通常更加探索性,而分类通常更加预测性。
we have strong predictor variables. But we are also interested in the weaker ones

Algorithms
Early algorithms: Learn unpruned decision tree; Extract rule; Compute measures for rules, rate and rank
Newer algorithms: Based on association rule mining; Based on evolutionary algorithms

Rating Rules
Goals: rules should be covering many examples & Accurate
Rules of both high coverage and accuracy are interesting

Subgroup Discovery – Rating Rules

Subgroup Discovery – Metrics
Subgroup Discovery – Metrics

WRacc1

WRacc2

WRacc3

Subgroup Discovery – Summary

6.11 Summary

Association AnalysisApriori & FP-GrowthSubgroup Discovery
discovering patterns in data; patterns are described by rulesFinds rules with minimum support (i.e., number of transactions) and minimum confidence (i.e., strength of the implication)Learn rules for a particular target variable; Create a comprehensive model of a class

这篇关于Data Mining数据挖掘—5. Association Analysis关联分析的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/476920

相关文章

捷瑞数字业绩波动性明显:关联交易不低,募资必要性遭质疑

《港湾商业观察》施子夫 5月22日,山东捷瑞数字科技股份有限公司(以下简称,捷瑞数字)及保荐机构国新证券披露第三轮问询的回复,继续推进北交所上市进程。 从2023年6月递表开始,监管层已下发三轮审核问询函,关注到捷瑞数字存在同业竞争、关联交易、募资合理性、期后业绩波动等焦点问题。公司的上市之路多少被阴影笼罩。​ 业绩波动遭问询 捷瑞数字成立于2000年,公司是一家以数字孪生驱动的工

[职场] 公务员的利弊分析 #知识分享#经验分享#其他

公务员的利弊分析     公务员作为一种稳定的职业选择,一直备受人们的关注。然而,就像任何其他职业一样,公务员职位也有其利与弊。本文将对公务员的利弊进行分析,帮助读者更好地了解这一职业的特点。 利: 1. 稳定的职业:公务员职位通常具有较高的稳定性,一旦进入公务员队伍,往往可以享受到稳定的工作环境和薪资待遇。这对于那些追求稳定的人来说,是一个很大的优势。 2. 薪资福利优厚:公务员的薪资和

高度内卷下,企业如何通过VOC(客户之声)做好竞争分析?

VOC,即客户之声,是一种通过收集和分析客户反馈、需求和期望,来洞察市场趋势和竞争对手动态的方法。在高度内卷的市场环境下,VOC不仅能够帮助企业了解客户的真实需求,还能为企业提供宝贵的竞争情报,助力企业在竞争中占据有利地位。 那么,企业该如何通过VOC(客户之声)做好竞争分析呢?深圳天行健企业管理咨询公司解析如下: 首先,要建立完善的VOC收集机制。这包括通过线上渠道(如社交媒体、官网留言

BD错误集锦3——ERROR: Can't get master address from ZooKeeper; znode data == null

hbase集群没启动,傻子!   启动集群 [s233 s234 s235]启动zk集群 $>zkServer.sh start $>zkServer.sh status   [s233] 启动dfs系统 $>start-dfs.sh 如果s237 namenode启动失败,则 [s237] $>hadoop-daemon.sh start namenode [s233]启动yarn集群

打包体积分析和优化

webpack分析工具:webpack-bundle-analyzer 1. 通过<script src="./vue.js"></script>方式引入vue、vuex、vue-router等包(CDN) // webpack.config.jsif(process.env.NODE_ENV==='production') {module.exports = {devtool: 'none

数据挖掘和数据分析

数据挖掘(Data Mining)和数据分析(Data Analysis)是现代计算机科学中两个重要的领域。它们虽然紧密相关,但在概念和应用上有一定的区别。下面将从定义、主要技术、应用领域和挑战四个方面详细阐述这两个领域。 一、定义 **数据挖掘**: 数据挖掘是指从大量数据中提取潜在的、有价值的信息和知识的过程。它综合了统计学、机器学习、数据库技术等多种学科的技术和方法。 **数据分析**

Java中的大数据处理与分析架构

Java中的大数据处理与分析架构 大家好,我是免费搭建查券返利机器人省钱赚佣金就用微赚淘客系统3.0的小编,也是冬天不穿秋裤,天冷也要风度的程序猿!今天我们来讨论Java中的大数据处理与分析架构。随着大数据时代的到来,海量数据的存储、处理和分析变得至关重要。Java作为一门广泛使用的编程语言,在大数据领域有着广泛的应用。本文将介绍Java在大数据处理和分析中的关键技术和架构设计。 大数据处理与

段,页,段页,三种内存(RAM)管理机制分析

段,页,段页         是为实现虚拟内存而产生的技术。直接使用物理内存弊端:地址空间不隔离,内存使用效率低。 段 段:就是按照二进制文件的格式,在内存给进程分段(包括堆栈、数据段、代码段)。通过段寄存器中的段表来进行虚拟地址和物理地址的转换。 段实现的虚拟地址 = 段号+offset 物理地址:被分为很多个有编号的段,每个进程的虚拟地址都有段号,这样可以实现虚实地址之间的转换。其实所谓的地

【Linux文件系统】被打开的文件与文件系统的文件之间的关联刨析总结

操作系统管理物理内存以及与外设磁盘硬件进行数据的交换 操作系统如何管理物理内存呢? 其实操作系统内核先对内存先描述再组织的!操作系统管理内存的基本单位是4KB,操作系统会为每一个4KB大小的物理内存块创建一个描述该4KB内存块的struct page结构体,该结构体存储着这4KB内存块的属性信息,通过管理struct page来对内存进行管理,page结构体的大小比较小,OS通常将它们组成一个

mediasoup 源码分析 (八)分析PlainTransport

mediasoup 源码分析 (六)分析PlainTransport 一、接收裸RTP流二、mediasoup 中udp建立过程 tips 一、接收裸RTP流 PlainTransport 可以接收裸RTP流,也可以接收AES加密的RTP流。源码中提供了一个通过ffmpeg发送裸RTP流到mediasoup的脚本,具体地址为:mediasoup-demo/broadcaste