马尔可夫毯式遗传算法在基因选择中的应用

本文主要是介绍马尔可夫毯式遗传算法在基因选择中的应用，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

#引用

##LaTex

@article{ZHU20073236,
title = “Markov blanket-embedded genetic algorithm for gene selection”,
journal = “Pattern Recognition”,
volume = “40”,
number = “11”,
pages = “3236 - 3248”,
year = “2007”,
issn = “0031-3203”,
doi = “https://doi.org/10.1016/j.patcog.2007.02.007”,
url = “http://www.sciencedirect.com/science/article/pii/S0031320307000945”,
author = “Zexuan Zhu and Yew-Soon Ong and Manoranjan Dash”,
keywords = “Microarray, Feature selection, Markov blanket, Genetic algorithm (GA), Memetic algorithm (MA)”
}

##Normal

Zexuan Zhu, Yew-Soon Ong, Manoranjan Dash,
Markov blanket-embedded genetic algorithm for gene selection,
Pattern Recognition,
Volume 40, Issue 11,
2007,
Pages 3236-3248,
ISSN 0031-3203,
https://doi.org/10.1016/j.patcog.2007.02.007.
(http://www.sciencedirect.com/science/article/pii/S0031320307000945)
Keywords: Microarray; Feature selection; Markov blanket; Genetic algorithm (GA); Memetic algorithm (MA)

#摘要

Microarray technologies
the smallest possible set of genes

Markov blanket-embedded genetic algorithm (MBEGA) for gene selection problem

Markov blanket and predictive power in classifier model

filter, wrapper, and standard GA

evaluation criteria:
classification accuracy, number of selected genes, computational cost, and robustness

#主要内容

这里写图片描述

##Markov Blanket（Markov毯）

$F$ — 所有特征的集合
$C$ — 类别

一个特征 $F_i$ 的Markov毯定义如下：

定义（Markov毯）
$M$ — 一个特征子集（不包含 $F_i$ ）
即， $\in F$ 且 $F_i \notin M$ 。
$M$ 为 $F_i$ 的一个Markov毯，若
给定 $M$ ， $F_i$ 是对于 $\left( F \cup C \right) - M - \left\{ F_i \right\}$ 条件独立的，
即， $\left( F - M - \left\{ F_i \right\}, C | F_i, M \right) = P \left( F - M - \left\{ F_i \right\}, C | M \right)$

给定X，两个属性A与B是条件独立的，若$P \left( A | X, B \right) = P \left( A | X \right) $，也就是说， B 并不能在 X 之外提供关于 A 的信息。若一个特征$ F_i $在当前选择的特征子集中有一个 M a r k o v 毯$ M $，那么$ F_i $在$ M $之外关于$ C $不能提供其他选择的特征的信息，因此，$ F_i $能够安全移除。然而，决定特征的条件独立的计算复杂度通常非常高，因此，只使用一个特征来估计$ F_i$的Markov毯。

定义（近似Markov毯）
对于两个特征 $F_i$ 与 $F_j$ $i\neq j$ ， $F_j$ 可看作为 $F_i$ 的近似Markov毯，若 $SU_{j,C} \geq SU_{i,C}$ 且 $SU_{i,j} \geq SU_{i,C}$ ，其中，
对称不确定性（symmetrical uncertainty，SU）度量特征（包括类， $C$ ）间的相关性，定义为：

这里写图片描述

$\left( F_i | F_j \right)$ — 特征 $F_i$ 与 $F_j$ 间的信息增益
$\left( F_i \right)$ 与 $\left( F_j \right)$ — 特征 $F_i$ 与 $F_j$ 的熵
$SU_{i,C}$ — 特征 $F_i$ 与类 $C$ 间的相关性，称为C-correlation
一个特征被认为是相关的若其C-correlation高于用户给定的阈值 $\gamma$ ，即， $S_{i,C} > \gamma$
没有任何近似Markov毯的特征为predominant feature主导特征

##马尔可夫毯式嵌入式遗传算法

这里写图片描述

若适应值差异小于 $\varepsilon$ ，则特征数较少的个体较好

Lamarckian learning：
通过将局部改进的个体放回种群竞争繁殖的机会，来迫使基因型反映改进的效果

这里写图片描述

$X$ — 选择的特征子集
$Y$ — 排除的特征子集

这里写图片描述

C-correlation 只计算一次

搜索范围 $L$ — 定义了 $A d d$ 与 $D e l$ 操作的最大数目 — $L^2$ 个操作组合
随机顺序 — 直到得到改进提升效果

这里写图片描述

Lamarckian learning process

之后是
usual evolutionary operations：

linear ranking selection
uniform crossover
mutation operators with elitism

##试验

MBEGA method

考虑了：

the FCBF (fast correlation-based filter)
BIRS (best incremental ranked subset)
standard GA feature selection algorithms

FCBF —
a fast correlation based filter method

selecting a subset of relevant features whose C-correlation are larger than a given threshold $\gamma$
sorts the relevant features in descending order in terms of C-correlation
redundant features are eliminated one-by-one in a descending order

A feature is redundant 仅当 it has an approximate Markov blanket

predominant features with zero redundant features in terms of C-correlation

BIRS — a similar scheme as the FCBF
evaluates the goodness of features using a classifier