CFS调度器（一）

本文主要是介绍CFS调度器（一），希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

关于调度类和优先级的概念，前面的文章《调度器概述》中已经做了介绍了，本文不在重述。本文主要关注的就是CFS调度器，或者叫做fair_sched_class调度类。这种调度器是被SCHED_NORMAL、SCHED_BATCH这两种调度策略使用的。（本文基于Linux4.0）

提到调度器涉及到两个问题：

1.怎么根据参量快速选出下一个要调度执行的进程？
2.怎么定义和更新不同进程的相关参量？

带着这两个问题，我们慢慢去讲解CFS调度器。

权重

权重代表着进程对于CPU使用时间的占有率，所以它的大小跟进程优先级相关，优先级越高，抢占的CPU时间越多，因此它的权重就越大。它的关系可以用下面的公式来表示：

进程运行时间 = 进程权重 * CPU总时间 / 所有进程总权重

内核中每个调度实体都用一个结构体成员来表示权重：

struct sched_entity {struct load_weight load;...
}

表示权重的结构体为：

struct load_weight {unsigned long weight;u32 inv_weight;
}

其中weight代表的是权重，inv_weight是为了方便计算出来的一个中间值。内核根据优先级nice来定义权重，约定nice值为0的进程权重值为1024，其他nice值的进程权重可以根据查表获取：

/** Nice levels are multiplicative, with a gentle 10% change for every* nice level changed. I.e. when a CPU-bound task goes from nice 0 to* nice 1, it will get ~10% less CPU time than another CPU-bound task* that remained on nice 0.** The "10% effect" is relative and cumulative: from _any_ nice level,* if you go up 1 level, it's -10% CPU usage, if you go down 1 level* it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.* If a task goes up by ~10% and another task goes down by ~10% then* the relative distance between them is ~25%.)*/
static const int prio_to_weight[40] = {/* -20 */     88761,     71755,     56483,     46273,     36291,/* -15 */     29154,     23254,     18705,     14949,     11916,/* -10 */      9548,      7620,      6100,      4904,      3906,/*  -5 */      3121,      2501,      1991,      1586,      1277,/*   0 */      1024,       820,       655,       526,       423,/*   5 */       335,       272,       215,       172,       137,/*  10 */       110,        87,        70,        56,        45,/*  15 */        36,        29,        23,        18,        15,
};

其他nice值的权重值都是基于1024计算出来的，高一优先级的进程权重总是低一优先级进程的1.25倍。除了这个表，内核为了计算方便还定义了另外一个表：

/** Inverse (2^32/x) values of the prio_to_weight[] array, precalculated.** In cases where the weight does not change often, we can use the* precalculated inverse to speed up arithmetics by turning divisions* into multiplications:*/
static const u32 prio_to_wmult[40] = {/* -20 */     48388,     59856,     76040,     92818,    118348,/* -15 */    147320,    184698,    229616,    287308,    360437,/* -10 */    449829,    563644,    704093,    875809,   1099582,/*  -5 */   1376151,   1717300,   2157191,   2708050,   3363326,/*   0 */   4194304,   5237765,   6557202,   8165337,  10153587,/*   5 */  12820798,  15790321,  19976592,  24970740,  31350126,/*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,/*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

内核设置进程权重的函数：

static void set_load_weight(struct task_struct *p)
{int prio = p->static_prio - MAX_RT_PRIO;struct load_weight *load = &p->se.load;/** SCHED_IDLE tasks get minimal weight:*/if (p->policy == SCHED_IDLE) {load->weight = scale_load(WEIGHT_IDLEPRIO);load->inv_weight = WMULT_IDLEPRIO;return;}load->weight = scale_load(prio_to_weight[prio]);load->inv_weight = prio_to_wmult[prio];
}

虚拟时间

CFS调度器的目标是保证每个进程的完全公平调度，尝试对进程一视同仁，把CPU时间平均分配给每个进程去运行，进程运行的同时会记录每个进程的运行时间，按照这种公平算法，CPU每次调度都会优先选择运行时间最少的那个进程，以保证进程的公平。
假如按照上述思想去设计调度器，那么会有一个问题，所有进程都是完全公平，那么如何体现优先级的不同呢？实际上内核并不是使用实际运行时间作为pick_next进程的参考量，而采用的是一个虚拟时间，这个虚拟时间加入了不同优先级权重因素的。虚拟时间的计算方法如下：

vruntime = delta_exec * NICE0_WEIGHT / weight;

vruntime是虚拟运行时间，delta_exec是实际运行时间，NICE0_WEIGHT是nice值为0的权重，也就是1024，weight进程的权重。这样由上面的表可知，优先级越高的进程算出来的vruntime越小，从而被CFS调度器选出来的概率就越大。runqueue中的所有进程都要维护一个vruntime变量，并且把这些进程按照红黑树来管理，vrunquque总是处于红黑树的最左侧，这样在算法选择时具有更高的效率。

回到前面的问题，CFS调度器就是通过vruntime参量来决定当前要选择运行的进程，为了选择算法的高效和快速，采用红黑树来管理这个参量，这就是CFS调度器的核心思想了。
下面是内核中的代码定义：

struct sched_entity {u64         exec_start;u64         sum_exec_runtime;u64         vruntime;u64         prev_sum_exec_runtime;......
};

对于实际运行时间转换为虚拟时间，通过如下函数来实现calc_delta_fair：

static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{if (unlikely(se->load.weight != NICE_0_LOAD))delta = __calc_delta(delta, NICE_0_LOAD, &se->load);return delta;
}

有了这个虚拟时间，我们可以用来作为选择pick next进程的参考量，那么此参考量何时需要更新呢？


1.进程创建，加入运行队列时要更新2.周期调度时要更新

调度延迟

调度延迟表示是保证系统中的每个进程都能执行一遍所需要的时间。它不是一个固定值，计算方式如下：


unsigned int sysctl_sched_latency = 6000000ULL;         //6ms
static unsigned int sched_nr_latency = 8;
unsigned int sysctl_sched_min_granularity = 750000ULL;   //0.75msstatic u64 __sched_period(unsigned long nr_running)
{u64 period = sysctl_sched_latency;unsigned long nr_latency = sched_nr_latency;if (unlikely(nr_running > nr_latency)) {period = sysctl_sched_min_granularity;period *= nr_running;}return period;
}

当系统处于就runnable的进程少于一个定值（默认值8）的时候，调度延迟也是固定一个值不变（默认值6ms）。当系统就绪态进程个数超过这个值时，我们保证每个进程至少运行一定的时间，那么调度延迟总时间就是nr_running * sysctl_sched_min_granularity。
当获取到一个调度延迟后，还需要计算一个进程分配的时间片，CFS调度器和以往的O(1)调度器不同，不在有固定的时间片，而是通过权重计算出的时间片。按照前面权重的介绍，通过

时间片=调度延迟总时间*进程权重/总权重

来计算一个进程的时间片。

static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);for_each_sched_entity(se) {struct load_weight *load;struct load_weight lw;cfs_rq = cfs_rq_of(se);load = &cfs_rq->load;if (unlikely(!se->on_rq)) {lw = cfs_rq->load;update_load_add(&lw, se->load.weight);load = &lw;}slice = __calc_delta(slice, se->load.weight, load);}return slice;
}

__calc_delta传入的参数分别是总时间，当前调度实体的权重，整个runqueue的总权重。因此每个进程的时间片是不固定的，而是根据权重和系统中进程的数量发生着变化。