Linux 3.8 Writeback机制源码分析

2024-02-14 00:48

本文主要是介绍Linux 3.8 Writeback机制源码分析,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

writeback相关数据结构

与writeback相关的数据结构主要有:

  1. backing_dev_info,该数据结构描述了backing_dev的所有信息,通常块设备的request queue中会包含backing_dev对象。
  2. bdi_writeback,该数据结构封装了writeback的内核线程以及需要操作的inode队列。
  3. wb_writeback_work,该数据结构封装了writeback的工作任务。




它们的结构体分别如下:

struct backing_dev_info {struct list_head bdi_list;unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */unsigned long state;	/* Always use atomic bitops on this */unsigned int capabilities; /* Device capabilities */congested_fn *congested_fn; /* Function pointer if device is md/dm */void *congested_data;	/* Pointer to aux data for congested func */char *name;struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];unsigned long bw_time_stamp;	/* last time write bw is updated */unsigned long dirtied_stamp;unsigned long written_stamp;	/* pages written at bw_time_stamp */unsigned long write_bandwidth;	/* the estimated write bandwidth */unsigned long avg_write_bandwidth; /* further smoothed write bw *//** The base dirty throttle rate, re-calculated on every 200ms.* All the bdi tasks' dirty rate will be curbed under it.* @dirty_ratelimit tracks the estimated @balanced_dirty_ratelimit* in small steps and is much more smooth/stable than the latter.*/unsigned long dirty_ratelimit;unsigned long balanced_dirty_ratelimit;struct fprop_local_percpu completions;int dirty_exceeded;unsigned int min_ratio;unsigned int max_ratio, max_prop_frac;struct bdi_writeback wb;  /* default writeback info for this bdi */spinlock_t wb_lock;	  /* protects work_list */struct list_head work_list;struct device *dev;struct timer_list laptop_mode_wb_timer;#ifdef CONFIG_DEBUG_FSstruct dentry *debug_dir;struct dentry *debug_stats;
#endif
};


struct bdi_writeback {struct backing_dev_info *bdi;	/* our parent bdi */unsigned int nr;unsigned long last_old_flush;	/* last old data flush */unsigned long last_active;	/* last time bdi thread was active */struct task_struct *task;	/* writeback thread */struct timer_list wakeup_timer; /* used for delayed bdi thread wakeup */struct list_head b_dirty;	/* dirty inodes */struct list_head b_io;		/* parked for writeback */struct list_head b_more_io;	/* parked for more writeback */spinlock_t list_lock;		/* protects the b_* lists */
};

/** Passed into wb_writeback(), essentially a subset of writeback_control*/
struct wb_writeback_work {long nr_pages;struct super_block *sb;unsigned long *older_than_this;enum writeback_sync_modes sync_mode;unsigned int tagged_writepages:1;unsigned int for_kupdate:1;unsigned int range_cyclic:1;unsigned int for_background:1;enum wb_reason reason;		/* why was writeback initiated? */struct list_head list;		/* pending work list */struct completion *done;	/* set if the caller waits */
};

  1. BDI数据结构是对块设备的一个描述。bdi对象在块设备添加的时候需要注册到系统的bdi队列中。对于ext3而言,在mount的时候需要将底层块设备的bdi对象联系到ext3 root_inode中。在bdi数据结构中有一条work_list,该队列维护了writeback内核线程需要处理的任务。如果该队列上没有work可以处理,那么writeback内核线程将会睡眠等待。
  2. writeback对象封装了内核线程task以及需要处理的inode队列。当page cache/buffer cache需要刷新radix tree上的inode时,可以将该inode挂载到writeback对象的b_dirty队列上,然后唤醒writeback线程。在处理过程中,inode会被移到b_io队列上进行处理。多条链表的方式可以降低多线程之间的资源共享。
  3. wb_writeback_work数据结构是对writeback任务的封装,不同的任务可以采用不同的刷新策略。writeback线程的处理对象就是writeback_work。如果writeback_work队列为空,那么内核线程就可以睡眠了。

writeback主要函数分析


writeback机制的主要函数包括如下两个方面:
  1. 管理bdi对象并且fork相应的writeback内核线程处理cache数据的刷新工作。
  2. writeback内核线程处理函数,实现dirty page的刷新操作

writeback线程管理

Linux中有一个内核守护线程,该线程用来管理系统bdi队列,并且负责为block device创建writeback thread。当bdi中有dirty page并且还没有为bdi分配内核线程的时候,bdi_forker_thread程序会为其分配线程资源;当一个writeback线程长时间(默认为5min)处于空闲状态时,bdi_forker_thread程序会释放该线程资源。

static int bdi_forker_thread(void *ptr)
{struct bdi_writeback *me = ptr;current->flags |= PF_SWAPWRITE;set_freezable();/** Our parent may run at a different priority, just set us to normal*/set_user_nice(current, 0);for (;;) {struct task_struct *task = NULL;struct backing_dev_info *bdi;enum {NO_ACTION,   /* Nothing to do */FORK_THREAD, /* Fork bdi thread */KILL_THREAD, /* Kill inactive bdi thread */} action = NO_ACTION;/** Temporary measure, we want to make sure we don't see* dirty data on the default backing_dev_info*/if (wb_has_dirty_io(me) || !list_empty(&me->bdi->work_list)) {del_timer(&me->wakeup_timer);wb_do_writeback(me, 0);}spin_lock_bh(&bdi_lock);/** In the following loop we are going to check whether we have* some work to do without any synchronization with tasks* waking us up to do work for them. Set the task state here* so that we don't miss wakeups after verifying conditions.*/set_current_state(TASK_INTERRUPTIBLE);list_for_each_entry(bdi, &bdi_list, bdi_list) {bool have_dirty_io;if (!bdi_cap_writeback_dirty(bdi) ||bdi_cap_flush_forker(bdi))continue;WARN(!test_bit(BDI_registered, &bdi->state),"bdi %p/%s is not registered!\n", bdi, bdi->name);have_dirty_io = !list_empty(&bdi->work_list) ||wb_has_dirty_io(&bdi->wb);/** If the bdi has work to do, but the thread does not* exist - create it.*/if (!bdi->wb.task && have_dirty_io) {/** Set the pending bit - if someone will try to* unregister this bdi - it'll wait on this bit.*/set_bit(BDI_pending, &bdi->state);action = FORK_THREAD;break;}spin_lock(&bdi->wb_lock);/** If there is no work to do and the bdi thread was* inactive long enough - kill it. The wb_lock is taken* to make sure no-one adds more work to this bdi and* wakes the bdi thread up.*/if (bdi->wb.task && !have_dirty_io &&time_after(jiffies, bdi->wb.last_active +bdi_longest_inactive())) {task = bdi->wb.task;bdi->wb.task = NULL;spin_unlock(&bdi->wb_lock);set_bit(BDI_pending, &bdi->state);action = KILL_THREAD;break;}spin_unlock(&bdi->wb_lock);}spin_unlock_bh(&bdi_lock);/* Keep working if default bdi still has things to do */if (!list_empty(&me->bdi->work_list))__set_current_state(TASK_RUNNING);switch (action) {case FORK_THREAD:__set_current_state(TASK_RUNNING);task = kthread_create(bdi_writeback_thread, &bdi->wb,"flush-%s", dev_name(bdi->dev));if (IS_ERR(task)) {/** If thread creation fails, force writeout of* the bdi from the thread. Hopefully 1024 is* large enough for efficient IO.*/writeback_inodes_wb(&bdi->wb, 1024,WB_REASON_FORKER_THREAD);} else {/** The spinlock makes sure we do not lose* wake-ups when racing with 'bdi_queue_work()'.* And as soon as the bdi thread is visible, we* can start it.*/spin_lock_bh(&bdi->wb_lock);bdi->wb.task = task;spin_unlock_bh(&bdi->wb_lock);wake_up_process(task);}bdi_clear_pending(bdi);break;case KILL_THREAD:__set_current_state(TASK_RUNNING);kthread_stop(task);bdi_clear_pending(bdi);break;case NO_ACTION:if (!wb_has_dirty_io(me) || !dirty_writeback_interval)/** There are no dirty data. The only thing we* should now care about is checking for* inactive bdi threads and killing them. Thus,* let's sleep for longer time, save energy and* be friendly for battery-driven devices.*/schedule_timeout(bdi_longest_inactive());elseschedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));try_to_freeze();break;}}return 0;
}

Writeback工作线程

writeback线程是bdi_forker_thread 创建的,该线程的任务就是处理等待的数据回刷任务。线程处理函数为bdi_writeback_thread,该函数的实现如下:
/** Handle writeback of dirty data for the device backed by this bdi. Also* wakes up periodically and does kupdated style flushing.*/
int bdi_writeback_thread(void *data)
{struct bdi_writeback *wb = data;struct backing_dev_info *bdi = wb->bdi;long pages_written;current->flags |= PF_SWAPWRITE;set_freezable();wb->last_active = jiffies;/** Our parent may run at a different priority, just set us to normal*/set_user_nice(current, 0);trace_writeback_thread_start(bdi);while (!kthread_freezable_should_stop(NULL)) {/** Remove own delayed wake-up timer, since we are already awake* and we'll take care of the periodic write-back.*/del_timer(&wb->wakeup_timer);pages_written = wb_do_writeback(wb, 0);trace_writeback_pages_written(pages_written);if (pages_written)wb->last_active = jiffies;set_current_state(TASK_INTERRUPTIBLE);if (!list_empty(&bdi->work_list) || kthread_should_stop()) {__set_current_state(TASK_RUNNING);continue;}if (wb_has_dirty_io(wb) && dirty_writeback_interval)schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));else {/** We have nothing to do, so can go sleep without any* timeout and save power. When a work is queued or* something is made dirty - we will be woken up.*/schedule();}}/* Flush any work that raced with us exiting */if (!list_empty(&bdi->work_list))wb_do_writeback(wb, 1);trace_writeback_thread_stop(bdi);return 0;
}

bdi_writeback_thread函数主要是调用wb_do_writeback()函数。
/** Retrieve work items and do the writeback they describe*/
long wb_do_writeback(struct bdi_writeback *wb, int force_wait)
{struct backing_dev_info *bdi = wb->bdi;struct wb_writeback_work *work;long wrote = 0;set_bit(BDI_writeback_running, &wb->bdi->state);while ((work = get_next_work_item(bdi)) != NULL) {/** Override sync mode, in case we must wait for completion* because this thread is exiting now.*/if (force_wait)work->sync_mode = WB_SYNC_ALL;trace_writeback_exec(bdi, work);wrote += wb_writeback(wb, work);/** Notify the caller of completion if this is a synchronous* work item, otherwise just free it.*/if (work->done)complete(work->done);elsekfree(work);}/** Check for periodic writeback, kupdated() style*/wrote += wb_check_old_data_flush(wb);wrote += wb_check_background_flush(wb);clear_bit(BDI_writeback_running, &wb->bdi->state);return wrote;
}

wb_check_old_data_flush函数的主要功能是周期性的检查脏页并写回,它默认写回30s之前写入的脏页,每隔5s扫描一次。
static long wb_check_old_data_flush(struct bdi_writeback *wb)
{unsigned long expired;long nr_pages;/** When set to zero, disable periodic writeback*/if (!dirty_writeback_interval)return 0;expired = wb->last_old_flush +msecs_to_jiffies(dirty_writeback_interval * 10);if (time_before(jiffies, expired))return 0;wb->last_old_flush = jiffies;nr_pages = get_nr_dirty_pages();if (nr_pages) {struct wb_writeback_work work = {.nr_pages	= nr_pages,.sync_mode	= WB_SYNC_NONE,.for_kupdate	= 1,.range_cyclic	= 1,.reason		= WB_REASON_PERIODIC,};return wb_writeback(wb, &work);}return 0;
}

wb_check_background_flush的功能是在脏页达到一定比例时写回所有的脏页,直到脏页的比例达到阀值以下。
static long wb_check_background_flush(struct bdi_writeback *wb)
{if (over_bground_thresh(wb->bdi)) {struct wb_writeback_work work = {.nr_pages	= LONG_MAX,.sync_mode	= WB_SYNC_NONE,.for_background	= 1,.range_cyclic	= 1,.reason		= WB_REASON_BACKGROUND,};return wb_writeback(wb, &work);}return 0;
}

wb_check_background_flush和wb_check_old_data_flush的函数只是设置wb_writeback_work的各项参数,然后执行wb_writeback函数,该函数是Writeback机制中真正执行写回的函数。Writeback机制中的写回磁盘操作都是通过wb_writeback函数实现的,wb_writeback调用与文件系统有关的write函数,执行协会磁盘的操作。

/** Explicit flushing or periodic writeback of "old" data.** Define "old": the first time one of an inode's pages is dirtied, we mark the* dirtying-time in the inode's address_space.  So this periodic writeback code* just walks the superblock inode list, writing back any inodes which are* older than a specific point in time.** Try to run once per dirty_writeback_interval.  But if a writeback event* takes longer than a dirty_writeback_interval interval, then leave a* one-second gap.** older_than_this takes precedence over nr_to_write.  So we'll only write back* all dirty pages if they are all attached to "old" mappings.*/
static long wb_writeback(struct bdi_writeback *wb,struct wb_writeback_work *work)
{unsigned long wb_start = jiffies;long nr_pages = work->nr_pages;unsigned long oldest_jif;struct inode *inode;long progress;oldest_jif = jiffies;work->older_than_this = &oldest_jif;spin_lock(&wb->list_lock);for (;;) {/** Stop writeback when nr_pages has been consumed*/if (work->nr_pages <= 0)break;/** Background writeout and kupdate-style writeback may* run forever. Stop them if there is other work to do* so that e.g. sync can proceed. They'll be restarted* after the other works are all done.*/if ((work->for_background || work->for_kupdate) &&!list_empty(&wb->bdi->work_list))break;/** For background writeout, stop when we are below the* background dirty threshold*/if (work->for_background && !over_bground_thresh(wb->bdi))break;/** Kupdate and background works are special and we want to* include all inodes that need writing. Livelock avoidance is* handled by these works yielding to any other work so we are* safe.*/if (work->for_kupdate) {oldest_jif = jiffies -msecs_to_jiffies(dirty_expire_interval * 10);} else if (work->for_background)oldest_jif = jiffies;trace_writeback_start(wb->bdi, work);if (list_empty(&wb->b_io))queue_io(wb, work);if (work->sb)progress = writeback_sb_inodes(work->sb, wb, work);elseprogress = __writeback_inodes_wb(wb, work);trace_writeback_written(wb->bdi, work);wb_update_bandwidth(wb, wb_start);/** Did we write something? Try for more** Dirty inodes are moved to b_io for writeback in batches.* The completion of the current batch does not necessarily* mean the overall work is done. So we keep looping as long* as made some progress on cleaning pages or inodes.*/if (progress)continue;/** No more inodes for IO, bail*/if (list_empty(&wb->b_more_io))break;/** Nothing written. Wait for some inode to* become available for writeback. Otherwise* we'll just busyloop.*/if (!list_empty(&wb->b_more_io))  {trace_writeback_wait(wb->bdi, work);inode = wb_inode(wb->b_more_io.prev);spin_lock(&inode->i_lock);spin_unlock(&wb->list_lock);/* This function drops i_lock... */inode_sleep_on_writeback(inode);spin_lock(&wb->list_lock);}}spin_unlock(&wb->list_lock);return nr_pages - work->nr_pages;
}

总结

writeback机制是比较简单的,其核心是通过一个常驻内核线程为每个BDI对象分配writeback线程,实现对cache中dirty page的数据回刷。

参考:http://www.linuxidc.com/Linux/2013-01/77576.htm

这篇关于Linux 3.8 Writeback机制源码分析的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/707070

相关文章

JVM 的类初始化机制

前言 当你在 Java 程序中new对象时,有没有考虑过 JVM 是如何把静态的字节码(byte code)转化为运行时对象的呢,这个问题看似简单,但清楚的同学相信也不会太多,这篇文章首先介绍 JVM 类初始化的机制,然后给出几个易出错的实例来分析,帮助大家更好理解这个知识点。 JVM 将字节码转化为运行时对象分为三个阶段,分别是:loading 、Linking、initialization

linux-基础知识3

打包和压缩 zip 安装zip软件包 yum -y install zip unzip 压缩打包命令: zip -q -r -d -u 压缩包文件名 目录和文件名列表 -q:不显示命令执行过程-r:递归处理,打包各级子目录和文件-u:把文件增加/替换到压缩包中-d:从压缩包中删除指定的文件 解压:unzip 压缩包名 打包文件 把压缩包从服务器下载到本地 把压缩包上传到服务器(zip

性能分析之MySQL索引实战案例

文章目录 一、前言二、准备三、MySQL索引优化四、MySQL 索引知识回顾五、总结 一、前言 在上一讲性能工具之 JProfiler 简单登录案例分析实战中已经发现SQL没有建立索引问题,本文将一起从代码层去分析为什么没有建立索引? 开源ERP项目地址:https://gitee.com/jishenghua/JSH_ERP 二、准备 打开IDEA找到登录请求资源路径位置

JAVA智听未来一站式有声阅读平台听书系统小程序源码

智听未来,一站式有声阅读平台听书系统 🌟&nbsp;开篇:遇见未来,从“智听”开始 在这个快节奏的时代,你是否渴望在忙碌的间隙,找到一片属于自己的宁静角落?是否梦想着能随时随地,沉浸在知识的海洋,或是故事的奇幻世界里?今天,就让我带你一起探索“智听未来”——这一站式有声阅读平台听书系统,它正悄悄改变着我们的阅读方式,让未来触手可及! 📚&nbsp;第一站:海量资源,应有尽有 走进“智听

Linux 网络编程 --- 应用层

一、自定义协议和序列化反序列化 代码: 序列化反序列化实现网络版本计算器 二、HTTP协议 1、谈两个简单的预备知识 https://www.baidu.com/ --- 域名 --- 域名解析 --- IP地址 http的端口号为80端口,https的端口号为443 url为统一资源定位符。CSDNhttps://mp.csdn.net/mp_blog/creation/editor

【Python编程】Linux创建虚拟环境并配置与notebook相连接

1.创建 使用 venv 创建虚拟环境。例如,在当前目录下创建一个名为 myenv 的虚拟环境: python3 -m venv myenv 2.激活 激活虚拟环境使其成为当前终端会话的活动环境。运行: source myenv/bin/activate 3.与notebook连接 在虚拟环境中,使用 pip 安装 Jupyter 和 ipykernel: pip instal

Java ArrayList扩容机制 (源码解读)

结论:初始长度为10,若所需长度小于1.5倍原长度,则按照1.5倍扩容。若不够用则按照所需长度扩容。 一. 明确类内部重要变量含义         1:数组默认长度         2:这是一个共享的空数组实例,用于明确创建长度为0时的ArrayList ,比如通过 new ArrayList<>(0),ArrayList 内部的数组 elementData 会指向这个 EMPTY_EL

如何在Visual Studio中调试.NET源码

今天偶然在看别人代码时,发现在他的代码里使用了Any判断List<T>是否为空。 我一般的做法是先判断是否为null,再判断Count。 看了一下Count的源码如下: 1 [__DynamicallyInvokable]2 public int Count3 {4 [__DynamicallyInvokable]5 get

Linux_kernel驱动开发11

一、改回nfs方式挂载根文件系统         在产品将要上线之前,需要制作不同类型格式的根文件系统         在产品研发阶段,我们还是需要使用nfs的方式挂载根文件系统         优点:可以直接在上位机中修改文件系统内容,延长EMMC的寿命         【1】重启上位机nfs服务         sudo service nfs-kernel-server resta