Rami Rosen: 理解CGroup v2

2024-03-08 02:10
文章标签 理解 v2 cgroup rosen rami

本文主要是介绍Rami Rosen: 理解CGroup v2,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

本文转发自: https://lwn.net/Articles/679786/ 原标题: 

Understanding the new control groups API

After many years, the Linux kernel's control group (cgroup) infrastructure is undergoing a rewrite that makes changes to the API in a number of places.  Understanding the changes is important to developers, particularly those working with containerization projects.  This article will look at the new features of cgroups v2, which were recently declared production-ready in kernel 4.5.  It is based on a talk I gave at the recent Netdev 1.1 conference in Seville, Spain.  The video [YouTube] for that talk is now available online.

Background

The cgroup subsystem and associated controllers handle management and accounting of various system resources like CPU, memory, I/O, and more.  Together with the Linux namespace subsystem, which is a bit older (having started around 2002) and is considered a bit more mature (apart, perhaps, from user namespaces, which still raise discussions), these subsystems form the basis of Linux containers. Currently, most projects involving Linux containers, like Docker, LXC, OpenVZ, Kubernetes, and others, are based on both of them.

The development of the Linux cgroup subsystem started in 2006 at Google, led primarily by Rohit Seth and Paul Menage. Initiallythe project was called "Process Containers", but later on the name was changed to "Control Groups", to avoid confusion with Linux containers, and nowadays everybody calls them "cgroups" for short.

There are currently 12 cgroup controllers in cgroups v1; all—except one—have existed for several years.  The new addition is the PIDs controller, developed by Aditya Kali and merged in kernel 4.3.  It allows restricting the number of processes created inside a control group, and it can be used as an anti-fork-bomb solution. The PID space in Linux consists of, at a maximum, about four million PIDs (PID_MAX_LIMIT). Given today's RAM capacities, this limit could easily and quite quickly be exhausted by a fork bomb from within a single container. The PIDs controller is supported by both cgroups v1 and cgroups v2.

Over the years, there was a lot of criticism about the implementation of cgroups, which seems to present a number of inconsistencies and a lot of chaos.  For example, when creating subgroups (cgroups within cgroups), several cgroup controllers propagate parameters to their immediate subgroups, while other controllers do not.  Or, for a different example, some controllers use interface files (such as the cpuset controller's clone_children) that appear in all controllers even though they only affect one.

As maintainer Tejun Heo himself hasadmitted [YouTube], "design followed implementation", "different decisions were taken for different controllers", and "sometimes too much flexibility causes a hindrance". In an LWN articlefrom 2012, it was said that "control groups are one of those features that kernel developers love to hate."

Migration

The cgroups v2 interface was declared non-experimental in kernel 4.5. However, the cgroups v1 subsystem was not removed from the kernel, so, after the system boots, both cgroups v1 and cgroups v2 are enabled by default.  You can use a mixture of both of them, although you cannot use the same type of controller in both cgroups v1 and in cgroups v2 at the same time.

It is worth mentioning that there is a patch that adds a kernel command-line option for disabling cgroups v1 controllers (cgroup_no_v1), which was merged for kernel 4.6.

Kernel support for cgroups v1 will probably still exist for at least several more years, as long as there are user-space applications that use it—quite like what we had in the past with iptables and ipchains, and what we observe now with iptables and nftables. Some user-space applications have already started migration to cgroups v2—for example, systemd and CGManager.

Both versions of cgroups are controlled by way of a synthetic filesystem that gets mounted by the user.  During the last three years or so, a special mount option was available in cgroups v1 (__DEVEL__sane_behavior). This mount option enabled using certain experimental features, some of which formed the basis of cgroups v2 (the option was removed in kernel 4.5, however).  For example, using this mount option forces the use the unified hierarchy mode, in which controller management is handled similarly to how it is done in cgroups v2. The __DEVEL__sane_behavior mount option is mutually exclusive with the mount options that were removed in cgroups v2, like noprefix, clone_children, release_agent, and more.

Systemd started to use cgroups for service management rather than for resource management many years ago. Each systemd service is mapped to a separate control group. However, the migration of systemd to cgroups v2 is still partial, as it uses the __DEVEL__sane_behavior mount option. Also, in CGManager, current support for cgroups v2 is partial: it is available only when using Upstart, and not when using systemd.

Currently, three cgroup controllers are available in cgroups v2: I/O, memory, and PIDs. There are already patches and discussions in the cgroups mailing list about adding the CPU controller as well.

There are also interesting patches adding support for resource groups, posted just last week by Heo. In cgroups v1, you could assign threads of the same process to different cgroups, but this is not possible in cgroups v2. As a result, in-process resource-management abilities, like the ability to control CPU cycle distribution hierarchically between the threads of a process, is missing, as all of the threads belong to a single cgroup. With the suggested resource groups (rgroups) infrastructure, this ability can be implemented as a natural extension of the setpriority() system call.

Details of the cgroups v2 interface

Mounting cgroups v2 is done as follows:

    mount -t cgroup2 none $MOUNT_POINT

Note that the type argument (following -t) specified has changed; cgroups v1 used -t cgroup.  As in cgroups v1, the mount point can be anywhere in the filesystem.  But, in contrast, there are no mount options at all in cgroups v2. One could use mount options to enable controllers in cgroups v1, but in cgroups v2 this is done differently, as we will see below. Creation of new subgroups in cgroups v2 is done withmkdir groupName, and removal is done withrmdir groupName.

After mounting cgroups v2, a cgroup root object is created, with three cgroup core interface files beneath it. For example, if cgroups v2 is mounted on /sys/fs/cgroup2, the following files are created under that directory:

  • cgroup.controllers – This shows the supported cgroup controllers. All v2 controllers not bound to a v1 hierarchy are automatically bound to the v2 hierarchy, and show up in cgroup.controllers of the cgroup root object.

  • cgroup.procs – When the the cgroup filesystem is first mounted, cgroup.procs in the root cgroup contains the list of PIDs of all processes in the system,  excluding zombie processes. For each newly created subgroup, the cgroup.procs is empty, as no process is attached to the newly created group. Attaching a process to a subgroup is done by writing its PID into the subgroup'scgroup.procs.

  • cgroup.subtree_control – This holds the controllers that are enabled for the immediate subgroups. This entry is empty just after mount, as no controllers are enabled by default. Enabling and disabling controllers in the immediate subgroups of a parent is done only by writing into itscgroup.subtree_control file. So, for example, enabling the memory controller is done by:

        echo "+memory" > /sys/fs/cgroup2/cgroup.subtree_control

    and disabling it is done by:

        echo "-memory" > /sys/fs/cgroup2/cgroup.subtree_control

    You can enable/disable more than one controller in the same command line.

These three cgroup core interface files are also created for each newly created subgroup. Apart from these three files, a cgroup core interface file called cgroup.events is created. This interface file is unique to non-root subgroups.

The cgroup.events file reflects the number of processes attached to the subgroup, and consists of one item, "populated: value". The value is 0 when there are no processes attached to that subgroup or its descendants, and 1 when there are one or more processes attached to that subgroup or its descendants.

As mentioned, subgroup creation is similar to how it is done in cgroups v1. But in cgroups v2, you can only create subgroups in a single hierarchy, under the cgroups v2 mount point.  When a new subgroup is created, the value of the "populated" entry in cgroup.events is 0, as you would expect, as there is no process yet attached to this newly created subgroup.

You can monitor events in this subgroup by calling poll(),inotify(), or dnotify() from user space. Thus, you can be notified notified when those files change, which can be used to determine when the last process attached to a subgroup terminates or when the first process is attached to that subgroup. This mechanism is much more efficient in terms of performance than the parallel mechanism in cgroups v1, the release agent.

It is worth mentioning that this notification mechanism can also be used by controller-specific interface files. For example, the cgroups v2 memory controller has an interface file called memory.events, which enables monitoring memory events like out-of-memory (OOM) in a similar way.

When a new subgroup is created, controller-specific files are created for each enabled controller in this subgroup. For example, when the PIDs controller is enabled, two interface files are created:pids.max and pids.current, for setting a limit on the number of processes forked in that subgroup, and for accounting of the number of processes in that subgroup.

Let's take a look at two diagrams illustrating what we just described.  The following sequence mounts cgroups v2 on /cgroup2 and creates a subgroup called "group1", creates two subgroups of group1 ("nested1" and "nested2"), then enables the PIDs controller in group1:

    mount -t cgroup2 nodev /cgroup2mkdir /cgroup2/group1mkdir /cgroup2/group1/nested1mkdir /cgroup2/group1/nested2echo +pids > /cgroup2/cgroup.subtree_control

The following diagram illustrates the status after running this sequence.  We can see that the two PIDs controller interface files, pids.max and pids.current, were created for group1.


Now, if we run:

    echo +pids > /cgroup2/group1/cgroup.subtree_control

this will enable the PIDs controller in group1's immediate subgroups, nested1 and nested2.  By writing +pids into the subtree_controlof the root cgroup, we only enable the PIDs controller in the root'sdirect child subgroups and no other descendants.  As a result, the PIDs-controller–specific files (pids.max and pids.current) are created for both these newly-created subgroups.

The subsequent diagram shows the status after enabling the PIDs controller on group1.


The no-internal-process rule

Unlike in cgroups v1, in cgroups v2 you can attach processesonly to leaves. This means that you cannot attach a process to an internal subgroup if it has any controller enabled. The reason behind this rule is that processes in a given subgroup competing for resources with threads attached to its parent group create significant implementation difficulties.

The following diagram illustrates this.


(Note: when you write 0 intocgroup.procs, this will write the PID of the process performing the writing into the file.)

The documentationdiscusses the no-internal-process rule in more detail.

In cgroups v1, a process can belong to many subgroups, if those subgroups are in different hierarchies with different controllers attached.  But, because belonging to more than one subgroup made it difficult to disambiguate subgroup membership, in cgroups v2, a process can belong only to a single subgroup.

We will look at an example when this restriction is important. In cgroups v1, there are two network controllers: net_prio (written by Neil Horman) and net_cls (by Thomas Graf). These controllers were not extended to support cgroups v2. Instead, thext_cgroup netfilter matching module was extended to support matching by a cgroup path. For example, the following iptables rule matches traffic that was generated by a socket created in a process attached to mygroup (or its descendants):

    iptables -A OUTPUT -m cgroup --path mygroup -j LOG

Such a match is not possible in cgroups v1, because sometimes a process can belong to more than a single subgroup. In cgroups v2, this problem does not exist, because of the single-subgroup rule.

Summary

Work is ongoing; in addition to the resource-group patches mentioned earlier, there are patches for a new RDMA cgroup controller that are currently in the pipeline. This patch set allows resource accounting and limit enforcement on a per-cgroup, per-RDMA-device basis. These patches are in the post-RFC phase, and are in the ninth iteration as of this writing; it seems likely that they are to be merged soon.

As we have seen, the new interface of cgroups v2, which was recently declared stable in the kernel, has several advantages over cgroups v1, such as its notification-to-user-space mechanism. Although the cgroups v2 implementation is still in its initial stages, it seems to be much better organized and more consistent than cgroups v1.



这篇关于Rami Rosen: 理解CGroup v2的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/785619

相关文章

认识、理解、分类——acm之搜索

普通搜索方法有两种:1、广度优先搜索;2、深度优先搜索; 更多搜索方法: 3、双向广度优先搜索; 4、启发式搜索(包括A*算法等); 搜索通常会用到的知识点:状态压缩(位压缩,利用hash思想压缩)。

【生成模型系列(初级)】嵌入(Embedding)方程——自然语言处理的数学灵魂【通俗理解】

【通俗理解】嵌入(Embedding)方程——自然语言处理的数学灵魂 关键词提炼 #嵌入方程 #自然语言处理 #词向量 #机器学习 #神经网络 #向量空间模型 #Siri #Google翻译 #AlexNet 第一节:嵌入方程的类比与核心概念【尽可能通俗】 嵌入方程可以被看作是自然语言处理中的“翻译机”,它将文本中的单词或短语转换成计算机能够理解的数学形式,即向量。 正如翻译机将一种语言

【C++高阶】C++类型转换全攻略:深入理解并高效应用

📝个人主页🌹:Eternity._ ⏩收录专栏⏪:C++ “ 登神长阶 ” 🤡往期回顾🤡:C++ 智能指针 🌹🌹期待您的关注 🌹🌹 ❀C++的类型转换 📒1. C语言中的类型转换📚2. C++强制类型转换⛰️static_cast🌞reinterpret_cast⭐const_cast🍁dynamic_cast 📜3. C++强制类型转换的原因📝

深入理解RxJava:响应式编程的现代方式

在当今的软件开发世界中,异步编程和事件驱动的架构变得越来越重要。RxJava,作为响应式编程(Reactive Programming)的一个流行库,为Java和Android开发者提供了一种强大的方式来处理异步任务和事件流。本文将深入探讨RxJava的核心概念、优势以及如何在实际项目中应用它。 文章目录 💯 什么是RxJava?💯 响应式编程的优势💯 RxJava的核心概念

如何通俗理解注意力机制?

1、注意力机制(Attention Mechanism)是机器学习和深度学习中一种模拟人类注意力的方法,用于提高模型在处理大量信息时的效率和效果。通俗地理解,它就像是在一堆信息中找到最重要的部分,把注意力集中在这些关键点上,从而更好地完成任务。以下是几个简单的比喻来帮助理解注意力机制: 2、寻找重点:想象一下,你在阅读一篇文章的时候,有些段落特别重要,你会特别注意这些段落,反复阅读,而对其他部分

深入理解数据库的 4NF:多值依赖与消除数据异常

在数据库设计中, "范式" 是一个常常被提到的重要概念。许多初学者在学习数据库设计时,经常听到第一范式(1NF)、第二范式(2NF)、第三范式(3NF)以及 BCNF(Boyce-Codd范式)。这些范式都旨在通过消除数据冗余和异常来优化数据库结构。然而,当我们谈到 4NF(第四范式)时,事情变得更加复杂。本文将带你深入了解 多值依赖 和 4NF,帮助你在数据库设计中消除更高级别的异常。 什么是

分布式系统的个人理解小结

分布式系统:分的微小服务,以小而独立的业务为单位,形成子系统。 然后分布式系统中需要有统一的调用,形成大的聚合服务。 同时,微服务群,需要有交流(通讯,注册中心,同步,异步),有管理(监控,调度)。 对外服务,需要有控制的对外开发,安全网关。

Java IO 操作——个人理解

之前一直Java的IO操作一知半解。今天看到一个便文章觉得很有道理( 原文章),记录一下。 首先,理解Java的IO操作到底操作的什么内容,过程又是怎么样子。          数据来源的操作: 来源有文件,网络数据。使用File类和Sockets等。这里操作的是数据本身,1,0结构。    File file = new File("path");   字

理解java虚拟机内存收集

学习《深入理解Java虚拟机》时个人的理解笔记 1、为什么要去了解垃圾收集和内存回收技术? 当需要排查各种内存溢出、内存泄漏问题时,当垃圾收集成为系统达到更高并发量的瓶颈时,我们就必须对这些“自动化”的技术实施必要的监控和调节。 2、“哲学三问”内存收集 what?when?how? 那些内存需要回收?什么时候回收?如何回收? 这是一个整体的问题,确定了什么状态的内存可以

理解分类器(linear)为什么可以做语义方向的指导?(解纠缠)

Attribute Manipulation(属性编辑)、disentanglement(解纠缠)常用的两种做法:线性探针和PCA_disentanglement和alignment-CSDN博客 在解纠缠的过程中,有一种非常简单的方法来引导G向某个方向进行生成,然后我们通过向不同的方向进行行走,那么就会得到这个属性上的图像。那么你利用多个方向进行生成,便得到了各种方向的图像,每个方向对应了很多