本文详细分析了cgroup初始化的过程。

本文基于3.10.0-862.el7.x86_64版本kernel进行分析。

在分析初始化之前,我们需要看一下层级和子系统对应的结构体以及几个重要的全局变量。

层级对应的结构体为:cgroupfs_root,子系统对应的结构体为cgroup_subsys.

cgroupfs_root 结构

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/*
 * A cgroupfs_root represents the root of a cgroup hierarchy, and may be
 * associated with a superblock to form an active hierarchy.  This is
 * internal to cgroup core.  Don't access directly from controllers.
 */
struct cgroupfs_root {
        struct super_block *sb;

        /*
         * The bitmask of subsystems intended to be attached to this
         * hierarchy
         */
        unsigned long subsys_mask;

        /* Unique id for this hierarchy. */
        int hierarchy_id;

        /* The bitmask of subsystems currently attached to this hierarchy */
        unsigned long actual_subsys_mask;

        /* A list running through the attached subsystems */
        struct list_head subsys_list;

        /* The root cgroup for this hierarchy */
        struct cgroup top_cgroup;

        /* Tracks how many cgroups are currently defined in hierarchy.*/
        int number_of_cgroups;

        /* A list running through the active hierarchies */
        struct list_head root_list;

        /* All cgroups on this root, cgroup_mutex protected */
        struct list_head allcg_list;

        /* Hierarchy-specific flags */
        unsigned long flags;

        /* IDs for cgroups in this hierarchy */
        struct ida cgroup_ida;

        /* The path to use for release notifications. */
        char release_agent_path[PATH_MAX];

        /* The name for this hierarchy - may be empty */
        char name[MAX_CGROUP_ROOT_NAMELEN];
};
  • sb指向该层级关联的文件系统超级块
  • subsys_maskactual_subsys_mask分别指向将要附加到层级的子系统和现在实际附加到层级的子系统,在子系统附加到层级时使用。
  • hierarchy_id是该层级唯一的id
  • top_cgroup指向该层级的根cgroup
  • number_of_cgroups记录该层级cgroup的个数
  • root_list是一个嵌入的list_head,用于将系统所有的层级连成链表
  • subsys_list是一个链表,该链表将附着于该挂载点上的子系统链接到一起

cgroup_subsys 结构

子系统对应的数据结构为:cgroup_subsys

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/*
 * Control Group subsystem type.
 * See Documentation/cgroups/cgroups.txt for details
 */

struct cgroup_subsys {
        struct cgroup_subsys_state *(*css_alloc)(struct cgroup *cgrp);
        int (*css_online)(struct cgroup *cgrp);
        void (*css_offline)(struct cgroup *cgrp);
        void (*css_free)(struct cgroup *cgrp);

        int (*can_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
        void (*cancel_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
        void (*attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
        RH_KABI_REPLACE(void (*fork)(struct task_struct *task),
                        void (*fork)(struct task_struct *task, void *priv))
        void (*exit)(struct cgroup *cgrp, struct cgroup *old_cgrp,
                     struct task_struct *task);
        void (*bind)(struct cgroup *root);

        int subsys_id;
        int disabled;
        int early_init;
        /*
         * True if this subsys uses ID. ID is not available before cgroup_init()
         * (not available in early_init time.)
         */
        bool use_id;

        /*
         * If %false, this subsystem is properly hierarchical -
         * configuration, resource accounting and restriction on a parent
         * cgroup cover those of its children.  If %true, hierarchy support
         * is broken in some ways - some subsystems ignore hierarchy
         * completely while others are only implemented half-way.
         *
         * It's now disallowed to create nested cgroups if the subsystem is
         * broken and cgroup core will emit a warning message on such
         * cases.  Eventually, all subsystems will be made properly
         * hierarchical and this will go away.
         */
        bool broken_hierarchy;
        bool warned_broken_hierarchy;

#define MAX_CGROUP_TYPE_NAMELEN 32
        const char *name;

        /*
         * Link to parent, and list entry in parent's children.
         * Protected by cgroup_lock()
         */
        struct cgroupfs_root *root;
        struct list_head sibling;
        /* used when use_id == true */
        struct idr idr;
        spinlock_t id_lock;

        /* list of cftype_sets */
        struct list_head cftsets;

        /* base cftypes, automatically [de]registered with subsys itself */
        struct cftype *base_cftypes;
        struct cftype_set base_cftset;

        /* should be defined only by modular subsystems */
        struct module *module;

        RH_KABI_EXTEND(int (*can_fork)(struct task_struct *task, void **priv_p))
        RH_KABI_EXTEND(void (*cancel_fork)(struct task_struct *task, void *priv))
};	

cgroup_subsys定义了一组操作,让各个子系统根据各自的需要去实现。这个相当于C++中抽象基类,然后各个特定的子系统对应cgroup_subsys则是实现了相应操作的子类。

类似的思想还被用在了cgroup_subsys_state中,cgroup_subsys_state并未定义控制信息,而只是定义了各个子系统都需要的共同信息,比如该cgroup_subsys_state从属的cgroup。然后各个子系统再根据各自的需要去定义自己的进程控制信息结构体,最后在各自的结构体中将cgroup_subsys_state包含进去,这样通过Linux内核的container_of等宏就可以通过cgroup_subsys_state来获取相应的结构体。

几个全局变量

  • init_css_set是默认的css_set。在还没有其他cgroup子系统mount时,它被init其子进程来使用。
  • css_set_count用来描述当前系统上有多少个css_set
  • rootnode是一个dummy hierarchy,它只有一个cgroup,所有的进程都属于这个cgroup
  • dummytop是一个指向rootnode.top_cgroup的缩写。
  • roots 是一个链表头,将所有的cgroupfs_root都链接到了一起。
  • root_count表示有多少个cgroupfs_root
  • init_css_set_link: 用于链接init_css_setdummytopcg_cgroup_link
  • struct cgroup_subsys *subsys[CGROUP_SUBSYS_COUNT]数组,该数组保存了所有子系统(cgroup_subsys)的信息

cgroups的初始化

在内核过程中,由于各个cgroup子系统的特点,cgroup的初始分为两部分:

  • cgroup_init_early
  • cgroup_init

cgroup_init_early

cgroup_init_early用来初始化需要尽早初始化的子系统。一般这些需要尽早初始化的子系统都包括:cpusetcpucpuacct

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/**
 * cgroup_init_early - cgroup initialization at system boot
 *
 * Initialize cgroups at system boot, and initialize any
 * subsystems that request early init.
 */
int __init cgroup_init_early(void)
{
        int i;
        atomic_set(&init_css_set.refcount, 1);
        INIT_LIST_HEAD(&init_css_set.cg_links);
        INIT_LIST_HEAD(&init_css_set.tasks);
        INIT_HLIST_NODE(&init_css_set.hlist);
        css_set_count = 1;
        init_cgroup_root(&rootnode);
        root_count = 1;
        init_task.cgroups = &init_css_set;

        init_css_set_link.cg = &init_css_set;
        init_css_set_link.cgrp = dummytop;
        list_add(&init_css_set_link.cgrp_link_list,
                 &rootnode.top_cgroup.css_sets);
        list_add(&init_css_set_link.cg_link_list,
                 &init_css_set.cg_links);

        for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
                struct cgroup_subsys *ss = subsys[i];

                /* at bootup time, we don't worry about modular subsystems */
                if (!ss || ss->module)
                        continue;
				// 这里做了一些基本的检查
                BUG_ON(!ss->name);
                BUG_ON(strlen(ss->name) > MAX_CGROUP_TYPE_NAMELEN);
                BUG_ON(!ss->css_alloc);
                BUG_ON(!ss->css_free);
                if (ss->subsys_id != i) {
                        printk(KERN_ERR "cgroup: Subsys %s id == %d\n",
                               ss->name, ss->subsys_id);
                        BUG();
                }

                if (ss->early_init)//只有当early_init为1时,才会进行初始化
                        cgroup_init_subsys(ss);
        }
        return 0;
}

该函数的主要功能如下:

  • 初始化init_css_set中的成员变量和几个全局变量css_set_countroot_countrootnodeinit_css_set_link
  • 初始化cgroup子系统cpusetcpucpuacct

cgroup_init

cgroup_init用来完成cgroup的初始化,其代码如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
/**
 * cgroup_init - cgroup initialization
 *
 * Register cgroup filesystem and /proc file, and initialize
 * any subsystems that didn't request early init.
 */
int __init cgroup_init(void)
{       
        int err;
        int i;
        unsigned long key;
        
        err = bdi_init(&cgroup_backing_dev_info);
        if (err)
                return err;
        
        for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
                struct cgroup_subsys *ss = subsys[i];
                
                /* at bootup time, we don't worry about modular subsystems */
                if (!ss || ss->module)
                        continue;
                if (!ss->early_init)
                        cgroup_init_subsys(ss);
                if (ss->use_id)
                        cgroup_init_idr(ss, init_css_set.subsys[ss->subsys_id]);
        }
        
        /* Add init_css_set to the hash table */
        key = css_set_hash(init_css_set.subsys);
        hash_add(css_set_table, &init_css_set.hlist, key);
        BUG_ON(!init_root_id(&rootnode));
        
        err = sysfs_create_mount_point(fs_kobj, "cgroup");
        if (err)
                goto out;
        
        err = register_filesystem(&cgroup_fs_type);
        if (err < 0) {
                sysfs_remove_mount_point(fs_kobj, "cgroup");
                goto out;
        }
        
        proc_create("cgroups", 0, NULL, &proc_cgroupstats_operations);

out:    
        if (err)
                bdi_destroy(&cgroup_backing_dev_info);
        
        return err;
}

主要完成了如下工作:

  • 初始化了其他几个子系统
  • 初始化cgroup_backing_dev_info
  • 根据use_id是否为true,进行必要的初始化
  • init_css_set这个目前唯一的css_set添加到hashcss_set_table
  • 创建目录/sys/fs/cgroup
  • 注册cgroup文件系统类型
  • 创建/proc/cgroups

cgroup_init_subsys

以上两个方法中都调用了cgroup_init_subsys,其代码如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
static void __init cgroup_init_subsys(struct cgroup_subsys *ss)                                                                          
{                                                                                                                                        
        struct cgroup_subsys_state *css;                                                                                                 
                                                                                                                                         
        printk(KERN_INFO "Initializing cgroup subsys %s\n", ss->name);                                                                   
                                                                                                                                         
        mutex_lock(&cgroup_mutex);                                                                                                       
                                                                                                                                         
        /* init base cftset */                                                                                                           
        cgroup_init_cftsets(ss);                                                                                                         
                                                                                                                                         
        /* Create the top cgroup state for this subsystem */                                                                             
        list_add(&ss->sibling, &rootnode.subsys_list);                 //只是临时添加到   rootnode.subsys_list 链表中,后面会移走的。                                                               
        ss->root = &rootnode;                                                                                                            
        css = ss->css_alloc(dummytop);                                                                                                   
        /* We don't handle early failures gracefully */                                                                                  
        BUG_ON(IS_ERR(css));                                                                                                             
        init_cgroup_css(css, ss, dummytop);                                                                                              
                                                                                                                                         
        /* Update the init_css_set to contain a subsys                                                                                   
         * pointer to this state - since the subsystem is                                                                                
         * newly registered, all tasks and hence the                                                                                     
         * init_css_set is in the subsystem's top cgroup. */                                                                             
        init_css_set.subsys[ss->subsys_id] = css;                                                                                        
                                                                                                                                         
        need_forkexit_callback |= ss->fork || ss->exit;                                                                                  
                                                                                                                                         
        /* At system boot, before all subsystems have been                                                                               
         * registered, no tasks have been forked, so we don't                                                                            
         * need to invoke fork callbacks here. */                                                                                        
        BUG_ON(!list_empty(&init_task.tasks));                                                                                           
                                                                                                                                         
        BUG_ON(online_css(ss, dummytop));                                                                                                
                                                                                                                                         
        mutex_unlock(&cgroup_mutex);                                                                                                     
                                                                                                                                         
        /* this function shouldn't be used with modular subsystems, since they                                                           
         * need to register a subsys_id, among other things */                                                                           
        BUG_ON(ss->module);                                                                                                              
}                                                                                                                                        
    
  • 初始化cgroup_subsyscftsets
  • 分配css
  • 初始化dummytopinit_css_set中对应的subsys数组
  • 调用online_css

初始化后mount前这些数据结构的关系图

我所使用的系统为centos7,默认cgroup各个子系统的挂载是systemd完成的,由于没有办法让systemd不进行挂载,所以在系统启动之后,我们手动umount掉这些cgroup子系统的挂载,用来分析内核里的数据结构。操作方法如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# #查看有哪些cgroup
# mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/debug type cgroup (rw,nosuid,nodev,noexec,relatime,debug)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
# # 将所有的进程都添加到各个子系统的root cgroup中
# echo $$ >  /sys/fs/cgroup/systemd/cgroup.procs 
# echo $$ >  /sys/fs/cgroup/debug/cgroup.procs 
# echo $$ >  /sys/fs/cgroup/blkio/cgroup.procs 
# echo $$ >  /sys/fs/cgroup/cpu,cpuacct/cgroup.procs 
# echo $$ >  /sys/fs/cgroup/cpuset/cgroup.procs 
# echo $$ >  /sys/fs/cgroup/net_cls,net_prio/cgroup.procs 
# echo $$ >  /sys/fs/cgroup/devices/cgroup.procs 
# echo $$ >  /sys/fs/cgroup/hugetlb/cgroup.procs 
# echo $$ >  /sys/fs/cgroup/pids/cgroup.procs 
# echo $$ >  /sys/fs/cgroup/memory/cgroup.procs 
# echo $$ >  /sys/fs/cgroup/freezer/cgroup.procs 
# echo $$ >  /sys/fs/cgroup/perf_event/cgroup.procs 
# # 查看有哪些cgroup子系统除了root cgroup外有子cgroup
# cat /proc/cgroups 
#subsys_name	hierarchy	num_cgroups	enabled
cpuset		6	1	1
debug		2	1	1
cpu		4	1	1
cpuacct		4	1	1
memory		10	1	1
devices		7	101	1
freezer		11	1	1
net_cls		5	1	1
blkio		3	1	1
perf_event	12	1	1
hugetlb		8	1	1
pids		9	106	1
net_prio	5	1	1 
# # 从上面可以看出,pids和devices子系统已经创建了子cgroup,我们需要将其子cgroup中的进程都添加到root cgroup中
# # 并删除除root cgroup外的所有的子cgroup,效果如下,显示每个子系统上的cgroup个数为1,即剩下的root cgroup了。
# cat /proc/cgroups 
#subsys_name	hierarchy	num_cgroups	enabled
cpuset		6	1	1
debug		2	1	1
cpu		4	1	1
cpuacct		4	1	1
memory		10	1	1
devices		7	1	1
freezer		11	1	1
net_cls		5	1	1
blkio		3	1	1
perf_event	12	1	1
hugetlb		8	1	1
pids		9	1	1
net_prio	5	1	1
# # 完成后,卸载到这些cgroup子系统
# umount /sys/fs/cgroup/net_cls,net_prio
# umount /sys/fs/cgroup/pids
# umount /sys/fs/cgroup/cpu,cpuacct
# umount /sys/fs/cgroup/freezer
# umount /sys/fs/cgroup/memory
# umount /sys/fs/cgroup/perf_event
# umount /sys/fs/cgroup/hugetlb
# umount /sys/fs/cgroup/debug
# umount /sys/fs/cgroup/blkio
# umount /sys/fs/cgroup/cpuset
# umount /sys/fs/cgroup/devices
# umount /sys/fs/cgroup/systemd
umount: /sys/fs/cgroup/systemd: target is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))
# mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)

可以看出,除了/sys/fs/cgroup/systemd不能umount外,其他子系统都umount成功了。此时/proc/cgroups的输出如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# cat /proc/cgroups 
#subsys_name	hierarchy	num_cgroups	enabled
cpuset		0	1	1
debug		0	1	1
cpu		0	1	1
cpuacct		0	1	1
memory		0	1	1
devices		0	1	1
freezer		0	1	1
net_cls		0	1	1
blkio		0	1	1
perf_event	0	1	1
hugetlb		0	1	1
pids		0	1	1
net_prio	0	1	1

可以看出,每个子系统的hierarchy id为0,且只有一个cgroup,即dummytop这个cgroup。

此时,我们就可以通过crash来分析这些数据结构的关系:

  • 由于系统上挂载了systemd这个cgroup,再加上rootnode这个dummy cgrouproot_fs,总共有两个cgrouproot_fs,所以root_count=2,而只有systemd对应的cgroupfs_root被链接到了roots这个链表上。
1
2
3
4
5
crash> p root_count
root_count = $1 = 2
crash> list -l cgroupfs_root.root_list -s cgroupfs_root.name -H roots
ffff9d7d56f94308
  name = "systemd\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  • rootnode的subsys_list应该包含了那13个未挂载的cgroup_subsys
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
crash> struct -o cgroupfs_root.subsys_list rootnode
struct cgroupfs_root {
  [ffffffffa03ff020] struct list_head subsys_list;
}
crash> list  -l cgroup_subsys.sibling -s cgroup_subsys.name,subsys_id -H ffffffffa03ff020
ffffffff9ebf9f70
  name = 0xffffffff9ea9a8cc "pids"
  subsys_id = 11
ffffffff9ec83270
  name = 0xffffffff9eacb432 "devices"
  subsys_id = 5
ffffffff9ebfb590
  name = 0xffffffff9eaa6abb "cpuset"
  subsys_id = 0
ffffffff9ec98e10
  name = 0xffffffff9eac4e39 "blkio"
  subsys_id = 8
ffffffff9ebf7ff0
  name = 0xffffffff9ea99ec6 "debug"
  subsys_id = 1
ffffffff9ed2f730
  name = 0xffffffff9eaa7164 "hugetlb"
  subsys_id = 10
ffffffff9ec54230
  name = 0xffffffff9ea9e4ec "perf_event"
  subsys_id = 9
ffffffff9ed2f590
  name = 0xffffffff9eaaf4d8 "memory"
  subsys_id = 4
ffffffff9ebf9df0
  name = 0xffffffff9ea97da3 "freezer"
  subsys_id = 6
ffffffff9ebf1050
  name = 0xffffffff9ea97b50 "cpuacct"
  subsys_id = 3
ffffffff9ebeda90
  name = 0xffffffff9ea9baca "cpu"
  subsys_id = 2
ffffffff9ece34b0
  name = 0xffffffff9eb2289a "net_prio"
  subsys_id = 12
ffffffff9ece3dd0
  name = 0xffffffff9eb22a9e "net_cls"
  subsys_id = 7
  • 系统上这些cgroup对应的cgroup_subsys的成员root都执行了rootnode:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
crash> struct -o cgroupfs_root.subsys_list rootnode
struct cgroupfs_root {
  [ffffffffa03ff020] struct list_head subsys_list;
}
crash> p &rootnode
$13 = (struct cgroupfs_root *) 0xffffffffa03ff000
crash>  list  -l cgroup_subsys.sibling -s cgroup_subsys.root  -H ffffffffa03ff020
ffffffff9ebf9f70
  root = 0xffffffffa03ff000
ffffffff9ec83270
  root = 0xffffffffa03ff000
ffffffff9ebfb590
  root = 0xffffffffa03ff000
ffffffff9ec98e10
  root = 0xffffffffa03ff000
ffffffff9ebf7ff0
  root = 0xffffffffa03ff000
ffffffff9ed2f730
  root = 0xffffffffa03ff000
ffffffff9ec54230
  root = 0xffffffffa03ff000
ffffffff9ed2f590
  root = 0xffffffffa03ff000
ffffffff9ebf9df0
  root = 0xffffffffa03ff000
ffffffff9ebf1050
  root = 0xffffffffa03ff000
ffffffff9ebeda90
  root = 0xffffffffa03ff000
ffffffff9ece34b0
  root = 0xffffffffa03ff000
ffffffff9ece3dd0
  root = 0xffffffffa03ff000
  • 这个时刻,系统上只有一个css_set,即init_css_set, 所有的进程的css_set都执向它:
1
2
crash> p css_set_count
css_set_count = $1 = 1
  • dummytop和init_css_set的成员subsys执行的css都相同:
1
2
3
4
5
6
crash> p &rootnode.top_cgroup
$18 = (struct cgroup *) 0xffffffffa03ff030
crash> cgroup.subsys 0xffffffffa03ff030
  subsys = {0xffffffff9ebfb2a0, 0xffff9d7d5a913a00, 0xffffffff9f5ac1c0, 0xffffffff9ebf1560, 0xffff9d7d5a96d000, 0xffff9d7d5a919480, 0xffff9d7d5a919540, 0xffff9d7d5a913a80, 0xffffffff9ec990c0, 0xffff9d7d5a913b00, 0xffff9d7d5a919600, 0xffff9d7d5a9196c0, 0xffff9d7d5a913b80}
crash> css_set.subsys init_css_set
  subsys = {0xffffffff9ebfb2a0, 0xffff9d7d5a913a00, 0xffffffff9f5ac1c0, 0xffffffff9ebf1560, 0xffff9d7d5a96d000, 0xffff9d7d5a919480, 0xffff9d7d5a919540, 0xffff9d7d5a913a80, 0xffffffff9ec990c0, 0xffff9d7d5a913b00, 0xffff9d7d5a919600, 0xffff9d7d5a9196c0, 0xffff9d7d5a913b80}

所以,cgroup_init_earlycgroup_init 执行完后,这些数据结构之间的关系如下图所示:

enter description here

/proc/cgroups 实现分析

cgroup初始化函数cgroup_init中会调用如下函数进行注册/proc/cgroups接口:

1
2
3
4
5
6
7
8
int __init cgroup_init(void) 
{
...
...
		proc_create("cgroups", 0, NULL, &proc_cgroupstats_operations);
...
...
}

proc_cgroupstats_operations的实现如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
/* Display information about each subsystem and each hierarchy */
static int proc_cgroupstats_show(struct seq_file *m, void *v)
{
        int i;

        seq_puts(m, "#subsys_name\thierarchy\tnum_cgroups\tenabled\n");
        /*   
         * ideally we don't want subsystems moving around while we do this.
         * cgroup_mutex is also necessary to guarantee an atomic snapshot of
         * subsys/hierarchy state.
         */
        mutex_lock(&cgroup_mutex);
        for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
                struct cgroup_subsys *ss = subsys[i];
                if (ss == NULL) // ss 可能为空
                        continue;
                seq_printf(m, "%s\t%d\t%d\t%d\n",
                           ss->name, ss->root->hierarchy_id,
                           ss->root->number_of_cgroups, !ss->disabled);
        }    
        mutex_unlock(&cgroup_mutex);
        return 0;
}

static int cgroupstats_open(struct inode *inode, struct file *file)
{
        return single_open(file, proc_cgroupstats_show, NULL);
}

static const struct file_operations proc_cgroupstats_operations = {
        .open = cgroupstats_open,
        .read = seq_read,
        .llseek = seq_lseek,
        .release = single_release,
};

从上可以看出,这些信息都来自于数组subsys和其成员subsys->root(类型为cgroupfs_root)。/proc/cgroups示例如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
~  # cat /proc/cgroups
#subsys_name    hierarchy       num_cgroups     enabled
cpuset  	3       2       1
debug   	4       3       1
cpu     	5       40      1
cpuacct 	5       40      1
memory  	2       44      1
devices 	11      42      1
freezer 	10      2       1
net_cls 	12      2       1
blkio   	8       42      1
perf_event      6       2       1
hugetlb 	7       2       1
pids    	9       109     1
net_prio        12      2       1

从上可以看出,

  • 第一列是cgroup子系统的名称
  • 第二列的hierarchy2开始,那么编号为1hierarchy是什么呢?hierarchy_id是动态分配,linux系统启动时,先挂载了一个未附加任何子系统的层级systemd,所以systemdhierarchy_id为1
  • 第三列说明该子系统中cgroups的个数
  • 第四列说明该子系统是否使能。

内核中数组subsys保存了系统上的不同的cgroup子系统信息。

1
2
3
4
5
6
7
8
#define SUBSYS(_x) [_x ## _subsys_id] = &_x ## _subsys,
#define IS_SUBSYS_ENABLED(option) IS_BUILTIN(option)
#define ENABLE_NETPRIO_NOW
static struct cgroup_subsys *subsys[CGROUP_SUBSYS_COUNT] = {
#include <linux/cgroup_subsys.h>
};
#undef ENABLE_NETPRIO_NOW
};

其中CGROUP_SUBSYS_COUNT的值为系统上支持的cgroup的个数,包括编译进内核的和编译成模块的。

注意,由于IS_SUBSYS_ENABLED的定义,这里只会初始化编译进内核模块的子cgroup。

使用crash工具,可以查看内核中cgroup子系统的情况:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
crash> p subsys[0].name
$5 = 0xffffffffa3ea6abb "cpuset"
crash> p subsys[1].name
$6 = 0xffffffffa3e99ec6 "debug"
crash> p subsys[2].name
$7 = 0xffffffffa3e9baca "cpu"
crash> p subsys[3].name
$8 = 0xffffffffa3e97b50 "cpuacct"
crash> p subsys[4].name
$9 = 0xffffffffa3eaf4d8 "memory"
crash> p subsys[5].name
$10 = 0xffffffffa3ecb432 "devices"
crash> p subsys[6].name
$11 = 0xffffffffa3e97da3 "freezer"
crash> p subsys[7].name
$12 = 0xffffffffa3f22a9e "net_cls"
crash> p subsys[8].name
$13 = 0xffffffffa3ec4e39 "blkio"
crash> p subsys[9].name
$14 = 0xffffffffa3e9e4ec "perf_event"
crash> p subsys[10].name
$15 = 0xffffffffa3ea7164 "hugetlb"
crash> p subsys[11].name
$16 = 0xffffffffa3e9a8cc "pids"
crash> p subsys[12].name
$17 = 0xffffffffa3f2289a "net_prio"

在内核中,每个cgroup子系统,都会有一个subsys的定义的结构体。对于pids子系统,其对应的subsys定义为:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
struct cgroup_subsys pids_subsys = { 
        .name           = "pids",
        .subsys_id      = pids_subsys_id,
        .css_alloc      = pids_css_alloc,
        .css_free       = pids_css_free,
        .can_attach     = pids_can_attach,
        .cancel_attach  = pids_cancel_attach,
        .can_fork       = pids_can_fork,
        .cancel_fork    = pids_cancel_fork,
        .fork           = pids_fork,                                                                                                           
        .exit           = pids_exit,
        .base_cftypes   = pids_files,
};

crash 查看cgroup的一些数据结构的关系

NOTE: 这里是centos 7启动后,默认情况下,各个cgroup子系统都已经被systemd挂载的情况。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
crash> struct -o cgroupfs_root rootnode
struct cgroupfs_root {
  [ffffffffa57ff000] struct super_block *sb;
  [ffffffffa57ff008] unsigned long subsys_mask;
  [ffffffffa57ff010] int hierarchy_id;
  [ffffffffa57ff018] unsigned long actual_subsys_mask;
  [ffffffffa57ff020] struct list_head subsys_list;
  [ffffffffa57ff030] struct cgroup top_cgroup;
  [ffffffffa57ff300] int number_of_cgroups;
  [ffffffffa57ff308] struct list_head root_list;
  [ffffffffa57ff318] struct list_head allcg_list;
  [ffffffffa57ff328] unsigned long flags;
  [ffffffffa57ff330] struct ida cgroup_ida;
  [ffffffffa57ff3a8] char release_agent_path[4096];
  [ffffffffa58003a8] char name[64];
}
SIZE: 5096
crash> list -l cgroup.allcg_node -s cgroup.name,id,root,count -H ffffffffa57ff318
ffffffffa57ff108
  name = 0xffffffffa3ff9740
  id = 0
  root = 0xffffffffa57ff000
  count = {
    counter = 47
  }
crash> cgroup_name.name 0xffffffffa3ff9740
  name = 0xffffffffa3ff9750 "/"
crash> p rootnode.number_of_cgroups
$1 = 1

从上可以看出,rootnode的成员allcg_list将属于该cgroupfs_root的所有的cgroup的链接到了一起,一般情况下,系统上属于rootnodecgroup只有一个,即dummy_top,其名称为/

当系统mount很多cgroup 层级后,全局变量roots作为链表头,将系统上所有的层级都链接到了一起。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
crash> list -l cgroupfs_root.root_list  -s cgroupfs_root.hierarchy_id,name,actual_subsys_mask -H roots
ffff9047143be308
  hierarchy_id = 12
  name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  actual_subsys_mask = 4224
ffff9047143b8308
  hierarchy_id = 11
  name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  actual_subsys_mask = 32
ffff9047143ba308
  hierarchy_id = 10
  name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  actual_subsys_mask = 64
ffff9047143bc308
  hierarchy_id = 9
  name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  actual_subsys_mask = 2048
ffff9047143c0308
  hierarchy_id = 8
  name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  actual_subsys_mask = 256
ffff9047143c2308
  hierarchy_id = 7
  name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  actual_subsys_mask = 1024
ffff9047143c4308
  hierarchy_id = 6
  name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  actual_subsys_mask = 512
ffff9047143c6308
  hierarchy_id = 5
  name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  actual_subsys_mask = 12
ffff90471418c308
  hierarchy_id = 4
  name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  actual_subsys_mask = 2
ffff90471418e308
  hierarchy_id = 3
  name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  actual_subsys_mask = 1
ffff904714188308
  hierarchy_id = 2
  name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  actual_subsys_mask = 16
ffff90471418a308
  hierarchy_id = 1
  name = "systemd\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  actual_subsys_mask = 0

可以看出,roots这个链表将系统上所有的cgroupfs_root链接到了一起,注意,这个链表中不包括rootnode这个cgroupfs_root,因为这个rootnodehierarchy_id0,这个链表中没有hierarchy_id0的结点。

1
2
crash> p rootnode.hierarchy_id
$1 = 0

关于dummytop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
crash> struct -o cgroupfs_root.top_cgroup  rootnode
struct cgroupfs_root {
  [ffffffffa57ff030] struct cgroup top_cgroup;
}
crash> struct -o cgroup ffffffffa57ff030
struct cgroup {
  [ffffffffa57ff030] unsigned long flags;
  [ffffffffa57ff038] atomic_t count;
  [ffffffffa57ff03c] int id;
  [ffffffffa57ff040] struct list_head sibling;
  [ffffffffa57ff050] struct list_head children;
  [ffffffffa57ff060] struct list_head files;
  [ffffffffa57ff070] struct cgroup *parent;
  [ffffffffa57ff078] struct dentry *dentry;
  [ffffffffa57ff080] struct cgroup_name *name;
  [ffffffffa57ff088] struct cgroup_subsys_state *subsys[13];
  [ffffffffa57ff0f0] struct cgroupfs_root *root;
  [ffffffffa57ff0f8] struct list_head css_sets;
  [ffffffffa57ff108] struct list_head allcg_node;
  [ffffffffa57ff118] struct list_head cft_q_node;
  [ffffffffa57ff128] struct list_head release_list;
  [ffffffffa57ff138] struct list_head pidlists;
  [ffffffffa57ff148] struct mutex pidlist_mutex;
  [ffffffffa57ff1f0] struct callback_head callback_head;
  [ffffffffa57ff200] struct work_struct free_work;
  [ffffffffa57ff250] struct list_head event_list;
  [ffffffffa57ff260] spinlock_t event_list_lock;
  [ffffffffa57ff2a8] struct simple_xattrs xattrs;
}
SIZE: 720
crash> list -H ffffffffa57ff040
(empty)
crash> list -H ffffffffa57ff050
(empty)
crash> list -H ffffffffa57ff060
(empty)
crash> p rootnode.top_cgroup.subsys
$2 = {0xffffffffa3ffb2a0, 0xffff90471e913a00, 0xffffffffa49ac1c0, 0xffffffffa3ff1560, 0xffff90471e96d000, 0xffff90471e919480, 0xffff90471e919540, 0xffff90471e913a80, 0xffffffffa40990c0, 0xffff90471e913b00, 0xffff90471e919600, 0xffff90471e9196c0, 0xffff90471e913b80}
crash> p rootnode.top_cgroup.name
$3 = (struct cgroup_name *) 0xffffffffa3ff9740
crash> cgroup_name.name 0xffffffffa3ff9740
  name = 0xffffffffa3ff9750 "/"

dummytop 是一个特殊的cgroup,其没有兄弟和孩子cgroup,其subsys包含了所有的控制子系统,即rootnode.top_cgroup.subsys数组中每个成员都不为null。这些css是在cgroup_init_subsys函数中创建的。初始化时这些csscgroup成员都指向了dummytop,在后续mount各个cgroup子系统时会进行调整。即mount时会创建新的cgroupfs_root, 并将对应的subsys跟新创建的cgroupfs_root建立对应的关系,当umount或者remount时,需要删除的子系统,将会移动到rootnode这个cgroupfs_root中。

cgroup subsys中的use_id

只有memory cgroup中的use_idtrue