docker系列--cgroups解读

cikenerd 发布于2019-06-28 16:38 / 1812人阅读

摘要：系列解读系列解读系列解读系列解读系列网络模式解读主要是隔离作用，主要是资源限制，联合文件主要用于镜像分层存储和管理，是运行时，遵循了接口，一般来说基于。冻结暂停中的进程。配置时间都以微秒为单位，文件名中用表示。

前言

理解docker，主要从namesapce，cgroups，联合文件，运行时(runC)，网络几个方面。接下来我们会花一些时间，分别介绍。

docker系列--namespace解读

docker系列--cgroups解读

docker系列--unionfs解读

docker系列--runC解读

docker系列--网络模式解读

namesapce主要是隔离作用，cgroups主要是资源限制，联合文件主要用于镜像分层存储和管理，runC是运行时，遵循了oci接口，一般来说基于libcontainer。网络主要是docker单机网络和多主机通信模式。

cgroups简介 cgroups是什么？

Cgroup是control group的简写，属于Linux内核提供的一个特性，用于限制和隔离一组进程对系统资源的使用，也就是做资源QoS，这些资源主要包括CPU、内存、block I/O和网络带宽。Cgroup从2.6.24开始进入内核主线，目前各大发行版都默认打开了Cgroup特性。
Cgroups提供了以下四大功能:

资源限制（Resource Limitation）：cgroups可以对进程组使用的资源总额进行限制。如设定应用运行时使用内存的上限，一旦超过这个配额就发出OOM（Out of Memory）。

优先级分配（Prioritization）：通过分配的CPU时间片数量及硬盘IO带宽大小，实际上就相当于控制了进程运行的优先级。

资源统计（Accounting）： cgroups可以统计系统的资源使用量，如CPU使用时长、内存用量等等，这个功能非常适用于计费。

进程控制（Control）：cgroups可以对进程组执行挂起、恢复等操作。

Cgroups中的三个组件

cgroup 控制组。cgroup 是对进程分组管理的一种机制，一个cgroup包含一组进程，并可以在这个cgroup上增加Linux subsystem的各种参数的配置，将一组进程和一组subsystem的系统参数关联起来。

subsystem 子系统。subsystem 是一组资源控制的模块。这块在下面会详细介绍。

hierarchy 层级树。hierarchy 的功能是把一组cgroup串成一个树状的结构，一个这样的树便是一个hierarchy，通过这种树状的结构，Cgroups可以做到继承。比如我的系统对一组定时的任务进程通过cgroup1限制了CPU的使用率，然后其中有一个定时dump日志的进程还需要限制磁盘IO，为了避免限制了影响到其他进程，就可以创建cgroup2继承于cgroup1并限制磁盘的IO，这样cgroup2便继承了cgroup1中的CPU的限制，并且又增加了磁盘IO的限制而不影响到cgroup1中的其他进程。

cgroups子系统

cgroup中实现的子系统及其作用如下：

devices：设备权限控制。

cpuset：分配指定的CPU和内存节点。

cpu：控制CPU占用率。

cpuacct：统计CPU使用情况。

memory：限制内存的使用上限。

freezer：冻结（暂停）Cgroup中的进程。

net_cls：配合tc（traffic controller）限制网络带宽。

net_prio：设置进程的网络流量优先级。

huge_tlb：限制HugeTLB的使用。

perf_event：允许Perf工具基于Cgroup分组做性能监测。

每个子系统的目录下有更详细的设置项，例如：
cpu

CPU资源的控制也有两种策略，一种是完全公平调度（CFS：Completely Fair Scheduler）策略，提供了限额和按比例分配两种方式进行资源控制；另一种是实时调度（Real-Time Scheduler）策略，针对实时进程按周期分配固定的运行时间。配置时间都以微秒（µs）为单位，文件名中用us表示。

cpuset CPU绑定：

除了限制 CPU 的使用量，cgroup 还能把任务绑定到特定的 CPU，让它们只运行在这些 CPU 上，这就是 cpuset 子资源的功能。除了 CPU 之外，还能绑定内存节点（memory node）。
在把任务加入到 cpuset 的 task 文件之前，用户必须设置 cpuset.cpus 和 cpuset.mems 参数。

cpuset.cpus：设置 cgroup 中任务能使用的 CPU，格式为逗号（,）隔开的列表，减号（-）可以表示范围。比如，0-2,7 表示 CPU 第 0，1，2，和 7 核。

cpuset.mems：设置 cgroup 中任务能使用的内存节点，和 cpuset.cpus 格式一样。

memory：

memory.limit_bytes：强制限制最大内存使用量，单位有k、m、g三种，填-1则代表无限制。

memory.soft_limit_bytes：软限制，只有比强制限制设置的值小时才有意义。填写格式同上。当整体内存紧张的情况下，task获取的内存就被限制在软限制额度之内，以保证不会有太多进程因内存挨饿。可以看到，加入了内存的资源限制并不代表没有资源竞争。

memory.memsw.limit_bytes：设定最大内存与swap区内存之和的用量限制。填写格式同上。

这里专门讲一下监控和统计相关的参数，比如cadvisor采集的那些参数。

memory.usage_bytes：报告该 cgroup中进程使用的当前总内存用量（以字节为单位）。

memory.max_usage_bytes：报告该 cgroup 中进程使用的最大内存用量。

docker如何使用cgroup

创建一个容器

# Run a container that will spawn 300 processes.
docker run cirocosta/stress pid  -n 300
Starting to spawn 300 blocking children
[1] Waiting for SIGINT

# Open another window and see that we have 300
# PIDS
docker stats
CONTAINER      …   MEM USAGE / LIMIT          PIDS
a730051832     …   21.02MiB / 1.951GiB     300

验证Docker是否为此容器放置了一些cgroup

# let"s get the ID of the container. Docker uses that ID
# to name things in the host to we can probably use it to
# find the cgroup created for the container
# under the parent docker cgroup
docker ps
CONTAINER ID        IMAGE               COMMAND       
a730051832e7        cirocosta/stress    "pid -n 300"  

 # Having the prefix in hands, let"s search for it under the
 # mountpoint for cgroups in our system
 find  /sys/fs/cgroup/ -name "a730051832e7*"
 
/sys/fs/cgroup/cpu,cpuacct/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/cpuset/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/devices/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/pids/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/freezer/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/perf_event/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/blkio/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/memory/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/net_cls,net_prio/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/hugetlb/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/systemd/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959

# There they are! Docker creates a control group with the name
# being the exact ID of the container under all the subsystems.

# What can we discover from this inspection? We can look at the
# subsystem that we want to place contrainst on (PIDs), for instance:

 tree /sys/fs/cgroup/pids/docker/a7300518327d...
/sys/fs/cgroup/pids/docker/a73005183...
├── cgroup.clone_children
├── cgroup.procs
├── notify_on_release
├── pids.current
├── pids.events
├── pids.max
└── tasks

# Which means that, if we want to know how many PIDs are in use right
# now we can look at "pids.current", to know the limits, "pids.max" and
# to know which processes have been assigned to this control group,
# look at tasks. Lets do it:
cat /sys/fs/cgroup/pids/docker/a730...c660a75a959/tasks 
5329
5371
5372
5373
5374
5375
5376
5377
(...)
# continues until the 300th entry - as we have 300 processes in this container

# 300 pids
cat /sys/fs/cgroup/pids/docker/a730051832e7d7764...9/pids.current
300

# no max set
cat /sys/fs/cgroup/pids/docker/a730051832e7d77.../pids.max 
max

一般在安装k8s的过程中经常会遇到如下错误：

create kubelet: misconfiguration: kubelet cgroup driver: "cgroupfs"
is different from docker cgroup driver: "systemd"

其实此处错误信息已经很明白了，就是docker 和kubelet指定的cgroup driver不一样。 docker
支持systemd和cgroupfs两种驱动方式。通过runc代码可以更加直观了解。

cgroup 只能限制 CPU 的使用，而不能保证CPU的使用。也就是说，使用
cpuset-cpus，可以让容器在指定的CPU或者核上运行，但是不能确保它独占这些CPU；cpu-shares
是个相对值，只有在CPU不够用的时候才其作用。也就是说，当CPU够用的时候，每个容器会分到足够的CPU；不够用的时候，会按照指定的比重在多个容器之间分配CPU。

对内存来说，cgroups 可以限制容器最多使用的内存。使用 -m 参数可以设置最多可以使用的内存。

代码解读

关于cgroups在runc的代码部分，大家可以点击进去详细阅读。这边我们只讲一个大概。
首先container的创建是由factory调用create方法实现的，而cgroup相关，factory实现了根据配置文件cgroup drive驱动的配置项，新建CgroupsManager的方法，systemd和cgroupfs两种实现方式：

// SystemdCgroups is an options func to configure a LinuxFactory to return
// containers that use systemd to create and manage cgroups.
func SystemdCgroups(l *LinuxFactory) error {
    l.NewCgroupsManager = func(config *configs.Cgroup, paths map[string]string) cgroups.Manager {
        return &systemd.Manager{
            Cgroups: config,
            Paths:   paths,
        }
    }
    return nil
}

// Cgroupfs is an options func to configure a LinuxFactory to return containers
// that use the native cgroups filesystem implementation to create and manage
// cgroups.
func Cgroupfs(l *LinuxFactory) error {
    l.NewCgroupsManager = func(config *configs.Cgroup, paths map[string]string) cgroups.Manager {
        return &fs.Manager{
            Cgroups: config,
            Paths:   paths,
        }
    }
    return nil
}

抽象cgroup manager接口。接口如下：

type Manager interface {
    // Applies cgroup configuration to the process with the specified pid
    Apply(pid int) error

    // Returns the PIDs inside the cgroup set
    GetPids() ([]int, error)

    // Returns the PIDs inside the cgroup set & all sub-cgroups
    GetAllPids() ([]int, error)

    // Returns statistics for the cgroup set
    GetStats() (*Stats, error)

    // Toggles the freezer cgroup according with specified state
    Freeze(state configs.FreezerState) error

    // Destroys the cgroup set
    Destroy() error

    // The option func SystemdCgroups() and Cgroupfs() require following attributes:
    //     Paths   map[string]string
    //     Cgroups *configs.Cgroup
    // Paths maps cgroup subsystem to path at which it is mounted.
    // Cgroups specifies specific cgroup settings for the various subsystems

    // Returns cgroup paths to save in a state file and to be able to
    // restore the object later.
    GetPaths() map[string]string

    // Sets the cgroup as configured.
    Set(container *configs.Config) error
}

在创建container的过程中，会调用上面接口的方法。例如：
在container_linux.go中，

func (c *linuxContainer) Set(config configs.Config) error {
    c.m.Lock()
    defer c.m.Unlock()
    status, err := c.currentStatus()
    if err != nil {
        return err
    }
    ...
    if err := c.cgroupManager.Set(&config); err != nil {
        // Set configs back
        if err2 := c.cgroupManager.Set(c.config); err2 != nil {
            logrus.Warnf("Setting back cgroup configs failed due to error: %v, your state.json and actual configs might be inconsistent.", err2)
        }
        return err
    }
...
}

接下来我们重点讲一下fs的实现。

在fs中，基本上每个子系统都是一个文件，如上图。

重点说一下memory.go，即memory子系统,其他子系统与此类似。
关键方法：

func (s *MemoryGroup) Apply(d *cgroupData) (err error) {
    path, err := d.path("memory")
    if err != nil && !cgroups.IsNotFound(err) {
        return err
    } else if path == "" {
        return nil
    }
    if memoryAssigned(d.config) {
        if _, err := os.Stat(path); os.IsNotExist(err) {
            if err := os.MkdirAll(path, 0755); err != nil {
                return err
            }
            // Only enable kernel memory accouting when this cgroup
            // is created by libcontainer, otherwise we might get
            // error when people use `cgroupsPath` to join an existed
            // cgroup whose kernel memory is not initialized.
            if err := EnableKernelMemoryAccounting(path); err != nil {
                return err
            }
        }
    }
    defer func() {
        if err != nil {
            os.RemoveAll(path)
        }
    }()

    // We need to join memory cgroup after set memory limits, because
    // kmem.limit_in_bytes can only be set when the cgroup is empty.
    _, err = d.join("memory")
    if err != nil && !cgroups.IsNotFound(err) {
        return err
    }
    return nil
}

通过d.path("memory")查找到cgroup的memory路径

func (raw *cgroupData) path(subsystem string) (string, error) {
    mnt, err := cgroups.FindCgroupMountpoint(subsystem)
    // If we didn"t mount the subsystem, there is no point we make the path.
    if err != nil {
        return "", err
    }

    // If the cgroup name/path is absolute do not look relative to the cgroup of the init process.
    if filepath.IsAbs(raw.innerPath) {
        // Sometimes subsystems can be mounted together as "cpu,cpuacct".
        return filepath.Join(raw.root, filepath.Base(mnt), raw.innerPath), nil
    }

    // Use GetOwnCgroupPath instead of GetInitCgroupPath, because the creating
    // process could in container and shared pid namespace with host, and
    // /proc/1/cgroup could point to whole other world of cgroups.
    parentPath, err := cgroups.GetOwnCgroupPath(subsystem)
    if err != nil {
        return "", err
    }

    return filepath.Join(parentPath, raw.innerPath), nil
}

d.join("memory")，将pid写到memory路径下

func (raw *cgroupData) join(subsystem string) (string, error) {
    path, err := raw.path(subsystem)
    if err != nil {
        return "", err
    }
    if err := os.MkdirAll(path, 0755); err != nil {
        return "", err
    }
    if err := cgroups.WriteCgroupProc(path, raw.pid); err != nil {
        return "", err
    }
    return path, nil
}

文章版权归作者所有，未经允许请勿转载,若此文章存在违规行为，您可以联系管理员删除。

转载请注明本文地址：https://www.ucloud.cn/yun/27450.html

docker系列--cgroups解读

摘要：系列解读系列解读系列解读系列解读系列网络模式解读主要是隔离作用，主要是资源限制，联合文件主要用于镜像分层存储和管理，是运行时，遵循了接口，一般来说基于。冻结暂停中的进程。配置时间都以微秒为单位，文件名中用表示。前言理解docker，主要从namesapce，cgroups，联合文件，运行时(runC)，网络几个方面。接下来我们会花一些时间，分别介绍。 docker系列--name...

岳光 2019-07-01 17:30 评论0 收藏0
docker系列--cgroups解读

摘要：系列解读系列解读系列解读系列解读系列网络模式解读主要是隔离作用，主要是资源限制，联合文件主要用于镜像分层存储和管理，是运行时，遵循了接口，一般来说基于。冻结暂停中的进程。配置时间都以微秒为单位，文件名中用表示。前言理解docker，主要从namesapce，cgroups，联合文件，运行时(runC)，网络几个方面。接下来我们会花一些时间，分别介绍。 docker系列--name...

alogy 2019-07-01 16:47 评论0 收藏0
docker系列--网络模式解读

摘要：网络主要是单机网络和多主机通信模式。下面分别介绍一下的各个网络模式。设计的网络模型。是以对定义的元数据。用户可以通过定义这样的元数据来自定义和驱动的行为。前言理解docker，主要从namesapce，cgroups，联合文件，运行时(runC)，网络几个方面。接下来我们会花一些时间，分别介绍。 docker系列--namespace解读 docker系列--cgroups解读 ...

haitiancoder 2019-07-01 17:30 评论0 收藏0
docker系列--网络模式解读

摘要：网络主要是单机网络和多主机通信模式。下面分别介绍一下的各个网络模式。设计的网络模型。是以对定义的元数据。用户可以通过定义这样的元数据来自定义和驱动的行为。前言理解docker，主要从namesapce，cgroups，联合文件，运行时(runC)，网络几个方面。接下来我们会花一些时间，分别介绍。 docker系列--namespace解读 docker系列--cgroups解读 ...

zollero 2019-06-28 16:38 评论0 收藏0
docker系列--网络模式解读

摘要：网络主要是单机网络和多主机通信模式。下面分别介绍一下的各个网络模式。设计的网络模型。是以对定义的元数据。用户可以通过定义这样的元数据来自定义和驱动的行为。前言理解docker，主要从namesapce，cgroups，联合文件，运行时(runC)，网络几个方面。接下来我们会花一些时间，分别介绍。 docker系列--namespace解读 docker系列--cgroups解读 ...

xiaotianyi 2019-07-01 16:47 评论0 收藏0