Nova如何统计节点硬件资源

derek_334892 发布于2019-07-25 10:47 / 931人阅读

摘要：引言当我们在使用那些建设在之上的云平台服务的时候，往往在概览页面都有一个明显的位置用来展示当前集群的一些资源使用情况，如，，内存，硬盘等资源的总量使用量剩余量。如上，就是统计节点硬件资源的整个逻辑过程为例。

引言

当我们在使用那些建设在OpenStack之上的云平台服务的时候，往往在概览页面都有一个明显的位置用来展示当前集群的一些资源使用情况，如，CPU，内存，硬盘等资源的总量、使用量、剩余量。而且，每当我们拓展集群规模之后，概览页面上的资源总量也会自动增加，我们都熟知，OpenStack中的Nova服务负责管理这些计算资源，那么你有没有想过，它们是如何被Nova服务获取的吗？

Nova如何统计资源

我们知道，统计资源的操作属于Nova服务内部的机制，考虑到资源统计结果对后续操作(如创建虚拟机，创建硬盘)的重要性，我们推断该机制的运行顺序一定先于其他服务。

通过上述简单的分析，再加上一些必要的Debug操作，我们得出：
该机制的触发点位于nova.service.WSGIService.start方法中：

</>复制代码 
    def start(self):
        """Start serving this service using loaded configuration.
        Also, retrieve updated port number in case "0" was passed in, which
        indicates a random port should be used.
        :returns: None
        """
        if self.manager:
            self.manager.init_host()
            self.manager.pre_start_hook()
            if self.backdoor_port is not None:
                self.manager.backdoor_port = self.backdoor_port
        self.server.start()
        if self.manager:
            self.manager.post_start_hook()

其中，self.manager.pre_start_hook()的作用就是去获取资源信息,它的直接调用为nova.compute.manager.pre_start_hook如下：

</>复制代码 
    def pre_start_hook(self):
        """After the service is initialized, but before we fully bring
        the service up by listening on RPC queues, make sure to update
        our available resources (and indirectly our available nodes).
        """
        self.update_available_resource(nova.context.get_admin_context())
...
    @periodic_task.periodic_task
    def update_available_resource(self, context):
        """See driver.get_available_resource()
        Periodic process that keeps that the compute host"s understanding of
        resource availability and usage in sync with the underlying hypervisor.
        :param context: security context
        """
        new_resource_tracker_dict = {}
        nodenames = set(self.driver.get_available_nodes())
        for nodename in nodenames:
            rt = self._get_resource_tracker(nodename)
            rt.update_available_resource(context)
            new_resource_tracker_dict[nodename] = rt
        # Delete orphan compute node not reported by driver but still in db
        compute_nodes_in_db = self._get_compute_nodes_in_db(context,
                                                            use_slave=True)
        for cn in compute_nodes_in_db:
            if cn.hypervisor_hostname not in nodenames:
                LOG.audit(_("Deleting orphan compute node %s") % cn.id)
                cn.destroy()
        self._resource_tracker_dict = new_resource_tracker_dict

上述代码中的rt.update_available_resource()的直接调用实为nova.compute.resource_tracker.update_available_resource()如下:

</>复制代码 
    def update_available_resource(self, context):
        """Override in-memory calculations of compute node resource usage based
        on data audited from the hypervisor layer.
        Add in resource claims in progress to account for operations that have
        declared a need for resources, but not necessarily retrieved them from
        the hypervisor layer yet.
        """
        LOG.audit(_("Auditing locally available compute resources"))
        resources = self.driver.get_available_resource(self.nodename)
        if not resources:
            # The virt driver does not support this function
            LOG.audit(_("Virt driver does not support "
                 ""get_available_resource"  Compute tracking is disabled."))
            self.compute_node = None
            return
        resources["host_ip"] = CONF.my_ip
        # TODO(berrange): remove this once all virt drivers are updated
        # to report topology
        if "numa_topology" not in resources:
            resources["numa_topology"] = None
        self._verify_resources(resources)
        
        self._report_hypervisor_resource_view(resources)
        return self._update_available_resource(context, resources)

上述代码中的self._update_available_resource的作用是根据计算节点上的资源实际使用结果来同步数据库记录，这里我们不做展开；self.driver.get_available_resource()的作用就是获取节点硬件资源信息，它的实际调用为：

</>复制代码 
class LibvirtDriver(driver.ComputeDriver):
    def get_available_resource(self, nodename):
        """Retrieve resource information.
        This method is called when nova-compute launches, and
        as part of a periodic task that records the results in the DB.
        :param nodename: will be put in PCI device
        :returns: dictionary containing resource info
        """
        # Temporary: convert supported_instances into a string, while keeping
        # the RPC version as JSON. Can be changed when RPC broadcast is removed
        stats = self.get_host_stats(refresh=True)
        stats["supported_instances"] = jsonutils.dumps(
                stats["supported_instances"])
        return stats
        
    def get_host_stats(self, refresh=False):
        """Return the current state of the host.
        If "refresh" is True, run update the stats first.
        """
        return self.host_state.get_host_stats(refresh=refresh)
        
        def _get_vcpu_total(self):
        """Get available vcpu number of physical computer.
        :returns: the number of cpu core instances can be used.
        """
        if self._vcpu_total != 0:
            return self._vcpu_total
        try:
            total_pcpus = self._conn.getInfo()[2] + 1
        except libvirt.libvirtError:
            LOG.warn(_LW("Cannot get the number of cpu, because this "
                         "function is not implemented for this platform. "))
            return 0
        if CONF.vcpu_pin_set is None:
            self._vcpu_total = total_pcpus
            return self._vcpu_total
        available_ids = hardware.get_vcpu_pin_set()
        if sorted(available_ids)[-1] >= total_pcpus:
            raise exception.Invalid(_("Invalid vcpu_pin_set config, "
                                      "out of hypervisor cpu range."))
        self._vcpu_total = len(available_ids)
        return self._vcpu_total
.....
class HostState(object):
    """Manages information about the compute node through libvirt."""
    def __init__(self, driver):
        super(HostState, self).__init__()
        self._stats = {}
        self.driver = driver
        self.update_status()
    def get_host_stats(self, refresh=False):
        """Return the current state of the host.
        If "refresh" is True, run update the stats first.
        """
        if refresh or not self._stats:
            self.update_status()
        return self._stats
        
    def update_status(self):
        """Retrieve status info from libvirt."""
        ...
        data["vcpus"] = self.driver._get_vcpu_total()
        data["memory_mb"] = self.driver._get_memory_mb_total()
        data["local_gb"] = disk_info_dict["total"]
        data["vcpus_used"] = self.driver._get_vcpu_used()
        data["memory_mb_used"] = self.driver._get_memory_mb_used()
        data["local_gb_used"] = disk_info_dict["used"]
        data["hypervisor_type"] = self.driver._get_hypervisor_type()
        data["hypervisor_version"] = self.driver._get_hypervisor_version()
        data["hypervisor_hostname"] = self.driver._get_hypervisor_hostname()
        data["cpu_info"] = self.driver._get_cpu_info()
        data["disk_available_least"] = _get_disk_available_least()
        ...

注意get_available_resource方法的注释信息，完全符合我们开始的推断。我们下面单以vcpus为例继续调查资源统计流程，self.driver._get_vcpu_total的实际调用为LibvirtDriver._get_vcpu_total(上述代码中已给出)，如果配置项vcpu_pin_set没有生效，那么得到的_vcpu_total的值为self._conn.getInfo()[2]（self._conn可以理解为libvirt的适配器，它代表与kvm,qemu等底层虚拟化工具的抽象连接，getInfo()就是对libvirtmod.virNodeGetInfo的一次简单的封装，它的返回值是一组数组，其中第三个元素就是vcpus的数量），我们看到这里基本就可以了，再往下就是libvirt的C语言代码而不是Python的范畴了。

另一方面，如果我们配置了vcpu_pin_set配置项，那么该配置项就被hardware.get_vcpu_pin_set方法解析成一个可用CPU位置索引的集合，再通过对该集合求长后，我们也能得到最终想要的vcpus的数量。

如上，就是Nova统计节点硬件资源的整个逻辑过程(vcpus为例)。

GPU云服务器云服务器云服务器硬件资源隔离资源数据统计数据资源统计分析技术中资源国外节点

文章版权归作者所有，未经允许请勿转载,若此文章存在违规行为，您可以联系管理员删除。

转载请注明本文地址：https://www.ucloud.cn/yun/38158.html

云计算节点故障自动化运维服务设计

此文已由作者王盼授权网易云社区发布。欢迎访问网易云社区，了解更多网易技术产品运营经验~ 现状计算节点发生磁盘损坏等数据无法恢复的异常时，节点上的云主机系统盘无法恢复，导致云主机只能被清理重建计算节点宕机但磁盘数据可用时，重启即可恢复所有云主机的运行计算节点多次宕机（或一段时间内频繁宕机），则需要迁移所有云主机或者直接清理重建，云硬盘需要迁移到其他cinder-volume存储服务节点一般来...

seanHai 2019-06-28 10:59 评论0 收藏0
OpenStack虚拟云桌面在携程呼叫中心的应用

摘要：一为什么要使用虚拟云桌面背景携程呼叫中心，即服务联络中心，是携程的核心部门之一，现有几万员工。他们全年小时为全球携程用户提供服务。为此，携程正式引入了虚拟云桌面。携程云桌面现状携程云桌面现已部署上海南通如皋合肥信阳穆棱六个呼叫中心。编者：本文为刘科在第六期【携程技术微分享】中的分享内容。在携程技术中心(微信号ctriptech）微信后台回复【云桌面】，可加入微信交流群，和关注云桌面的...

EsgynChina 2019-06-28 10:51 评论0 收藏0
OpenStack虚拟云桌面在携程呼叫中心的应用

摘要：一为什么要使用虚拟云桌面背景携程呼叫中心，即服务联络中心，是携程的核心部门之一，现有几万员工。他们全年小时为全球携程用户提供服务。为此，携程正式引入了虚拟云桌面。携程云桌面现状携程云桌面现已部署上海南通如皋合肥信阳穆棱六个呼叫中心。编者：本文为刘科在第六期【携程技术微分享】中的分享内容。在携程技术中心(微信号ctriptech）微信后台回复【云桌面】，可加入微信交流群，和关注云桌面的...

biaoxiaoduan 2019-06-24 17:49 评论0 收藏0
深度解析 OpenStack metadata 服务架构

摘要：下图展示了虚拟机可以获取到的信息神奇的这个地址来源于，亚马逊在设计公有云的时候为了让能够访问，就将这个特殊的作为服务器的地址。服务启动了服务，负责处理虚拟机发送来的请求。服务也运行在网络节点。中的路由和服务器都在各自独立的命名空间中。前言下图是OpenStack虚拟机在启动过程中发出的一个请求，我们在里面可以看到cloud－init和169．254．169．254。那么它们分别是做什么用的呢...

Michael_Lin 2019-04-29 15:42 评论0 收藏0
深度解码超实用的OpenStack Heat

摘要：模板中的顶级，定义实例化后将返回的数据。通过如此的解析和协作，最终完成请求的处理。服务接受请求，读入模板信息，处理后利用请求发送给。首先，调用拿到对应的。Heat 是由AWS的EC2 Cloud Formation 演化而来，是openstack中负责Orchestration的service，用于openstack 中资源的编排，它通过将OpenStack中的资源（resource）以模...

pkwenda 2019-04-29 19:33 评论0 收藏0