资讯专栏INFORMATION COLUMN

Exadata db节点网卡异常down处理

IT那活儿 / 1888人阅读
Exadata db节点网卡异常down处理
点击上方“IT那活儿”公众号,关注后了解更多内容,不管IT什么活儿,干就完了!!!

背景

日常巡检发现db节点ib1网卡异常down了,查看链路状态也是down,通过ifconfig ib1 down/up也无法恢复。
  • 环境
    exadata x8-2
    Image version: 21.2.6
1.1 ib1网卡状态没有running

</>复制代码

  1. ib0: flags=4163 mtu 65520
    inet 192.168.XX.35 netmask 255.255.252.0 broadcast 192.168.XX.255
    inet6 fe80::ba59:9f03:91:7fd1 prefixlen 64 scopeid 0x20
    Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).

    infiniband 80:00:02:08:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
    RX packets 441801234 bytes 135108307679 (125.8 GiB)
    RX errors 0 dropped 0 overruns 0 frame 0
    TX packets 571055390 bytes 386080892808 (359.5 GiB)
    TX errors 0 dropped 200 overruns 0 carrier 0 collisions 0

    ib1: flags=4099 mtu 65520
    Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
    infiniband 80:00:02:09:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)

    RX packets 44482253 bytes 15202598429 (14.1 GiB)
    RX errors 0 dropped 0 overruns 0 frame 0
    TX packets 16442693 bytes 4497934510 (4.1 GiB)
    TX errors 0 dropped 16 overruns 0 carrier 0 collisions 0

    ib0:P02: flags=4163 mtu 65520
    inet 192.168.XX.36 netmask 255.255.252.0 broadcast 192.168.XX.255
    Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
    infiniband 80:00:02:08:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)

    lo: flags=73 mtu 65536
    inet 127.0.0.1 netmask 255.0.0.0
    inet6 ::1 prefixlen 128 scopeid 0x10
    loop txqueuelen 1000 (Local Loopback)
    RX packets 64915590345 bytes 15540910225759 (14.1 TiB)
    RX errors 0 dropped 0 overruns 0 frame 0
    TX packets 64915590345 bytes 15540910225759 (14.1 TiB)
    TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
1.2 链路显示down了

</>复制代码

  1. Infiniband device mlx4_0 port 1 status:
    default gid: fe80:0000:0000:0000:b859:9f03:0091:7fd1
    base lid: 0x2b
    sm lid: 0x13
    state: 4: ACTIVE
    phys state: 5: LinkUp
    rate: 40 Gb/sec (4X QDR)
    link_layer: InfiniBand

    Infiniband device mlx4_0 port 2 status:
    default gid: fe80:0000:0000:0000:b859:9f03:0091:7fd2
    base lid: 0x2c
    sm lid: 0x13
    state: 1: DOWN
    phys state: 2: Polling
    rate: 10 Gb/sec (4X)
    link_layer: InfiniBand
1.3 db节点alert日志显示IB交换机端口异常

</>复制代码

  1. 359_1 2022-02-05T22:28:34+08:00 critical "InfiniBand port 
    HCA-1:2 is in an invalid state. State : Down Physical Link 
    State : Polling Data Rate : 10 Gps Symbol Errors : 0 Received Errors : 0"

到这里可以排除网卡损坏,开始核查所有ib交换机状态。


分析原因

在所有ib交换机上执行检查,发现iba01存在异常。

</>复制代码

  1. Environment test started:
    Starting Environment Daemon test:
    Environment daemon running
    Environment Daemon test returned OK
    Starting Voltage test:
    Voltage ECB OK
    Measured 3.3V Main = 3.25 V
    Measured 3.3V Standby = 3.37 V
    Measured 12V = 11.97 V
    Measured 5V = 4.99 V
    Measured VBAT = 3.03 V
    Measured 2.5V = 2.48 V
    Measured 1.8V = 1.78 V
    Measured I4 1.2V = 1.21 V
    Voltage test returned OK
    Starting PSU test:
    PSU 0 present OK
    PSU 1 present OK
    PSU test returned OK
    Starting Temperature test:
    Back temperature 29
    Front temperature 31
    SP temperature 46
    Switch temperature 57, maxtemperature 60
    Temperature test returned OK
    Starting FAN test:
    Fan 0 not present
    Fan 1 running at rpm 15478
    Fan 2 running at rpm 15696
    Fan 3 running at rpm 15696
    Fan 4 not present
    FAN test returned OK
    Starting Connector test:
    Connector test returned OK
    Starting Onboard ibdevice test:
    Switch OK
    All Internal ibdevices OK
    Onboard ibdevice test returned OK
    Starting SSD test:
    SSD test returned OK
    Starting Auto-link-disable test:
    WARNING Autodisabled ports
    Auto-link-disable test returned 1 faults
    Environment test FAILED

</>复制代码

  1. # listlinkup
    Connector 0A Present <-> Switch Port 20 is up (Enabled)
    Connector 1A Present <-> Switch Port 22 is down (AutomaticHighErrorRate)
    Connector 2A Present <-> Switch Port 24 is down (Enabled)
    Connector 3A Present <-> Switch Port 26 is up (Enabled)
    Connector 4A Present <-> Switch Port 28 is up (Enabled)
    Connector 5A Present <-> Switch Port 30 is up (Enabled)

</>复制代码

  1. # showunhealthy
    WARNING Autodisabled ports
    FAILURE - 1 sensors NOT OK

</>复制代码

  1. vendid=0x2c9
    devid=0x1003
    sysimgguid=0xb8599f0300917fd3
    caguid=0xb8599f0300917fd0
    Ca 2 "H-b8599f0300917fd0" # "exdadbadm10 S 192.168.XX.35,192.168.XX.36 HCA-1"
    [1](b8599f0300917fd1) "S-0010e0cdd353a0a0"[22] # lid 43 lmc 0 "SUN DCS 36P QDR exdasw-ibb01 168.168.XX.37" lid 26 4xQDR

可以确认,连接ib1的连接端口22有报错,而22端口由于high error rate导致处于关闭状态。


处理步骤

3.1 首先清理报错信息

</>复制代码

  1. [root@exdasw-iba01 ~]# ibclearerrors

    ## Summary: 35 nodes cleared 0 errors
    [root@exdasw-iba01 ~]#
    [root@exdasw-iba01 ~]#
    [root@exdasw-iba01 ~]# ibclearcounters

    ## Summary: 35 nodes cleared 0 errors
3.2 启用IB交换机端口并验证

</>复制代码

  1. [root@exdasw-iba01 ~]# enableswitchport --automatic 22

    [root@exdasw-iba01 ~]# listlinkup
    Connector 0A Present <-> Switch Port 20 is up (Enabled)
    Connector 1A Present <-> Switch Port 22 is up (Enabled)

    Starting Onboard ibdevice test:
    Switch OK
    All Internal ibdevices OK
    Onboard ibdevice test returned OK
    Starting SSD test:
    SSD test returned OK
    Starting Auto-link-disable test:
    Auto-link-disable test returned OK
可以看到状态已经恢复正常。
再次检查db节点ib01网卡,无需人工处理,该网卡已经处于running状态。
3.3 开始清理告警日志

</>复制代码

  1. [root@exdasw-iba01 ~]# spsh

    Oracle(R) Integrated Lights Out Manager

    Version 2.2.16-3 ILOM 3.2.11 r137127

    Copyright (c) 2020, Oracle and/or its affiliates. All rights reserved.

    Warning: HTTPS certificate is set to factory default.

    Hostname: exdasw-iba01

    -> show faulty
    Target | Property | Value
    ----------------------------------+----------------------------------------+-----------------------------------------------------------
    /SP/faultmgmt/0                   | fru | /SYS
    /SP/faultmgmt/0/faults/0          | class | fault.device.ib.auto-link-disable
    /SP/faultmgmt/0/faults/0          | sunw-msg-id | ---
    /SP/faultmgmt/0/faults/0          | component | /SYS
    /SP/faultmgmt/0/faults/0          | uuid | 8f0fd5cf-661b-e0f5-db37-cfbcbd7b673a
    /SP/faultmgmt/0/faults/0          | timestamp | 2022-02-02/10:18:11


    -> start /SP/faultmgmt/shell
    Are you sure you want to start /SP/faultmgmt/shell (y/n)? y

    faultmgmtsp> fmadm faulty
    ------------------- ------------------------------------ -------------- --------
    Time UUID msgid Severity
    ------------------- ------------------------------------ -------------- --------
    2022-02-05/22:22:57 57471550-2815-eba0-ef6b-f46fa7788bad IBSWITCH-8000-D6 Major

    Fault class : fault.chassis.device.ib.link-error

    FRU : /SYS
    (Part Number: 7305544)
    (Serial Number: AK00417399)

    Description : One or more ports have been auto-disabled due to high error
    rate or bad link speed or width.

    Response : Illuminate service-required LED on the chassis.

    Impact : One or more ports have been auto-disabled due to high error
    rate or bad link speed or width.

    Action : Please refer to the associated reference document at %s for
    the latest service procedures and policies regarding this
    diagnosis.

    faultmgmtsp> fmadm repair 57471550-2815-eba0-ef6b-f46fa7788bad
    faultmgmtsp> fmadm faulty
    No problems found
    faultmgmtsp> show faulty
    Invalid command show - type help for a list of commands.

    faultmgmtsp> exit
    -> show faulty
    Target | Property | Value
    ------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------------------------------

    ->
总 结
当网卡出现异常时,不一定是硬件损坏,需要核查整个链路上的组件运行状态。


本文作者汤 杰(上海新炬中北团队)

本文来源:“IT那活儿”公众号

文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。

转载请注明本文地址:https://www.ucloud.cn/yun/129136.html

相关文章

  • 【独家】终生受用的Redis高可用技术解决方案大全

    摘要:哨兵是社区版本推出的原生高可用解决方案,部署架构主要包括两部分集群和数据集群,其中集群是由若干节点组成的分布式集群。自研推荐推荐自研的高可用解决方案,主要体现在配置中心故障探测和的处理机制上,通常需要根据企业业务的实际线上环境来定制化。 最近很多朋友向我咨询关于高可用的方案的优缺点以及如何选择合适的方案线上使用,刚好最近在给宜人贷,光大银行做企业内训的时候也详细讲过,这里我再整理发出来...

    cc17 评论0 收藏0
  • 【独家】终生受用的Redis高可用技术解决方案大全

    摘要:哨兵是社区版本推出的原生高可用解决方案,部署架构主要包括两部分集群和数据集群,其中集群是由若干节点组成的分布式集群。自研推荐推荐自研的高可用解决方案,主要体现在配置中心故障探测和的处理机制上,通常需要根据企业业务的实际线上环境来定制化。 最近很多朋友向我咨询关于高可用的方案的优缺点以及如何选择合适的方案线上使用,刚好最近在给宜人贷,光大银行做企业内训的时候也详细讲过,这里我再整理发出来...

    helloworldcoding 评论0 收藏0
  • 留给传统 DBA 的时间不多了?看饿了么如何构建数据库平台自动化

    摘要:因为传统的数据库管理方式在当前这种架构下依靠手工或者借助简单的工具是无法应对多活架构大规模管理带来的复杂性,因此平台化显得非常重。我们在做的方案时做了充分调查及论证,最终没有选择这种方式。 蔡鹏,2015年加入饿了么,见证了饿了么业务&技术从0到1的发展过程,并全程参与了数据库及DBA团队高速发展全过程。同时也完成个人职能的转型-由运维DBA到DEV-DBA的转变,也从DB的维稳转变到专心为...

    explorer_ddf 评论0 收藏0

发表评论

0条评论

IT那活儿

|高级讲师

TA的文章

阅读更多
最新活动
阅读需要支付1元查看
<