资讯专栏INFORMATION COLUMN

Oracle数据库4031故障分析

不知名网友 / 3260人阅读
Oracle数据库4031故障分析
点击上方“IT那活儿”公众号,关注后了解更多内容,不管IT什么活儿,干就完了!!!

问题描述

数据库业务出现超时,数据库出现enq: TX - index contention等等待事件。


问题分析

2.1 数据库出现enq: TX - index contention等待
通过查看问题前的ash,发现在数据库异常之前,数据库的主要等待事件是:SGA: allocation forcing component growth。
Alert.log:
2022-10-03T00:16:41.029338+08:00
Errors in file /u01/app/oracle/diag/rdbms/*/trace/db2_ora_6519.trc
ORA-04031: unable to allocate 63496 bytes of shared memory ("shared pool","DBMS_BACKUP_RESTORE","PLMCD^a5f238a0","BAMIMA: Bam Buffer")
Incident details in: /u01/app/oracle/diag/rdbms/*/incident/incdir_218692/db2_ora_6519_i218692.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2022-10-03T00:16:41.321084+08:00
ORA-04031 heap dump being written to trace file /u01/app/oracle/diag/rdbms/*/incident/incdir_218692/db2_ora_6519_i218692.trc
2022-10-03T00:19:38.948488+08:00
Errors in file /u01/app/oracle/diag/rdbms/*/
trace/b2_m000_8030.trc (incident=218693):
ORA-04031: unable to allocate 57768 bytes of shared memory ("shared pool","DBMS_STATS_INTERNAL","PLMCD^60dfd26c","BAMIMA: Bam Buffer")
Incident details in: /u01/app/oracle/diag/rdbms/*/incident/incdir_218693/b2_m000_8030_i218693.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2022-10-03T00:19:42.233916+08:00
Errors in file /u01/app/oracle/diag/rdbms/*/
trace/db2_ora_8166.trc (incident=218700):
ORA-04031: unable to allocate 63496 bytes of shared memory ("shared pool","DBMS_BACKUP_RESTORE","PLMCD^a5f238a0","BAMIMA: Bam Buffer")
Incident details in: /u01/app/oracle/diag/rdbms/*/incident/incdir_218700/db2_ora_8166_i218700.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.

2022-10-03T04:23:42.937853+08:00
Errors in file /u01/app/oracle/diag/rdbms/*/
trace/db2_lmd0_7330.trc (incident=215852):
ORA-04031: unable to allocate 13840 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","ges resource dynamic")
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2022-10-03T04:23:49.345977+08:00
Errors in file /u01/app/oracle/diag/rdbms/*/trace/db2_lmd0_7330.trc (incident=215853):
ORA-04031: unable to allocate 13840 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","ges resource dynamic")
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.

2022-10-03T08:58:07.793554+08:00
DDE: Problem Key ORA 4031 was completely flood controlled (0x6)
Further messages for this problem key will be suppressed for up to 10 minutes
2022-10-03T09:08:09.957747+08:00

DDE: Problem Key ORA 4031 was completely flood controlled (0x6)
Further messages for this problem key will be suppressed for up to 10 minutes
2022-10-03T09:18:12.005452+08:00
DDE: Problem Key ORA 4031 was completely flood controlled (0x6)
Further messages for this problem key will be suppressed for up to 10 minutes
2022-10-03T09:28:12.835118+08:00
DDE: Problem Key ORA 4031 was completely flood controlled (0x6)

022-10-03T09:38:13.206769+08:00
DDE: Problem Key ORA 4031 was completely flood controlled (0x6)
Further messages for this problem key will be suppressed for up to 10 minutes
2022-10-03T09:44:00.615729+08:00
Thread 2 advanced to log sequence 29428 (LGWR switch)
Current log# 8 seq# 29428 mem# 0: +DG_DATA_SSD_1/N*/
ONLINELOG/group_8.276.1008547021
Current log# 8 seq# 29428 mem# 1: +DG_DATA_SSD_1/N*/ONLINELOG/group_8.277.1008547025
2022-10-03T09:44:01.334950+08:00
TT03: Standby redo logfile selected for thread 2 sequence 29428 for destination LOG_ARCHIVE_DEST_2
2022-10-03T09:44:03.799630+08:00
Archived Log entry 119834 added for T-2.S-29427 ID 0x79a6de54 LAD:1
2022-10-03T09:48:17.418439+08:00
DDE: Problem Key ORA 4031 was completely flood controlled (0x6)
Further messages for this problem key will be suppressed for up to 10 minutes
2022-10-03T09:58:19.513479+08:00
DDE: Problem Key ORA 4031 was completely flood controlled (0x6)
Further messages for this problem key will be suppressed for up to 10 minutes
2022-10-03T10:08:21.022510+08:00
DDE: Problem Key ORA 4031 was completely flood controlled (0x6)
Further messages for this problem key will be suppressed for up to 10 minutes
2022-10-03T10:14:50.702688+08:00
LCK1 (ospid: 7342) waits for event libcache interrupt action by LCK for 73 secs.

2022-10-03T10:14:50.702867+08:00
LCK1 (ospid: 7342) is hung in an acceptable location (libcache 0x41.02).
2022-10-03T10:17:11.454316+08:00
LCK1 (ospid: 7342) waits for event libcache interrupt action by LCK for 214 secs.
2022-10-03T10:17:11.454494+08:00
LCK1 (ospid: 7342) is hung in an acceptable location (libcache 0x41.02).
2022-10-03T10:18:23.645094+08:00
DDE: Problem Key ORA 4031 was completely flood controlled (0x6)
Further messages for this problem key will be suppressed for up to 10 minutes
2022-10-03T10:21:42.845553+08:00
LCK1 (ospid: 7342) waits for event libcache interrupt action by LCK for 79 secs.
2022-10-03T10:21:42.845731+08:00
LCK1 (ospid: 7342) is hung in an acceptable location (libcache 0x41.02).
2022-10-03T10:24:20.692124+08:00
LCK1 (ospid: 7342) waits for event libcache interrupt action by LCK for 80 secs.
2022-10-03T10:24:20.692681+08:00
LCK1 (ospid: 7342) is hung in an acceptable location (libcache 0x41.02).
2022-10-03T10:27:57.458418+08:00
CKPT (ospid: 7350) waits for event enq: XR - database force logging for 78 secs.
2022-10-03T10:28:07.628285+08:00
CKPT (ospid: 7350) waits for event enq: XR - database force logging for 88 secs.
2022-10-03T10:28:17.668429+08:00
CKPT (ospid: 7350) waits for event enq: XR - database force logging for 98 secs.
2.2 查看数据库内存使用情况
通过ash趋势和alert.log发现,数据库在出现enq: TX - index contention之前,数据库主要等待是SGA: allocation forcing component growth和library cache lock,并且alert.log从0点开始出现ora-4031持续到异常时段。
Buffer cache的内存非常小,只有128m,而shared pool占用了32G。
Shared pool中ges resource dynamic和ges enqueues分别占用了12G和8.9G,这通常是不正常的。说明由于这两个组件占用了过多的内存,导致数据库的ORA-4031,进而出来sga resize,进而影响了数据库的整体性能。
分析lmhb.trc:
grep "library cache pin wait check" n*2_lmhb_7340.trc
kjgcr_ServiceGCR: KJGCR_METRICS: Local metric library cache pin wait check, id 11 succeed
kjgcr_ServiceGCR: KJGCR_METRICS: Local metric library cache pin wait check, id 11 failed
kjgcr_ServiceGCR: KJGCR_METRICS: Local metric library cache pin wait check, id 11 succeed
kjgcr_ServiceGCR: KJGCR_METRICS: Local metric library cache pin wait check, id 11 failed
kjgcr_ServiceGCR: KJGCR_METRICS: Local metric library cache pin wait check, id 11 succeed
kjgcr_ServiceGCR: KJGCR_METRICS: Local metric library cache pin wait check, id 11 failed
kjgcr_ServiceGCR: KJGCR_METRICS: Local metric library cache pin wait check, id 11 succeed
kjgcr_ServiceGCR: KJGCR_METRICS: Local metric library cache pin wait check, id 11 failed
kjgcr_ServiceGCR: KJGCR_METRICS: Local metric library cache pin wait check, id 11 succeed
kjgcr_ServiceGCR: KJGCR_METRICS: Local metric library cache pin wait check, id 11 failed
kjgcr_ServiceGCR: KJGCR_METRICS: Local metric library cache pin wait check, id 11 succeed

grep "check lck heartbeat" n*2_lmhb_7340.trc
kjgcr_ChkGlobalMetric: metric 7 (check lck heartbeat) failed for dbname N*, inst 2, node 2
kjgcr_ChkGlobalMetric: metric 7 (check lck heartbeat) failed for dbname N*, inst 2, node 2
kjgcr_ChkGlobalMetric: metric 7 (check lck heartbeat) failed for dbname N*, inst 2, node 2
kjgcr_ChkGlobalMetric: metric 7 (check lck heartbeat) failed for dbname N*, inst 2, node 2
kjgcr_ChkGlobalMetric: metric 7 (check lck heartbeat) failed for dbname N*, inst 2, node 2
kjgcr_ChkGlobalMetric: metric 7 (check lck heartbeat) failed for dbname N*, inst 2, node 2
kjgcr_ChkGlobalMetric: metric 7 (check lck heartbeat) failed for dbname N*, inst 2, node 2
kjgcr_ChkGlobalMetric: metric 7 (check lck heartbeat) failed for dbname N*, inst 2, node 2
kjgcr_GrowResourceCache: LCP waits high, allowing res cache growth
kjgcr_GrowResourceCache: LCP waits high, allowing res cache growth
kjgcr_GrowResourceCache: LCP waits high, allowing res cache growth
kjgcr_GrowResourceCache: LCP waits high, allowing res cache growth
kjgcr_GrowResourceCache: LCP waits high, allowing res cache growth
kjgcr_GrowResourceCache: LCP waits high, allowing res cache growth
kjgcr_GrowResourceCache: LCP waits high, allowing res cache growth
kjgcr_GrowResourceCache: LCP waits high, allowing res cache growth
kjgcr_GrowResourceCache: LCP waits high, allowing res cache growth
kjgcr_GrowResourceCache: LCP waits high, allowing res cache growth
kjgcr_GrowResourceCache: LCP waits high, allowing res cache growth
参考文档:Error ORA-04031 in the Shared Pool with Huge Allocation in Memory Type "ges resource dynamic" or "ges enqueues" memory" (Doc ID 2631592.1)
For Oracle Versions >= 12.2 but BELOW 19.1
Bug:26405036 - VERY HIGH "GES ENQUEUES" ON THE SHARED POOL
If a problem fulfills all of the condition below, it is a duplicate of this problem.
1. The LMHB process keeps reporting "memory load check" failure.
[Example]
============================================================
kjgcr_StatCheckMEM: memory is low, free memory 18%, average 19%
kjgcr_ServiceGCR: KJGCR_METRICS: Local metric memory load check, id 10 failed
============================================================
2. The LMHB process reported "library cache pin wait check" failure.
[Example]
============================================================
kjgcr_ServiceGCR: KJGCR_METRICS: Global metric library cache pin wait check, id 11 failed
============================================================
3. The heap dump of the shared pool contains many object of description "ges resource dynamic".
Bug:27824540 - ORA-04031 ("SHARED POOL","UNKNOWN OBJECT","SGA HEAP(1,0)","GES RESOURCE DYNAMIC"
If a problem fulfills all of the conditions below, it is a duplicate of this problem.
1. Many objects "ges resource dynamic" are allocated in the shared pool.
2. Trace file of LMHB process traced Action 11 (kjgcr_GrowResourceCache)
was executed, and was not resetted.
============================================================
kjgcr_GrowResourceCache: LCP waits high, allowing res cache growth
============================================================
3. Trace file of LMHB process keep tracing failure of metric 7 (check lck heartbeat).
认为命中Bug:26405036 - VERY HIGH "GES ENQUEUES" ON THE SHARED POOL和Bug:27824540 - ORA-04031 ("SHARED POOL","UNKNOWN OBJECT","SGA HEAP(1,0)","GES RESOURCE DYNAMIC"。
其中Bug:26405036 fix在12.2.0.1.200114  ,Bug:27824540 fix在12.2.0.1.190716。
2.3 对应的解决方案

For Oracle Versions >= 12.2 but BELOW 19.1
Apply fix for Bug:26405036; see NOTE:26405036.8
Workaround: on R12.2 or above, Start pseudo reconfiguration by below command is workaround.
SQL> oradebug setmypid
SQL> oradebug lkdebug -m reconfig lkdebug
Apply fix for Bug:27824540; see NOTE:27824540.8

Workaround: There are 2 possible workarounds.
1) Disable the action 11.
SQL> oradebug dyn_gcr -a 11 -disable

Note: This oradebug command is available on 12.2 and later.
2) Disable the GES resource cache; set the initialization parameter "_ges_direct_free" to TRUE.
Note: Completely disabling GES resource cache may lead to some other side affects like contention on TM lock for insert statements,
so use this workaround with caution if there is still a need.


问题总结

数据库由于shared pool中ges resource dynamic和ges enqueues占用内存过多,shared pool不够用引起ora-4031和sga resize,进而影响了数据库整体性能。
通过重启实例后,ges resource dynamic空间释放,但ges enqueues还有8g,目前看已经缓解问题。最终建议通过补丁解决。

建 议:

oracle 12.2已过支持期,但基于hp-ux平台oracle 12.2.0.1.180417的26405036、27824540的merge patch不存在并且可能无法申请出来。

  • 建议1

    只安装补丁26405036 - VERY HIGH "GES ENQUEUES" ON THE SHARED POOL看问题是否解决。

  • 建议2

    升级RU到12.2.0.1.200114以上版本,已修复以上问题。

  • 建议3
    尝试使用bug中的workaround或重启实例做为临时的解决方案:
SQL> oradebug setmypid
SQL> oradebug lkdebug -m reconfig lkdebug
优先建议使用建议1,小补丁窗口时间短并且可以rolling。


本文作者:张 芳(上海新炬中北团队)

本文来源:“IT那活儿”公众号

文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。

转载请注明本文地址:https://www.ucloud.cn/yun/129090.html

相关文章

  • ORA-4031错误深入解析

    摘要:总结一下产生错误的背景条件大量的硬解析产生了很多小碎片产生了大量的小碎片后突然来了一条大的语句需要解析。 想要彻底理解4031错误发生的原因就要了解SQL语句的执行过程以及Oracle共享内存的结构 showImg(https://segmentfault.com/img/bVR5eU?w=750&h=329);客户端与Oracle之间通信过程图 客户端输入sql语句,sql语句通...

    UsherChen 评论0 收藏0
  • ORA-4031错误深入解析

    摘要:总结一下产生错误的背景条件大量的硬解析产生了很多小碎片产生了大量的小碎片后突然来了一条大的语句需要解析。 想要彻底理解4031错误发生的原因就要了解SQL语句的执行过程以及Oracle共享内存的结构 showImg(https://segmentfault.com/img/bVR5eU?w=750&h=329 src=https://static.segmentfault.com/v-...

    PumpkinDylan 评论0 收藏0
  • 记一次Oracle控制文件分析情况

    摘要:如果一个数据库的所有控制文件都出了问题,那么这个数据库就需要进行恢复。因此,实际的商用数据库至少需要个一般为个控制文件,为了防止磁盘的物理故障,这些控制文件最好放在不同的物理磁盘上,而且最好放在不同的物理磁盘控制器上。 最近两天在从头看一些与DBA有关的书籍,有一章讲到了控制文件,现在整理一下有关的知识点及实际对照生产库对比的情况 控制文件的定义及引入的目的 Oracle数据库的控制...

    hersion 评论0 收藏0
  • 数据面临的挑战:当大数据遭遇云计算

    摘要:在未来十年,管理数据仓库的服务器的数量将增加倍以便迎合倍的大数据增长。毫无疑问,大数据将挑战企业的存储架构及数据中心基础设施等,也会引发云计算数据仓库数据挖掘商业智能等应用的连锁反应。 大数据正在彻底改变IT世界。那么,什么样的数据谈得上数据呢?   根据IDC的报告,未来十年全球大数据将增加50倍。仅在2011年,我们就将看到1.8ZB(也就是1.8万亿GB)的大数据创建产生。这相当...

    forsigner 评论0 收藏0
  • 数据面临的挑战:当大数据遭遇云计算

    摘要:毫无疑问,大数据将挑战企业的存储架构及数据中心基础设施等,也会引发云计算数据仓库数据挖掘商业智能等应用的连锁反应。        大数据正在彻底改变IT世界。那么,什么样的数据谈得上数据呢?  根据IDC的报告,未来十年全球大数据将增加50倍。仅在2011年,我们就将看到1.8ZB(也就是1.8万亿GB)的大数据创建产生。这相当于每位美国人每分钟写3条Tweet,而且还是不停地写2.6976...

    Berwin 评论0 收藏0

发表评论

0条评论

不知名网友

|高级讲师

TA的文章

阅读更多
最新活动
阅读需要支付1元查看
<