资讯专栏INFORMATION COLUMN

RAC双节点crash回复一例

IT那活儿 / 1542人阅读
RAC双节点crash回复一例

客户现场两节点库crash告警。运维人员紧急将数据库拉起,应用恢复。但启动后alert log 报错ORA-16191和ORA-01031,为DataGuard主备库密码文件不一致所致, 重建密码文件后, 故障解决。

 分析alert log发现:16:32,节点1读取控制文件发现坏块,紧接着16:33分实例无法正常读取控制文件导致crash,然后实例2在16:35关闭。经检查控制文件并未存在坏块,初步判定为数据库短暂读取控制文件失败导致BUG。 

发起SR,经SSC人员及SR后台专家共同确认为bug 11698676,该bug与bug  9549042为重复bug,并在patch 9549042上被fixed。 

2. 故障分析/处理

2.1 故障处理 

  4月5日16:34, ssyy库两节点相继crash, 紧急接入后确认两实例已被彻底关闭、监听仍然开启,紧急startup将两实例拉起,应用恢复连接至生产库。

  重启实例后,检查节点1 alert log 发现: 

Check that the primary and standby are using a password file

and remote_login_passwordfile is set to SHARED or EXCLUSIVE, 

and that the SYS password is same in the password files.

returning error ORA-16191

    提示为SYS主备库上密码文件不一致导致, 于是决定主库重建密码文件,并将新生成的密码文件拷至备库节点应用(操作前备份原密码文件,并更改主库SYS密码).

  分别在primary-rac两个节点上执行密码文件创建语句.

orapwd file=/oracle/db/oracle/product/11.1.0/db/dbs/ssyydb1 entries=5 force=y  password=*********

orapwd file=/oracle/db/oracle/product/11.1.0/db/dbs/ssyydb2 entries=5 force=y  password=*********

       分别将ssyydb1和ssyydb2依次拷至standby-rac节点1和节点2.   

  primary-rac1节点alert log 仍持续报错:

Errors in file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_arc2_4134.trc:

ORA-01031: insufficient privileges

PING[ARC2]: Heartbeat failed to connect to standby drdb. Error is 1031.

     此时,主库节点1无法向备库节点1传送archive log. 查询MOS,ORA-01031仍为主备库密码文件不一致导致,怀疑主库归档进程使用了主机缓存密码文件导致,因归档进程为非关键进程,kill -9 后会重新启动,对当前数据库无影响。 

  依次kill主库节点1和节点2所有归档进程,节点1仍持续报错ORA-01031。

  sqlplus连接确认主备库上SYS密码已更改.

  检查新生成的密码文件是否已被应用:

--主库节点

SQL> select * from  v$pwfile_users;

USERNAME                       SYSDB SYSOP SYSAS

------------------------------ ----- ----- -----

SYS                            TRUE  TRUE  FALSE

--备库节点

SQL> select * from  v$pwfile_users;

no rows selected

     显然,主库密码文件已被应用,备库密码文件未被应用。

     仔细检查备库密码文件, 文件名未满足orapw<$ORACLE_SID>命名规则, 密码文件沿      用主库密码文件,但备库实例名区别于主库实例名。

     修改备库密码文件名:

mv $ORACLE_HOME/dbs/ssyydb1 $ORACLE_HOME/dbs/orapwdrdb1 

mv $ORACLE_HOME/dbs/ssyydb2 $ORACLE_HOME/dbs/orapwdrdb2

     持续观察几分钟,ORA-01031错误未解决. 

  查询MOS,参照ORA-1031 for Remote Archive Destination on Primary (Doc ID 733793.1)解决方案操作.

1. Make sure parameter REMOTE_LOGIN_PASSWORDFILE is set to EXCLUSIVE or SHARED in both databases.  


2. Copy the password file again from primary : 


a. Defer the log_archive_dest_2 on primary: 

SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_2 = DEFER; 


b. Copy/ftp the password file from primary to standby and rename it accordingly on the standby database. Creating the password file on standby with orapwd-utility is not supported for 11g anymore.

Make sure that name of password file on both primary and standby is : orapw. Name of the password file is case sensitive. If SID of database on standby is prod then name of the password file should be orapwprod, orapwPROD will not work. 


c. Enable the log_archive_dest_2 on primary: 

SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_2 = ENABLE; 


d. Switch 2-3 log files on primary : 

SQL> ALTER SYSTEM SWITCH LOGFILE; 


e. Check the status of log_archive_dest_2 on primary. 

SQL> SELECT STATUS,ERROR FROM V$ARCHIVE_DEST WHERE DEST_ID =2; 

STATUS    ERROR 

--------- ----------------------------------------------------------------- 

VALID 

     持续跟踪主库节点alert log ,在持续ORA-01031报错3-5分钟后, 主库节点均能正常向备库节点传送archive log,备库实例也能正常应用archive log, 主库节点1和节点2 alert log 也未曾重现ORA-01031和ORA-16191.

     至此,故障全部解决! 

2.2 crash分析 

    

    首先,检查两节点syslog,无异常,排除主机因素。

     实例1 alert log:

Fri Apr 05 15:58:52 2013

Archived Log entry 34220 added for thread 1 sequence 12072 ID 0x9441c6d1 dest 1:

Fri Apr 05 16:32:39 2013

Read from controlfile member /dev/oravg/rlv_cntl1 has found a corrupted block (blk# 4, cf seq# 0)

Hex dump of (file 0, block 4) in trace file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_lmon_22418.trc

Corrupt block relative dba: 0x00000004 (file 0, block 4)

Bad check value found during control file block read

Data in bad block:

type: 21 format: 2 rdba: 0x00000004

last change scn: 0x0000.00000000 seq: 0x1 flg: 0x04

spare1: 0x0 spare2: 0x0 spare3: 0x0

consistency value in tail: 0x00001501

check value in block header: 0x8f5d

computed block checksum: 0x2

Re-read from controlfile member /dev/oravg/rlv_cntl1 returned valid block 4

Hex dump of (file 0, block 4) in trace file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_lmon_22418.trc

Errors in file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_lmon_22418.trc:

ORA-00202: control file: /dev/oravg/rlv_cntl1

Errors in file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_lmon_22418.trc  (incident=888259):

ORA-00227: corrupt block detected in control file: (block 4, # blocks 1)

ORA-00202: control file: /dev/oravg/rlv_cntl1

Incident details in: /oracle/db/diag/rdbms/ssyy/ssyy1/incident/incdir_888259/ssyy1_lmon_22418_i888259.trc

Fri Apr 05 16:33:24 2013

Errors in file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_lmon_22418.trc:

ORA-00227: corrupt block detected in control file: (block 4, # blocks 1)

ORA-00202: control file: /dev/oravg/rlv_cntl1

LMON (ospid: 22418): terminating the instance due to error 227

     16:32:39,实例1在读控制文件/dev/oravg/rlv_cntl1的时候出错,发现坏块。

     16:33:24,实例1因无法正常读取控制文件导致实例crash。 

     检查三个控制文件,未发现坏块。

ssyy1: dbv file=/dev/datavg02/rlv_cntl1 blocksize=16384

ssyy1: dbv file=/dev/datavg02/rlv_cntl2 blocksize=16384

ssyy1: dbv file=/dev/datavg02/rlv_cntl3 blocksize=16384

     

     查看节点2 crsd.log: 16:35:23由于数据库异常offline,CRS停掉实例2.

2013-04-05 16:32:42.179: [  CRSRES][6345673] Resource recovery not purged:ora.ssyy.ssyy2.inst

2013-04-05 16:32:42.205: [  CRSRES][6345673] ora.ssyy.ssyy2.inst target set to OFFLINE before stop action

2013-04-05 16:32:42.206: [  CRSRES][6345673] StopResource: setting CLI values

2013-04-05 16:32:42.252: [  CRSRES][6345673] Attempting to stop `ora.ssyy.ssyy2.inst` on member `ssyy2`

2013-04-05 16:33:40.826: [    CRSD][54] SM: rE2Ec: 4

2013-04-05 16:33:40.896: [  CRSRES][6345681] ora.ssyy.db target set to OFFLINE before stop action

2013-04-05 16:33:40.896: [  CRSRES][6345681] StopResource: setting CLI values

2013-04-05 16:33:42.288: [    CRSD][6345681] SM:dE2Ec: all E2E cmds done. 0

2013-04-05 16:35:23.123: [  CRSRES][6345695] Resource recovery not purged:ora.ssyy.db

2013-04-05 16:35:23.124: [  CRSRES][6345695] `ora.ssyy.db` is already OFFLINE.

2013-04-05 16:35:23.173: [  CRSRES][6345673] Stop of `ora.ssyy.ssyy2.inst` on member `ssyy2` succeeded.

     

     初步怀疑为bug导致, 发起SR,经SSC人员及SR后台专家共同确认,命中bug 11698676。

     该bug与bug 9549042为重复bug, 在当前HP-UX Itanium 64 bit 平台下,有现成patch 9549042。

2.3 解决方案 

     官方建议,尽快打patch 9549042, 以规避此crash故障再现。


文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。

转载请注明本文地址:https://www.ucloud.cn/yun/130244.html

相关文章

发表评论

0条评论

IT那活儿

|高级讲师

TA的文章

阅读更多
最新活动
阅读需要支付1元查看
<