记一次Oracle Clusterware成功安装后的故障处理
记一次Oracle Clusterware安装成功后的故障处理
1. 环境
[grid@rac1 rac1]$ cat /etc/issue Red Hat Enterprise Linux Server release 5.8 (Tikanga) Kernel \r on an \m
2. 问题描述在安装RAC的过程中, 成功安装好grid (clusterware) 后关闭了各节点. 在下次开启各节点后, 检查crs资源状态, 出现如下错误:
[grid@rac1 ~]$ crs_stat -t -v CRS-0184: Cannot communicate with the CRS daemon.
3. 分析解决
检查crs状态:
[grid@rac1 ~]$ crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4535: Cannot communicate with Cluster Ready Services # 无法与CRS通信 CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online
查看crsd对应日志:
2014-11-21 15:18:13.490: [GIPCXCPT][1002185440] gipcShutdownF: skipping shutdown, count 2, from [ clsinet.c : 1732], ret gipcretSuccess (0) 2014-11-21 15:18:13.492: [GIPCXCPT][1002185440] gipcShutdownF: skipping shutdown, count 1, from [ clsgpnp0.c : 1021], ret gipcretSuccess (0) 2014-11-21 15:18:13.498: [ OCRASM][1002185440]proprasmo: Error in open/create file in dg [DATA] # 打开磁盘组失败 [ OCRASM][1002185440]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge ORA-15077: could not locate ASM instance serving a required diskgroup # 没有ASM实例 2014-11-21 15:18:13.498: [ OCRASM][1002185440]proprasmo: kgfoCheckMount returned [7] 2014-11-21 15:18:13.498: [ OCRASM][1002185440]proprasmo: The ASM instance is down # ASM实例处于关闭状态 2014-11-21 15:18:13.499: [ OCRRAW][1002185440]proprioo: Failed to open [+DATA]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE. 2014-11-21 15:18:13.499: [ OCRRAW][1002185440]proprioo: No OCR/OLR devices are usable 2014-11-21 15:18:13.499: [ OCRASM][1002185440]proprasmcl: asmhandle is NULL 2014-11-21 15:18:13.499: [ OCRRAW][1002185440]proprinit: Could not open raw device 2014-11-21 15:18:13.499: [ OCRASM][1002185440]proprasmcl: asmhandle is NULL 2014-11-21 15:18:13.499: [ OCRAPI][1002185440]a_init:16!: Backend init unsuccessful : [26] 2014-11-21 15:18:13.499: [ CRSOCR][1002185440] OCR context init failure. Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge ORA-15077: could not locate ASM instance serving a required diskgroup ] [7] 2014-11-21 15:18:13.499: [ CRSD][1002185440][PANIC] CRSD exiting: Could not init OCR, code: 26 2014-11-21 15:18:13.499: [ CRSD][1002185440] Done.
日志信息表明, ASM实例未能启动, 导致crsd进程无法启动
尝试手动启动ASM实例:
[grid@rac1 ~]$ asmcmd Connected to an idle instance. ASMCMD> startup ORA-27154: post/wait create failed ORA-27300: OS system dependent operation:semget failed with status: 28 ORA-27301: OS failure message: No space left on device ORA-27302: failure occurred at: sskgpsemsper Connected to an idle instance.
上述信息表明, 失败的操作是semget.
semget的任务是获得信号量集(get set of semaphores), 这里的No space left on device并不是指存储空间,而是信号量资源.
检查系统中的信号量使用情况:
[grid@rac1 ~]$ ipcs ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x00000000 3407873 root 644 80 2 0x00000000 3440643 root 644 16384 2 0x00000000 3473412 root 644 280 2 ------ Semaphore Arrays -------- key semid owner perms nsems ------ Message Queues -------- key msqid owner perms used-bytes messages
未发现异常. 继续检查内核参数中的semmns:
root@rac1 ~]# sysctl -a|grep sem kernel.sem = 250 100 32 128
四个参数分别为:
semmsl---每个信号集包含的信号数,该值应比ORACLE进程的最大数大10左右
semmns---系统中的信号数
semopm---每个信号操作呼叫的最大操作数
semmni---信号集标识符数,用于控制可随时创建的信号集数
加大系统中的信号量(/etc/sysctl.conf):
kernel.sem = 256 32768 100 228
重新启动ASM实例:
ASMCMD> startup ORA-03113: end-of-file on communication channel Connected to an idle instance.
因着急继续做实验, 此时直接对两个节点进行了重新启动, 重启后ASM实例正常启动, crs资源状态正常, 问题得到解决.
后来实验结束后查询ORA-03113, 导致该错误的可能原因有:
1) Unix核心参数设置不当 2) Oracle执行文件权限不正确/环境变量问题 3) 客户端通信不能正确处理 4) 数据库服务器崩溃/操作系统崩溃/进程被kill 5) Oracle 内部错误 6) 特定SQL、PL/SQL引起的错误 7) 空间不够 8) 防火墙的问题
但因错误环境已消失, 未能排查解决, 很是遗憾, 仅留做以后参考.
4. 参考
1) [Oracle 11g RAC CRS-4535/ORA-15077] http://blog.csdn.net/l106439814/article/details/8969060
2) [ASM启动报错ORA-27300, ORA-27301 and ORA-27302: failure occurred at: sskgpsemsper] http://www.51itstudy.com/33735.html
3) [DBA手记:共享内存无法正常释放的处理] http://www.eygle.com/archives/2011/03/ipcs_semaphore.html
4) [ORA-03113: end-of-file on communication channel 错误定位过程] http://www.51itstudy.com/6628.html
郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。