OGG的extract进程checkpoint时间点回到1988-01-01 00:00:00故障处理

1、故障现象

    Extract进程(SEXTR01)状态为running,但是Lag at Chkpt却达到5个多小时,且时间一直在增长,根本就不抽取新日志,状态信息如下:

GGSCI (calladgdb) 21> info all

 

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

 

MANAGER     RUNNING                                          

EXTRACT     RUNNING     DPEYWGL     00:00:00      00:00:04   

EXTRACT     RUNNING     SEXTR01     05:14:58      00:00:03   

 

2、故障分析

2.1 debug一下进程状态信息

GGSCI (calladgdb) 23> info sextr01, showch debug

 

EXTRACT    SEXTR01   Last Started 2015-01-25 22:51   Status RUNNING

Checkpoint Lag       05:21:26 (updated 00:00:04 ago)

Log Read Checkpoint  Oracle Redo Logs

                     2015-01-25 18:05:16  Thread 1, Seqno 5554, RBA 611373536

                     SCN 3126.136656462 (13426204423758)

Log Read Checkpoint  Oracle Redo Logs

                     1988-01-01 00:00:00  Thread 2, Seqno 4016, RBA 1024

                     SCN 44527.110828288 (191242119617280)

 

 

Current Checkpoint Detail:

 

Read Checkpoint #1

……

 

Read Checkpoint #2

 

  Oracle Threaded Redo Log

 

  Startup Checkpoint (starting position in the data source):

    Thread #: 2

    Sequence #: 4015

    RBA: 640984080

    Timestamp: 2015-01-25 18:03:51.000000

    SCN: 3126.136602577 (13426204369873)

    Redo File: +DGROUP1/calldb/onlinelog/group_15.267.799673791

 

  Recovery Checkpoint (position of oldest unprocessed transaction in the data source):

    Thread #: 2

    Sequence #: 4015

    RBA: 640984080

    Timestamp: 2015-01-25 18:03:51.000000

    SCN: 3126.136602577 (13426204369873)

    Redo File: +DGROUP1/calldb/onlinelog/group_15.267.799673791

 

  Current Checkpoint (position of last record read in the data source):

    Thread #: 2

    Sequence #: 4016

    RBA: 1024

    Timestamp: 1988-01-01 00:00:00.000000

    SCN: 44527.110828288 (191242119617280)

    Redo File: +DGROUP1/calldb/onlinelog/group_16.266.799673793

……

 

从这里发现一个不可思议的问题:

      Thread #: 2的current checkpoint 的timestamp竟然是1988-01-0100:00:00.000000SCN的十进制数,竟然为191242119617280,远远大于当前数据库的SCN,多出一位数。

      时间倒退,SCN号反而大涨,时间还那么有个性。这个问题挺有意思。

2.2 view extract进程的report

    使用view report sextr01查看进程的报告,提示有ERROR,thread#2 4016 archive log 不能打开。(由于笔者没有将信息复制下来,所以当时的状态只能这么描述了),但是4016号archivelog实际物理文件是存在的,而且权限正常。

2.3 将进程停止下来,showch其状态

GGSCI (calladgdb) 24>stop sextr01

GGSCI (calladgdb) 25> info SEXTR01, showch

EXTRACT    SEXTR01   Last Started 2015-01-26 01:00   Status STOPPED

Checkpoint Lag       06:56:55 (updated 00:00:58 ago)

Log Read Checkpoint  Oracle Redo Logs

                     2015-01-25 18:05:16  Thread 1, Seqno 5554, RBA 611373536

                     SCN 3126.136656462 (13426204423758)

Log Read Checkpoint  Oracle Redo Logs

                     1988-01-01 00:00:00  Thread 2, Seqno 4016, RBA 1024

                     SCN 44527.110828288 (191242119617280)

 

Current Checkpoint Detail:

 

Read Checkpoint #1

……

 

Read Checkpoint #2

 

  Oracle Threaded Redo Log

 

  Startup Checkpoint (starting position in the data source):

    Thread #: 2

    Sequence #: 4015

    RBA: 640984080

    Timestamp: 2015-01-25 18:03:51.000000

    SCN: 3126.136602577 (13426204369873)

    Redo File: +DGROUP1/calldb/onlinelog/group_15.267.799673791

 

  Recovery Checkpoint (position of oldest unprocessed transaction in the data source):

    Thread #: 2

    Sequence #: 4015

    RBA: 640984080

    Timestamp: 2015-01-25 18:03:51.000000

    SCN: 3126.136602577 (13426204369873)

    Redo File: +DGROUP1/calldb/onlinelog/group_15.267.799673791

 

  Current Checkpoint (position of last record read in the data source):

    Thread #: 2

    Sequence #: 4016

    RBA: 1024

    Timestamp: 1988-01-01 00:00:00.000000

    SCN: 44527.110828288 (191242119617280)

    Redo File: +DGROUP1/calldb/onlinelog/group_16.266.799673793

……


3 分析原因

       进程的thread#2的checkpoint time竟然是1988-01-01 00:00:00.000000,肯定是不正常,结合error信息为找不到4016归档日志文件,以及进程的current checkpoint thread#2的sequence#:4016,而RBA却为1024。

既然RBA都已经运行取1024了,就不应该找不到archivelog的情况,如果4016号归档日志文件还没有开始抽取,RBA号就应该是0。

      根据此分析,决定手动的将extract进程的chkeckpoint恢得到recover chkeckpoint点,以及next extseqno调整到4016的0号RBA试一下,看能否解决。


4、解决办法

4.1 调整netxt sequno为4016,RBA为0

GGSCI (calladgdb)2> alter SEXTR01, thread 2, extseqno 4016, extrba 0

4.2 调整ioextseqno与ioextrba到进程的rechvercheckpoint状态

GGSCI (calladgdb)3> alter SEXTR01, thread 2, ioextseqno 4015, ioextrba 640984080

4.3 验证调整结果

GGSCI (calladgdb)4> info sextr01, showch

EXTRACT    SEXTR01   Initialized   2015-01-26 01:48   Status STOPPED

Checkpoint Lag       07:43:41 (updated 00:00:51 ago)

Log Read Checkpoint  Oracle Redo Logs

                     2015-01-25 18:05:16  Thread 1, Seqno 5554, RBA 611373536

                     SCN 3126.136656462 (13426204423758)

Log Read Checkpoint  Oracle Redo Logs

                     1988-01-01 00:00:00  Thread 2, Seqno 4016, RBA 0

                     SCN 0.0 (0)

 

Current Checkpoint Detail:

 

Read Checkpoint #1

……

 

Read Checkpoint #2

 

  Oracle Threaded Redo Log

 

  Startup Checkpoint (starting position in the data source):

    Thread #: 2

    Sequence #: 4016

    RBA: 0

    Timestamp: 1988-01-01 00:00:00.000000

    SCN: Not available

    Redo File: +DGROUP1/calldb/onlinelog/group_16.266.799673793

 

  Recovery Checkpoint (position of oldest unprocessed transaction in the data source):

    Thread #: 2

    Sequence #: 4015

    RBA: 640984080

    Timestamp:

    SCN: Not available

    Redo File:

 

  Current Checkpoint (position of last record read in the data source):

    Thread #: 2

    Sequence #: 4016

    RBA: 0

    Timestamp: 1988-01-01 00:00:00.000000

    SCN: Not available

    Redo File: +DGROUP1/calldb/onlinelog/group_16.266.799673793

……

   Start checkpoint和current chkeckpoint的sequence#都变成了4016,RBA号都变成了0

4.4 启动extract进程sextr01,并做brreset操作

GGSCI (calladgdb)6> startSEXTR01, brreset

GGSCI (calladgdb) 6> start SEXTR01, brreset

 

Sending START request to MANAGER ...

EXTRACT SEXTR01 starting

 

GGSCI (calladgdb) 7> info all

 

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

 

MANAGER     RUNNING                                           

EXTRACT     RUNNING     DPEYWGL     00:00:00      00:00:01   

EXTRACT     STARTING    SEXTR01     07:43:41      00:03:47   

 

4.5 多做几次进程状态信息显示,验证延迟下降效果

GGSCI (calladgdb) 10> info all

 

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

 

MANAGER     RUNNING                                          

EXTRACT     RUNNING     DPEYWGL     04:46:20      00:00:01   

EXTRACT     RUNNING     SEXTR01     04:23:13      00:00:01   

 

GGSCI (calladgdb) 11> info all

 

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

 

MANAGER     RUNNING                                          

EXTRACT     RUNNING     DPEYWGL     01:53:22      00:00:09   

EXTRACT     RUNNING     SEXTR01     01:27:25      00:00:08   

 

GGSCI (calladgdb) 12> info all

 

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

 

MANAGER     RUNNING                                          

EXTRACT     RUNNING     DPEYWGL     00:00:00      00:00:01   

EXTRACT     RUNNING     SEXTR01     00:00:02      00:00:00   

 

   通过几次info all查看进程状态,看到lag at chkpt时间快速下降。问题解决。

 

 

 

本文作者:黎俊杰(网名:踩点),从事”系统架构、操作系统、存储设备、数据库、中间件、应用程序“六个层面系统性的性能优化工作

欢迎加入 系统性能优化专业群,共同探讨性能优化技术。群号:258187244

郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。