1.背景介绍
OGG抽取进程运行在由四节点组成的Oracle RAC的一台服务器上。数据库的数据文件采用ASM管理。因此,OGG抽取是需要读取ASM文件的。现在Oracle RAC要做整体迁移,从目前的四节点集群迁移到另一个机房的三节点集群上去。另说一句,OGG其实很稳定的,对于单表的单向同步工作,运行的非常稳定,都跑了2年多了,也一直没有问题,直到负责实施运维的同事都离职了,来了这么一次迁移工作。
2.迁移操作
在RAC迁移完成后,我们将OGG抽取端程序和配置打包拷贝过去。首先启用OGG管理进程,正常。再启用抽取进程ext1,结果却是失败,状态变成ABENDING。在日志文件里里看到的信息如下:2016-01-15 19:19:48 ERROR OGG-00446 Oracle GoldenGate Capture for Oracle, ext1.prm: , error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0, error retrieving redo file name for sequence 300499, archived = 0, use_alternate = 0Not able to establish initial position for sequence 300499, rba 5170704.2016-01-15 19:19:48 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, ext1.prm: PROCESS ABENDING.
(都很久没研究过ogg了啊,这下麻烦啦)
我们梳理一下OGG的配置和环境。我们仅仅是RAC节点减少一个,OGG找不到第四个实例,这是因为新环境没有第四个节点,就是说OGG的抽取源少了一组REDO LOGFILE,所以报错。google了很久,也没看到有这种节点删除的操作文档。最后,在METALINK上找到一份文档,介绍如何纠正这个错误的。
How to Configure GoldenGate Extract When Adding or Removing Redo Log Ths in an Oracle RAC ?, OGG-00446 (文档 ID 1267901.1Disabling redo log threads1)
The purpose is to remove an existing RAC thread from goldengate extract so that extract will not capture from that thread.
1) Edit the extract parameter file to either remove these parameters or specify what is required depending on which threads you wish to enable. See THREADOPTIONS PROCESSTHREADS description below.
2) Disable the redo log threads.
3) The extract will abend because these threads are not available. Simply restart the extract as you now have added the THREADOPTIONS PROCESSTHREADS in the extract parameter file.
我粗略地过了一遍,嗯,发现很简单,改个配置就可以。于是去吃晚饭,等回来再继续,没想到这才是苦难的开始。
闲话少说,回来后,根据文档的操作步骤,将RAC实例中的线程4禁用掉。删除了线程4的redo logfile。再到extract的配置文件中,设置THREADOPTIONS PROCESSTHREADS EXCEPT 4。最后重启EXTRACT。在启动过程中,我想这下该正常了吧。但是,抽取进程又直接挂了。哪里有出错了?再将文档向下看的,完蛋。OGG的线程和RAC的线程是不匹配的。在数据库中执行’select distinct thread# from v$log;’得到的排序结果和OGG中的不一致。
RAC THREAD# OGG thread
———— ————–
1 – 1
2 – 2
4 – 3
3 – 4
现在,是我禁用错了实例。在OGG的角度看,是实例3的REDO LOGFILE不能访问了。于是,赶紧改改改。在extract的配置文件中,设置THREADOPTIONS PROCESSTHREADS EXCEPT 3,重启extract。结果还是报错。我想是不是ext1因为修改错了,导致出错的。于是换成另一个抽取进程kh_ext,也一样错。2016-01-15 20:33:04 INFO OGG-01643 Oracle GoldenGate Capture for Oracle, kh_ext.prm: BOUNDED RECOVERY: CANCELED: for object pool 3: p14422_Redo Thread 4.2016-01-15 20:33:04 INFO OGG-01579 Oracle GoldenGate Capture for Oracle, kh_ext.prm: BOUNDED RECOVERY: VALID BCP: CP.KH_EXT.000001577.2016-01-15 20:33:04 INFO OGG-01629 Oracle GoldenGate Capture for Oracle, kh_ext.prm: BOUNDED RECOVERY: PERSISTED OBJECTS RECOVERED: <>.2016-01-15 20:33:05 INFO OGG-00546 Oracle GoldenGate Capture for Oracle, kh_ext.prm: Default thread stack size: 10485760.2016-01-15 20:33:05 ERROR OGG-00446 Oracle GoldenGate Capture for Oracle, kh_ext.prm: The number of Oracle redo threads (3) is not the same as the number of checkpoint threads (4). EXTRACT groups on RAC systems should be created with the THREADS parameter (e.g., ADD EXT, TRANLOG, THREADS 3, BEGIN…).2016-01-15 20:33:05 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, kh_ext.prm: PROCESS ABENDING.
根据错误信息,告诉我实例数不符合,能符合嘛?
咋办?简单的except操作搞不定。
那我重建吗?那么多表,那么多数据能导入到其他库中?嗯,我可以将extract删除重建试试。这里,我又找了一篇关于如何重建extract文档。
首先使用info ext ext1,showch,将现在extract的关键信息点保存下来。
GGSCI (webrac2) 27> info ext ext1,showch
EXTRACT EXT1 Last Started 2015-12-24 10:02 Status ABENDED
Checkpoint Lag 00:00:00 (updated 03:24:58 ago)
Log Read Checkpoint Oracle Redo Logs
2016-01-15 16:06:23 Thread 1, Seqno 378433, RBA 110834248
SCN 1508.2302597186 (6479113279554)
Log Read Checkpoint Oracle Redo Logs
2016-01-15 16:06:26 Thread 2, Seqno 319453, RBA 1128976
SCN 1508.2302597714 (6479113280082)
Log Read Checkpoint Oracle Redo Logs
2016-01-15 16:06:30 Thread 4, Seqno 300524, RBA 704000
SCN 1508.2302597781 (6479113280149)
Log Read Checkpoint Oracle Redo Logs
2016-01-15 16:06:30 Thread 3, Seqno 338237, RBA 813568
SCN 1508.2302597925 (6479113280293)
然后,删除重建。
delete ext ext1
add ext ext1 ,begin now,tranlog,threads 3
alter extract ext1,tranlog,extseqno 378433,extrba 110834248,thread 1
alter extract ext1,tranlog,extseqno 319453,extrba 1128976,thread 2
alter extract ext1,tranlog,extseqno 338237,extrba 813568,thread 4
alter extract ext1,tranlog,extseqno 300524, extrba 704000, thread 3
alter extract ext1,ioextseqno 378433,ioextrba 110826000,thread 1
alter extract ext1,ioextseqno 319453,ioextrba 1047056,thread 2
alter extract ext1,ioextseqno 338237,ioextrba 812560,thread 4
alter extract ext1,ioextseqno 300499, ioextrba 5170704,thread 3
add exttrail /u01/app/ogg/ggstrail/et ,seqno 36154,rba 56591320,extract ext1
这里有点情况,你说这thread 4是建还是不建呢。不建,数据不一致了,有些事务会不匹配。建,我到哪里去找实例4去啊。
我两种操作都来了一遍,结果是都不行。一度我需要手工杀死OGG的操作系统进程,重启OGG才能继续操作。
这个时候,夜已经深了。
我们停了下来,想了一下。要重做吗?那么大的数据量呢。我们将OGG创建的文档找出来翻了翻。在OGG的目标端,复制进程启动命令是这样的,start kh_rep, aftercsn xyz。就是说,它是在SCN号之后进行复制的。那么,我们将迁移过程中这段SCN操作给略过,不就可以了嘛?因为我们可以确定,这段时间是没有业务操作的。
重建抽取端的EXTRACT,配置成现在开始抽取,线程数设置成3。这样就变相实现了重建过程,而目标端的数据又不需删除后再插入。
嗯,找到现在3节点数据库的当前SCN,重建EXTRACT。这会很快成功。幸亏,RAC的迁移是使用DG的SWITCH ROLE完成的。
最后,我们在源库上修改一张同步表的数据,在目标库上很快看到同步状况,确认OGG的数据同步正常。
3. 总结
OGG在单表单向复制时,正常使用还是蛮稳定的。像这种跨库迁移的库,可以在启动之前,重建一下EXTRACT,不用修改其他配置的。