Oracle Standby Redo Transport SYNC mode Tracing

Standby database is replica of Primary database synchronized using redo log (archived redo log). Redo log generated in Primary is send to standby depending upon transport settings in primary. It can be send using Log Write or Archiver process.

There are 3 Dataguard protection modes. They are Maximum Availability, Maximum Performance and Maximum Protection. In Maximum Availability, SYNC mode with AFFIRM or NOAFIRM is supported. In Maximum Performance, ASYNC mode with NOAFFIRM is supported. In Maximum Protection, SYNC mode with AFFIRM is supported. Maximum Protection requires an ACK from RFS process in the standby side to the LGWR in the primary side to complete the redo transport.

One of the options that can be set for Redo Transport is SYNC and AFFIRM (Maximum Availability mode). In this article I’m trying debugging a SYNC mode transport done by log writer (LGWR). In a SYNC mode LOG_ARCHIVE_DEST_n is set using LGWR SYNC AFFIRM with NET_TIMEOUT set to 30.

Redo transport in Dataguard Standby environment is done using LGWR-> LNS(NSSn processes) -> RFS in destination side. LNS is sends as messages over network using SQL Net. LGWR signals one of the NSS(NetServer) process that it has redo blocks to be transferred over to the standby side. These blocks are packaged into messages and NSS process establish a connection to the RFS process in the standby side, then the message is transmitted.

There are other bunch of process used for pings and heartbeats. Some processes are used for Data guard broker synchronization. If you see excessive ‘log file sync’ wait due to standby, tracing will be useful to see what is going on under the hood. To enable redo transport tracing, debug level tracing can be enabled. These TRACING can generate huge trace files, so not advanced to enable more than few minutes.
In the LGWR trace, you will see

Relevant trace files:
/u01/app/oracle/diag/rdbms/prim/PRIM/trace/PRIM_lgwr_14921.trc
/u01/app/oracle/diag/rdbms/prim/PRIM/trace/PRIM_nss2_14995.trc
/u01/app/oracle/diag/rdbms/prim/PRIM/trace/ST2BY_rfs_67112.trc

LGWR – send message to NSS2:
===========================
*** 2018-11-06 15:30:55.716
kcrfwl: called by a message
kcrfwl: lst = 0, slt = 0
kcrfnl: ctxp = 0xaef49678, rtp = 0xaef494bc, dip = 0xaef49378, stucklog = 0xaef49674, ckp = 0xaef495d4
*** 2018-11-06 15:30:55.759 5278 krsl.c
Posting queued network I/O #0 [1,68,0x6022f400,512,0×44]
Post: queued network I/O #0 [1,68,0x6022f400,512] [0xf8c3dd98,0xf8c3dda8]
Submitting queued network I/O #0 [1,68,0x0x6022f400,512] [0xf8c3dd98,0xf8c3dda8]
krsl_redo_spool_int: [SLA] – Copying the primary info to the rfsdef for ptim 1002209454 pscn [0x0000.01a4860f] pseq 7600
Archiving block 68 count 1 block(s) to ‘st2by’
New Clock for LGWR-LNS ‘RFSWRITE’ for dest ‘LOG_ARCHIVE_DEST_2’:Begin time <03/11/2018 15:30:55> NET_TIMEOUT seconds
Maxconnections 0 for dest ‘LOG_ARCHIVE_DEST_2’ and mode=3
krsu_upi_rfs: mode:3 adip=0x7f9234999818, pdip=0x7f9233cf07f8, ocis=0x7f9233cf0a88, adip_lab=0, pdip_lab=0 mountid:0xd923ec1a
*** 2018-11-06 15:30:55.759 1073 krsu.c
Issuing standby archive of block 68 count 1 to ‘st2by’ to RFS pid [67112] instnum [1]
*** 2018-11-06 15:30:55.759 1274 krsu.c
RFS message number 288 (RFSWRITE)
Old rfsgdbn: PRIM
New rfsgdbn: PRIM
LGWR: Starting
rfsp: 0x7f9234999900
rfsmod: 3
rfsver: 10
rfsflag: 0x20882
rfscount:288
rfspclt: 3
rfscpid: 14921
rfsbuf: (nil)
rfsbufl: 0
rfsrfnl: 0
rfsrfn: (nil)
rfsrfn:
rfsdbi: 0xfdaf6ff5
rfsmid: 0xfff8fa3c
rfsacid: 0xfdb8d689
rfslno: 1
rfsthrd: 1
rfsseq: 7600
rfsabf: 10
rfsabz: 2048
rfscflg: 0x1
rfssblkn:68
rfsblksz:512
rfsblkct:1
rfsalosb:27559443
rfsalosw:0
rfsalost:1002209388
rfsalnsb:4294967295
rfsalnsw:65535
rfsalnst:0
rfsalbct:68
rfsaldst:2
rfsreopn:300
rfsdelay:0
rfscln: 2
rfscsq: 7600
rfsalsiz:102400
rfsrfx: 0x7f9234999878
rfsgdbn: PRIM
rfsnam: 0x7fffaef481c0
rfsnaml: 512
rfsnamrl: 0x7fffaef48444
rfsarb: 0xc10fc80
rfsarbl: 64
rfsarrl: (nil)
rfsarb.kccicrls:0:1
rfsarb.kccicrlc:963831477
rfsflg2: 0x20
rfscmp: 0x0
rfsdgid: 0x930db0e3
rfxp: 0x7f9234999878
rfxalmid: 0x0
rfxrflg: 0x0
rfxdgid:0x0
rfxogapflg:0x0
rfsarb.kccicrls:0:1
rfsarb.kccicrlc:963831477
rfsfob: (nil)
rfsfobl: 0
rfsfobrl: (nil)
rfsinm: (nil)
NetServer pid:14975
*** 2018-11-06 15:30:55.760 6700 krsu.c
Making upirfs mod=3 thr=1 seq=7600 blk=68 count=1 to NSS2 [pid=14975] asyncblks=0 syncparallel=1 nettimeout seconds begin_op_time <03/11/2018 15:30:55>
LGWR has 1 parts to send to NetServer
DEBUG_RFSARB: Copying rfsarb from LGWR to sync LNS
Sending part 0
Sent part 0 successfully

NSS2 – receive message from LGWR & send redo to RFS:
====================================================
*** 2018-11-06 15:30:55.763 2096 krsw.c
NSS2: Request to perform RFS operation (RFSWRITE) message number 288 [thr=1, seq=7600]
Checking queued network I/O #0 [1,68,0x6022f400,512]
NSS2: Sending current buffer size=512
NSS2: making the upirfs call
*** 2018-11-06 15:30:55.763 6408 krsu.c
Making upirfs call rfsmod=3 to RFS pid[67112] at instance number [1]
OCI HANDLE[krsu_wupi_rfs]:ocis=0x7f4c2e310d20, svrhp=0x7f4c2e1aa0c8, svchp=0x7f4c2e2fa870, errhp=0x7f4c2e2fa950, authhp=0x7f4c2e1a9610
Client sending SQLNET request data [kpurcs] oper=’Write’ flag=133250 thrd=1 seq=7600 msgid=288
*** 2018-11-06 15:30:55.763 5837 krsu.c
*** 2018-11-06 15:30:55.767 6058 krsu.c
… Client received SQLNET call [kpurcs] response oper=’Write’ flag=133250 thrd=1 seq=7600 msgid=288

RFS – recieve redo NSS2, complete write and send message to NSS2:
=================================================================
*** 2018-11-06 15:30:55.763 950 krsr.c
RFS message number 288 (RFSWRITE state:1 mode:3 flag:0x4000882) start
krsr_rfs_wda: Broadcast new conservative RTA boundary for thread 1 seq 7600
to block 68 for SYNC AFFIRM

*** 2018-11-06 15:30:55.764
Performing write of 1 blocks starting block 68
incoming data is not compressed

*** 2018-11-06 15:30:55.764
Calculate new RTA Threshold: [4752344]
krsr_rfs_dsp: – RFS mode 3, Current standby SCN [0x00b8.e8373a7f]
*** 2018-11-06 15:30:55.764 1468 krsr.c
RFS message number 288 (RFSWRITE state:1 mode:3 flag:0x4000882) complete

NSS2 – receive write complete from RFS send message to LGWR:
============================================================
*** 2018-11-06 15:30:55.767 6440 krsu.c
status of upirfs is 0
krsu_nss_wrp_upi_rfs: [BMR] – Retrieved valid apply SCN [0x0000.01a48658]
NSS2: Successful in sending current buffer
*** 2018-11-06 15:30:55.767 2318 krsw.c
NSS2: RFS operation (RFSWRITE) message number 288 [thr=1, seq=7600] completed. Status=0
NetServer: Successful
DEBUG_RFSARB: No copy from sync LNS to LGWR rfsarb
NSS2: sending response status=0 to LGWR

LGWR – acknowledge NSS2 message:
================================
*** 2018-11-06 15:30:55.767 6440 krsu.c
status of upirfs is 0
krsu_nss_wrp_upi_rfs: [BMR] – Retrieved valid apply SCN [0x0000.01a48658]
NSS2: Successful in sending current buffer
*** 2018-11-06 15:30:55.767 2318 krsw.c
NSS2: RFS operation (RFSWRITE) message number 288 [thr=1, seq=7600] completed. Status=0
NetServer: Successful
DEBUG_RFSARB: No copy from sync LNS to LGWR rfsarb
NSS2: sending response status=0 to LGWR
======================================================================

You can see how much time it takes from LGWR to RFSWRITE (standby log files). In the above experiment, it too approximately 50 milliseconds.
Started AT *** 2018-11-06 15:30:55.716
Complete *** 2018-11-06 15:30:55.767
LGWR will be waiting for all these time on ‘log file sync’. An additional wait event that you see related to this is “LGWR-LNS WAIT ON CHANNEL”.
These is very helpful to identify delay in terms of Network transport or Standby side Standby log file I/O.