RAC node not joining cluster – Grid Interprocess Communication (GIPC)

GIPC is daemon used for inter-process communication between the nodes. This also support Redundant Interconnect Usage. GIPC Daemon (gipcd.bin)has an end point ‘gipcha://nodename:xxx’. You can see this in the below log entry.

2019-03-14 23:26:56.429 : GIPCGEN:718243584: gipcRequestSaveInfo: client took long time 2300 ms for consuming request 0x7fbc2407e530 [000000001b23a413] { gipcReceiveRequest : peerName ‘gipcha://stone:nm2_primcluster/ed1f-1959-1be4-9d99’, data 0x7fbc24092df8, len 148, olen 148, off 0, parentEndp 0x7fbc4c065de0, ret gipcretSuccess (0), objFlags 0x0, reqFlags 0x4 } endp 0x7fbc4c065de0 [0000000000000517] { gipcEndpoint : localAddr ‘gipcha://roll:bdc3-a0d7-cd85-3ad9’, remoteAddr ‘gipcha://stone:nm2_primcluster/ed1f-1959-1be4-9d99’, numPend 1, numReady 0, numDone 3, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef 0x22973e0, ready 1, wobj 0x7fbc4c067d70, sendp (nil) status 0flags 0x20038606, flags-2 0x0, usrFlags 0x0 }

Connection to this end point is very important when a new node is trying to bring up CRS. Assume that there is a two node cluster and node1 is up and running and node2 is trying to join the cluster. Once GIPC Daemon successfully started node2, agent starts CSSD (node ). CSSD (node2) makes attempt to establish connection with CSSD (node1) and for that CSSD (node 2) needs to RESOLVE connection string of CSSD (node 1). CSSD (node2) asks GIPCD (node 2) to resolving CSSD (node 1) connect string, GIPCD (node 2) makes attempt to connect GIPCD (node 1). Once the bootstrap connection is established between GIPCD (node 2) and GIPCD (node 1), then GIPCD (node 2) will resolve the connection string of CSSD (node 1) and gets the interface details of CSSD (node 1) and then CSSD (node 1) will establish its connection with CSSD(node 1).

Node 2 send a gipchaUpperConnect request

crsd.trc:2019-03-15 15:26:58.056 :GIPCHAUP:991872768: gipchaUpperConnect: initiated connect for umsg 0x7fcc2804fa50 { msg 0x7fcc280ce990, ret gipcretRequestPending (15), flags 0x6 }, msg 0x7fcc280ce990 { type gipchaMsgTypeConnect (3), srcPort ’23bb-7df3-f7df-0943′, dstPort ‘1ce0-223d-0365-c7c7′, srcCid 00000000-066ab136, cookie 00007fcc-2804fa50 } dataLen 0, endp 0x7fcc080101d0 [00000000066ab136] { gipchaEndpoint : port ’23bb-7df3-f7df-0943’, peer ‘:’, srcCid 00000000-066ab136, dstCid 00000000-00000000, numSend 0, maxSend 100, groupListType 1, hagroup 0x6000c50, priority 0, forceAckCount 0, usrFlags 0x4000, flags 0x0 } node 0x7fcc280564d0 { host ‘stone’, haName ’55a8-1766-1844-5cfb’, srcLuid 4f4247aa-20dc01d6, dstLuid 2dc39efd-83d14b11 numInf 1, sentRegister 1, localMonitor 0, baseStream 0x7fcc28037f60 type gipchaNodeType12001 (20), nodeIncarnation 6e8cc55a-df0ec036, incarnation 2, cssIncarnation 0, roundTripTime 4294967295 lastSeenPingAck 0 nextPingId 1 latencySrc 0 latencyDst 0 flags 0x80c}

Node 2 go ACK (gipchaUpperProcessConnectAck) from node 1 for the above request.

crsd.trc:2019-03-15 15:26:58.066 :GIPCHAUP:991872768: gipchaUpperProcessConnectAck: CONNACK completed umsg 0x7fcc2804fa50 { msg 0x7fcc280ce990, ret gipcretSuccess (0), flags 0xe }, msg 0x7fcc280ce990 { type gipchaMsgTypeConnect (3), srcPort ’23bb-7df3-f7df-0943′, dstPort ‘1ce0-223d-0365-c7c7′, srcCid 00000000-066ab136, cookie 00007fcc-2804fa50 } dataLen 0, hendp 0x7fcc080101d0 [00000000066ab136] { gipchaEndpoint : port ’23bb-7df3-f7df-0943’, peer ‘stone:1ce0-223d-0365-c7c7/440a-9c44-76e9-9e73’, srcCid 00000000-066ab136, dstCid 00000000-00001e8e, numSend 0, maxSend 100, groupListType 1, hagroup 0x6000c50, priority 0, forceAckCount 0, usrFlags 0x4000, flags 0x204 } node 0x7fcc280564d0 { host ‘stone’, haName ’55a8-1766-1844-5cfb’, srcLuid 4f4247aa-20dc01d6, dstLuid 2dc39efd-83d14b11 numInf 1, sentRegister 1, localMonitor 0, baseStream 0x7fcc28037f60 type gipchaNodeType12001 (20), nodeIncarnation 6e8cc55a-df0ec036, incarnation 2, cssIncarnation 0, roundTripTime 4294967295 lastSeenPingAck 0 nextPingId 1 latencySrc 0 latencyDst 0 flags 0x80c}

From node 1 (GIPC), It has processed the gipchaUpperProcessAccept

crsd.trc:2019-03-15 15:26:58.061 :GIPCHAUP:3581314816: gipchaUpperProcessAccept: completed new hastream 0x7f2dc003f770 { host ‘roll’, haName ‘6da7-92b8-0e29-3de4’ srcStreamId 00000000-00001e8e dstStreamId 00000000-066ab136 , hendp (nil) haNode 0x7f2dc4061f40 numInf 1, contigSeq 1, lastAck 0, lastValidAck 0, sendSeq [1 : 1], priority 0, duplicate recv 0, completed recv 0, completed send 0, total send 0, total recv 1, flags 0x1} for hendp 0x7f2dc003f920 [0000000000001e8e] { gipchaEndpoint : port ‘1ce0-223d-0365-c7c7/440a-9c44-76e9-9e73’, peer ‘:’, srcCid 00000000-00001e8e, dstCid 00000000-00000000, numSend 0, maxSend 100, groupListType 1, hagroup 0x558d830, priority 0, forceAckCount 0, usrFlags 0x4000, flags 0x0 }

If a node is not able to join the cluster and it is stuck at CSSD not coming ONLINE, You need to see if GIPC Daemon on all other nodes are online and it is able to communicate to other instances.

You may see that CRS is stuck at starting CSSD with below errors in the cssd log. Error is saying that NHB is missing, so it is required to verify private interconnect and multicast is working before coming the GIPC.

GIPCHALO:55916864: gipchaLowerSendEstablish: sending establish message for node ‘0x7fa4046ea260 { host ‘ rac-node-01′, haName …}’
GIPCHALO:55916864: gipchaLowerSendEstablish: sending establish message for node ‘0x7fa4046d8540 { host ‘ rac-node-03′, haName ..}’
CSSD:39119168: clssnmvDHBValidateNCopy: node 1, rac-node-01, has a disk HB, but no network HB, …
CSSD:39119168: clssnmvDHBValidateNCopy: node 3, rac-node-03, has a disk HB, but no network HB, …
CSSD:29653312: clssnmvDHBValidateNCopy: node 1, rac-node-01, has a disk HB, but no network HB, ..
CSSD:29653312: clssnmvDHBValidateNCopy: node 3, rac-node-03, has a disk HB, but no network HB, ..
CSSD:4227856704: clssnmSendingThread: Connection pending for node rac-node-01, number 1, flags 0x00000002
CSSD:136706368: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186

You can follow below steps.

On surviving node, kill gipcd.bin process (kill -15 “gipcd.bin ospid”)

In 11.2, evmd.bin, crsd.bin and ctssd.bin processes also will restart. Clusterware will respawn all of them automatically.

Once gipcd.bin, evmd.bin, crsd.bin and ctssd.bin processes have been re-spawned on the surviving node, verify whether other nodes join cluster.

You will have to make sure none of the above process are in status. mean it is not killed or zoombie, other node may still try to connect to these processes.

Most of the time, GI will start, but in case it does not, re-start GI on the other nodes with crsctl command.

if GI is still not starting, then a whole cluster restart is required. Most of the time a rolling restart will work, but there are chances that a complete shutdown/restart is required

Leave a Reply

Your email address will not be published. Required fields are marked *