Reboot-less node fencing in Oracle Clusterware 11g Release 2

在进行一次RAC的高可用性测试时,当private网卡的网线被拔掉之后,没有出现传说中的有一个节点被CRS强制重启,取而代之的是node2上面的ASM实例和RDBMS实例被关闭;当网线被重新插上时,node2上面的ASM实例和RDBMS实例自动重新启动。

基于上面的现象,在google上搜索,发现oracle在11.2.0.2版本之后引入了叫reboot-less node fencing的特性,即不使用重启节点的方式进行fencing。

下面引用oracle官方文档对reboot-less node fencing的介绍。

As mentioned, Oracle Clusterware uses a STONITH (Shoot The Other Node In The Head) comparable fencing algorithm to ensure data integrity in cases, in which cluster integrity is endangered and split-brain scenarios need to be prevented. In case of Oracle Clusterware, this means that a local process enforces the removal of one or more nodes from the cluster (fencing). 

Until Oracle Clusterware 11g Release 2, Patch Set One (11.2.0.2) the fencing of a node was performed by a “fast reboot” of the respective server. A “fast reboot” in this context summarizes a shutdown and restart procedure that does not wait for any IO to finish or for file systems to synchronize on shutdown. With Oracle Clusterware 11g Release 2, Patch Set One (11.2.0.2) this mechanism has been changed in order to prevent such a reboot as much as possible. 

Already with Oracle Clusterware 11g Release 2 this algorithm was improved so that failures of certain, Oracle RAC-required subcomponents in the cluster do not necessarily cause an immediate fencing (reboot) of a node. Instead, an attempt is made to clean up the failure within the cluster and to restart the failed subcomponent. Only, if a cleanup of the failed component appears to be unsuccessful, a node reboot is performed in order to force a cleanup. 

With Oracle Clusterware 11g Release 2, Patch Set One (11.2.0.2) further improvements were made so that Oracle Clusterware will try to prevent a split-brain without rebooting the node. It thereby implements a standing requirement from those customers, who were requesting to preserve the node and to prevent a reboot, since the node runs applications not managed by Oracle Clusterware, which would otherwise be forcibly shut down by the reboot of a node. 

With the new algorithm and when a decision is made to evict a node from the cluster, Oracle Clusterware will first attempt to shutdown all resources on the machine that was chosen to be the subject of an eviction. Especially IO generating processes are killed and it is ensured that those processes are completely stopped before continuing. If, for some reason, not all resources can be stopped or IO generating processes cannot be stopped completely, Oracle Clusterware will still perform a reboot or use IPMI to forcibly evict the node from the cluster. 

If all resources can be stopped and all IO generating processes can be killed, Oracle Clusterware will shut itself down on the respective node, but will attempt to restart after the stack has been stopped. The restart is initiated by the Oracle High Availability Services Daemon, which has been introduced with Oracle Clusterware 11g Release 2. 
    原文作者:花菜土豆粉
    原文地址: https://segmentfault.com/a/1190000006126984
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞