Oracle Grid Infrastructure: Understanding Split-Brain Node Eviction (Doc ID 1546004.1)

本文主要是介绍Oracle Grid Infrastructure: Understanding Split-Brain Node Eviction (Doc ID 1546004.1),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

In this Document

 What does "split brain" mean?
 Why is this a problem?
 How does the clusterware resolve a "split brain" situation?
 Identifying a split-brain eviction
 Finding the cohort
 Understanding the cohort message
 Using the cohort message to identify interconnect network issues
 Follow-up Action
 Community Discussions



Oracle Database - Enterprise Edition - Version and later
Information in this document applies to any platform.


The purpose of this note is to explain split-brain node evictions in Oracle Clusterware release 11.2


The intended audience of this note is Oracle Clusterware 11.2 administrators at any level of expertise. As written, this note applies only to 11.2.


Missed network heartbeat (NHB) evictions happen when ocssd of the surviving node loses contact with the evicted node over the interconnect. The nodes must be able to communicate over the interconnect to avoid a "split brain" situation. In the case of a "split brain" node eviction, one node aborted itself to avoid "split brain" when communication over the interconnect was compromised.

What does "split brain" mean?

"Split brain" means that there are 2 or more distinct sets of nodes, or "cohorts", with no communication between the two cohorts.

For example:
Suppose there are 4 nodes named A, B, C, D, in the following situation
* Nodes A,B can talk to each other; nodes C,D can talk to each other
* But A and B cannot talk to C or D, and vice versa
Then there are two cohorts: {A, B} and {C, D}.

Why is this a problem?

In a split-brain situation, there are in a sense two (or more) separate clusters working on the same shared storage. This has the potential for data corruption. So the split-brain must be resolved.

How does the clusterware resolve a "split brain" situation?

Oracle Clusterware handles the split-brain by terminating all the nodes in the SMALLER cohort.
If both of the cohorts are the same size, the cohort with the lowest numbered node in it survives.

The clusterware identifies the LARGEST cohort, and aborts all the nodes which do NOT belong to that cohort.

Identifying a split-brain eviction

In a split-brain node eviction, the following message is present in the ocssd log ($GRID_HOME/log/<hostname>/ocssd/ocssd.log) of the evicted node:

clssnmCheckDskInfo: Aborting local node to avoid splitbrain.

And earlier in the same log, within 10 minutes prior to "clssnmCheckDskInfo: Aborting local node" message:

clssnmPollingThread: node %s (%n) at <X>% heartbeat fatal, removal in...


Finding the cohort

The split-brain message in the ocssd.log will show "cohort" information. For example:

2012-12-28 20:26:25.803: [    CSSD][1111296320]clssnmCheckDskInfo: My cohort: 1
2012-12-28 20:26:25.803: [    CSSD][1111296320]clssnmCheckDskInfo: Surviving cohort: 2,3,4
2012-12-28 20:26:25.803: [    CSSD][1111296320](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 1, sprora01, is smaller than cohort of 3 nodes led by node 2, sprora02, based on map type 2

Understanding the cohort message

In a split-brain situation, ocssd on each node records on the voting disk the set of nodes it can communicate with. Each set is known as a "cohort". When there are two (or more) mutually non-intersecting sets, we have a "split-brain" situation. It means that there are two (or more) separate sets of nodes which cannot talk to each other over the interconnect. 

For example, in the above quote

My cohort: 1
Surviving cohort: 2,3,4

The meaning of these messages is

* "My cohort: 1" => The list of nodes I can communicate with: 1
* "Surviving cohort: 2,3,4" => From the voting disk, I know that nodes 2,3,4 can all communicate with each other.
* "Cohort of 1 nodes with leader 1, sprora01, is smaller than cohort of 3 nodes led by node 2, sprora02"
=> Oracle Clusterware has identified that the cohort {1} is smaller than the cohort {2,3,4}.

Oracle Clusterware handles the split-brain by terminating all the nodes in the SMALLER cohort. In this case, the smaller cohort is {1}. Therefore, ocssd on node {1} aborts the node.

Using the cohort message to identify interconnect network issues

The cohort message describes which nodes can communicate with each other.

Each cohort is a set of nodes that can talk to each other, and cannot talk to the nodes NOT in the cohort.

In the above example, the cohort message tells us that nodes {2,3,4} are all in communication; node 1 is not in communication with any of them.


Follow-up Action

The private network between node 1 and the other 3 nodes should be checked.

Please refer to the following note to check private interconnect network:  Document 1534949.1 - Oracle Grid Infrastructure: How to Troubleshoot Missed Network Heartbeat Evictions

这篇关于Oracle Grid Infrastructure: Understanding Split-Brain Node Eviction (Doc ID 1546004.1)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!


oracle DBMS_SQL.PARSE的使用方法和示例

《oracleDBMS_SQL.PARSE的使用方法和示例》DBMS_SQL是Oracle数据库中的一个强大包,用于动态构建和执行SQL语句,DBMS_SQL.PARSE过程解析SQL语句或PL/S... 目录语法示例注意事项DBMS_SQL 是 oracle 数据库中的一个强大包,它允许动态地构建和执行

PLsql Oracle 下载安装图文过程详解

《PLsqlOracle下载安装图文过程详解》PL/SQLDeveloper是一款用于开发Oracle数据库的集成开发环境,可以通过官网下载安装配置,并通过配置tnsnames.ora文件及环境变... 目录一、PL/SQL Developer 简介二、PL/SQL Developer 安装及配置详解1.下


《CSS3中使用flex和grid实现等高元素布局的示例代码》:本文主要介绍了使用CSS3中的Flexbox和Grid布局实现等高元素布局的方法,通过简单的两列实现、每行放置3列以及全部代码的展示,展示了这两种布局方式的实现细节和效果,详细内容请阅读本文,希望能对你有所帮助... 过往的实现方法是使用浮动加


《oracle如何连接登陆SYS账号》在Navicat12中连接Oracle11g的SYS用户时,如果设置了新密码但连接失败,可能是因为需要以SYSDBA或SYSOPER角色连接,解决方法是确保在连接... 目录oracle连接登陆NmOtMSYS账号工具问题解决SYS用户总结oracle连接登陆SYS账号


《Oracle数据库如何切换登录用户(system和sys)》文章介绍了如何使用SQL*Plus工具登录Oracle数据库的system用户,包括打开登录入口、输入用户名和口令、以及切换到sys用户的... 目录打开登录入口登录system用户总结打开登录入口win+R打开运行对话框,输php入:sqlp


《查询Oracle数据库表是否被锁的实现方式》本文介绍了查询Oracle数据库表是否被锁的方法,包括查询锁表的会话、人员信息,根据object_id查询表名,以及根据会话ID查询和停止本地进程,同时,... 目录查询oracle数据库表是否被锁1、查询锁表的会话、人员等信息2、根据 object_id查询被


《Oracle查询优化之高效实现仅查询前10条记录的方法与实践》:本文主要介绍Oracle查询优化之高效实现仅查询前10条记录的相关资料,包括使用ROWNUM、ROW_NUMBER()函数、FET... 目录1. 使用 ROWNUM 查询2. 使用 ROW_NUMBER() 函数3. 使用 FETCH FI


《数据库oracle用户密码过期查询及解决方案》:本文主要介绍如何处理ORACLE数据库用户密码过期和修改密码期限的问题,包括创建用户、赋予权限、修改密码、解锁用户和设置密码期限,文中通过代码介绍... 目录前言一、创建用户、赋予权限、修改密码、解锁用户和设置期限二、查询用户密码期限和过期后的修改1.查询用

Oracle数据库使用 listagg去重删除重复数据的方法汇总

《Oracle数据库使用listagg去重删除重复数据的方法汇总》文章介绍了在Oracle数据库中使用LISTAGG和XMLAGG函数进行字符串聚合并去重的方法,包括去重聚合、使用XML解析和CLO... 目录案例表第一种:使用wm_concat() + distinct去重聚合第二种:使用listagg,

oracle中exists和not exists用法举例详解

《oracle中exists和notexists用法举例详解》:本文主要介绍oracle中exists和notexists用法的相关资料,EXISTS用于检测子查询是否返回任何行,而NOTE... 目录基本概念:举例语法pub_name总结 exists (sql 返回结果集为真)not exists (s