Oracle Grid Infrastructure: Understanding Split-Brain Node Eviction (Doc ID 1546004.1)

本文主要是介绍Oracle Grid Infrastructure: Understanding Split-Brain Node Eviction (Doc ID 1546004.1),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

In this Document

 Purpose
 Scope
 Details
 What does "split brain" mean?
 Why is this a problem?
 How does the clusterware resolve a "split brain" situation?
 Identifying a split-brain eviction
 Finding the cohort
 Understanding the cohort message
 Using the cohort message to identify interconnect network issues
 Follow-up Action
 Community Discussions

 References

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Information in this document applies to any platform.

PURPOSE

The purpose of this note is to explain split-brain node evictions in Oracle Clusterware release 11.2

SCOPE

The intended audience of this note is Oracle Clusterware 11.2 administrators at any level of expertise. As written, this note applies only to 11.2.

DETAILS

Missed network heartbeat (NHB) evictions happen when ocssd of the surviving node loses contact with the evicted node over the interconnect. The nodes must be able to communicate over the interconnect to avoid a "split brain" situation. In the case of a "split brain" node eviction, one node aborted itself to avoid "split brain" when communication over the interconnect was compromised.

What does "split brain" mean?

"Split brain" means that there are 2 or more distinct sets of nodes, or "cohorts", with no communication between the two cohorts.

For example:
Suppose there are 4 nodes named A, B, C, D, in the following situation
* Nodes A,B can talk to each other; nodes C,D can talk to each other
* But A and B cannot talk to C or D, and vice versa
Then there are two cohorts: {A, B} and {C, D}.

Why is this a problem?

In a split-brain situation, there are in a sense two (or more) separate clusters working on the same shared storage. This has the potential for data corruption. So the split-brain must be resolved.

How does the clusterware resolve a "split brain" situation?

Oracle Clusterware handles the split-brain by terminating all the nodes in the SMALLER cohort.
If both of the cohorts are the same size, the cohort with the lowest numbered node in it survives.

The clusterware identifies the LARGEST cohort, and aborts all the nodes which do NOT belong to that cohort.

Identifying a split-brain eviction

In a split-brain node eviction, the following message is present in the ocssd log ($GRID_HOME/log/<hostname>/ocssd/ocssd.log) of the evicted node:

clssnmCheckDskInfo: Aborting local node to avoid splitbrain.

And earlier in the same log, within 10 minutes prior to "clssnmCheckDskInfo: Aborting local node" message:

clssnmPollingThread: node %s (%n) at <X>% heartbeat fatal, removal in...

 

Finding the cohort

The split-brain message in the ocssd.log will show "cohort" information. For example:

2012-12-28 20:26:25.803: [    CSSD][1111296320]clssnmCheckDskInfo: My cohort: 1
2012-12-28 20:26:25.803: [    CSSD][1111296320]clssnmCheckDskInfo: Surviving cohort: 2,3,4
2012-12-28 20:26:25.803: [    CSSD][1111296320](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 1, sprora01, is smaller than cohort of 3 nodes led by node 2, sprora02, based on map type 2

Understanding the cohort message

In a split-brain situation, ocssd on each node records on the voting disk the set of nodes it can communicate with. Each set is known as a "cohort". When there are two (or more) mutually non-intersecting sets, we have a "split-brain" situation. It means that there are two (or more) separate sets of nodes which cannot talk to each other over the interconnect. 

For example, in the above quote

My cohort: 1
Surviving cohort: 2,3,4


The meaning of these messages is

* "My cohort: 1" => The list of nodes I can communicate with: 1
* "Surviving cohort: 2,3,4" => From the voting disk, I know that nodes 2,3,4 can all communicate with each other.
* "Cohort of 1 nodes with leader 1, sprora01, is smaller than cohort of 3 nodes led by node 2, sprora02"
=> Oracle Clusterware has identified that the cohort {1} is smaller than the cohort {2,3,4}.

Oracle Clusterware handles the split-brain by terminating all the nodes in the SMALLER cohort. In this case, the smaller cohort is {1}. Therefore, ocssd on node {1} aborts the node.

Using the cohort message to identify interconnect network issues

The cohort message describes which nodes can communicate with each other.

Each cohort is a set of nodes that can talk to each other, and cannot talk to the nodes NOT in the cohort.

In the above example, the cohort message tells us that nodes {2,3,4} are all in communication; node 1 is not in communication with any of them.

 

Follow-up Action


The private network between node 1 and the other 3 nodes should be checked.

Please refer to the following note to check private interconnect network:  Document 1534949.1 - Oracle Grid Infrastructure: How to Troubleshoot Missed Network Heartbeat Evictions

这篇关于Oracle Grid Infrastructure: Understanding Split-Brain Node Eviction (Doc ID 1546004.1)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/632532

相关文章

mysql数据库重置表主键id的实现

《mysql数据库重置表主键id的实现》在我们的开发过程中,难免在做测试的时候会生成一些杂乱无章的SQL主键数据,本文主要介绍了mysql数据库重置表主键id的实现,具有一定的参考价值,感兴趣的可以了... 目录关键语法演示案例在我们的开发过程中,难免在做测试的时候会生成一些杂乱无章的SQL主键数据,当我们

Oracle存储过程里操作BLOB的字节数据的办法

《Oracle存储过程里操作BLOB的字节数据的办法》该篇文章介绍了如何在Oracle存储过程中操作BLOB的字节数据,作者研究了如何获取BLOB的字节长度、如何使用DBMS_LOB包进行BLOB操作... 目录一、缘由二、办法2.1 基本操作2.2 DBMS_LOB包2.3 字节级操作与RAW数据类型2.

查看Oracle数据库中UNDO表空间的使用情况(最新推荐)

《查看Oracle数据库中UNDO表空间的使用情况(最新推荐)》Oracle数据库中查看UNDO表空间使用情况的4种方法:DBA_TABLESPACES和DBA_DATA_FILES提供基本信息,V$... 目录1. 通过 DBjavascriptA_TABLESPACES 和 DBA_DATA_FILES

nvm如何切换与管理node版本

《nvm如何切换与管理node版本》:本文主要介绍nvm如何切换与管理node版本问题,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录nvm切换与管理node版本nvm安装nvm常用命令总结nvm切换与管理node版本nvm适用于多项目同时开发,然后项目适配no

基于Python开发PDF转Doc格式小程序

《基于Python开发PDF转Doc格式小程序》这篇文章主要为大家详细介绍了如何基于Python开发PDF转Doc格式小程序,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下... 用python实现PDF转Doc格式小程序以下是一个使用Python实现PDF转DOC格式的GUI程序,采用T

Oracle登录时忘记用户名或密码该如何解决

《Oracle登录时忘记用户名或密码该如何解决》:本文主要介绍如何在Oracle12c中忘记用户名和密码时找回或重置用户账户信息,文中通过代码介绍的非常详细,对同样遇到这个问题的同学具有一定的参... 目录一、忘记账户:二、忘记密码:三、详细情况情况 1:1.1. 登录到数据库1.2. 查看当前用户信息1.

mysql8.0无备份通过idb文件恢复数据的方法、idb文件修复和tablespace id不一致处理

《mysql8.0无备份通过idb文件恢复数据的方法、idb文件修复和tablespaceid不一致处理》文章描述了公司服务器断电后数据库故障的过程,作者通过查看错误日志、重新初始化数据目录、恢复备... 周末突然接到一位一年多没联系的妹妹打来电话,“刘哥,快来救救我”,我脑海瞬间冒出妙瓦底,电信火苲马扁.

Node.js net模块的使用示例

《Node.jsnet模块的使用示例》本文主要介绍了Node.jsnet模块的使用示例,net模块支持TCP通信,处理TCP连接和数据传输,具有一定的参考价值,感兴趣的可以了解一下... 目录简介引入 net 模块核心概念TCP (传输控制协议)Socket服务器TCP 服务器创建基本服务器服务器配置选项服

mac安装nvm(node.js)多版本管理实践步骤

《mac安装nvm(node.js)多版本管理实践步骤》:本文主要介绍mac安装nvm(node.js)多版本管理的相关资料,NVM是一个用于管理多个Node.js版本的命令行工具,它允许开发者在... 目录NVM功能简介MAC安装实践一、下载nvm二、安装nvm三、安装node.js总结NVM功能简介N

CSS3 最强二维布局系统之Grid 网格布局

《CSS3最强二维布局系统之Grid网格布局》CS3的Grid网格布局是目前最强的二维布局系统,可以同时对列和行进行处理,将网页划分成一个个网格,可以任意组合不同的网格,做出各种各样的布局,本文介... 深入学习 css3 目前最强大的布局系统 Grid 网格布局Grid 网格布局的基本认识Grid 网