Reliability, Availability, and Serviceability (RAS), for A-profile architecture

1 Introduction to RAS

1.1 Faults,Errors,and failures


A failure is the event of deviation from correct service. This includes data corruption, data loss, and service loss.
An error is the deviation from correct service. An incorrect value that has an error is corrupt.
A fault is the cause of the error.

There are many sources of faults in a system, including both software and hardware faults:
• Hardware faults originate in, or affect, hardware.
• Software faults affect software, that is programs or data.

The RAS Extension and RAS System Architecture primarily address errors produced from hardware faults. These fall into two main areas:
• 1. Transient faults.
• 2. Non-transient or persistent faults.

1.2 General taxonomy of errors(错误分类)

1.2.1 Error detection
When a component accesses memory or other state, an error might be detected in that memory or state.
The error might be corrected or deferred by the component, or signaled to another component as either a deferred error or a detected error.

1.2.2 Error propagation
An error is propagated by deviations from correct service, including when any of the following occurs that would not have been permitted to occur had the fault not been activated:

• 1. A corrupt value is passed from producer to consumer.

• 2. A transaction or other operation occurs that should not have occurred.

• 3. A transaction or other operation that should have occurred does not occur.

• 4. A loss of uniprocessor semantics or any other loss of coherency in a multiprocessor coherent system is observed.

• 5. Changing the timing and/or order of transactions or other operations such that the timing and/or order of those transactions or operations is incorrect. In this case, the service interface defines acceptable timings and/or orders for transactions and other operations.
改变了 timing 或者 transactions 的顺序

An error is silently propagated by the producer of a transaction if the consumer of the transaction cannot detect the error and consumes an undetected error because of the transaction. This might be because of one of the following:
2)错误被 Producer 静默传播的原因有如下
• 1. The error is present on the transaction, but was not detected by the producer. The error is silently propagated by the producer.

• 2. The error is present on the transaction, but was not signaled to the consumer as an error. For example, a corrupt value was passed in the transaction with no indication that it was corrupt. The error is silently propagated by the producer.

如上两者的差别是,第一种是 Producer 也检测不出来,所以传播下去了;另一种是 Producer 没有做错误标记给到 Consumer 传播了下去。

Errors might be propagated by components in a system until one of the following occurs:
• They are masked and do not affect the outcome of the system.
The error might be masked because a corrupt value is discarded or overwritten, or the error is detected and removed.
它们被 Masked 了,并且不会影响系统的结果,错误可能被丢弃或覆盖,或者错误被检测并删除。

• They affect the service interface of the system and possibly cause failure. If the error has been silently propagated to the service interface then:
– This is a Silent Data Corruption (SDC).
– The rate of such failures, measured as the number of failures per billion device-hours of operation, is called the SDC Failure-in-Time (FIT) rate.
Alternatively, the error might have been detected, causing the system to invoke error handling and recovery.
– 这是静默数据损坏(SDC, Silent Data Corruption)
– 这种故障率,以每十亿个设备运行小时的故障数来衡量,称为SDC实时故障(FIT,Failure-in-Time)率

1.2.3 Infected and poisoned

The state of a component becomes infected when the component consumes an uncorrected error that updates
the state.

A value is poisoned in the state of a component if it is marked as being in error, such that a subsequent access of
the state will detect the value is so marked and is treated as a detected error.

Poison is used to defer an error.
Poison 是用来延缓错误的

1.2.4 Containable and uncontainable(可控制和不可控制)

An undetected error is uncontained at the component that failed to detect it.
未检测到的错误对于未能检测到它的组件而言是 不可控制的

A silently propagated error is uncontained at the component that silently propagated it.
静默传播的错误是 不可控制的

A detected uncorrected error is uncontainable at the component if it might be uncontained at the component.

A detected uncorrected error is containable at the component if it is not uncontainable at the component. If
the component cannot determine whether a detected uncorrected error is uncontainable or containable at the
component, then the component treats the detected uncorrected error as uncontainable at the component.

An error that is uncontainable at a component might be containable at the system level.

Reporting an error as containable allows software to contain the error. This does not mean that hardware has
contained the error

1.3 Techniques for improving reliability, availability, and serviceability

1.3.1 Fault prevention and fault removal(故障预防和故障排除)
Fault prevention and fault removal are two techniques for handling faults. Fault prevention and fault removal

Fault prevention techniques are outside the scope of the architecture.

A fault that is removed is a corrected error and might be recorded and generate a fault handling interrupt, but it
is not propagated. This means that it is not consumed and does not cause service failure.
故障排除 – 举例:一个纠正的错误,可能被记录并产生一个故障处理中断,但它没有传播。这意味着它没有被使用,也不会导致服务失败

A common technique to detect and correct errors is the use of an Error Detection and Correction Code (EDAC),
more commonly referred to as simply an Error Correction Code (ECC). ECC schemes use mathematical codes
to detect and correct an error in a value in memory. The size of the value is the protection granule for the ECC

The RAS Extension and RAS System Architecture do not require implementation any fault removal schemes,
including ECC

1.3.2 Error handling and recovery(错误处理和恢复)
A fault that is not removed gives rise to an uncorrected error.
未消除的故障会导致不纠正的错误(1bit ECC积累成 2bit ECC错误)

Error recovery is the process by which software and hardware minimize the impact of an uncorrected error.

Error recovery methods include:
• Deferring an error from a fault. An error is deferred by hardware if hardware can make forward progress
without consuming the error. Deferring the error means(延迟错误意味着):

– 1. The fault might become masked later (fault removal). For example, because the corrupt value is
overwritten before it is consumed.
故障可能稍后masked(故障排除),例如,因为损坏的值在 consumed 之前被 Overwritten

– If the deferred error is later consumed, then the error is reported at the point of consumption. For
example, if the deferred error is consumed by a Processing element (PE) then the consumer PE
generates an error exception. This can give better results in terms of error recovery in the case where
the original producer of the data is not known when the error was deferred. For example because a
latent error was detected.
如果稍后 Consumed 了延迟错误,则会在消耗点报告该错误。

A common technique to defer an error is to replace the corrupt value with a poisoned value, for example in
memory or in a transaction.
延迟错误的一种常见技术是用 poisoned 的值替换损坏的值,例如在内存或 transaction 中。

• Preventing further propagation of the error, that is containing the error. In particular, preventing silent
propagation of the error.

• Reducing the severity of a failure by invoking a service failure mode:
– This is a Detected Uncorrected Error (DUE).
– The rate of such failures gives the DUE FIT rate.
– The type of service failure mode depends on what is acceptable to the service.

A software error recovery agent is typically invoked when hardware detects an error it cannot correct, defer, or

An error recovery agent also provides information to the operator through error logs to improve serviceability,
for example to help with the identification of a Field Replaceable Unit (FRU).

The RAS Extension and RAS System Architecture provide optional common programmers’ models to record
information about an error in an error record.

The RAS Extension describes the behavior of a PE when an error is signaled to it by the system, including
invoking a service failure mode by taking an error exception, and optional mechanisms to limit propagation of
an error.

The RAS Extension and RAS System Architecture do not require systems to implement error recovery
mechanisms, including poison, and do not require systems to limit the silent propagation of errors.

1.3.3 Fault handling
Fault handling by software is the process by which software diagnoses and responds to faults to improve

Fault handling methods include:

• 1. Predictive Failure Analysis (PFA), using information recorded by hardware to trigger pre-emptive action.

The RAS Extension and RAS System Architecture provide optional mechanisms to allow the reporting of errors
and warnings to a fault handling agent, and to record information about the fault in an error record. It is the
responsibility of the error recovery and fault handling processes to collate the error record data and write it to an
error log.

The detailed nature of the fault handling agent is outside the scope of this architecture. Fault handling and error
recovery might be independent agents

2 RAS Extension for A-profile

2.1 PE error handling

2.1.1 PE error detection
When a PE accesses memory or other state, an error might be detected in that memory or state, and corrected,
deferred, or signaled to the PE as a detected error with an in-band error response.

When an error is detected by a component on a read or a cache maintenance operation from the PE:

– 1. If the error can be corrected, it is corrected and corrected data is returned.

– 2. If the error cannot be corrected and can be deferred, it is deferred. For example, on a load by poisoning
the PE state, if this is supported by the PE implementation.
如果错误不能纠正且可以延迟,则会延迟;例如,在一个负载上,如果PE实现支持它,则通过 Poisoning PE状态

– If the error cannot be corrected and if implemented and enabled at the component, the detected error
is signaled to the PE as an in-band error response.

When an error is detected by a component consuming a write from the PE:

– If the error can be corrected, it is corrected.

– If the error cannot be corrected and can be deferred, it is deferred to the consumer. For example, by
poisoning the location being written.
如果错误不能被纠正,并且可以延迟,则会延迟给消费者。例如,通过 Poisoning 到被写入的位置

– If the error cannot be corrected and if implemented and enabled at the component, the detected error
is signaled to the PE as an in-band error response.

2.1.2 PE error propagation
The program-visible architectural state of the PE, referred to as the PE state, includes:
• General-purpose, SIMD&FP, and SVE registers.
• System registers.
• Special-purpose registers.

An error is consumed by the PE by any of the following:
1)PE被以下任何一个项一个错误 Consumed:

• 1. An instruction commits the corruption into the PE state.

• 2. The error is on an instruction fetch and the corrupt instruction is committed for execution.

• 3. The error is on a translation table walk for a committed load, store, or instruction fetch.

An error is propagated by the PE by one or more of the following occurring that would not have been permitted
to occur had the fault not been activated:

• Consumption of the corrupt value by any instruction, propagating the error to the target(s) of the instruction.
This includes:
通过任何指令 Consumered 损坏的值,将错误传播到指令的目标值,这包括:

– A store of a corrupt value.

– A write of a corrupt value to a System register, Special-purpose register, or PSTATE. Infecting a
System register state might mean that the PE generates transactions that would not otherwise be
一个写,到了系统寄存器、特殊用途寄存器或PSTATE的损坏值。感染系统注册状态可能意味着PE生成以其他方式不被允许的 transaction

• Any operation occurring that should not have occurred, including:

– 1. A load, translation table walk, or instruction fetch that would not have been permitted, including those
from hardware speculation or prefetching.

– 2. A store to an incorrect address, or a store that would not have been made or not permitted.

– 3. A direct or indirect write to a Special-purpose or System register that would not have been made or
not permitted.

– 4. Assertion of any signal, such as an interrupt, that would not have been asserted.

• Any operation not occurring that should have occurred.

• Causing the PE to take an imprecise exception, other than an error exception in response to the error itself.
See the section Definition of a precise exception in the Arm® Architecture Reference Manual, for A-profile

• The PE discarding data that it holds in a modified state.

• Any other loss of required uniprocessor semantics, ordering, or coherency

An error propagated by the PE is silently propagated by the PE only if all of the following are true:

  1. The propagation is not part of the required operation of the PE in taking an error exception generated by
    the error.

  2. The propagation is not part of the required operation of the PE executing an ESB instruction that
    synchronizes the error.

  3. The error is not signaled to the consumer as a detected error or deferred error.

  4. Any of the following are true:
    • The corrupt value is held in other than the general-purpose, SIMD&FP, or SVE registers.
    损坏值保存在 general-purpose、SIMD&FP或SVE寄存器中

• The error is propagated by an instruction in program order before either taking an error exception
generated by the error or executing an ESB instruction that synchronizes the error, and is propagated
to outside of the general-purpose, SIMD&FP, or SVE registers

• The error is propagated other than by an instruction that consumes the corrupt value as an input
operand but otherwise behaves correctly.

2.1.3 Other errors – 2024.03.17 下周从这里开始

2.2 Generating error exceptions

2.3 Taking error exceptions

2.4 Error synchronization event

2.5 Virtual SError interrupts

2.6 Error records in the PE

3 RAS System Architecture

3.1 Nodes

3.2 Detecting and consuming errors

3.3 Standard error record

3.4 Error recovery interrupt

3.5 Fault handling interrupt

3.6 In-band error response signaling (external aborts)

3.7 Critical error interrupt

3.8 Standard format Corrected error counter

3.9 Error recovery, fault handling, and critical error signaling

3.10 Error recovery reset

3.11 Timestamp extension

3.12 Common Fault Injection Model Extension

4 RAS Extension and RAS System Architecture Registers



