BPM Process Instances – Faults, Rollback Recovery – Part 4

2023-11-01 20:08

本文主要是介绍BPM Process Instances – Faults, Rollback Recovery – Part 4,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

BPM Process Instances – Faults, Rollback & Recovery – Part 4

Introduction

This is part 4 of a 4 part blog explaining how the BPM engine functions under the covers when “faults” occur, be they unhandled technical faults or failures at the engine level.

Part 1 can be found here.

Part 4: BPM Message Recovery

Idempotence

It is vitally important to understand the conept of idempotence, i.e. the ability to replay activities more than once without any adverse impact. As an example, an activity to credit money to a bank account would not be idempotent whereas an activity to query the money on a bank account would be. This is important since recovering process instances necessarily means that some activities will be replayed. It is a business decision as to whether this recovery of instances is valid or not.

Recover Individual Instances

As a first step, let’s look at how to recover the individual instances that have failed.

Pattern 1 – Async – Async

Via the Enterprise Manager Console we can view the BPMN engine….

BPMR_45

…and from the “Recovery” tab we can see any uncompleted “Invoke” activities (excluding the last 5 minutes as this is an active table)….

BPMR_46

…and there we see the failed instance which we can recover.

Pattern 2 – Async – Sync

Via the Enterprise Manager Console as before…

BPMR_47

Pattern 3 – Async with Acknowledgement – Sync

Via the Enterprise Manager Console as before…

BPMR_48

…we can’t see the latest instance since it was not rolled back to an invoke, but we can see the actual activity itself, this however is not recoverable….

BPMR_49

We know that it has rolled back to the timer activity and we can recover this by simply clicking the “Refresh Alarm Table” button shown, note that this will refresh all timers, it is a bulk operation. Note this button is only available in PS6 and later versions.

Bulk Recovery of Instances

Now that we have seen how to find and recover individual instances which have failed with various patterns, let’s look at how we can query and recover in bulk. It could be the case that a catastrophic failure has caused managed servers to crash resulting in potentially thousands of failed process instances and at multiple different activities within them.

How do we find all “stuck” instances ? Which ones will recover automatically and which will have to be manually recovered ?

Automatic Recovery – Undelivered Messages

By default BPM instances are not recovered automatically on managed server restart, as opposed to BPEL instances which are. This can be verified, or changed if required, in the Enterprise Manager Console (remember idempotence !)….

BPMR_50

BPMR_51

…i.e. on startup, recover all instance during a duration of 0 seconds.

Automatic Recovery – Failed Timers

In contrast, on server restart all timers will re-fire, i.e. in “Pattern 3”, on server restart, the “catchEvent” timer activity will fire again. Also worth noting is any timers which expired whilst the managed server was down will also fire on restart… this could cause a large spike in activity on restart if multiple instances with expired timers are retried.

Note also what exactly this “refresh” does – when a WORK_ITEM entry is created for a timer, be it a simple “sleep”….

BPMR_52

…or a boundary timer on an activity….

BPMR_53

…then an in-memory timer is created scheduled from the transaction commit listener, when the in-memory timer expires the WORK_ITEM table is updated.

A “refresh” will re-invoke the active entries in the WORK_ITEM table thus creating new instances of those in-memory timers, it will not however reset these timers to “zero”, i.e. begin the time period again.

Recovery Queries

The above scenarios have covered some common patterns and the message recovery associated with them. The provided scripts cover all possible “stuck instance” scenarios, how to find the instances and how to recover them in bulk.

It is advisable to agree on a fixed point in time for the recovery window.  This will ensure that, when you run the various queries we are about to describe, you will get a consistent set of results. The queries below include  “receive_date < systimestamp – interval ‘1’ minute” this is to avoid including in-flight instances. However, you may augment this to query for “stuck” messages up to a particular cut-off date e.g. 01-August 2013.

Querying the DLV_MESSAGE table – Find the “unresolved” messages

As a reminder, the valid values for the STATE column.

BPMR_56

“Stuck” messages will be those with the values 0 or 4. “0 – STATE_UNRESOLVED” as we saw earlier in our example scenarios, “4 – STATE_MAX_RECOVERED” could occur if auto-recovery was set to on, or if someone had retried a number of times to resubmit the message from Enterprise Manager Console.

We have 2 types of messages – Invoke and Callback. Invoke à DLV_TYPE = 1, Callback à DLV_TYPE = 2

We can query the “stuck” messages for each type as follows….

Simple query on the DLV_MESSAGE table…

  • Group by the dlv_type, composite and component allowing us to isolate where the bulk of the “stuck” messages are.
  • Optionally separate the query into two parts, one for “Invoke”, one for “Callback”

BPMR_57

…when we run this for our failed scenarios we get the results as expected….

BPMR_58

Querying the DLV_MESSAGE table – for FYI messages

In production scenarios, there may be some rows that you can discard immediately – for example FYI tasks cause messages to be written to the DLV_MESSAGE table – these can essentially be ignored. You could use the following SQL to get themessage_guid of such messages and then mark them as cancelled using the Recovery API. Use of the API will be discussed late, suffice to know the example class is called BatchMessageCancelForFYI, included in the Java examples that accompany this document. You may also consider updating the human task (for FYI) as a workaround to avoid these extra messages.  See Patch 16494888.

BPMR_59

Querying the DLV_MESSAGE table – Drilling Deeper

We can now concentrate on the “stuck” messages and drill a little deeper to get some context e.g. what activity caused the problem. To do this we can query further tables…. COMPOSITE_INSTANCE, COMPONENT_INSTANCE, CUBE_INSTANCE, WORK_ITEM, BPM_AUDIT_QUERY.

Drill down to COMPOSITE_INSTANCE data…

BPMR_60

…when we run this for our failed scenarios we get the following….

BPMR_61

…i.e we can see here that “Pattern 1” at 05:01 failed (COMPOSITE_STATE = 2) in the “BPMAsyncService” process and that “Pattern 2” at 05:56 failed in the “BPMTestTimeoutAsyncClient” exactly as shown in the failure images above.

Drill down to CUBE_INSTANCE data….

Here we can get information on cube state and scope size….

BPMR_62

…when we run this for our failed scenarios…

BPMR_63

…we can see that “Pattern 1” failed in “BPMAsyncService” and “Pattern 2” failed in “BPMTestTimeoutAsyncClient” as a CUBE_INSTANCE_STATE of “10 – STATE_CLOSED_ROLLED_BACK”

Drill down to CUBE_INSTANCE data….

We can now see to what activity we rolled back….

BPMR_64

…and for our failed scenarios….

BPMR_65

…this is interesting, we now only see the failure for “Pattern 1”, not for “Pattern 2”, why ? Well, remember “Pattern 2” rolled back all the way to the “Start” message of the client process so no active WORK_ITEM rows exist.

For “Pattern 2” we can see that we reached “SCOPE_ID=TestAsyncAsync_try.2” and “NODE_ID=ACT8144282777463”, looking at the original process model….

BPMR_66

…we can see that the passivation point on the “Receive” activity created a WORK_ITEM entry with state “3 – OPEN_PENDING_COMPLETE”.

Drill down to BPM_AUDIT_QUERY data….

If auditing was enabled, we can now see which was the last activity audited….

BPMR_67

…and for our failed scenarios…

BPMR_68

…we can see for “Pattern 1” the last audited activity before rollback was “ScriptTask”, i.e. we know that it was here we had a failing data association, and for “Pattern 2” the last audited activity was “DBCall”, i.e. it was here that the process timed out.

Timer Queries

Expired Timers

In the case where a server has crashed it can be very useful to know how many timers have expired in the downtime, given that on restart of the server they will all re-fire. We can query these as follows….

BPMR_69

…i.e. return all timers that had an expiry date in the past but are still “open_pending_complete” and the  composite instance is still “running”.

Failed Timers

The other area where timers could be incomplete is in our scenario 3, although the timer completed the transaction which completed it has rolled back. We can query these as follows….

BPMR_70

i.e. return all timers that are still “open_pending_complete” and the  composite instance is “running with faults”.

For our failed scenario 3 we can see the results of this query….

BPMR_71

Recovery Queries Conclusion

From the above queries it is possible to get a view on what instances have failed, what activity they reached when failure occurred and to where they rolled back. With this information it is possible to determine whether recovery is possible from a business perspective (idempotency) and to infer patterns from failures to try to minimize re-occurrence.

Leveraging the Recovery API

Before running recovery, you may want to backup your SOA_INFRA database.                                       

Briefly, this is the Recovery API example that goes with this blog….

BPMR_72

The previous sections described the SQL queries that find the messages in need of re-submission. Essentially the result of these queries (message_guid) will be fed in to either recover invoke, recover callback, or cancel the message.

These APIs are in addition to what’s documented for fault recovery here.

BPMR_73

Cancel FYI Task messages in the DLV_Message Table

Extract from “BatchMessageCancelForFYI”…

BPMR_74

These cancelled messages will be picked up by the next SOA_INFRA purge assuming that a purge strategy is in place.

Batch Recovery of messages in the DLV_Message Table

Here we recover the message(s) using the “message_guid”, extract from “BatchMessageRecovery”….

BPMR_75

Refreshing Timers

Unlike the examples above, refreshing timers does not leverage the Recovery API. As previously mentioned, timers can be simply refreshed from the Enterprise Manager Console in PS6 and beyond, or with a simple API call to the “refreshAlarmTable” method on the BPMN Service Engine thus….

BPMR_76

Summary

In this four part blog we have taken a deep dive into how the BPM engine handles messages, threads, rollbacks & recovery. Whenever we hear from a customer “my message is stuck” or “I’ve lost one of my process instances” we should now know where to look and how to recover it.

Attached to this blog is the JDev project with all SQL queries to find rolled back messages and all java code to recover them….

InstanceRecoveryExample

 

style="BORDER-BOTTOM: medium none; POSITION: absolute; BORDER-LEFT: medium none; WIDTH: 67px; HEIGHT: 20px; VISIBILITY: visible; BORDER-TOP: medium none; BORDER-RIGHT: medium none" title="fb:like Facebook Social Plugin" height="1000" src="http://www.facebook.com/plugins/like.php?action=like&app_id=172525162793917&channel=http%3A%2F%2Fstatic.ak.facebook.com%2Fconnect%2Fxd_arbiter%2Fw9JKbyW340G.js%3Fversion%3D41%23cb%3Df2f709aac%26domain%3Dwww.ateam-oracle.com%26origin%3Dhttp%253A%252F%252Fwww.ateam-oracle.com%252Ff34ef2358%26relation%3Dparent.parent&font=arial&href=http%3A%2F%2Fwww.ateam-oracle.com%2Fbpm-process-instances-faults-rollback-recovery-part-4%2F&layout=button_count&locale=zh_CN&sdk=joey&send=false&show_faces=false&width=90" frameborder="0" width="90" allowtransparency="" name="f2dd08bb0c" scrolling="no" _xhe_src="http://www.facebook.com/plugins/like.php?action=like&app_id=172525162793917&channel=http%3A%2F%2Fstatic.ak.facebook.com%2Fconnect%2Fxd_arbiter%2Fw9JKbyW340G.js%3Fversion%3D41%23cb%3Df2f709aac%26domain%3Dwww.ateam-oracle.com%26origin%3Dhttp%253A%252F%252Fwww.ateam-oracle.com%252Ff34ef2358%26relation%3Dparent.parent&font=arial&href=http%3A%2F%2Fwww.ateam-oracle.com%2Fbpm-process-instances-faults-rollback-recovery-part-4%2F&layout=button_count&locale=zh_CN&sdk=joey&send=false&show_faces=false&width=90">
style="WIDTH: 116px; HEIGHT: 20px" id="twitter-widget-0" class="twitter-share-button twitter-tweet-button twitter-share-button twitter-count-horizontal" title="Twitter Tweet Button" src="http://platform.twitter.com/widgets/tweet_button.d4db41a5a14a4516e6d4ecf6250c7419.en.html#_=1413769544516&count=horizontal&counturl=http%3A%2F%2Fwww.ateam-oracle.com%2Fbpm-process-instances-faults-rollback-recovery-part-4%2F&id=twitter-widget-0&lang=en&original_referer=http%3A%2F%2Fwww.ateam-oracle.com%2Fbpm-process-instances-faults-rollback-recovery-part-4%2F&size=m&text=BPM%20Process%20Instances%20%E2%80%93%20Faults%2C%20Rollback%20%26%20Recovery%20%E2%80%93%20Part%204%20&url=http%3A%2F%2Fwww.ateam-oracle.com%2Fbpm-process-instances-faults-rollback-recovery-part-4%2F" frameborder="0" allowtransparency="" scrolling="no" _xhe_src="http://platform.twitter.com/widgets/tweet_button.d4db41a5a14a4516e6d4ecf6250c7419.en.html#_=1413769544516&count=horizontal&counturl=http%3A%2F%2Fwww.ateam-oracle.com%2Fbpm-process-instances-faults-rollback-recovery-part-4%2F&id=twitter-widget-0&lang=en&original_referer=http%3A%2F%2Fwww.ateam-oracle.com%2Fbpm-process-instances-faults-rollback-recovery-part-4%2F&size=m&text=BPM%20Process%20Instances%20%E2%80%93%20Faults%2C%20Rollback%20%26%20Recovery%20%E2%80%93%20Part%204%20&url=http%3A%2F%2Fwww.ateam-oracle.com%2Fbpm-process-instances-faults-rollback-recovery-part-4%2F" data-twttr-rendered="true">

Comments

  1. Nick Collier says:

    Hi Mark,

    Thanks for this Article, it has been really useful.

    With respect to the BPM Alarm ‘refresh’. When using the programmatic call described in your reschedule timers Java class,
    Do you have to target each SOA server in a multi node cluster?
    Or are you supposed to target the cluster name?

    Or should a call to one server suffice to refresh all of them?

    The ListTimers java project implies to get the instances associated to a server, you target each server individually in a comma separated list, which makes sense.
    I just wasn’t sure the same reasoning could be applied to the rescheduling.

    thanks,

    Nick

    Log in to Reply

Add Your Comment

这篇关于BPM Process Instances – Faults, Rollback Recovery – Part 4的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/325428

相关文章

Level3 — PART 3 — 自然语言处理与文本分析

目录 自然语言处理概要 分词与词性标注 N-Gram 分词 分词及词性标注的难点 法则式分词法 全切分 FMM和BMM Bi-direction MM 优缺点 统计式分词法 N-Gram概率模型 HMM概率模型 词性标注(Part-of-Speech Tagging) HMM 文本挖掘概要 信息检索(Information Retrieval) 全文扫描 关键词

MySQL record 02 part

查看已建数据库的基本信息: show CREATE DATABASE mydb; 注意,是DATABASE 不是 DATABASEs, 命令成功执行后,回显的信息有: CREATE DATABASE mydb /*!40100 DEFAULT CHARACTER SET utf8mb3 / /!80016 DEFAULT ENCRYPTION=‘N’ / CREATE DATABASE myd

Unity Post Process Unity后处理学习日志

Unity Post Process Unity后处理学习日志 在现代游戏开发中,后处理(Post Processing)技术已经成为提升游戏画面质量的关键工具。Unity的后处理栈(Post Processing Stack)是一个强大的插件,它允许开发者为游戏场景添加各种视觉效果,如景深、色彩校正、辉光、模糊等。这些效果不仅能够增强游戏的视觉吸引力,还能帮助传达特定的情感和氛围。 文档

Vue3图片上传报错:Required part ‘file‘ is not present.

错误 "Required part 'file' is not present" 通常表明服务器期望在接收到的 multipart/form-data 请求中找到一个名为 file 的部分(即文件字段),但实际上没有找到。这可能是因为以下几个原因: 请求体构建不正确:在发送请求时,可能没有正确地将文件添加到 FormData 对象中,或者使用了错误的字段名。 前端代码错误:在前端代码中,可能

Oracle(110)什么是RMAN(Recovery Manager)?

RMAN(Recovery Manager)是Oracle数据库提供的一个高效的备份和恢复工具。它能够简化和自动化复杂的备份和恢复操作,并且提供了强大的功能来确保数据的完整性和安全性。 RMAN 的主要功能 备份数据库:支持全备份、增量备份和归档日志备份。恢复数据库:支持从备份中恢复整个数据库或部分数据。克隆数据库:可以方便地创建数据库的副本。验证备份:确保备份数据的一致性和完整性。管理备份空

C++入门(part 2)

前言 在前文我们讲解了C++的诞生与历史,顺便讲解一些C++的小语法,本文会继续讲解C++的基础语法知识。 1. 缺省参数 1.1缺省参数的概念 缺省参数是声明或定义函数时为函数的参数指定⼀个缺省值。在调⽤该函数时,如果没有指定实参则采⽤该形参的缺省值,否则使用指定的实参。(有些地⽅把缺省参数也叫默认参数) 1.2 缺省参数的分类 缺省参数分为全缺省和半缺省参数,全缺省就是全部形参给

MySQL record 01 part

更改密码: alter user 'root'@'localhost' identified with mysql_native_password by ‘123456’; 注意: 在命令行方式下,每条MySQL的命令都是以分号结尾的,如果不加分号,MySQL会继续等待用户输入命令,直到MySQL看到分号,才会去执行分号前的所有用户输入的语句。包括密码在内,用户名、主机名,都需要使用引

出现 E: Sub-process /usr/bin/dpkg returned an error code (1) 解决方法 (全面分析)

目录 前言1. 问题所示2. 原理分析2.1 第一阶段2.2 第二阶段 3. 解决方法4. 彩蛋4.1 错误不提示,直接卸载4.2 卸载后还是无错误提示 前言 3年前遇到过一个类似的,但是轻松解决,推荐阅读:ubuntu:E: dpkg was interrupted, you must manually run ‘sudo dpkg --configure…解决方法 这回发

【Android studio】 unable to start the daemon process

这几天在做一个安卓桌面项目时,突然发现android studio 不能用了。 提示: 网上的一些方法,要不就是: 1、删除C:\Users\<username>\.gradle 文件夹 2、File Menu - > Invalidate Caches/ Restart->Invalidate and Restart 3、C:\Users\<us

Recovery中常用到的系统函数汇总(一)

最近在研究Android 5.1的recovery升级,发现里面的很多系统函数都不是很熟悉,现在做一下笔记,方便自己及有需要的朋友。 1、库函数 int strcmp(const char *str1, const char *str2) 把 str1 所指向的字符串和 str2 所指向的字符串进行比较。下面的函数跟strcmp类似,返回值情况类似。C 库函数 int strncmp(const