Recovery (Magic xpi 4.5)
Magic xpi can recover automatically from many disaster and failure scenarios.
It is recommended to first understand the server architecture before designing your project. For a visual explanation of how the server architecture works, click here.
You define the recovery mechanism's policy in the Flow Properties dialog box's Recovery policy property. The recovery policy only applies to the main linear flow. Recovery is carried out on three levels:
These are further explained below.
Note:
|
The ExceptionMessageBoxDisplay flag is used to determine the behavior when a thread terminates abnormally. For more information, click here.
|
If a machine or instance running Magic xpi components fails, shuts down, or disconnects from the network, or if one of the grid components is aborted or stops functioning, the grid has recovery mechanisms that continue to provide service completely automatically. All grid services will continue to function uninterrupted, and no data will be lost as long as at least one machine is still running.
When a machine running Magic xpi workers fails, it triggers the recovery mechanisms for engine or worker failures. These mechanisms are discussed below in more detail.
If one of the Magic xpi engines or workers aborts unexpectedly, the recovery mechanism involves three parts:
-
Identification: Each worker and engine reports its status to the Space using a Flow Request message. This includes parallel branches, stand-alone branches and flow executions by both triggers and auto start flows. The dedicated Magic processing unit monitors these reports. Once the processing unit identifies a worker or engine that did not report in the defined timeout, the following action is taken:
-
-
In the case of a worker (either trigger or flow), the processing unit asks the engine running the worker for the worker status.
-
In the case of an engine, the processing unit asks the Grid Service Agent (GSA) that started the engine if it is still alive.
-
Handling the engines’ and workers’ recovery: When an unresponsive or non-existent engine is identified, the GSA will be instructed to terminate the existing engine and start a new one to replace the one that crashed. In case of a thread crash, the engine will be instructed to start a new worker under the same engine.
-
Handling the work processes’ recovery: A work process refers to the execution tree of a root flow, or a set of flows, comprising a single business transaction. The recovery guidelines are as follows:
-
-
Recovery is defined for the root flow that starts the work process, and not for individual flows. (Recovery definitions in child flows or child contexts are ignored.)
-
Save Points are only saved on the linear branch of the root flow. (Any Save Points defined in child flows or child contexts are ignored.)
-
A work process can be run by one or more workers (in the case of parallel branches), running on one or more separate physical servers.
When triggered, the recovery mechanism works as follows:
Note:
|
If you do not select a recovery policy, the default is None.
|
After the FlowTimeoutGraceTime plus the timeout set on the flow expires, the relevant MgxpiServer.EXE files terminate all the relevant workers that serve the same Root Flow Sequence IDs (FSIDs) and then execute the root flow's Timeout policy defined on the root flow.
If a Magic xpi server fails to report its status, but the MgxpiServer.EXE process is still running, the MgxpiServer.EXE process will be terminated after a grace period that is automatically set by Magic xpi. The grace period is between 20 and 120 seconds depending on the scenario. A GSA will then be instructed to start a new Magic xpi server to replace it. The recovery policy in the event of a Magic xpi server crash is to raise a new Magic xpi server with its initial configuration. Threads that were terminated due to the server crash will be recovered as per their recovery policy.
The RequesterTimeoutSec setting in the Mgreq.ini file determines the timeout for the HTTP trigger. If the RequesterTimeoutSec is shorter that the time it takes to execute the recovery policy on a crashed flow, then the recovery will not happen.
Storing Recovery Data