Recovery (Magic xpi 4.1)
Magic xpi can recover automatically from many disaster and failure scenarios. You define the recovery mechanism's policy in the Flow Properties dialog box's Recovery policy property. Recovery is carried out on three levels:
These are further explained below.
Note:
|
The ExceptionMessageBoxDisplay flag is used to determine the behavior when a thread terminates abnormally. For more information, click here.
|
|
If a machine or instance running Magic xpi components fails, shuts down, or disconnects from the network, or if one of the grid components is aborted or stops functioning, the grid has recovery mechanisms that continue to provide service completely automatically. All grid services will continue to function uninterrupted, and no data will be lost as long as at least one machine is still running.
When a machine running Magic xpi workers fails, it triggers the recovery mechanisms for engine or worker failures. These mechanisms are discussed below in more detail.
Magic xpi has two distinct types of workers:
-
Flow worker: A flow worker performs flow steps. Flow workers start and wait for flow invocation requests. When a flow invocation request is created, a flow worker executes the requested flow, as defined in the project.
-
Trigger worker: A trigger worker is responsible for “listening” to events that originate from external sources. Trigger workers start and wait for external events to occur. When such an event occurs, the trigger creates a flow invocation request with the event data. This is handled by available flow workers.
If one of the Magic xpi engines or workers aborts unexpectedly, the recovery mechanism involves three parts:
-
Identification: Each worker and engine reports its status to the Space. The dedicated Magic processing unit monitors these reports. Once the processing unit identifies a worker or engine that did not report in the defined timeout, the following action is taken:
-
-
In the case of a worker (either trigger or flow), the processing unit asks the engine running the worker for the worker status.
-
In the case of an engine, the processing unit asks the Grid Service Agent (GSA) that started the engine if it is still alive.
-
Handling the engines’ and workers’ recovery: When an unresponsive or non-existent engine is identified, the GSA will be instructed to terminate the existing engine and start a new one to replace the one that crashed. In case of a thread crash, the engine will be instructed to start a new worker under the same engine.
-
Handling the work processes’ recovery: A work process refers to the execution tree of a root flow, or a set of flows, comprising a single business transaction. The recovery guidelines are as follows:
-
-
Recovery is defined for the root flow that starts the work process, and not for individual flows. (Recovery definitions in child flows or child contexts are ignored.)
-
Save Points are only saved on the linear branch of the root flow. (Any Save Points defined in child flows or child contexts are ignored.)
-
A work process can be run by one or more workers (in the case of parallel branches), running on one or more separate physical servers.
When triggered, the recovery mechanism works as follows:
Note:
|
If you do not select a recovery policy, the default is None.
|
|
If a Magic xpi server fails to report its status, but the MgxpiServer.EXE process is still running, the MgxpiServer.EXE process will be terminated after a grace period that is automatically set by Magic xpi. The grace period is between 20 and 120 seconds depending on the scenario. A GSA will then be instructed to start a new Magic xpi server to replace it.
The recovery policy in the event of a Magic xpi server crash is to raise a new Magic xpi server with its initial configuration. Threads that were terminated due to the server crash will be recovered as per their recovery policy.
Storing Recovery Data
Flow Timeouts