Oban.Pro.Plugins.DynamicLifeline
(Oban Pro v1.7.0)
The DynamicLifeline plugin uses producer records to periodically rescue orphaned jobs, i.e.
jobs that are stuck in the executing state because the node was shut down before the job could
finish. In addition, it performs the following maintenance tasks:
- Discard jobs left
availablewith exhausted attempts due to rare edge cases - Repair stuck workflows with deleted dependencies or missed scheduling events
- Repair stuck chains with deleted dependencies or missed scheduling events
- Repair jobs in partitioned queues that are missing a partition key
- Repair chunk jobs that are missing a computed
chunk_id
Without DynamicLifeline you'll need to manually rescue stuck jobs or perform maintenance.
Using the Plugin
To use the DynamicLifeline plugin, add the module to your list of Oban plugins in
config.exs:
config :my_app, Oban,
engine: Oban.Pro.Engines.Smart,
plugins: [Oban.Pro.Plugins.DynamicLifeline]
...No configuration is necessary—the defaults are tuned for most systems.
Options
:repair_limit— the maximum number of jobs each repair operation (workflows, chains, partitions, chunks) will process per cycle. Defaults to1000.:retry_exhausted— whentrue, jobs that have exhausted their attempts are rescued back toavailablewith an incrementedmax_attemptsinstead of being discarded. Defaults tofalse. See "Rescuing Exhausted Jobs" for details.:timeout— the maximum time allowed for each rescue query, in milliseconds. Increase this if your system is under high load or produces a multitude of orphans. Defaults to45_000.
Automatic Repairs
The plugin automatically performs several repair operations during each rescue cycle to keep workflows, chains, partitions, and chunks healthy.
Workflow repair — Jobs held waiting for dependencies that were deleted or missed a scheduling event are released. This handles edge cases where workflow jobs get stuck due to incomplete dependency resolution.
Chain repair — Similar to workflows, chain jobs waiting on deleted or stuck predecessors are released to continue processing.
Partition repair — Jobs in partitioned queues that are missing a
partition_keyin their metadata (e.g., jobs scheduled before a queue was partitioned) have their partition key computed and set automatically.Chunk repair — Chunk jobs that are missing a
chunk_idin their metadata have their chunk ID computed from the job'schunk_byconfiguration and set automatically. This can happen when chunk settings are changed after jobs are already enqueued.
Each repair operation processes up to repair_limit jobs per cycle.
Identifying Rescued Jobs
Rescued jobs can be identified by a rescued value in meta. Each rescue increments the
rescued count by one.
Rescuing Exhausted Jobs
When a job's attempt matches its max_attempts its retries are considered "exhausted".
Normally, the DynamicLifeline plugin transitions exhausted jobs to the discarded state and
they won't be retried again. It does this for a couple of reasons:
To ensure at-most-once semantics. Suppose a long-running job interacted with a non-idempotent service and was shut down while waiting for a reply; you may not want that job to retry.
To prevent infinitely crashing BEAM nodes. Poorly behaving jobs may crash the node (through NIFs, memory exhaustion, etc.) We don't want to repeatedly rescue and rerun a job that repeatedly crashes the entire node.
When exhausted jobs are discarded, the Oban.Pro.Worker.on_discarded/2 callback is called
with an :exhausted reason, which is useful for error reporting or notifications.
Discarding exhausted jobs may not always be desired. Use the retry_exhausted option if you'd
prefer to retry exhausted jobs when they are rescued, rather than discarding them:
plugins: [{Oban.Pro.Plugins.DynamicLifeline, retry_exhausted: true}]During rescues, with retry_exhausted: true, a job's max_attempts is incremented and it is
moved back to the available state.
Instrumenting with Telemetry
The DynamicLifeline plugin adds the following metadata to the [:oban, :plugin, :stop] event:
:rescued_jobs— a list of jobs transitioned back toavailable:discarded_jobs— a list of jobs transitioned todiscarded
Note: jobs only include id, queue, and state fields.
Summary
Types
@type option() :: {:conf, Oban.Config.t()} | {:name, Oban.name()} | {:repair_limit, pos_integer()} | {:retry_exhausted, boolean()} | {:rescue_interval, timeout()} | {:timeout, timeout()}