Changelog for Oban Pro v1.3

This release is entirely dedicated to Smart engine optimizations, from slashing queue transactions to boosting bulk insert performance.

📮 Async Tracking

Rather than synchronously recording updates (acks) in a separate transaction after jobs execute, the Smart engine now bundles acks together to minimize transactions and reduce load on the database.

Async tracking, combined with the other enhancements detailed below, showed the following improvements over the previous Smart engine when executing 1,000 jobs with concurrency set to 20:

  • Transactions reduced by 97% (1,501 to 51)
  • Queries reduced by 94% (3,153 to 203)

That means less Ecto pool contention, fewer transactions, fewer queries, and fewer writes to the oban_producers table! There are similar, albeit less flashy, improvements over the Basic engine as well.

Notes and Implementation Details

  • Acks are stored centrally per queue and flushed with the next transaction using a lock-free mechanism that never drops an operation.

  • Acks are grouped and executed as a single query whenever possible. This is most visible in high throughput queues.

  • Acks are preserved across transactions to guarantee nothing is lost in the event of a rollback or an exception.

  • Acks are flushed on shutdown and when the queue is paused to ensure data remains as consistent as the previous synchronous version.

  • Acking is synchronous in testing mode, when draining jobs, and when explicitly enabled by a flag provided to the queue.

See the Smart engine's async tracking section for more details and instructions on how to selectively opt out of async mode.

v1.3.5 — 2024-02-16

Enhancements

  • [DynamicLifeline] Track rescues with a counter in meta.

    Rescued jobs can now be identified by a rescued value in meta. Each rescue increments the rescued count by one.

  • [Smart] Skip taking unique advisory locks in testing mode.

    Advisory locks are global and apply across transactions. That can break async tests with overlapping unique jobs because the lock is held in a concurrent, sandboxed test.

Bug Fixes

  • [Smart] Prevent stuck jobs with more reliable async ack management.

    Unhandled transaction failures or timeouts while acking could cause uncomitted acks to be lost, leaving jobs stuck in an executing state but unable to be rescued.

    Now acks are pulled from the producer's ETS table all at once, without a time-based select. Successfully persisted acks are deleted from the table individually rather than by time.

  • [Smart] Ensure uniqueness across args when no keys are specified regardless of insertion order.

  • [Smart] Force materializing the CTE when fetching jobs.

    The CTE used to prevent optimizations in the engine's fetch query is only referenced once, which may allow the Postgres optimizer to inline it. Inlining can negate the CTE "optimization fence", so we force the CTE to be materialized.

  • [DynamicPartitioner] Determine date_partition? default lazily at runtime.

    Date partitioning should be disabled in the test environment because the plugin doesn't run to pre-create partitions. However, using Mix.env at compliation time wasnt't reliable enough to prevent sub-partitioning by date in testing environments. This switches the check to runtime.

v1.3.4 — 2024-02-06

Enhancements

  • [DynamicPartitioner] Improve partitioned structure and indexes for performance.

    The partitioner migration now creates fewer partitions to improve staging queries. To match, it creates more specific indexes for each table to prevent sequential scans during common queries.

    All partitions gain a primary key index, and the compound index for "completed" states now omits fields that are only needed for incomplete states.

    Indexes on existing partitioned tables should be updated as shown below. However, note that indexes can't be added to partitioned tables concurrently.

    -- This will cascade down to all partitions
    CREATE INDEX oban_jobs_pkey ON oban_jobs (id);
    
    -- Add an index with the correct state to for available, scheduled, and retryable
    CREATE INDEX oban_jobs_available_state_queue_priority_scheduled_at_id_index
      ON oban_jobs_available (state, queue, priority, scheduled_at, id);
    CREATE INDEX oban_jobs_available_state_queue_priority_scheduled_at_id_index
      ON oban_jobs_retryable (state, queue, priority, scheduled_at, id);
    CREATE INDEX oban_jobs_available_state_queue_priority_scheduled_at_id_index
      ON oban_jobs_scheduled (state, queue, priority, scheduled_at, id);
    
    -- Add a simpler index to completed states
    CREATE INDEX oban_jobs_cancelled_queue_scheduled_at_id_index ON oban_jobs_cancelled (queue, scheduled_at, id);
    CREATE INDEX oban_jobs_completed_queue_scheduled_at_id_index ON oban_jobs_completed (queue, scheduled_at, id);
    CREATE INDEX oban_jobs_discarded_queue_scheduled_at_id_index ON oban_jobs_discarded (queue, scheduled_at, id);
    
    -- Drop the previous index that lacked the state
    DROP INDEX oban_jobs_queue_priority_scheduled_at_id_index;
  • [DynamicPartitioner] Allow using DynamicPruner with DynamicPartitioner

    There are situations where it's still useful to use DynamicPruner to aggressively prune a subset of jobs more frequently than partitioning allows.

  • [Smart] Augment all frequent queries with a state condition to aid partitioned queries.

    Partitioned table queries require the state for partition pruning, especially for an id only query. This changes the Smart engine's queries to run optimally with either a standard or partitioned table.

Bug Fixes

  • [DynamicPartitioner] Override configured prefix during backfills

    DynamicPartitioner backfills retained the configured prefix, which defaults to public, without respecting the new_prefix option. Now the prefix is overridden and correctly escaped before usage.

  • [Worker] Make embed_one with required fields truly optional

    The recursive nature of embedded structs made it impossible to have an optional embed_one with required fields. Now the embedded fields are only validated when a value is provided.

  • [Workflow] Use t:name for :names in t:fetch_opts

    Names may be either an atom or a string, and they're always coerced before querying anyhow.

v1.3.3 — 2024-01-23

This release depends on Oban v2.17.3 or greater

Enhancements

  • [Worker] Add after_process/3 hook callback that includes execution results.

    The new after_process/3 callback includes the job's return value as a third argument. That allows hooks to have immediate access to the job's return value without recording it and fetching it from the database.

  • [Testing] Ensure queues are started before returning from start_supervised_oban/1.

    This prevents race conditions between when start_supervised_oban returns and when queues are fully booted, i.e. actively listening to pause/resume/scale signals.

  • [DynamicPruner] Add by_state_timestamp option to DynamicPruner.

    In rare situations where the scheduled_at timestamp isn't accurate enough to identify prunable jobs, e.g. cancelling large swaths of jobs scheduled far into the future, the new by_state_timestamp: true option can be used for increased accuracy.

  • [DynamicQueues] Accept ack_async/refresh_interval in DynamicQueues.

    DynamicQueues now allows the ack_async and refresh_interval virtual fields for parity with standard queues.

Bug Fixes

  • [Smart] Serialize all meta updates through the producer.

    This fixes all of the race conditions and outdated record issues that the mismatch between registry meta and the producer's own meta caused.

  • [Smart] Force synchronous acking for all Batch jobs.

    Batches require accurate status counts to insert callbacks correctly. The slight delay from async jobs can cause incorrect counts in highly active queues with batches.

v1.3.2 — 2024-01-19

Bug Fixes

  • [Smart] Ensure global queues keep running with ack_async: false.

    Global queues that are marked with ack_async: false must refresh the in-memory producer record between job fetching to keep the queue running. Otherwise, tracked jobs linger in the producer record despite successful acking.

  • [Smart] Prevent a race condition while pausing from stopping global queues.

    Pausing a global queue while there are pending acks could trigger a write-after-read race condition that lost tracking changes. Eventually, leaked changes could prevent the queue from fetching new jobs because it looked like the global limit was met.

  • [Smart] Always split completed ack queries for recorded jobs.

    Jobs with different recorded output could mistakenly be written with a single query if they completed within a few ms of each other. This changes the grouping mechanism to only bundle simple completions, never recorded completions.

v1.3.1 — 2024-01-17

Bug Fixes

  • [Smart] Default to synchronous acking when Oban is in a testing mode.

    Acking should always be synchronous during tests to prevent flickering failures from race conditions. Previously, acking relied on a failed registry lookup to switch to synchronous mode, which wasn't accurate enough.

  • [Smart] Default to synchronous acking for drain_jobs/2 and related test helpers.

    Draining runs synchronously in the test process, but not in testing mode. This explicitly disables ack_async when draining jobs.

  • [DynamicPartitioner] Only sub-partition by date in non-test environments.

    To prevent testing errors after migration, the completed, cancelled, and discarded states are sub-partitioned by date only in :dev and :prod environments.

    It's possible to enable date partitioning in other production-like environments with the new date_partition? flag.

  • [DynamicPartitioner] Rename existing args and meta indexes to allow index recreation

    When renaming the existing table to oban_jobs_old the args and meta indexes weren't renamed. That prevented creating those indexes on the new partitioned table, because Postgres detects that those indexes already existed and so it skips their creation.

v1.3.0 — 2024-01-16

Enhancements

  • [Smart] Skip extra query to "touch" the producer when acking without global or rate limiting enabled. This change reduces overall producer updates from 1 per job to 2 per minute for standard queues.

  • [Smart] Avoid refetching the local producer's data when fetching new jobs.

    Async acking is centralized through the producer, which guarantees global and rate tracking data is up-to-date before fetching without an additional read.

  • [Smart] Optimize job insertion with fewer iterations.

    Iterating through job changesets as a map/reduce with fewer conversions improves inserting 1k jobs by 10% while reducing overall memory by 9%.

  • [Smart] Efficiently count changesets during insert_all.

    Prevent duplicate iterations through changesets to count unique jobs. Iterating through them once to accumulate multiple counts improved insertion by 3% and reduced overall memory by 2%.

  • [Smart] Acking cancelled jobs is done with a single operation and limited to queues with global limiting.

Bug Fixes

  • [Smart] Always merge acked meta updates dynamically.

    All meta-updating queries are dynamically merged with existing meta. This prevents recorded jobs from clobbering other meta updates made while the job executed.

  • [Smart] Safely extract producer uuid from attempted_by with more than two elements

  • [DynamicCron] Preserve stored opts such as args, priority, etc., on reboot when no new opts are set.

  • [Relay] Skip attempting relay notifications when the associated Oban pid isn't alive.