Oban.Pro.Plugins.DynamicScaler (Oban Pro v1.5.0-rc.2)

The DynamicScaler examines queue throughput and issues commands to horizontally scale cloud infrastructure to optimize processing. With auto-scaling you can spin up additional nodes during high traffic events, and pare down to a single node during a lull. Beyond optimizing throughput, scaling may save money in environments with little to no usage at off-peak times, e.g. staging.

Horizontal scaling is applied at the node level, not the queue level, so you can distribute processing over more phyiscal hardware.

Predictive Scaling — The optimal scale is calculated by predicting the future size of a queue based on recent trends. Multiple samples are then used to prevent overreacting to changes in queue depth or throughput. Your provide an acceptible range of nodes and auto-scaling takes care of the rest.
Multi-Cloud — Cloud integration is provided by a simple, flexible, behaviour that you implement for your specific environment and configure for each scaler.
Queue Filtering — By default, all queues are considered for scale calculations. However, you can restrict calculations to one or more ciritical queues.
Multiple Scalers — Some systems may restrict work to specific node types, e.g. generating exports or processing videos. Other hybrid systems may straddle multiple clouds. In either case, you can configure multiple independent scalers driven by distinct queues.
Non Linear — An optional step parameter allows you to conservatively scale up or down one node at a time, or optimize for responsiveness and jump from the min to the max in a single scaling period.
Prevent Thrashing — A cooldown period skips scaling when there was recent scale activity to prevent unnecessarily scaling nodes up or down. Nodes may take several minutes to start within an environment, so the default cooldown period is 2 minutes.

Using and Configuring

Clouds
There are ample hosting platforms, aka "clouds", out there and we can't support them all! Before you can begin dynamic scaling you'll need to implement a Cloud module for your environment. Don't worry, we have copy-and-paste examples for some popular platforms and a guide to walk through implementations for your environment.

With a cloud module in hand you're ready to add the DynamicScaler plugin to your Oban config:

config :my_app, Oban,
  plugins: [
    {Oban.Pro.Plugins.DynamicScaler, ...}
  ]

Then, add a scaler with a range to define the minimum and maximum nodes, and your cloud strategy:

{DynamicScaler, scalers: [range: 1..5, cloud: MyApp.Cloud]}

Now, every minute, DynamicScaler will calculate the optimal number of nodes for your queue's throughput and issue scaling commands accordingly.

Configuring Scalers

Scalers have options beyond :cloud and :range for more advanced systems or to constrain resource usage. Here's a breakdown of all options, followed by specific examples of each.

:cloud — A module or {module, options} tuple that interacts with the external cloud during scale events. Required.
:range — The range of compute units to scale between. For example, 1..3 declares a minimum of 1 node and a maximum of 3. The minimum must be 0 or more, and the maximum must be 1 or at least match the minimum. Required.
:cooldown — The minimum time between scaling events. Defaults to 120 seconds.
:lookback — The historic time to check queues. Defaults to 60 seconds.
:queues — Either :all or a list of queues to consider when measuring throughput and backlog.
:step — Either :none or the maximum nodes to scale up or down at once. Defaults to :none.

Scaler Examples

Filter throughput queries to the :media queue:

scalers: [queues: :media, range: 1..3, cloud: MyApp.Cloud]

Filter throughput queries to both :audio and :video queues:

scalers: [queues: [:audio, :video], range: 1..3, cloud: MyApp.Cloud]

Configure scalers driven by different queues (note, queues may not overlap):

scalers: [
  [queues: :audio, range: 0..2, cloud: {MyApp.Cloud, asg: "my-audio-asg"}],
  [queues: :video, range: 0..5, cloud: {MyApp.Cloud, asg: "my-video-asg"}]
]

Limit scaling to one node up or down at a time:

scalers: [range: 1..3, step: 1, cloud: MyApp.Cloud]

Wait at least 5 minutes (300 seconds) between scaling events:

scalers: [range: 1..3, cloud: MyApp.Cloud, cooldown: 300]

Increase the period used to calculate historic throughput to 90 seconds:

scalers: [range: 1..3, cloud: MyApp.Cloud, lookback: 90]

Scaling Down to Zero Nodes

It's possible to scale down to zero nodes in staging environments or production applications with periods of downtime. However, it is only viable for multi-node setups with dedicated worker nodes and another instance type that isn't controlled by DynamicScaler. Without a separate "web" node, or something that is always running, you run the risk of scaling down without the ability to scale back up.

Cloud Modules

There are a lot of hosting platforms, aka "clouds" out there and we can't support them all! Even with optional dependencies, it would be a mess of libraries that may not agree with your application decisions. Instead, the Oban.Pro.Cloud behaviour defines two simple callbacks, and integrating with platforms typically takes a single HTTP query or library call.

The following links contain gists of full implementations for popular cloud platforms. Feel free to copy-and-paste to use them as-is or as the basis for your own cloud modules.

Let us know if an integration for your platform is missing (which is rather likely) and you'd like assistance. Otherwise, follow the guide below to write your own integration!

Writing Cloud Modules

Cloud callback modules must define an init/1 function to prepare configuration at runtime, and a scale/2 callback called with the desired number of nodes and the prepared configuration.

The following example demonstrates a complete callback module for scaling EC2 Auto Scaling Groups on AWS using the SetDesiredCapacity action. It assumes you're using the ex_aws package with the proper credentials.

defmodule MyApp.ASG do
  @behaviour Oban.Pro.Cloud

  @impl Oban.Pro.Cloud
  def init(opts), do: Map.new(opts)

  @impl Oban.Pro.Cloud
  def scale(desired, conf) do
    params = %{
      "Action" => "SetDesiredCapacity",
      "AutoScalingGroupName" => conf.asg,
      "DesiredCapacity" => desired,
      "Version" => "2011-01-01"
    }

    query = %ExAws.Operation.Query{
      path: "",
      params: params,
      service: :autoscaling,
      action: :set_desired_capacity
    }

    with {:ok, _} <- ExAws.request(query), do: {:ok, conf}
  end
end

You'd then use your cloud module as a scaler option:

{DynamicScaler, scalers: [range: 1..3, cloud: {MyApp.ASG, asg: "my-asg-name"}]}

Clouds can also pull from the application or system environment to build configuration. If your module pulls from the environment exclusively, then you can pass the module name rather than a tuple:

{DynamicScaler, scalers: [range: 1..3, cloud: MyApp.ASG]}

Optimizing Throughput Queries

While the scaler's throughput queries are optimized for a standard load, high throughput queues, or systems that retain a large volume of jobs, may benefit from an additional index that aids calculating throughput. Use the following migration to add an index if you find that scaling queries are too slow or timing out:

@disable_ddl_transaction true
@disable_migration_lock true

def change do
  create_if_not_exists index(
    :oban_jobs,
    [:state, :queue, :attempted_at, :attempted_by],
    concurrently: true,
    where: "attempted_at IS NOT NULL",
    prefix: "public"
  )
end

Alternatively, you can change the timeout used for scaler inspection queries:

{DynamicScaler, timeout: :timer.seconds(15), scalers: ...}

Instrumenting with Telemetry

The DynamicScaler plugin adds the following metadata to the [:oban, :plugin, :stop] event:

:scaler - details of the active scaler config with recent scaling values
:error — the value returned from scale/2 when scaling fails

When multiple scalers are configured one event is emitted for each scaler.