Orchestration

In the context of a server cluster the term Orchestration refers to the automated process that manages the life-cycle of each node.

More specifically, as regards the Overcast Cluster, the Orchestration process takes care of monitoring the state of the Game Nodes, adding, removing or recycling them depending on the traffic and load in the system.

In Overcast, the Orchestration process is made up of three principal components:

  • Conductor: the core logic responsible for reacting to changes in the cluster's state.
  • ScaleUp condition: a component that verifies the need to scale up the cluster when certain conditions are met.
  • ScaleDown condition: another component, similar to the above, that checks the need to scale down the cluster.

Configuration

The Orchestrator comes with a number of default settings that can be tweaked when necessary, based on the characteristics of your application. Let's see what these are from the AdminTool's Cluster Configurator module:

The ScaleUp/ScaleDown properties provide settings for the respective conditions we have discussed:

  • Scale Up
    • ccuThreshold: the CCU threshold that triggers a ScaleUp event.
    • instanceVolume: the number of new servers that are launched when the event is triggered.
  • Scale Down
    • ccuThreshold: the CCU threshold that triggers a ScaleDown event.
    • instanceVolume: the number of servers that should be removed when the event is triggered.

The threshold values specified here are intended as average load per-server.

Scaling Up

Let's see an example: the cluster has 4500 CCU active on two Game Nodes. The average CCU load is thus 2250 which exceeds the ScaleUp threshold of 2000, therefore an event will trigger the Conductor to add a number of new Game Nodes equal to the instanceVolume specified.

When adding new servers the Conductor searches for existing servers that are currently inactive and eligible for recycling. These instances exist as a result of previous Scale Down events (which are described in the next section). If no recycling candidate is found a new machine will be spun up, using the configured server snapshot.

Scaling Down

Let's see an example for the Scale Down event: the cluster is running 80 CCU on two Game Nodes. The average CCU is thus 40 which is less than the specified threshold of 100, therefore the Orchestrator will attempt to remove a number of servers equal to the configured instanceVolume.
In this case it's not possible to remove 3 servers as the system is running with only 2 Nodes and one Game Node must always be available in the Cluster, therefore the system will only remove one of the two Game Nodes.

When a Game Node leaves the cluster it's not literally removed (as in terminated) but rather it is deactivated, which simply removes the server from the Load Balancing pool. All games running on this node will continue as if nothing had happened, but the node will not receive new players and eventually will become empty.

A Game Node that is both inactive and empty is eligible for recycling if a new Scale Up event is triggered.

Cooldown

The ScaleUp and ScaleDown events have a cooldown setting (expressed in seconds) that acts as a buffer while the new machines are being set up and launched. Without this mechanism the system would keep spamming ScaleUp events while waiting the new servers, which would be problematic. Typically we do not recommend to change these values unless you really know what you're doing as it can have negative impact on the cluster's performance and balance.

Server recycling

We have learned that deactivated servers eventually become empty and can be re-used by the system if a ScaleUp event is triggered. The Instance recycling minimum time setting specifies the number of minutes that an empty and inactive server should remain available for recycling. After that time the server is going to be terminated.

Remote deployment

The last setting in this panel sets a timeout for remote Extension deployment. This topic is discussed in the Cluster Extension deployment document if you're not familiar with it.

If one of the remote servers doesn't report back a successful deployment within the expected time it will be considered as a failed deployment. In order to remedy this situation you may need to connect to the specific Game Node via its Admin Tool and redo the update via the Extension Manager module.

Custom Settings

Configuring the Scale Up and Scale Down parameters in the Cluster Configurator depends on a couple of variables:

  • the size of your Game Nodes
  • the expected max CCU per Game Node

Since all Game Nodes use the same instance type we need to figure out a reasonable CCU limit (per instance) that allows the server to run effectively without getting overloaded. This is usually done by load-testing your game on a single server and finding a sensible client limit, trying to stay within 75-85% of its hardware resources (CPU and RAM).

You can learn more about load testing in our dedicated article.

The CCU limit can vary quite significantly based on the type of game we're developing (turn-based vs real-time) and the size of the game servers, so there isn't a precise rule to decide what these values are.

What we can recommend is to avoid the extremes such as running dozens of tiny game servers that can only handle a few hundreds CCUs or, on the other side of the spectrum, running only a few massive servers dealing with hundreds of thousands of clients.

In the former case you would spreading the load on too many servers with too few users and keeping the system in a continuous need of scaling up and down. In the second case you would be concentrating too many users on a few servers with the risk of causing a significant disservice in case one server fails.