Cluster configuration

After launching a cluster we should invest a bit of time in fine tuning some of the important settings available in the configuration. In particular we will concentrate on the Load Balancing and Orchestration parameters available in the AdminTool.

Let's get started by connecting to the Lobby via its AdminTool and select the Cluster Configurator module.

Load Balancer configuration

The Load Balancer (in short LB) is responsible for distributing the traffic among Game Nodes in the cluster: every time a user wants to jump into a game, the LB is responsible for finding a suitable Game Node and running the match-making query on that node.

The two main LB algorithms provided by the SmartFoxServer Cluster are:

  • Least Connections LB: this is the default algorithm. It always searches for the least loaded Game Node (i.e. the one with the lowest CCU count among all active nodes) and sends the client there.
  • Most Connections LB: uses the opposite approach by looking for the most loaded server (i.e. the one with the highest CCU, within a defined limit) and sends the client there.

By default the Lowest Connections LB is used to ensure an even distribution of players among all active nodes. If you prefer the fill-one-server-at-a-time approach you could switch to the Most Connections LB.

Also it's worth mentioning that you can write your own LB algorithm if you have special needs and require a custom logic. We provide all the details for writing your algorithms in a dedicated article.

In order to change the default LB algorithm you can specify its fully qualified class name in the top field of the Load Balancer settings page.

Uniform Game Nodes CCU limit

One important aspect of keeping the cluster balanced is to set a uniform CCU limit for every Game Node. Depending on the nature of your project (turn-based vs real-time) and the hardware used for Game Nodes, you should be able to determine a reasonable amount of CCU (per-server) that can be set as a limit.

This is something that can be established via load testing and by further fine tuning in the live environment. For example if you have determined that your Game Nodes can handle 1000 CCU each without problems you can start with that value as the limit and push it a bit higher, over time, if you think the servers are still underused. Our recommendation is to keep CPU usage <= 85% at peak usage.

The Maximum # of users per Game Node parameter found in the screen above allows to redefine the CCU limit at runtime while the cluster is running. This is a non-destructive update, meaning that if you set a limit that is lower than actual CCU on the servers, nothing bad will happen: excess players will keep playing and new players will be directed to other Game Nodes. In case all nodes are full, the players will receive a Load Balancer Error and the system will launch new Game Node instances to accomodate them (we discuss how to handle these errors in the client examples section).

Most Connections LB

In order to switch to the Most Connections LB, we need to specify its class name in the field at the top of the settings panel: com.smartfoxserver.cluster.balancer.MostConnectionsLoadBalancer.

Additionally we need to add a custom property called ccuLimit and set its value to the same value we have used for the global CCU limit.

This is particularly important because the LB algorithm needs to know when it's time to switch to a different Game Node. Internally the LB uses a default value of ccuLimit = 500 so, if you forget to set the property you will eventually see all your nodes loaded at ~500 CCU.

NOTE: we highly recommend to set the ccuLimit property and the Max number of CCU per node value to the same number.

Health Checks

The health check section is comprised of the last five settings in the Load Balancer tab and it defines different parameters for monitoring the Game Nodes. In general we don't recommend changing these values unless you know exactly what you're doing. If you want to learn more please consult the advanced documentation on Load Balancer customization.

Orchestrator configuration

In the context of a server cluster, the term Orchestration refers to the automated process that manages the life-cycle of each node.

More specifically, as regards the Overcast Cluster, the Orchestration process takes care of monitoring the state of the Game Nodes, adding, removing or recycling them depending on the traffic and load in the system.

In Overcast, the Orchestration process is made up of three principal components:

  • Conductor: the core logic responsible for reacting to changes in the cluster's state.
  • ScaleUp condition: a component that verifies the need to scale up the cluster when certain conditions are met.
  • ScaleDown condition: another component, similar to the above, that checks the need to scale down the cluster.

The Orchestrator comes with a number of default settings that can be tweaked when necessary, based on the characteristics of your application. Let's see what these are from the AdminTool's Cluster Configurator module:

The ScaleUp/ScaleDown properties provide settings for the respective conditions we have discussed:

  • Scale Up
    • ccuThreshold: the CCU threshold that triggers a ScaleUp event.
    • instanceVolume: the number of new servers that are launched when the event is triggered.
  • Scale Down
    • ccuThreshold: the CCU threshold that triggers a ScaleDown event.
    • instanceVolume: the number of servers that should be removed when the event is triggered.

The threshold values specified here are intended as average load per-server.

Scaling Up

Let's see an example: the cluster has 4500 CCU active on two Game Nodes. The average CCU load is thus 2250 which exceeds the ScaleUp threshold of 2000, therefore an event will trigger the Conductor to add a number of new Game Nodes equal to the instanceVolume specified.

When adding new servers the Conductor searches for existing servers that are currently inactive and eligible for recycling. These instances exist as a result of previous Scale Down events (which are described in the next section). If no recycling candidate is found a new machine will be spun up, using the configured server snapshot.

Scaling Down

Let's see an example for the Scale Down event: the cluster is running 80 CCU on two Game Nodes. The average CCU is thus 40 which is less than the specified threshold of 100, therefore the Orchestrator will attempt to remove a number of servers equal to the configured instanceVolume.
In this case it's not possible to remove 3 servers as the system is running with only 2 Nodes and one Game Node must always be available in the cluster, therefore the system will only remove one of the two Game Nodes.

When a Game Node leaves the cluster it's not literally removed (as in terminated) but rather it is deactivated, which simply removes the server from the Load Balancing pool. All games running on this Node will continue as if nothing had happened, but the Node will not receive new players and eventually will become empty.

A Game Node that is both inactive and empty is eligible for recycling if a new Scale Up event is triggered.

Cooldown

The ScaleUp and ScaleDown events have a cooldown setting (expressed in seconds) that acts as a buffer while the new machines are being set up and launched. Without this mechanism the system would keep spamming ScaleUp events while waiting the new servers, which would be problematic. Typically we do not recommend to change these values unless you really know what you're doing as it can have negative impact on the cluster's performance and balance.

Server recycling

We have learned that deactivated servers eventually become empty and can be re-used by the system if a ScaleUp event is triggered. The Instance recycling minimum time setting specifies the number of minutes that an empty and inactive server should remain available for recycling. After that time the server is going to be terminated.

Remote deployment

The last setting in this panel sets a timeout for remote Extension deployment. This topic is discussed in the Cluster Extension deployment document if you're not familiar with it.

If one of the remote servers doesn't report back a successful deployment within the expected time it will be considered as a failed deployment. In order to remedy this situation you may need to connect to the specific Game Node via its Admin Tool and redo the update via the Extension Manager module.

Custom Settings

Configuring the Scale Up and Scale Down parameters in the Cluster Configurator depends on a couple of variables:

  • the size of your Game Nodes
  • the expected max CCU per Game Node

Since all Game Nodes use the same instance type we need to figure out a reasonable CCU limit (per instance) that allows the server to run effectively without getting overloaded. This is usually done by load-testing your game on a single server and finding a sensible client limit, trying to stay within 75-85% of its hardware resources (CPU and RAM).

You can learn more about load testing in our dedicated article.

The CCU limit can vary quite significantly based on the type of game we're developing (turn-based vs real-time) and the size of the game servers, so there isn't a precise rule to decide what these values are.

What we can recommend is to avoid the extremes such as running dozens of tiny game servers that can only handle a few hundreds CCUs or, on the other side of the spectrum, running only a few massive servers dealing with hundreds of thousands of clients.

In the former case you would spreading the load on too many servers with too few users and keeping the system in a continuous need of scaling up and down. In the second case you would be concentrating too many users on a few servers with the risk of causing a significant disservice in case one server fails.