There are a few instances when a full cluster restart is required versus being able to do a rolling update.
A rolling update is the process of replacing one version of the system with another without downtime. The changes can be new code, changed dependencies such as new Pekko version, or modified configuration.
In Apache Pekko, rolling updates are typically used for a stateful Pekko Cluster where you can’t run two separate clusters in parallel during the update, for example in blue green deployments.
For rolling updates related to Pekko dependency version upgrades and the migration guides, please see Rolling Updates and Pekko versions
There are two parts of Pekko that need careful consideration when performing an rolling update.
- Compatibility of remote message protocols. Old nodes may send messages to new nodes and vice versa.
- Serialization format of persisted events and snapshots. New nodes must be able to read old data, and during the update old nodes must be able to read data stored by new nodes.
There are many more application specific aspects for serialization changes during rolling updates to consider. For example based on the use case and requirements, whether to allow dropped messages or tear down the TCP connection when the manifest is unknown. When some message loss during a rolling update is acceptable versus a full shutdown and restart, assuming the application recovers afterwards * If a
java.io.NotSerializableException is thrown in
fromBinary this is treated as a transient problem, the issue logged and the message is dropped * If other exceptions are thrown it can be an indication of corrupt bytes from the underlying transport, and the connection is broken
For more zero-impact rolling updates, it is important to consider a strategy for serialization that can be evolved. One approach to retiring a serializer without downtime is described in two rolling update steps to switch to the new serializer. Additionally you can find advice on Persistence - Schema Evolution which also applies to remote messages when deploying with rolling updates.
During a rolling update, sharded entities receiving traffic may be moved, based on the pluggable allocation strategy and settings. When an old node is stopped the shards that were running on it are moved to one of the other remaining nodes in the cluster when messages are sent to those shards.
To make rolling updates as smooth as possible there is a configuration property that defines the version of the application. This is used by rolling update features to distinguish between old and new nodes. For example, the default
LeastShardAllocationStrategy avoids allocating shards to old nodes during a rolling update. The
LeastShardAllocationStrategy sees that there is rolling update in progress when there are members with different configured
To make use of this feature you need to define the
app-version and increase it for each rolling update.
pekko.cluster.app-version = 1.2.3
To understand which is old and new it compares the version numbers using normal conventions, see
Version for more details.
Rebalance is also disabled during rolling updates, since shards from stopped nodes are anyway supposed to be started on new nodes. Messages to shards that were stopped on the old nodes will allocate corresponding shards on the new nodes, without waiting for rebalance actions.
You should also enable the health check for Cluster Sharding if you use Pekko Management. The readiness check will delay incoming traffic to the node until Sharding has been initialized and can accept messages.
ShardCoordinator is itself a cluster singleton. To minimize downtime of the shard coordinator, see the strategies about ClusterSingleton rolling updates below.
A few specific changes to sharding configuration require a full cluster restart.
Cluster singletons are always running on the oldest node. To avoid moving cluster singletons more than necessary during a rolling update, it is recommended to upgrade the oldest node last. This way cluster singletons are only moved once during a full rolling update. Otherwise, in the worst case cluster singletons may be migrated from node to node which requires coordination and initialization overhead several times.
Kubernetes Deployments with
RollingUpdate strategy will roll out updates in this preferred order, from newest to oldest.
For rolling updates it is best to leave the Cluster gracefully via Coordinated Shutdown, which will run automatically on SIGTERM, when the Cluster node sees itself as
Exiting. Environments such as Kubernetes send a SIGTERM, however if the JVM is wrapped with a script ensure that it forwards the signal. Graceful shutdown of Cluster Singletons and Cluster Sharding similarly happen automatically.
In case of network failures it may still be necessary to set the node’s status to Down in order to complete the removal. Cluster Downing details downing nodes and downing providers. Split Brain Resolver can be used to ensure the cluster continues to function during network partitions and node failures. For example if there is an unreachability problem Split Brain Resolver would make a decision based on the configured downing strategy.
Configuration Compatibility Checks
During rolling updates the configuration from existing nodes should pass the Cluster configuration compatibility checks. For example, it is possible to migrate Cluster Sharding from Classic to Typed Actors in a rolling update using a two step approach:
- Deploy with the new nodes set to
pekko.cluster.configuration-compatibility-check.enforce-on-join = offand ensure all nodes are in this state
- Deploy again and with the new nodes set to
pekko.cluster.configuration-compatibility-check.enforce-on-join = on.
Full documentation about enforcing these checks on joining nodes and optionally adding custom checks can be found in
Pekko Cluster configuration compatibility checks.
When Shutdown Startup Is Required
There are a few instances when a full shutdown and startup is required versus being able to do a rolling update.
Cluster Sharding configuration change
If you need to change any of the following aspects of sharding it will require a full cluster restart versus a rolling update:
- The role that the shard regions run on
- The persistence mode - It’s important to use the same mode on all nodes in the cluster
number-of-shards- Note: changing the number of nodes in the cluster does not require changing the number of shards.
Cluster configuration change
- A full restart is required if you change the SBR strategy
Migrating from PersistentFSM to EventSourcedBehavior
If you’ve migrated from
EventSourcedBehavior and are using PersistenceFSM with Cluster Sharding, a full shutdown is required as shards can move between new and old nodes.
Migrating from classic remoting to Artery
If you’ve migrated from classic remoting to Artery which has a completely different protocol, a rolling update is not supported.
Changing remoting transport
Rolling update is not supported when changing the remoting transport.
Migrating from Classic Sharding to Typed Sharding
If you have been using classic sharding it is possible to do a rolling update to typed sharding using a 3 step procedure. The steps along with example commits are detailed in this sample PR