Node Failure Pattern

Priyanka Jadhav
3 min readNov 11, 2021

--

Node failure pattern mainly focuses on how any application should react or respond when the compute node on which it is running fails or shuts down.

This pattern shows the perception of application code running on a node i.e. virtual machine that suddenly fails or shuts down due to hardware or software issue.

The application has following three responsibilities:

  1. To prepare the application to minimize issues when nodes fail.
  2. To handle node shutdown in a refined way.
  3. To recover once a node has failed.

Failure Scenarios

Handling Node Shutdown

  1. Node shutdown with minimal impact to user experience
  2. Node shutdown without losing partially completed work
  3. Node shutdown without losing operational data

Node Failure pattern: AWS and Azure

Node Failure Recovery of Single-Node Systems:

AWS:

  • The recovery of failed nodes is a an automatic process in AWS.
  • It starts a replacement node and the instance for the single-node system automatically restarts and is configured on another physical server.
  • After the system is back up, logons are enabled.
  • If a node hangs it is not able to recover automatically. The instance should be stopped and started again from the AWS Web Console.

Azure

For single-node systems, recovering failed nodes is automatic in the following scenarios. You do not need to create a post-upgrade restore image.

  • If hardware issues occur on the host, Azure migrates the VMs to other hosts.
  • If platform issues occur, such as Hypervisor or guest agent issues, the VM might restart.
  • If planned maintenance is scheduled, updates to the host might pause the VM for about 30 seconds. For some updates, the VM might restart.

For unexpected restarts, check the Azure Activity Log. If you do not find information related to the restart, create an Azure support request to see if Azure can look for issues in their backend availability logs.

Node Failure Recovery of Multi-Node Systems:

AWS

  • The replacement node is either based off an image recorded when the system is first deployed, or an updated image after a software upgrade. The replacement node has the same secondary IP, elastic IP, and identifiers as the replaced node.
  • The primary IP address of the replacement node will be new. However, if you had allocated an elastic IP address to each node when you deployed your instance, the public IP address should be the same after the node recovery.
  • The secondary private IP addresses will be the same regardless of the elastic IP address settings you chose when you deployed the instance. At least one free IP address in the subnet is required.
  • You must create a new system image after a software upgrade. If you do not create this image, the software on the recovered instance will not match and the database cannot start if a node failure occurs.
  • Node failure recovery is handled differently than on-premises systems. Unless you want failed nodes to continue running for diagnostic purposes, you should terminate the instance.

Azure

  • On a Vantage multi-node system, NFR automatically replaces the failed nodes with the same number of available hot standby nodes (HSNs). If there are no HSNs or not enough available HSNs, NFR automatically spins up one or more replacement nodes, detaches the network-attached storage of the failed node, reattaches the network-attached storage to the new VM, migrates the secondary IPs from the failed node NIC to the new VM NIC, and reinstates the configuration.
  • The replacement node is deployed from a snapshot of the active (control) node.
  • NFR takes longer than a typical TPA reset. If your node does not automatically recover after 10 to 15 minutes, check the deployment logs in your Azure resource group.

--

--

No responses yet