Finding the Root Cause of a Failed Job

Finding the Root Cause of a Failed Job

Finding the Root Cause of a Failed Job

Step 1: Login to AWS Console

Find the AWS Account where the action failed. Most of the actions are performed directly in an Environment, which is linked to one AWS Account. The user can find the account ID in the environment status page.

Other actions are performed in the Management AWS Account, using CloudFormation StackSets to deploy stacks into accounts members of the AWS Organization.

Step 2: Go to CloudFormation

Go to CloudFormation and select the Region that the NX1 Environment uses. The user can find the region in the environment status page.

For troubleshooting the Management Account, always set the region to N. Virginia (us-east-1).

Step 3: Find the failed stack

Failed stacks are automatically deleted by NX1. So to find the root cause of the failure, first, the user needs to change the filter to “Deleted” stacks, as seen below.

image

Once the user sees the stack, click on it and select the “Events” tab.

Find the first (chronologically) failed event in the list, example:

image

This will give the user the reason for the stack failure.

One common issue, which is “Resource creation canceled” usually happens in situations where multiple stacks are being deployed at once and one of the stacks fails, triggering a cancellation of all stacks from that set. In this case the best approach is to look for other deleted stacks in the Management, Log Archive and Audit AWS Accounts.

Step 4: Collect the data and contact support

Once the root failed event is found, please send all the information collected from AWS Console and describe which NX1 action triggered the issue to our support team at support@nx1.io.