Responding to alerts and incidents
This section is a quick reference version of the guide for DM’s on incident management document.
** The technical guide for Devs & SREs is located in Google Drive **
Types of alert and where you’ll see them
Alerts come from different sources.
Sentry alerts
Emails are sent to the GovWifi developers mailbox. Notifications are sent to the #govwifi-monitoring Slack channel.
AWS Cloudwatch alerts
Emails are sent to GovWifi critical alerts mailbox. Notifications are sent to the #govwifi-monitoring Slack channel, Critical Alerts are sent to the #govwifi-alerts Slack channel, there are also Pager Duty Notifications for out of hours support.
StatusPage
The GovWifi status page shows updates about the service. You can log into the Status page with your work email account . “Choose Login With Google”
Alerts from the status page are sent as emails to the GovWifi support mailbox. Notifications are sent to the #govwifi Slack channel.
Notify alerts
Emails are sent to the GovWifi support mailbox and StatusPage.
Tell people if you see an alert
- Share a brief summary of what you’ve seen on the #govwifi Slack channel.
- Find out if anyone else is already investigating the issue.
- If the issue is security related, tell the Cyber Security Team. You can use the #cyber-security-help Slack channel.
Appoint an incident lead and a communication lead
Make sure the product manager, delivery manager, engagement manager or tech lead is aware of the incident.
One of them will be the communication lead. Appoint another person to be an incident lead. This will usually be a developer or an SRE, or a person actively working on fixing incident.
Categorise the incident
The incident lead should lead the categorisation of the incident and discuss it with the relevant people on the team.
P1 incidents (critical)
These are situations where there’s:
- a complete outage
- unauthorised access
The main criteria is that registered users cannot authenticate on RADIUS services and access the internet using GovWifi.
Possible examples are:
- A serious outage with the production platform, for example:
- Failure of one or both AWS regions leading to users being unable to authenticate
- Issue with AWS Elastic Load Balancer leading to increased traffic and authentication failure for users
- Loss of ownership of RADIUS Elastic IPs
- This would mean we’d have to ask organisations to re-configure their infrastructure to use new IP addresses, causing major remedial work and reputational damage
We must:
- respond within 30 minutes during business hours
- give an update every hour
In a complete outtage we can use this list of admin email addresses to contact organisations/network admins directly with additional instructions if we need to. The list is generated each night by an automated task.
P2 incidents (major)
These are situations where there’s a substantial degradation of the service.
The main criteria are:
- new organisations cannot register to use GovWifi
- new end users cannot sign up to use GovWifi
- existing admin users cannot access GovWifi admin
A possible example is a significant issue with the production platform on one of the AWS regions. For example, if the London region failed, GovWifi admin would not work and users would not be able to sign up to GovWifi. However, authentication for existing users would be fine.
We must:
- respond within 1 hour during business hours
- give an update every 2 hours
P3 incidents (significant)
These are situations where there’s intermittent or degraded service due to a platform issue.
The main criteria are:
- users experience intermittent or degraded service
- the website is intermittently unavailable, or there are assets missing
A possible example is a temporary outage of RADIUS authentication.
We must:
- respond within 2 hours during business hours
- give updates every business day
P4 incidents (minor)
These are situations where there’s a component failure that is not immediately affecting the service.
A possible example is a failure of one of the RADIUS servers (there are 3 in each region).
We must:
- respond within 1 day during business hours
- give updates every 2 business days
Update the Status Page
Add an entry briefly describing the service/s affected and level of impact.
Work to fix the issue
The tech lead and incident lead should lead the actions to fix the service. All developers - frontend and infrastructure - will help.
Tell the relevant people
The communications lead should make sure that the relevant people know about the incident.
The type of incident will affect who the ‘relevant people’ are and how often they should be updated. Some of the contact information is sensitive, so they have been stored in Google docs. You will need to be a member of the GovWifi team to access them.
P1 incidents
The communications lead needs to:
- Open an incident on the GovWifi status page and add regular updates. You can do this by logging in to the Status page with your work email account . “Choose Login With Google”
- If there’s been a data breach, tell the Data Protection Officer. More information about this process can be found in the Risk Assessment Document (the entire document does not have to be read during the incident, the most important thing is to alert the Data Protection Officier of the breach).
- Updating the status page is usually enough to inform all relevant parties of a service outage. However during a major outage (a P1 which lasts longer than 30 minutes) we should also send an email directly to the Senior Management team.
We do not contact individual GovWifi users. This is because we would need the technical team to extract user information from the database. This takes time, which would be better spent fixing the service. Individual users can subscribe to the status page for regular updates.
P2 incidents
The communication lead needs to:
- open an incident on the GovWifi status page and add regular updates
- if there’s been a data breach, tell the Data Protection Officer. More information about this process can be found in the Risk Assessment Document (the entire document does not have to be read during the incident, the most important thing is to alert the Data Protection Officier of the breach).
- talk to the team and decide what other communications are necessary
P3 and P4 incidents
The incident and communication lead should talk to the team and decide what communications are necessary.
The team should decide if an incident should be opened on the GovWifi status page.
Create an incident report
The incident and communication lead needs to create an incident report using the report template.
Throughout the incident, record actions taken, decisions, significant external communications in the report. Include timestamps.
After the incident
When the incident is resolved, the incident lead needs to record the completed template with a unique ID in the Incident Log.
The Delivery Manager should organise an incident retro with the team. Try to keep this within two weeks of the incident where possible. You do not have to have been involved in the Incident in order to attend the retro. Everyone on the GovWifi team is welcome to attend and learn.
Also ensure you add the incident to the GPA monthly reporting document.
Helping other organisation incident management teams with their incidents
From time to time GovWifi may be requested to assist an organisation that is in incident management mode. The following principles should help make the engagement easier.
Identify the admin user for the organisation
In GovWifi admin, locate the admin users listed for the organisation in question. The incident lead for the organisation is unlikely to be an admin user so invite at least one admin user to any resolution call.
Introduce everyone involved in the incident
When speaking to the organisation’s incident team, make sure everyone has been properly introduced. The team may have a very different structure to ours and we need to understand their roles and hierarchy.
Align the goals of any incident calls/meetings
By default GovWifi should focus on assisting to resolve the issue as quickly as possible for our end users. Organisations may be inclined to understand the root cause of the incident first. We must work with the incident team to ensure our calls/meetings are outcome focused.
Demonstrate empathy with the incident team
Emotions are likely to be running high for the incident team. This can lead to miscommunication. Keep calm and try to empathise with the incident team.
Consider the interoperability of unified communications tools
GovWifi uses Google Meet, other organisations may not use Google Meet. If the unified comms does not work for either team try and find one that works for everyone. Screen shares can be incredibly helpful to both teams so finding a common platform is important.
Be confident in your knowledge of GovWifi
Always remember you are the GovWifi subject matter expert and take confidence in that knowledge.