Introduction

Mike, your teammate at UpCommerce, has just pushed an update after some changes made to the site's code by the development team. Ever since that push, you have been getting pager alerts that the UpCommerce.com service is down. As SRE team lead, it is your job to handle and resolve problems that happen in the UpCommerce Kubernetes cluster. You will also need to keep your Incident Commander (IC) and Communications Lead (CL) up to date so they can manage UpCommerce's users.

Getting your development environment ready

You will use GitHub Codespaces as your development environment, just like you did for the project in Week 1.

Steps

Create a fork of the week's repository. This repo contains the exact code that Mike pushed to UpCommerce's production Kubernetes cluster.
When you have created a fork of the week's repository, start a codespace on the main branch. The directory structure of the project is the same as Week 1's.
Run the command below in your codespace's terminal to create a single-node, Kubernetes cluster using Minikube: minikube start
Once your Minikube cluster is running, enter the command below: kubectl create namespace sre This creates a namespace in your Kubernetes cluster named sre. It is within this namespace that you'll do all the tasks required for this project.
Run the commands below to install and activate Prometheus in your sre namespace

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/prometheus \
  -f prometheus.yml \
  --namespace sre

Run the command below to install and activate Grafana in your sre namespace:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install grafana grafana/grafana \
 --namespace sre \
 --set adminPassword="admin"

For this task, you'll need the following services running:

a. Prometheus server

b. Prometheus Alertmanager

c. Prometheus PushGateway

d. Grafana server

You'll need a split terminal for each of the services above because each of these services are forwarded to a port in your Minikube cluster so that you can view them. Activate the services with the following commands:

a. Prometheus server

export POD_NAME=$(kubectl get pods --namespace sre -l "app.kubernetes.io/name=prometheus,app.kubernetes.io/instance=prometheus" -o jsonpath="{.items[0].metadata.name}")

kubectl --namespace sre port-forward $POD_NAME 9090

b. Prometheus Alertmanager

export POD_NAME=$(kubectl get pods --namespace sre -l "app.kubernetes.io/name=alertmanager,app.kubernetes.io/instance=prometheus" -o jsonpath="{.items[0].metadata.name}")

kubectl --namespace sre port-forward $POD_NAME 9093

c. Prometheus PushGateway

export POD_NAME=$(kubectl get pods --namespace sre -l "app.kubernetes.io/instance=prometheus,app.kubernetes.io/name=prometheus-pushgateway" -o jsonpath="{.items[0].metadata.name}")

kubectl --namespace sre port-forward $POD_NAME 9091

d. Grafana server

export POD_NAME=$(kubectl get pods --namespace sre -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=grafana" -o jsonpath="{.items[0].metadata.name}")

kubectl --namespace sre port-forward $POD_NAME 3000

Run the commands below to create a deployment and service

kubectl apply -f deployment.yml -n sre
kubectl apply -f service.yml -n sre

Run the commands below to see the status of your UpCommerce deployment:

kubectl get deployment -n sre

By now, you will see that your UpCommerce deployment is failing. This is the main reason why you have been getting firing alerts.

Task

You are required to troubleshoot your deployment and find the reason for which it is failing. When you have found it, write a memo (100–150 words) in a Google Doc to your Incident Commander and Communications Lead, informing them of the cause of the service outage.

[Extension]

Troubleshooting system outages using Grafana

Grafana is an open-source analytics and monitoring solution for time-series data. It allows you to query, visualize, alert on, and understand your metrics, no matter where they are stored. Grafana provides a powerful and flexible platform for creating rich, interactive dashboards that can be used for monitoring and analyzing various systems and applications.

Troubleshooting

Checking Pod status

kubectl get pods -n sre

Describing the `upcommerce` pod to get details

Incorrect configuration identified

Corrected configuration

Upcommerce now running

Additional Task

Grafana dashboards created

RCA Document

Document

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
static		static
templates		templates
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
deployment.yml		deployment.yml
gmail-alert.yml		gmail-alert.yml
prometheus.yml		prometheus.yml
requirements.txt		requirements.txt
service.yml		service.yml
slack-alert.yml		slack-alert.yml
values.yml		values.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Getting your development environment ready

Steps

Task

[Extension]

Troubleshooting

Checking Pod status

Describing the `upcommerce` pod to get details

Incorrect configuration identified

Corrected configuration

Upcommerce now running

Additional Task

Grafana dashboards created

RCA Document

About

Releases

Packages

Languages

aggarwal-tanushree/sre-week-two

Folders and files

Latest commit

History

Repository files navigation

Introduction

Getting your development environment ready

Steps

Task

[Extension]

Troubleshooting

Checking Pod status

Describing the upcommerce pod to get details

Incorrect configuration identified

Corrected configuration

Upcommerce now running

Additional Task

Grafana dashboards created

RCA Document

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Describing the `upcommerce` pod to get details

Packages