Skip to content

ECE750-Group-5/Proactive-Circuit-Breaking-For-Istio

Repository files navigation


Logo

Proactive Circuit Breaking For Istio

Inspired by TCP Reno and the Adaptive Concurrency Limit feature of the Envoy Proxy
Explore our presentation »

Table of Contents
  1. About The Project
  2. Getting Started
  3. License
  4. Acknowledgments
  5. Limitations

About The Project

Motivations

We got the idea from two sources. First, Mendonça and etc. in their survey paper about building self-adaptive microservice systems, mentioned how the self-adaptive methods could make todays' cloud native applications more resilient. In this paper, they pointed out a research topic for self-adaptive circuit breakers.

Second, Netflix Engineering Blog has a famous article, Performance under Load, which states how circuit breakers can keep the downstream services get overwhelmed and mitigate the cascading failures.

Problem Statement

Because of runtime uncertainty and frequent code changes, it is hard to set the right circuit breaking thresholds. The traditional circuit breaking methods are not adaptive to the runtime changes.

The Envoy Proxy has a feature called Adaptive Concurrency Limit, which is a real-time adaptive circuit breaking mechanism. Inspired by TCP Vegas, a latency-based TCP congestion control algorithm, It uses the latency as a feedback to adjust the concurrency limit. However, this feature is not available in Istio (Istio Issue 25991), which is a popular service mesh platform.

The Envoy implementation have to recalibrate the latency when the concurrency limit is 1 for every measure window, introducing parameters for extra tuning and artificial unavailability.

Solution

Our solution intends to solve this problem by mimicking the TCP Reno congestion control algorithm and use CPU utilization, a more immediate signal for saturation, as the feedback.

High-Level Design

architecture

State Machine Algorithm

  • Multiplicative Decrease Multiplicative Increase
  • Random Probing

state machine

Experiment

Experiment Setup

We have three experiment group: Group A with proactive Circuit Breaking (timeline 21:50 to 22:00), Group B without any Circuit Breaking (timeline 22:00 to 22:20), and Group C with static Circuit Breaking with a predefined concurrency limit of 10 (timeline 22:25 to 22:35). For each group, we used Fortio to generate a constant load of 140 QPS and a HttpBin container as our target service with a constant resource of 20m CPU amd 78 Mi memory. We used Prometheus and Grafana to monitor the CPU utilization and the QPS of the target service.

Results

CPU utilization

CPU utilization

Both CPU and Latency improves. However, the latency didn't improve as much as we expected. We will need to further investigate the root cause of the high variance.

Built With

(back to top)

Getting Started

Prerequisites

You need to have minikube installed. If you don't have it, you can install it by following the instructions here.

Installation

  1. Install Prometheus Operator, Prometheus, CAdvisor, Fortio and Grafana In the root directory, run the following command:
chmod +x set-up.sh
./set-up.sh
  1. Configure the receiver for Prometheus AlertManager (Optional) This is example for Slack. You can use other receivers as well.
kubectl apply -f monitoring/alert
  1. Deploy Httpbin
kubectl apply -f httpbin/httpbin.yaml
  1. Start the proactive circuit breaker MAPE loop
python3 analyzing_planning_executing/main.py
  1. Start Fortio load test
kubectl exec "$FORTIO_POD" -c fortio -- /usr/bin/fortio load -c 140 -qps 140 -n 60000 -loglevel Warning http://httpbin:8000/get

(back to top)

Limitations

  • The current implementation is a proof of concept and is not production ready.
  • We haven't tested the system for system degradation and scaling-out events.
  • We could adopt the cubic increase function from TCP Cubic for more efficient adaptations.

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Acknowledgments

This project was developed as part of the ECE750 course at the University of Waterloo. We would like to thank our instructors, Prof. Landan and the TAs, for their guidance and support throughout the term.

We used DALLE to generate our project logo and Copilot for generating documentations.

References

  1. “Circuit Breaking.” n.d. Istio. Accessed December 1, 2023. https://istio.io/latest/docs/tasks/traffic-management/circuit-breaking/.
  2. Mendonça, Nabor C., Pooyan Jamshidi, David Garlan, and Claus Pahl. "Developing self-adaptive microservice systems: Challenges and directions." IEEE Software 38, no. 2 (2019): 70-79.
  3. Yanacek , David. n.d. AWS. Accessed December 1, 2023. https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/.
  4. Landau, Eran, William Thurston, and Tim Bozarth. 2018. “Performance under Load.” Medium. Netflix Technology Blog. March 23, 2018. https://netflixtechblog.medium.com/performance-under-load-3e6fa9a60581.
  5. Netflix Opensource Software. 2023. “Concurrency Limit.” GitHub. November 29, 2023. https://github.com/Netflix/concurrency-limits/tree/master.
  6. Allen, Tony. 2020. “Envoy, Take the Wheel: Real-Time Adaptive Circuit Breaking.” Www.youtube.com. September 4, 2020. https://www.youtube.com/watch?v=CQvmSXlnyeQ.
  7. Allen, Tony. 2019. “Envoy GitHub Issue #7789: Adaptive Concurrency Control L7 Filter.” GitHub. July 31, 2019. envoyproxy/envoy#7789.

(back to top)

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published