Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Borrowing botocore implementation of get_response
TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in #1783. `kombu` previously used to craft AWS requests manually as explained in detail in #1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted #1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on #1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in #1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in #1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
- Loading branch information