Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal] Add support for SQS event source #884

Closed
jamesls opened this issue Jul 3, 2018 · 9 comments
Closed

[proposal] Add support for SQS event source #884

jamesls opened this issue Jul 3, 2018 · 9 comments

Comments

@jamesls
Copy link
Member

jamesls commented Jul 3, 2018

This issue proposes adding support for the newly released sqs event source.

While the implementation will be different, the user facing API is almost identical to the s3 events and sns events we recently added.

app.py changes

First, we'll add an @app.on_sqs_message decorator, which takes a queue name and an optional batch_size (default is 10, which is the same in the console):

@app.on_sqs_message(queue='myqueue', batch_size=10)
def handler(event):
    for record in event:
        print(record.body)
        print(record.to_dict())

Similar to the topic arg in the SNS PR, this allows us to eventually add support for Queue URLs and possible queue resources if needed (verified that queue urls and queue names are distinct).

In terms of the event object being passed to the handler, it will still inherit from BaseLambdaEvent except that it will also be iterable. Because we can now get multiple records in a single event, there's no helpers for accessing just the first record. Each item we iterate over is itself a BaseLambdaEvent and will have these values:

class SQSRecord(BaseLambdaEvent):
    body = ...  # type: str

I propose making each item a BaseLambdaEvent so you also get a to_dict() method on each record.

Here's what the raw event looks like (showing just a single record but there can be multiple records):

{'Records': [{
        'attributes': {
            'ApproximateFirstReceiveTimestamp': '1530576251596',
            'ApproximateReceiveCount': '1',
            'SenderId': 'sender-id',
            'SentTimestamp': '1530576251595'
        },
        'awsRegion': 'us-west-2',
        'body': 'queue message body',
        'eventSource': 'aws:sqs',
        'eventSourceARN': 'arn:aws:sqs:us-west-2:12345:queue-name',
        'md5OfBody': '754ac2f7a12df38320e0c5eafd060145',
        'messageAttributes': {},
        'messageId': 'message-id',
        'receiptHandle': 'handle'
}]}
@kadrach
Copy link
Member

kadrach commented Jul 3, 2018

... add support for [...] queue resources

My first thought on seeing both SNS and SQS support was deployment. In a SAM template, I can add CF resources and reference them, e.g. have the queue created as part of the stack or import a value from another stack.

Is this the intent of managed resources here?

@kyleknap
Copy link
Contributor

kyleknap commented Jul 3, 2018

@jamesls this looks fine, especially since it looks like users will most likely get multiple messages in one event. The only other thing that I wonder is if we want the top-level event to have a to_dict() as well? I think right now we are only missing the Records key, but not sure if there are other top-level fields that we will miss out on.

@jamesls
Copy link
Member Author

jamesls commented Jul 3, 2018

I have an implementation of this proposal and I've noticed one thing that might be problematic. As far as I can tell, if you have a batch size of more than 1, then the entire batch succeeds (and all the messages are deleted) or the entire batch fails if an exception is raised. I don't see any way to handle partial failures of the message batch.

If that's the case then I've been experimenting with abstractions that could handle partial failures but none of them are that great. It seems like we should provide some way for a user to say that an individual message was successfully handled. No matter what we still need to raise an exception to let lambda know not to delete messages which I prefer not to do.

I suppose this could be an additive thing later, but I wonder how problematic this is going to be.

@kadrach You are correct, creating the SNS/SQS resources is part of the managed resources proposal, which we still plan on doing but we wanted to wrap out the event source work first.

EDIT: meant to say you're correct, not incorrect.

@jamesls
Copy link
Member Author

jamesls commented Jul 3, 2018

@kyleknap both the top level event passed in and each individual record subclass from BaseLambdaEvent so they both have a to_dict() method.

@kyleknap
Copy link
Contributor

kyleknap commented Jul 3, 2018

@jamesls sounds good about the to_dict(). That is interesting about the partial failures. I am a little hesitant to make partial failures the default behavior. Seems a little too implicit as we are not polling for messages. We are given a list with multiple messages.

I do think it would be nice to have an abstraction to handle partial failures so you can make sure you have seen all of your messages at least once, but I think we should be explicit about it. What abstractions have you been considering?

Thinking outloud one option that may be nice to add is support for dead letter queues that way if the entire set of messages fails and gets deleted, they are not gone for good.

@jamesls
Copy link
Member Author

jamesls commented Jul 3, 2018

I do think it would be nice to have an abstraction to handle partial failures so you can make sure you have seen all of your messages at least once

No, that's not the issue I'm getting at. The default behavior now (the comes for free from lambda) is that they either all succeed if your lambda function returns successfully, or the all fail. This means that out of a batch of 10, if I have 9 that I successfully handle, and 1 that fails (and raises an exception to indicate the failure), then I will see all 10 message again. There's never a case where you won't see your messages at least once.

The behavior that I want is that if 9 messages succeed and 1 fails, then I should only see the 1 failed message again. The other 9 are deleted.

Seems a little too implicit as we are not polling for messages.

Can you elaborate on how this is too implicit, my suggestion was to let the user tell us somehow which succeeded and which didn't ("It seems like we should provide some way for a user to say that an individual message was successfully handled.") If a user tells us a message was handled successfully, it seems reasonable they would expect not to see the message again.

The thing that I'm not a fan of (and that we can't change) is that an exception must be raised to tell lambda not to delete any messages. I would prefer if we had a way where we didn't have to use exceptions but could let lambda know which ones succeeded and which ones failed. Essentially I want uncaught exceptions to be reserved for cases where things really went wrong. I'd like something more like dynamodb's batch write items (i.e a list of items that succeeded and a list of items that failed).

Let me know if I'm still not making sense and I can try to put together a more concrete example.

@jamesls
Copy link
Member Author

jamesls commented Jul 3, 2018

FWIW, I have a branch that's working if you want to play around with this feature: https://github.com/aws/chalice/compare/master...jamesls:sqs-events?expand=1. The only functionality missing is SAM support, which should just be a few lines.

There's still some things about it that need to be fixed before it's ready for a PR, but it's enough to test out.

@kyleknap
Copy link
Contributor

kyleknap commented Jul 3, 2018

Talked offline about it. The partial failures definitely seems concerning to me. Looks like there is some research we need to do here to figure out what the appropriate behavior should be before we can launch this in chalice.

@jamesls
Copy link
Member Author

jamesls commented Jul 5, 2018

After mulling it over for a few days, I don't think we should add any built in support for partial failures just yet. The rationale being:

  • If you're willing to trade a higher request count, you can just set your batch size to 1. Arguably this should be the default in chalice, it's the default in the lambda API (though the console defaults to 10). Then the whole issue of partial failures goes away.
  • If you have a batch size > 1, you can always just delete the messages yourself as you process them. This is essentially what chalice would be doing for you. To help, I think we should map the receipt handle as an attribute on the SQSRecord so you can just say sqs.delete_message(record.receipt_handle).
  • We can add this later in a backwards compatible way. Similar to what we did with the Response class for API gateway, we can eventually support some sort of SQSResponse that has the list of success/failures if we decide to support this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants