Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding query stack fault to MSQ to capture native query errors. #13926

Merged
merged 4 commits into from
Apr 5, 2023

Conversation

cryptoe
Copy link
Contributor

@cryptoe cryptoe commented Mar 13, 2023

Description

  • Add a new fault "QueryRuntimeError" to MSQ engine to capture native query errors.
  • Fixed bug in MSQ fault tolerance where worker were being retried if UnexpectedMultiValueDimensionException was thrown.
  • An exception from the query runtime with org.apache.druid.query as the package name is thrown as a QueryRuntimeError

Release note

Add a new fault "QueryRuntimeError" to MSQ engine to capture native query errors.
Fixed bug in MSQ fault tolerance where worker were being retried if UnexpectedMultiValueDimensionException was thrown.


Key changed/added classes in this PR
  • MSQErrorReport
  • QueryRuntimeError

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

@cryptoe cryptoe added the Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 label Mar 13, 2023
Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a new error code QueryStackError that is only used in one case: when the query stack throws UnexpectedMultiValueDimensionException. Did you mean to associate this error code with more kinds of problems?

If we keep this design, also, to me QueryRuntimeError makes more sense than QueryStackError. I think end users are more likely to think of the thing that runs queries as a runtime vs a stack. ("Stack of what?")

@cryptoe
Copy link
Contributor Author

cryptoe commented Mar 14, 2023

I earlier made changes in BaseLeafFrameProcessor#runIncrementally method but since that would also throw FrameSpecific exceptions I dropped that idea.

There is a new error code QueryStackError that is only used in one case: when the query stack throws UnexpectedMultiValueDimensionException. Did you mean to associate this error code with more kinds of problems?

I was thinking we would eventually add more errors and UnexpectedMultiValueDimensionException is just the start.

Another approach was to get the exception and check if the exception class name starts with org.apache.druid.query . It might be a hack but it may work for about 10 exceptions. WDYT ?

Screenshot 2023-03-14 at 12 42 18 PM

If we keep this design, also, to me QueryRuntimeError makes more sense than QueryStackError

Sure I can rename QueryStackError to QueryRuntimeError.

Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes 🚀 . Agreeing with @gianm's comments on 2 things:

  1. Renaming the fault class name
  2. This seems like a one-off case that we handle currently. Can we identify the cases where we defer to the native query engine for our processing, and wrap the errors there in this new Fault. For example, in the GroupByPreShuffleProcessor, we create the frame writer and iterate over it. I remember adding a similar fault a while back, and I ran into the same issue so I wrapped a large portion of the code (which might encapsulate the frame errors as well). Is there a way to change the design somehow (maybe in a separate PR), so that we can identify the frame errors v/s rest of the errors easily?

The second point above might not be possible in this PR, so I am okay with the changes, and you can add a comment in the base class or the MSQErrorReport's code to mention the possibility of a better design.

@@ -679,6 +679,7 @@ The following table describes error codes you may encounter in the `multiStageQu
| <a name="error_InsertTimeOutOfBounds">`InsertTimeOutOfBounds`</a> | A REPLACE query generated a timestamp outside the bounds of the TIMESTAMP parameter for your OVERWRITE WHERE clause.<br /> <br />To avoid this error, verify that the you specified is valid. | `interval`: time chunk interval corresponding to the out-of-bounds timestamp |
| <a name="error_InvalidNullByte">`InvalidNullByte`</a> | A string column included a null byte. Null bytes in strings are not permitted. | `column`: The column that included the null byte |
| <a name="error_QueryNotSupported">`QueryNotSupported`</a> | QueryKit could not translate the provided native query to a multi-stage query.<br /> <br />This can happen if the query uses features that aren't supported, like GROUPING SETS. | |
| <a name="error_QueryStackError">`QueryStackError`</a> | MSQ uses the native query engine to run the leaf stages. This error tells MSQ that error is in native query engine.<br /> <br /> Since this is a generic error, the user needs to look at the error message and stack trace to figure out the course of action. If the user is stuck, consider raising a github issue for assistance. | `baseErrorMessage` error message from the native stack. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rename either the description here to mention QueryStack (which seems incomplete, or rename the fault to QueryStackErrorFault because all of the faults in the docs are derived from their class names by removing the -Fault suffix.

Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the tests, and the changes LGTM. We should look for a way to reduce the if..else conditions in the report generation in subsequent PRs.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cryptoe cryptoe merged commit e6a1170 into apache:master Apr 5, 2023
@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Documentation Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants