-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSQ: Validate that strings and string arrays are not mixed. #15920
MSQ: Validate that strings and string arrays are not mixed. #15920
Conversation
When multi-value strings and string arrays coexist in the same column, it causes problems with "classic MVD" style queries such as: select * from wikipedia -- fails at runtime select count(*) from wikipedia where flags = 'B' -- fails at planning time select flags, count(*) from wikipedia group by 1 -- fails at runtime To avoid these problems, this patch adds type verification for INSERT and REPLACE. It is targeted: the only type changes that are blocked are string-to-array and array-to-string. There is also a way to exclude certain columns from the type checks, if the user really knows what they're doing.
This patch does not yet include docs and tests, so marking it draft. The idea is there though. |
Depending on which direction you're going in, the errors look like either:
Or:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
) | ||
{ | ||
if (targetTable == null) { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this doesn't necessarily need to be done as part of this PR, but I do want to see at some point that this would have a validation failure for the case where the output type has any ARRAY and arrayIngestMode
isn't set to array
, for the effect of preventing people from writing new insert queries that are incorrectly relying on the mvd behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking more about the validations that are in effect when arrayIngestMode: array
, since those validations would help us change the default.
Maybe when we change the default, we can also consider making it so when people explicitly set arrayIngestMode: mvd
, they need to use ARRAY_TO_MV
. (I think that's what you're suggesting here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe when we change the default, we can also consider making it so when people explicitly set arrayIngestMode: mvd, they need to use ARRAY_TO_MV. (I think that's what you're suggesting here.)
yea, I'm mostly thinking about it in terms of forcing people to write the correct queries as soon as possible in either mode to make things less painful down the road, and was thinking it would be nice to prevent bad habits for new tables before they have a chance to get started since those habits are going to have to change.
Its ok to wait until we change the default i guess, which this change should allow us to do much sooner (druid 30 hopefully)
@@ -152,6 +154,7 @@ public class MultiStageQueryContext | |||
public static final String CTX_ARRAY_INGEST_MODE = "arrayIngestMode"; | |||
public static final ArrayIngestMode DEFAULT_ARRAY_INGEST_MODE = ArrayIngestMode.MVD; | |||
|
|||
public static final String CTX_SKIP_TYPE_VERIFICATION = "skipTypeVerification"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we have docs for this parameter? If so, it'll be another parameter that we'd have to support in MSQ. If not, then I don't see a way for users to set this parameter, both of which seem ugly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do without this parameter, since the user can use ARRAY_TO_MV or MV_TO_ARRAY functions to preserve the behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this parameter is to allow someone to explicitly decide to ingest mismatched types. I don't think ARRAY_TO_MV
and MV_TO_ARRAY
would be helpful, since they're more something that users could use to make types line up properly.
One scenario for this parameter might be that someone is migrating from MVD to array, and has updated their queries to work with mismatched array and MVD types: they added MV_TO_ARRAY
around all their MVDs, and have migrated to UNNEST
, etc. They want to start ingesting arrays rather than MVDs for new data, but don't want to go back and rewrite all their old data, because that would be too expensive.
I do agree it's not really ideal for it to be a new context parameter. I just couldn't think of another approach to allowing validation bypass for specific columns, without a table-level catalog.
Maybe some new SQL syntax?
I wonder if you have any thoughts here? (or anyone else?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the context flag seems ok to me, especially since i don't think we're quite yet in a place where a lot of people will want to be using a mix of mvds and arrays, we are missing optimized versions of MV_TO_ARRAY
and ARRAY_TO_MV
to make things a lot more efficient to use a mix of stuff.
I feel like once we've transitioned to array
being default arrayIngestMode
, given some time for it to settle, and then eventually remove arrayIngestMode
to drop all of the other behaviors, we might be able to consider removing this validation in favor of pushing it to the catalog since the old habits would be gone and at that point it should also be a lot less painful to work with a mix of stuff. But we can revisit once arrayIngestMode
is removed.
final ColumnType newDruidType = | ||
DimensionSchemaUtils.getDimensionType(Calcites.getColumnTypeForRelDataType(newSqlType), arrayIngestMode); | ||
|
||
if (newDruidType.isArray() && oldDruidType.is(ValueType.STRING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like we are building a DDL layer, but for specific types, and for MSQ ingestion, which seems kinda off, given that Druid allows a flexible schema. However, I also think that most of the cases triggering this would be accidental, rather than deliberate. Do we plan on getting rid of these validations, once arrayIngestMode = array is the only acceptable argument to MSQ ingestion, and we are not stuck between supporting incorrect behavior, and allowing only correct but incompatible behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This validation is really just to prevent accidents that break queries, due a pretty unique situation:
- due to the complexity around
arrayIngestMode
it is easy to accidentally ingest a list of strings as the "wrong" type accidentally (MVD when you meant ARRAY or vice-versa). - when a single column has both MVD and ARRAY in it, existing queries can break for the reasons I mentioned in the PR description.
This situation is problematic but also only really exists for the MVD/array combinaton. So I don't intent this to be the start of a DDL layer, just a point fix to this particular problem. I do think we want a catalog layer that can be part of more extensive type validations, but that should be opt-in & user-driven & more flexible than some hard-coded checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation!
Taking this out of draft since I've added tests and docs. I have also added error message examples to the PR description. |
This causes them to be stored as [multi-value strings](multi-value-dimensions.md), using the same `STRING` column type | ||
as regular scalar strings. SQL `BIGINT ARRAY` and `DOUBLE ARRAY` cannot be loaded under `arrayIngestMode: mvd`. This | ||
is the default behavior when `arrayIngestMode` is not provided in your query context, although the default behavior | ||
may change to `array` in a future release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: will change
) EXTEND ( | ||
"timestamp" VARCHAR, | ||
"label" VARCHAR, | ||
"arrayString" VARCHAR ARRAY, | ||
"arrayLong" BIGINT ARRAY, | ||
"arrayDouble" DOUBLE ARRAY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for fixing these up, i wrote these docs before i fixed up EXTEND
to support array types 🤘
) | ||
{ | ||
if (targetTable == null) { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe when we change the default, we can also consider making it so when people explicitly set arrayIngestMode: mvd, they need to use ARRAY_TO_MV. (I think that's what you're suggesting here.)
yea, I'm mostly thinking about it in terms of forcing people to write the correct queries as soon as possible in either mode to make things less painful down the road, and was thinking it would be nice to prevent bad habits for new tables before they have a chance to get started since those habits are going to have to change.
Its ok to wait until we change the default i guess, which this change should allow us to do much sooner (druid 30 hopefully)
{ | ||
if (queryType == null) { | ||
// if schema information is not available, create a string dimension | ||
return ColumnType.STRING; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if this should return null
to make createDimensionSchema
decided to use the string schema (in case down the line we ever want to switch that to the auto schema)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I held this condition over from the previous code, but I'm not sure that the type can actually ever be null. This method is used in a couple places:
- by MSQ
ControllerImpl
, which checks for null before passing it; if it's null a "No type for column" exception is thrown. - by the new validation code I added; but here
null
only arises ifCalcites.getColumnTypeForRelDataType(newSqlType)
isnull
, which seems unlikely for a SQL type that appears in an already-planned INSERT or REPLACE statement. The fact that it's already been planned means we should have Druid types corresponding to all the SQL types.
So, maybe we could remove the @Nullable
and make this an error.
Should I try that now, or want to leave it for later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I should have looked at the callers, that makes sense, leaving for now and making an error later seems fine to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just built and plaid with it. Looks great.
…5920) * MSQ: Validate that strings and string arrays are not mixed. When multi-value strings and string arrays coexist in the same column, it causes problems with "classic MVD" style queries such as: select * from wikipedia -- fails at runtime select count(*) from wikipedia where flags = 'B' -- fails at planning time select flags, count(*) from wikipedia group by 1 -- fails at runtime To avoid these problems, this patch adds type verification for INSERT and REPLACE. It is targeted: the only type changes that are blocked are string-to-array and array-to-string. There is also a way to exclude certain columns from the type checks, if the user really knows what they're doing. * Fixes. * Tests and docs and error messages. * More docs. * Adjustments. * Adjust message. * Fix tests. * Fix test in DV mode.
#15920) (#16160) * Cherry-picking 15920-to-29.0.1 * Fixing extra test case which got added as part of merge --------- Co-authored-by: Gian Merlino <[email protected]>
When multi-value strings and string arrays coexist in the same column, it causes problems with "classic MVD" style queries such as:
To avoid these problems, this patch adds type verification for INSERT and REPLACE. It is targeted: the only type changes that are blocked are string-to-array and array-to-string. There is also a way to exclude certain columns from the type checks, if the user really knows what they're doing.
The patch includes new documentation about
arrayIngestMode
; seedocs/querying/arrays.md
. Other mentions of it throughout the docs are updated to link to this section.Error scenarios:
Write SQL ARRAY to existing ARRAY field in
arrayIngestMode: mvd
Write SQL VARCHAR to existing ARRAY field in
arrayIngestMode: mvd
Write SQL VARCHAR to existing ARRAY field in
arrayIngestMode: array
Write SQL ARRAY to existing VARCHAR field in
arrayIngestMode: array