Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnest functionality for Druid #13268

Merged
merged 38 commits into from
Dec 3, 2022
Merged

Conversation

somu-imply
Copy link
Contributor

@somu-imply somu-imply commented Oct 27, 2022

Implementation of Unnest.
Unnest has been created as a data source. An unnest data source has the following:

  • base (the base data source to be unnested)
  • column (the column in the base data source to be unnested)
  • outputColumn (the name under which the unnested column will be displayed)
  • allowList (allow to unnest a subset of a multivalue column).

segment references and storage adapters have been created. Two different cursors have also been added one for dictionary encoded columns (DimensionUnnestCursor) while the other (ColumnarValueUnnestCursor) handles column values without encoding.

Queries supported are:

  1. Scan
    { "queryType": "scan", "dataSource": { "type": "unnest", "base": { "type": "table", "name": "foo" }, "column": "dim3", "outputName": "unnest-dim3" }, "intervals": { "type": "intervals", "intervals": [ "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z" ] }, "limit": 1000, "columns": [ "__time", "dim1", "dim2", "dim3", "m1", "m2", "unnest-dim3" ], "legacy": false, "granularity": { "type": "all" }, "context": { "debug": true, "useCache": false } }

Screen Shot 2022-11-06 at 2 33 30 PM

  1. GroupBy

{ "queryType": "groupBy", "dataSource": { "type": "unnest", "base": "foo", "column": "dim3", "outputName": "unnest-dim3", "allowList": null }, "intervals": ["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"], "granularity": "all", "dimensions": [ "unnest-dim3" ], "limitSpec": { "type": "default", "columns": [ { "dimension": "unnest-dim3", "direction": "descending" } ], "limit": 1001 }, "context": { "debug": true } }

Screen Shot 2022-11-06 at 2 34 03 PM

  1. TopN

{ "queryType": "topN", "dataSource": { "type": "unnest", "base": { "type": "table", "name": "foo" }, "column": "dim3", "outputName": "unnest-dim3", "allowList": null }, "dimension": { "type": "default", "dimension": "dim2", "outputName": "d0", "outputType": "STRING" }, "metric": { "type": "inverted", "metric": { "type": "numeric", "metric": "a0" } }, "threshold": 3, "intervals": { "type": "intervals", "intervals": [ "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z" ] }, "granularity": { "type": "all" }, "aggregations": [ { "type": "floatMin", "name": "a0", "fieldName": "m1" } ], "context": { "debug": true } }

Screen Shot 2022-11-06 at 2 34 29 PM

Additionally user can add allowLists

Screen Shot 2022-11-06 at 2 39 32 PM

Filters can also be specified alongside allowLists
Screen Shot 2022-11-06 at 3 00 11 PM

This allows nesting as well

{
  "queryType": "scan",
  "dataSource": {
    "type": "unnest",
    "base": {
      "type": "unnest",
      "base": {
        "type": table,
        "name": foo1
      },
      "column": "dim3",
      "outputName": "unnest-dim3",
      "allowList": ["a"]
    },
    "column": "dim3",
    "outputName": "unnest-dim3-again",
    "allowList": ["b","d"]
  },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
    ]
  },
  "limit": 1000,
  "columns": [
    "__time",
    "dim1",
    "dim2",
    "dim3",
    "m1",
    "m2",
    "unnest-dim3",
    "unnest-dim3-again"
  ],
  "legacy": false,
  "granularity": {
    "type": "all"
  },
  "context": {
    "debug": true,
    "useCache": false
  }
}

and you can also do multiple levels involving unnest joins and queries

{
  "queryType": "scan",
  "dataSource": {
    "type": "unnest",
    "base": {
        "type": "join",
        "left": {
          "type": "table",
          "name": "foo1"
        },
        "right": {
          "type": "query",
          "query": {
            "queryType": "scan",
            "dataSource": {
              "type": "table",
              "name": "foo1"
            },
            "intervals": {
              "type": "intervals",
              "intervals": [
                "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
              ]
            },
            "virtualColumns": [
              {
                "type": "expression",
                "name": "v0",
                "expression": "\"m2\"",
                "outputType": "FLOAT"
              }
            ],
            "resultFormat": "compactedList",
            "columns": [
              "__time",
              "dim1",
              "dim2",
              "dim3",
              "m1",
              "m2",
              "v0"
            ],
            "legacy": false,
            "context": {
              "queryId": "65c2cd34-2aa8-4370-9a41-01835cacc5fd",
              "sqlOuterLimit": 1001,
              "sqlQueryId": "65c2cd34-2aa8-4370-9a41-01835cacc5fd",
              "useNativeQueryExplain": true
            },
            "granularity": {
              "type": "all"
            }
          }
        },
        "rightPrefix": "j0.",
        "condition": "(\"m1\" == \"j0.v0\")",
        "joinType": "INNER"
      },
    "column": "dim3",
    "outputName": "unnest-dim3",
    "allowList": []
    },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
    ]
  },
  "resultFormat": "compactedList",
  "limit": 1001,
  "columns": [
    "__time",
    "dim1",
    "dim2",
    "dim3",
    "j0.__time",
    "j0.dim1",
    "j0.dim2",
    "j0.dim3",
    "j0.m1",
    "j0.m2",
    "m1",
    "m2",
    "unnest-dim3"
  ],
  "legacy": false,
  "context": {
    "queryId": "65c2cd34-2aa8-4370-9a41-01835cacc5fd",
    "sqlOuterLimit": 1001,
    "sqlQueryId": "65c2cd34-2aa8-4370-9a41-01835cacc5fd",
    "useNativeQueryExplain": true
  },
  "granularity": {
    "type": "all"
  }
}

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@somu-imply somu-imply changed the title Moving all unnest cursor code atop refactored code for unnest Unnest functionality for Druid Nov 6, 2022
@somu-imply somu-imply marked this pull request as ready for review November 6, 2022 22:45
}

/**
* Create an unnest dataSource from a string condition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this comment trying to tell me?

Comment on lines 142 to 159
return JvmUtils.safeAccumulateThreadCpuTime(
cpuTimeAccumulator,
() -> {
if (column == null) {
return Function.identity();
} else if (column.isEmpty()) {
return Function.identity();
} else {
return baseSegment ->
new UnnestSegmentReference(
baseSegment,
column,
outputName,
allowList
);
}
}
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code doesn't seem to be delegating to its child, do you have any tests that test for, e.g. nesting of these things?

@Override
public DataSource withUpdatedDataSource(DataSource newSource)
{
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this never called? If it is, my guess is that it will produce an NPE. Maybe include a comment about why it is safe to do this?

@@ -125,6 +126,11 @@ public static DataSourceAnalysis forDataSource(final DataSource dataSource)
current = subQuery.getDataSource();
}

while (current instanceof UnnestDataSource) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a Query of an Unnest of a Query of an Unnest, the way that you have interleaved these is not going to completely unwrap the objects as expected.

This DataSourceAnalysis thing is probably another thing to move onto the DataSource object itself... Not sure if we should do that now or leave it as something to do for later though. either way, you need both conditions (check for Query and check for Unnest) on the while loop above.

Comment on lines 46 to 51
public ColumnarValueUnnestCursor(
Cursor cursor,
String columnName,
String outputColumnName,
LinkedHashSet<String> allowSet
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be safe to pass in the baseColumnSelectorFactory directly. Once you've made the decision to use this object, you should already have a good column selector factory to use.

Comment on lines 126 to 135
if (availableDimensions.contains(outputColumnName)) {
throw new IAE(
"Provided output name [%s] already exists in table to be unnested. Please use a different name.",
outputColumnName
);
} else {
availableDimensions.add(outputColumnName);
}
return new ListIndexed<>(Lists.newArrayList(availableDimensions));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it bad for the output name to already exist?

@Override
public int getNumRows()
{
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure if it's safe to return 0 from this... We should double check what uses this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is fine, only segment metadata uses it and some metrics about segment row counts


// TODO: Use placementish column in the QueryRunnerHelperTest
// and set up native queries
public class UnnestQueryRunnerTest extends InitializedNullHandlingTest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional that this test is completely empty?

this.baseList = inputList;
}

void populateList()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you want is a static method that builds a ListCursor rather than a method that can get called at any point in time to mutate and changes the internals of the ListCursor object.

Comment on lines 271 to 281
Object dimSelectorVal = dimSelector.getObject();
Assert.assertNotNull(dimSelector.getRow());
Assert.assertNotNull(dimSelector.getValueCardinality());
Assert.assertNotNull(dimSelector.makeValueMatcher(OUTPUT_COLUMN_NAME));
Assert.assertNotNull(dimSelector.idLookup());
Assert.assertNotNull(dimSelector.lookupName(0));
Assert.assertNotNull(dimSelector.defaultGetObject());
Assert.assertFalse(dimSelector.isNull());
if (dimSelectorVal == null) {
Assert.assertNull(dimSelectorVal);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertions here should be updated. You should be able to know and validate the sepcific ValueCardinality and this seems to always be looking up the value for the 0 index, which shouldn't be correct. If the test isn't actually walking through rows with different dictionary values, it's not really validating what we need it to.

return baseColumnSelectorFactory.makeDimensionSelector(dimensionSpec);
}

//final DimensionSpec actualDimensionSpec = dimensionSpec.withDimension(columnName);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this and the other commented out line

outputName,
allowList
return
segmentMapFn.andThen(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a style thing, but this sort of fluent style tends to produce hard-to-read stack traces if there are any errors. It creates stack traces with lines from the Function class rather than from UnnestDataSource. Generally speaking, only use a fluent style when the fluency doesn't go outside of the current scope of the code. If you are returning an object that is going to be used by someone else, create a closure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed, creating new object now

if (!outputName.equals(dimensionSpec.getDimension())) {
return baseColumSelectorFactory.makeDimensionSelector(dimensionSpec);
}
return baseColumSelectorFactory.makeDimensionSelector(DefaultDimensionSpec.of(columnName));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm perhaps missing something, but this seems to be just delegating to the base and returning without attempting to do any unnesting?

You likely haven't run into this because you are doing a validation ahead of time for what the column can be. As such, the correct answer here might be to throw an UnsupportedOperatorException instead as you are expecting the user to be calling the ColumnValueSelector option instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed, added an unit test for the exception as well

Copy link
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would docs come in a follow-up PR?
I haven't yet reviewed the cursor classes though those too could use some javadocs to explain what they are doing.

@Override
public byte[] getCacheKey()
{
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does caching work for this data source?

Copy link
Contributor Author

@somu-imply somu-imply Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have kept this null as of now. The caching can be turned on by setting this part to

public byte[] getCacheKey()
  {
    return new byte[0];
  }

This is similar to the other data sources that are involved in caching like TableDataSource

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the column being unnested would need to be part of the cache key since the reason table datasources can get away with an empty cache key is because that is part of the segmentId. However here the results are dependent on what is being unnested, so we can't rely on just the datasource name, so a cache key would need to be non-empty

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getCacheKey is documented as

  /**
   * Compute a cache key prefix for a data source. This includes the data sources that participate in the RHS of a
   * join as well as any query specific constructs associated with join data source such as base table filter. This key prefix
   * can be used in segment level cache or result level cache. The function can return following
   * - Non-empty byte array - If there is join datasource involved and caching is possible. The result includes
   * join condition expression, join type and cache key returned by joinable factory for each {@link PreJoinableClause}
   * - NULL - There is a join but caching is not possible. It may happen if one of the participating datasource
   * in the JOIN is not cacheable.
   *
   * @return the cache key to be used as part of query cache key
   */

Meaning that a null return type should disable caching. We should likely be even more explicit and set isCachable to return false.

import java.util.LinkedHashSet;
import java.util.List;

public class ColumnarValueUnnestCursor implements Cursor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some javadocs here about this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added javadocs

this.baseAdapter = baseAdapter;
this.dimensionToUnnest = dimension;
this.outputColumnName = outputColumnName;
this.allowSet = allowSet;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is special about allowSet that it gets its own variable? Is it just a filter or something more?

Copy link
Contributor Author

@somu-imply somu-imply Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An infilter for a MVD returns the entire value in the row in case of any match.
Screen Shot 2022-11-29 at 10 44 28 PM

This allowSet allows to filter inside an MVD just to allow to unnest the values specified in the allowList and ignore the others. So if we want to unnest only the a and b here we need to add them to the allowList like the one below:
Screen Shot 2022-11-30 at 9 19 28 AM

public ColumnCapabilities getColumnCapabilities(String column)
{
if (outputColumnName.equals(dimensionToUnnest)) {
return baseAdapter.getColumnCapabilities(column);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the returned set of column capabilities always have hasMultipleValues to false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part delegates the column capabilities to the ones of the base adapter so the properties depends on the column capabilities of the input column. I am not sure I understood this correctly though

@Override
public DimensionSelector makeDimensionSelector(DimensionSpec dimensionSpec)
{
throw new UnsupportedOperationException("Dimension selector not applicable for column value selector");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errr, you did too much! I was only talking about the one case where something asks for the column that is being unnested. It's totally possible that one of the other columns is being accessed as a DimensionSelector and you want to still allow for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 36 to 75
/**
* The cursor to help unnest MVDs with dictionary encoding.
* Consider a segment has 2 rows
* ['a', 'b', 'c']
* ['d', 'c']
*
* Considering dictionary encoding, these are represented as
*
* 'a' -> 0
* 'b' -> 1
* 'c' -> 2
* 'd' -> 3
*
* The baseCursor points to the row of IndexedInts [0, 1, 2]
* while the unnestCursor with each call of advance() moves over individual elements.
*
* advance() -> 0 -> 'a'
* advance() -> 1 -> 'b'
* advance() -> 2 -> 'c'
* advance() -> 3 -> 'd' (advances base cursor first)
* advance() -> 2 -> 'c'
*
* Total 5 advance calls above
*
* The allowSet if available helps skip over elements which are not in the allowList by moving the cursor to
* the next available match. The hashSet is converted into a bitset (during initialization) for efficiency.
* If allowSet is ['c', 'd'] then the advance moves over to the next available match
*
* advance() -> 2 -> 'c'
* advance() -> 3 -> 'd' (advances base cursor first)
* advance() -> 2 -> 'c'
*
* Total 3 advance calls in this case
*
* The index reference points to the index of each row that the unnest cursor is accessing
* The indexedInts for each row are held in the indexedIntsForCurrentRow object
*
* The needInitialization flag sets up the initial values of indexedIntsForCurrentRow at the beginning of the segment
*
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. 👍

public void advanceUninterruptibly()
{
do {
advanceAndUpdate();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if baseCursor doesn't have any data. so advanceAndUpdate is done and in matchAndProceed, could indexedIntsForCurrentRow.get(index) throw an exception?

Copy link
Contributor Author

@somu-imply somu-imply Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the base cursor does not have any data it does not come until this stage of unnest cursor creation as the base cursor is already in a isDone==true state. Additionally UnnestStorageAdapter before the cursor creation ensures that the base cursor is non-null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the baseCursor is also advanced in advanceAndUpdate? So its possible that baseCursor was not done before but got done during invocation of this method. maybe I am missing something. Advancing and then accessing base cursor doesn't look right here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I think of it, it probably doesn't matter. what can happen is that index is reset to zero and indexedIntsForCurrentRow points to the last row, just before the loop is about to exit. the matchAndProceed will not throw an exception since indexedIntsForCurrentRow would always have at least one entry.

Comment on lines 100 to 101
throw new UnsupportedOperationException(
"Dimension selector not applicable for column value selector for column " + outputName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More design nits:

  1. We have a UOE that bundles String.format into the building of the exception use that.
  2. We encase interpolated values in [] to help differentiate things like extra spaces.

@somu-imply
Copy link
Contributor Author

somu-imply commented Dec 1, 2022

The ColumnarValueUnnestCursor was taking care of List or Strings but in case of virtual columns they are an array of objects. The change is made to support virtual columns and the following type of query. Thanks to @clintropolis for finding this

{
  "queryType": "scan",
  "dataSource":{
    "type": "unnest",
    "base": {
      "type": "table",
      "name": "foo1"
    },
    "column": "v0",
    "outputName": "unnest-v0"
  }
  "intervals": {
    "type": "intervals",
    "intervals": [
      "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
    ]
  },
  "virtualColumns": [
    {
      "type": "expression",
      "name": "v0",
      "expression": "array(\"m1\",\"m2\")",
      "outputType": "ARRAY<LONG>"
    }
  ],
  "resultFormat": "compactedList",
  "limit": 1001,
  "columns": [
    "unnest-v0"
  ],
  "legacy": false,
  "context": {
    "populateCache": false,
    "queryId": "d273facb-08cc-4de7-ac0b-d0b82173e531",
    "sqlOuterLimit": 1001,
    "sqlQueryId": "d273facb-08cc-4de7-ac0b-d0b82173e531",
    "useCache": false,
    "useNativeQueryExplain": true
  },
  "granularity": {
    "type": "all"
  }
}

Screen Shot 2022-12-01 at 2 24 13 PM

* The needInitialization flag sets up the initial values of unnestListForCurrentRow at the beginning of the segment
*
*/
public class ColumnarValueUnnestCursor implements Cursor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nitpick, but why not just call this thing what it is doing, e.g. UnnestColumnValueSelectorCursor? Same thing with the other one, UnnestDimensionSelectorCursor

@Override
public byte[] getCacheKey()
{
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the column being unnested would need to be part of the cache key since the reason table datasources can get away with an empty cache key is because that is part of the segmentId. However here the results are dependent on what is being unnested, so we can't rely on just the datasource name, so a cache key would need to be non-empty

if (value == null) {
return 0;
}
return Double.valueOf((String) value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can count on casting to a string here, since it depends on the type of the underlying column value selector, same for other primitive numeric getters

if (!outputName.equals(columnName)) {
baseColumSelectorFactory.getColumnCapabilities(column);
}
return baseColumSelectorFactory.getColumnCapabilities(columnName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think you want to strictly pass through the underlying capabilities. If the underlying column is a multi-value string, you need to return capabilities that have multiple values set to false since it is no longer a multi-value string, if the underlying capabilities is an ARRAY type, you need to return the element type of the array.

}
unnestListForCurrentRow.add(null);
} else {
if (currentVal instanceof List) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll want to check for Object[] too, since that is the type we have been standardizing ARRAY types to deal in

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah this comment is stale now

// Helper class to help in returning
// getRow from the dimensionSelector
// This is set in the initialize method
private class SingleIndexInts implements IndexedInts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it reuses the index that is being incremented

Comment on lines 389 to 392
public int get(int idx)
{
return indexedIntsForCurrentRow.get(index);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a bit confusing to pass this through to the underlying rows IndexedInts... size is 1, so get of this method should always be 0, no?

I guess i'm worried about silent bugs possible by having it like this instead of the other SingleIndexInts which can only possible expose a single value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not passing through the idx, it's using the index that gets incremented. This is required to preserve the semantics of the dictionary.

@Override
public int getNumRows()
{
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is fine, only segment metadata uses it and some metrics about segment row counts

Comment on lines 102 to 104
ColumnCapabilities capabilities = cursor.getColumnSelectorFactory().getColumnCapabilities(dimensionToUnnest);
if (capabilities.isDictionaryEncoded() == ColumnCapabilities.Capable.TRUE
&& capabilities.areDictionaryValuesUnique() == ColumnCapabilities.Capable.TRUE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capablities returned here are allowed to be null, suggest checking for nulls.

Also the statement can be slightly simplified
capabilities.isDictionaryEncoded().and(capabilities.areDictionaryValuesUnique()).isTrue()

import java.util.List;

/**
* The cursor to help unnest MVDs without dictionary encoding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't specific to multi-value dimensions, since this also handles ARRAY typed selectors

public ColumnCapabilities getColumnCapabilities(String column)
{
if (!outputName.equals(columnName)) {
baseColumSelectorFactory.getColumnCapabilities(column);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing return (also on the dim cursor)

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going ahead and approving, but the column capabilities really need fixed up but is ok with me if you do as a follow-up...

it would also be nice to add native query tests to like GroupByQueryRunnerTest, TopNQueryRunnerTest, TimeseriesQueryRunnerTest, and ScanQueryRunnerTest to make sure everything works as expected with the different native query types, but it is fine to do that as a follow-up too. There is a multi-value dimension in the test data these tests use ('placementish') and i believe some numeric columns also so that numeric arrays can be tested with virtual columns.

return baseColumSelectorFactory.getColumnCapabilities(column);
}
// This currently returns the same type as of the column to be unnested
// This is fine for STRING types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think its not really great for string types either since if 'hasMultipleValues` is set the engine will take less efficient paths and treat it as a multi-value string.

Btw, this is pretty easy to fix, all you need to do is something like this, but is fine to do as a follow-up too

      final ColumnCapabilities capabilities = baseColumSelectorFactory.getColumnCapabilities(columnName);
      if (capabilities.isArray()) {
        return ColumnCapabilitiesImpl.copyOf(capabilities).setType(capabilities.getElementType());
      }
      if (capabilities.hasMultipleValues()) {
        return ColumnCapabilitiesImpl.copyOf(capabilities).setHasMultipleValues(false);
      }
      return capabilities;

Copy link
Contributor Author

@somu-imply somu-imply Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was pretty minor change. Adding it here. The rest of the things will be added in the followup PR

@clintropolis clintropolis merged commit 9177419 into apache:master Dec 3, 2022
@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants