-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
additional functionalities to support packed-struct encoding #3187
Comments
some facts: fact 1: from lance.file import LanceFileReader, LanceFileWriter
import pyarrow as pa
# Define the fields for the struct column
fields = [
pa.field('x', pa.uint32()),
pa.field('y', pa.uint32()),
]
# Create the struct type
struct_type = pa.struct(fields)
# Create 8 rows of data for the struct column
data = [
{'x': 1, 'y': 2},
{'x': 4, 'y': 5},
{'x': 7, 'y': 8},
{'x': 10, 'y': 11},
{'x': 13, 'y': 14},
{'x': 16, 'y': 17},
{'x': 19, 'y': 20},
{'x': 22, 'y': 23}
]
lance_file_path = "/home/x/packed-struct.lance"
# Convert the data to a list of structs
struct_array = pa.array(data, type=struct_type)
# Define the new int32 column
int32_column = pa.array([1, 2, 3, 4, 5, 6, 7, 8], type=pa.int32())
# Define the metadata
metadata = {b'packed': b'true'}
struct_field = pa.field('struct_col', struct_type, metadata=metadata)
int32_field = pa.field('int_col', pa.int32())
second_struct_field = pa.field('second_struct_col', struct_type)
# Create a schema with the updated fields
schema = pa.schema([struct_field, second_struct_field])
# Create a table using the struct array, int32 column, and duplicate struct column
table = pa.Table.from_arrays([struct_array, struct_array], schema=schema)
# Write the table to a Lance file
with LanceFileWriter(lance_file_path, version="2.1") as writer:
writer.write_batch(table)
print("Data written to Lance file successfully.")
# Read the Lance file and display the contents
tab_lance = LanceFileReader(lance_file_path).read_all().to_table()
print(tab_lance.to_pandas()) has result:
observation: the result of fact 2: from lance.file import LanceFileReader, LanceFileWriter
import pyarrow as pa
# Define the fields for the struct column
fields = [
pa.field('x', pa.uint64()),
pa.field('y', pa.uint32()),
]
# Create the struct type
struct_type = pa.struct(fields)
# Create 8 rows of data for the struct column
data = [
{'x': 1, 'y': 2},
{'x': 4, 'y': 5},
{'x': 7, 'y': 8},
{'x': 10, 'y': 11},
{'x': 13, 'y': 14},
{'x': 16, 'y': 17},
{'x': 19, 'y': 20},
{'x': 22, 'y': 23}
]
lance_file_path = "/home/x/packed-struct.lance"
# Convert the data to a list of structs
struct_array = pa.array(data, type=struct_type)
# Define the new int32 column
int32_column = pa.array([1, 2, 3, 4, 5, 6, 7, 8], type=pa.int32())
# Define the metadata
metadata = {b'packed': b'true'}
struct_field = pa.field('struct_col', struct_type, metadata=metadata)
int32_field = pa.field('int_col', pa.int32())
second_struct_field = pa.field('second_struct_col', struct_type)
second_int32_field = pa.field('second_struct_col', pa.int32())
# Create a schema with the updated fields
schema = pa.schema([struct_field, second_struct_field])
# Create a table using the struct array, int32 column, and duplicate struct column
table = pa.Table.from_arrays([struct_array, struct_array], schema=schema)
# Write the table to a Lance file
with LanceFileWriter(lance_file_path, version="2.1") as writer:
writer.write_batch(table)
print("Data written to Lance file successfully.")
# Read the Lance file and display the contents
tab_lance = LanceFileReader(lance_file_path).read_all().to_table()
print(tab_lance.to_pandas()) the Loaded page:
observation: the second page content is the same as the third page content, but they should be different. either we read a same page twice or we write one page twice.
the write path seem correct. fact 3: fact 4: fact 5:
panic at pub fn is_all_valid(&self) -> bool {
self.def_meaning[self.current_layer].is_all_valid()
} |
during write, able to write a struct with multiple fields to one page, and record it in the metadata.
during read, able to detect a column is a
packed-struct column
and it maps to many column indices.during scheduling, able to map one physical page to many fields in this struct.
we should also allow reading a subset of fields in
struct array
The text was updated successfully, but these errors were encountered: