You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As an effective concept for hierarchically managing data assets, Catalog has been widely accepted in the industry. Moreover, many open-source engines have implemented this concept. The most typical one is the Apache Hive Catalog. At the same time, with the increasing popularity of LLM, the industry is also actively practicing managing AI data assets and even data assets of any type through Catalog (for example, Unity Catalog, Gravitino, etc.). As a modern data format, Lance's success must be connected to the support of many ecological components in AI and big data. The core element of integration is allowing these ecological components to obtain the information and metadata of "Lance" correctly and seamlessly connect with their Catalog system. Therefore, it is necessary to design a Catalog for Lance to make it play a greater value in the AI and big data ecosystem.
Investigation of Mainstream Catalog Systems
In the fields of Big Data and AI, there are some mainstream Catalog Systems in the industry (more typical ones are: Unity Catalog and Apache Gravitino). The best choice for Lance in expanding its ecosystem is to integrate with them. Given that, currently, there is no Catalog System has become a "de facto" standard in the industry. Here, we compare them from some dimensions to support more in-depth discussions and decisions.
Dimension
Unity Catalog (OSS)
Gravitino
Description
LICENSE
Apache v2.0
Apache v2.0
Multilingual Ecosystem
❌
✅
Unity Catalog: no Python Client currently; Gravitino: Python SDK
Conclusion: Regarding the current capabilities and maturity of the two projects, integrating Lance with Apache Gravitino may be a better choice. However, in the following text, we will define Lance's Catalog interface, enabling it to be integrated with either of these systems.
Lance Catalog Conceptual Design
After researching mainstream Catalog System and Table Format catalog designs, we believe that the Lance Catalog is mainly a two-level conceptual design:
Namespace: Used to organize several Lance datasets. It is equivalent to the Schema (or Database) concept in RDBMS and also comparable to the Namespace in Iceberg. Dataset assets can be organized under the Namespace.
Dataset: Corresponding to the current Lance Dataset concept, it is equivalent to a Table in RDBMS.
The overall structure is illustrated as follows:
Based on the above concept introduction, we need to define the following entities and interfaces.
Entity
Namespace: Used for organizing datasets, similar to a database in an RDBMS.
DataSetIdentifier: Used for uniquely marking a dataset in a catalog.
DatasetMetadata(optional): Used to encapsulate metadata information of a dataset (such as schema, location, and some extended attributes).
Interface
Catalog: An abstract interface used to stipulate the basic semantic interfaces that Lance Catalog needs to support.
DatasetOperation: An abstract interface used to stipulate submitting or refreshing dataset metadata.
Lance Catalog architecture design
At the architectural level, there are two approaches to implement the Lance Catalog. The diagram is as follows:
The choice between these two designs is a trade-off on multiple levels. Here we compare them in some dimensions:
Design comparison
Dimension
Solution 1
Solution 2
Explanation (mainly explain the reasons for the inferior scheme)
Complexity
❌
✅
The language complexity of RUST and the costs of binding call writing, DEBUG, etc. are higher.
Controllability
❌
✅
The vertical mode only stipulates interfaces. The logic is implemented by each language itself so that a bug will not affect all docking scenarios of multiple languages.
Workload
✅
❌
Taking Hive Catalog as an example, for Lance SDKs of different languages, adaptation is required separately once.
Maintainability
✅
❌
Vertical integration needs to rely on "conventions" between each programming language to "guarantee" the consistency of interface semantics, with relatively low constraints.
Maturity (Are there reference cases of this solution in the same field in the industry?)
❌
✅
Iceberg adopted Solution 2. Although basic programming languages of different formats have different implementation paths and some historical implementation backgrounds, only the results are used for evaluation here.
Friendliness of docking with mainstream Catalog System.
✅
✅
Both can interact with Unity Catalog/Apache Gravitino through RESTful API.
Conclusion: The choice of specific schemes still requires further discussion.
Lance Catalog interface design
The following is temporarily implemented in the form of Option 2 for POC.
Catalog Interface
/** A Catalog API for dataset create, drop, and load operations. */publicinterfaceCatalog {
/** * Return the name for this catalog. * * @return this catalog's name */Stringname()
/** * Create dataset with a given identifier and schema. */DatasetcreateDataset(DatasetIdentifieridentifier, Schemaschema)
/** * Create dataset with a given identifier, schema, location and properties. */DatasetcreateDataset(
DatasetIdentifieridentifier,
Schemaschema,
Stringlocation,
Map<String, String> properties)
/** * Return all the identifiers under this namespace. * * @param namespace a namespace * @return a list of identifiers for datasets * @throws NoSuchNamespaceException if the namespace is not found */List<DatasetIdentifier> listDatasets(Namespacenamespace);
/** * Drop a dataset; optionally delete data and metadata files. * * <p>If purge is set to true the implementation should delete all data and metadata files. * * @param identifier a dataset identifier * @param purge if true, delete all data and metadata files in the dataset * @param storageOptions a map of storage options to use when deleting data and metadata files * @return true if the dataset was dropped, false if the dataset did not exist */booleandropDataset(
DatasetIdentifieridentifier, booleanpurge, Map<String, String> storageOptions);
booleandropDataset(DatasetIdentifieridentifier);
/** * Rename a dataset. * * @param from identifier of the dataset to rename * @param to new dataset name * @throws NoSuchDatasetException if the from dataset does not exist * @throws AlreadyExistsException if the to dataset already exists */voidrenameDataset(DatasetIdentifierfrom, DatasetIdentifierto);
/** * Load a dataset. * * @param identifier a dataset identifier * @return instance of {@link Dataset} implementation referred by {@code identifier} * @throws NoSuchDatasetException if the dataset does not exist */Optional<Dataset> loadDataset(DatasetIdentifieridentifier);
/** * Invalidate cached dataset metadata from current catalog. * * <p>If the dataset is already loaded or cached, drop cached data. If the dataset does not exist * or is not cached, do nothing. * * @param identifier a dataset identifier */defaultvoidinvalidateDataset(DatasetIdentifieridentifier) {}
/** * Register a dataset with the catalog if it does not exist. * * @param identifier a dataset identifier * @param metadataFileLocation the location of a metadata file * @return a dataset instance * @throws AlreadyExistsException if the dataset already exists in the catalog. */DatasetregisterDataset(DatasetIdentifieridentifier, StringmetadataFileLocation)
DatasetBuilderbuildDataset(DatasetIdentifieridentifier, Schemaschema)
/** * Create a namespace in the catalog. * * @param namespace a namespace. {@link Namespace}. * @throws AlreadyExistsException If the namespace already exists * @throws UnsupportedOperationException If create is not a supported operation */voidcreateNamespace(Namespacenamespace)
/** * Create a namespace in the catalog. * * @param namespace a multi-part namespace * @param metadata a string Map of properties for the given namespace * @throws AlreadyExistsException If the namespace already exists * @throws UnsupportedOperationException If create is not a supported operation */voidcreateNamespace(Namespacenamespace, Map<String, String> metadata);
/** * List top-level namespaces from the catalog. * * <p>If an object such as a table, view, or function exists, its parent namespaces must also * exist and must be returned by this discovery method. For example, if table a.b.t exists, this * method must return ["a"] in the result array. * * @return a List of namespace {@link Namespace} names */List<Namespace> listNamespaces()
/** * List child namespaces from the namespace. * * <p>For two existing tables named 'a.b.c.table' and 'a.b.d.table', this method returns: * * <ul> * <li>Given: {@code Namespace.empty()} * <li>Returns: {@code Namespace.of("a")} * </ul> * * <ul> * <li>Given: {@code Namespace.of("a")} * <li>Returns: {@code Namespace.of("a", "b")} * </ul> * * <ul> * <li>Given: {@code Namespace.of("a", "b")} * <li>Returns: {@code Namespace.of("a", "b", "c")} and {@code Namespace.of("a", "b", "d")} * </ul> * * <ul> * <li>Given: {@code Namespace.of("a", "b", "c")} * <li>Returns: empty list, because there are no child namespaces * </ul> * * @return a List of child {@link Namespace} names from the given namespace * @throws NoSuchNamespaceException If the namespace does not exist (optional) */List<Namespace> listNamespaces(Namespacenamespace) throwsNoSuchNamespaceException;
/** * Load metadata properties for a namespace. * * @param namespace a namespace. {@link Namespace} * @return a string map of properties for the given namespace * @throws NoSuchNamespaceException If the namespace does not exist (optional) */Map<String, String> loadNamespaceMetadata(Namespacenamespace) throwsNoSuchNamespaceException;
/** * Drop a namespace. If the namespace exists and was dropped, this will return true. * * @param namespace a namespace. {@link Namespace} * @return true if the namespace was dropped, false otherwise. * @throws NamespaceNotEmptyException If the namespace is not empty */booleandropNamespace(Namespacenamespace) throwsNamespaceNotEmptyException;
/** * Set a collection of properties on a namespace in the catalog. * * <p>Properties that are not in the given map are not modified or removed by this method. * * @param namespace a namespace. {@link Namespace} * @param properties a collection of metadata to apply to the namespace * @throws NoSuchNamespaceException If the namespace does not exist (optional) * @throws UnsupportedOperationException If namespace properties are not supported */booleansetProperties(Namespacenamespace, Map<String, String> properties)
throwsNoSuchNamespaceException;
/** * Remove a set of property keys from a namespace in the catalog. * * <p>Properties that are not in the given set are not modified or removed by this method. * * @param namespace a namespace. {@link Namespace} * @param properties a collection of metadata to apply to the namespace * @throws NoSuchNamespaceException If the namespace does not exist (optional) * @throws UnsupportedOperationException If namespace properties are not supported */booleanremoveProperties(Namespacenamespace, Set<String> properties)
throwsNoSuchNamespaceException;
/** * Checks whether the Namespace exists. * * @param namespace a namespace. {@link Namespace} * @return true if the Namespace exists, false otherwise */booleannamespaceExists(Namespacenamespace)
/** * Initialize a catalog given a custom name and a map of catalog properties. * * <p>A custom Catalog implementation must have a no-arg constructor. A compute engine like Spark * or Flink will first initialize the catalog without any arguments, and then call this method to * complete catalog initialization with properties passed into the engine. * * @param name a custom name for the catalog * @param properties catalog properties */defaultvoidinitialize(Stringname, Map<String, String> properties) {}
interfaceDatasetBuilder {
DatasetBuilderwithLocation(Stringlocation);
DatasetBuilderwithProperties(Map<String, String> properties);
DatasetBuilderwithProperty(Stringkey, Stringvalue);
//....Datasetcreate();
}
}
DatasetOperation Interface
current: Return the dataset metadata that is currently loaded.
refresh: Refresh the dataset metadata.
commit: Replace the table's metadata with a new version. (The commit operation needs to consider locking.)
TODO(discussion)
Provide an InMemeory Catalog for testing;
Provide index-related APIs and store it in metadata;
The text was updated successfully, but these errors were encountered:
Motivation
As an effective concept for hierarchically managing data assets, Catalog has been widely accepted in the industry. Moreover, many open-source engines have implemented this concept. The most typical one is the Apache Hive Catalog. At the same time, with the increasing popularity of LLM, the industry is also actively practicing managing AI data assets and even data assets of any type through Catalog (for example, Unity Catalog, Gravitino, etc.). As a modern data format, Lance's success must be connected to the support of many ecological components in AI and big data. The core element of integration is allowing these ecological components to obtain the information and metadata of "Lance" correctly and seamlessly connect with their Catalog system. Therefore, it is necessary to design a Catalog for Lance to make it play a greater value in the AI and big data ecosystem.
Investigation of Mainstream Catalog Systems
In the fields of Big Data and AI, there are some mainstream Catalog Systems in the industry (more typical ones are: Unity Catalog and Apache Gravitino). The best choice for Lance in expanding its ecosystem is to integrate with them. Given that, currently, there is no Catalog System has become a "de facto" standard in the industry. Here, we compare them from some dimensions to support more in-depth discussions and decisions.
Conclusion: Regarding the current capabilities and maturity of the two projects, integrating Lance with Apache Gravitino may be a better choice. However, in the following text, we will define Lance's Catalog interface, enabling it to be integrated with either of these systems.
Lance Catalog Conceptual Design
After researching mainstream Catalog System and Table Format catalog designs, we believe that the Lance Catalog is mainly a two-level conceptual design:
The overall structure is illustrated as follows:
Based on the above concept introduction, we need to define the following entities and interfaces.
Entity
Interface
Lance Catalog architecture design
At the architectural level, there are two approaches to implement the Lance Catalog. The diagram is as follows:
Option 1: Integration mode centered on RUST.
Option 2: Multi-language vertical integration mode
The choice between these two designs is a trade-off on multiple levels. Here we compare them in some dimensions:
Design comparison
Conclusion: The choice of specific schemes still requires further discussion.
Lance Catalog interface design
The following is temporarily implemented in the form of Option 2 for POC.
Catalog Interface
DatasetOperation Interface
TODO(discussion)
The text was updated successfully, but these errors were encountered: