Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Introduce Catalog for Lance #3257

Open
yanghua opened this issue Dec 17, 2024 · 0 comments
Open

Proposal: Introduce Catalog for Lance #3257

yanghua opened this issue Dec 17, 2024 · 0 comments

Comments

@yanghua
Copy link
Contributor

yanghua commented Dec 17, 2024

Motivation

As an effective concept for hierarchically managing data assets, Catalog has been widely accepted in the industry. Moreover, many open-source engines have implemented this concept. The most typical one is the Apache Hive Catalog. At the same time, with the increasing popularity of LLM, the industry is also actively practicing managing AI data assets and even data assets of any type through Catalog (for example, Unity Catalog, Gravitino, etc.). As a modern data format, Lance's success must be connected to the support of many ecological components in AI and big data. The core element of integration is allowing these ecological components to obtain the information and metadata of "Lance" correctly and seamlessly connect with their Catalog system. Therefore, it is necessary to design a Catalog for Lance to make it play a greater value in the AI and big data ecosystem.

Investigation of Mainstream Catalog Systems

In the fields of Big Data and AI, there are some mainstream Catalog Systems in the industry (more typical ones are: Unity Catalog and Apache Gravitino). The best choice for Lance in expanding its ecosystem is to integrate with them. Given that, currently, there is no Catalog System has become a "de facto" standard in the industry. Here, we compare them from some dimensions to support more in-depth discussions and decisions.

Dimension Unity Catalog (OSS) Gravitino Description
LICENSE Apache v2.0 Apache v2.0
Multilingual Ecosystem Unity Catalog: no Python Client currently; Gravitino: Python SDK
Unstructured Data Support
Iceberg REST API Unity Catalog: read-only, via UniForm
Security Control
Engine Support Unity Catalog: Spark, DuckDB, Trino, Daft, PuggyGraph, SpiceAI, XTable; Gravitino: Trino, Spark, Flink, PyTorch, Ray
WEB GUI

Conclusion: Regarding the current capabilities and maturity of the two projects, integrating Lance with Apache Gravitino may be a better choice. However, in the following text, we will define Lance's Catalog interface, enabling it to be integrated with either of these systems.

Lance Catalog Conceptual Design

After researching mainstream Catalog System and Table Format catalog designs, we believe that the Lance Catalog is mainly a two-level conceptual design:

  • Namespace: Used to organize several Lance datasets. It is equivalent to the Schema (or Database) concept in RDBMS and also comparable to the Namespace in Iceberg. Dataset assets can be organized under the Namespace.
  • Dataset: Corresponding to the current Lance Dataset concept, it is equivalent to a Table in RDBMS.

The overall structure is illustrated as follows:

catalog-concept-design

Based on the above concept introduction, we need to define the following entities and interfaces.

Entity

  • Namespace: Used for organizing datasets, similar to a database in an RDBMS.
  • DataSetIdentifier: Used for uniquely marking a dataset in a catalog.
  • DatasetMetadata(optional): Used to encapsulate metadata information of a dataset (such as schema, location, and some extended attributes).

Interface

  • Catalog: An abstract interface used to stipulate the basic semantic interfaces that Lance Catalog needs to support.
  • DatasetOperation: An abstract interface used to stipulate submitting or refreshing dataset metadata.

Lance Catalog architecture design

At the architectural level, there are two approaches to implement the Lance Catalog. The diagram is as follows:

Option 1: Integration mode centered on RUST.

design-option-1

Option 2: Multi-language vertical integration mode

design-option-2

The choice between these two designs is a trade-off on multiple levels. Here we compare them in some dimensions:

Design comparison

Dimension Solution 1 Solution 2 Explanation (mainly explain the reasons for the inferior scheme)
Complexity The language complexity of RUST and the costs of binding call writing, DEBUG, etc. are higher.
Controllability The vertical mode only stipulates interfaces. The logic is implemented by each language itself so that a bug will not affect all docking scenarios of multiple languages.
Workload Taking Hive Catalog as an example, for Lance SDKs of different languages, adaptation is required separately once.
Maintainability Vertical integration needs to rely on "conventions" between each programming language to "guarantee" the consistency of interface semantics, with relatively low constraints.
Maturity (Are there reference cases of this solution in the same field in the industry?) Iceberg adopted Solution 2. Although basic programming languages of different formats have different implementation paths and some historical implementation backgrounds, only the results are used for evaluation here.
Friendliness of docking with mainstream Catalog System. Both can interact with Unity Catalog/Apache Gravitino through RESTful API.

Conclusion: The choice of specific schemes still requires further discussion.

Lance Catalog interface design

The following is temporarily implemented in the form of Option 2 for POC.

Catalog Interface

/** A Catalog API for dataset create, drop, and load operations. */
public interface Catalog {

  /**
   * Return the name for this catalog.
   *
   * @return this catalog's name
   */
  String name()

  /**
   * Create dataset with a given identifier and schema.
   */
  Dataset createDataset(DatasetIdentifier identifier, Schema schema)

  /**
   * Create dataset with a given identifier, schema, location and properties.
   */
  Dataset createDataset(
      DatasetIdentifier identifier,
      Schema schema,
      String location,
      Map<String, String> properties)

  /**
   * Return all the identifiers under this namespace.
   *
   * @param namespace a namespace
   * @return a list of identifiers for datasets
   * @throws NoSuchNamespaceException if the namespace is not found
   */
  List<DatasetIdentifier> listDatasets(Namespace namespace);

  /**
   * Drop a dataset; optionally delete data and metadata files.
   *
   * <p>If purge is set to true the implementation should delete all data and metadata files.
   *
   * @param identifier a dataset identifier
   * @param purge if true, delete all data and metadata files in the dataset
   * @param storageOptions a map of storage options to use when deleting data and metadata files
   * @return true if the dataset was dropped, false if the dataset did not exist
   */
  boolean dropDataset(
      DatasetIdentifier identifier, boolean purge, Map<String, String> storageOptions);

  boolean dropDataset(DatasetIdentifier identifier);

  /**
   * Rename a dataset.
   *
   * @param from identifier of the dataset to rename
   * @param to new dataset name
   * @throws NoSuchDatasetException if the from dataset does not exist
   * @throws AlreadyExistsException if the to dataset already exists
   */
  void renameDataset(DatasetIdentifier from, DatasetIdentifier to);

  /**
   * Load a dataset.
   *
   * @param identifier a dataset identifier
   * @return instance of {@link Dataset} implementation referred by {@code identifier}
   * @throws NoSuchDatasetException if the dataset does not exist
   */
  Optional<Dataset> loadDataset(DatasetIdentifier identifier);

  /**
   * Invalidate cached dataset metadata from current catalog.
   *
   * <p>If the dataset is already loaded or cached, drop cached data. If the dataset does not exist
   * or is not cached, do nothing.
   *
   * @param identifier a dataset identifier
   */
  default void invalidateDataset(DatasetIdentifier identifier) {}

  /**
   * Register a dataset with the catalog if it does not exist.
   *
   * @param identifier a dataset identifier
   * @param metadataFileLocation the location of a metadata file
   * @return a dataset instance
   * @throws AlreadyExistsException if the dataset already exists in the catalog.
   */
  Dataset registerDataset(DatasetIdentifier identifier, String metadataFileLocation)

  DatasetBuilder buildDataset(DatasetIdentifier identifier, Schema schema)
  
  /**
   * Create a namespace in the catalog.
   *
   * @param namespace a namespace. {@link Namespace}.
   * @throws AlreadyExistsException If the namespace already exists
   * @throws UnsupportedOperationException If create is not a supported operation
   */
  void createNamespace(Namespace namespace)

  /**
   * Create a namespace in the catalog.
   *
   * @param namespace a multi-part namespace
   * @param metadata a string Map of properties for the given namespace
   * @throws AlreadyExistsException If the namespace already exists
   * @throws UnsupportedOperationException If create is not a supported operation
   */
  void createNamespace(Namespace namespace, Map<String, String> metadata);

  /**
   * List top-level namespaces from the catalog.
   *
   * <p>If an object such as a table, view, or function exists, its parent namespaces must also
   * exist and must be returned by this discovery method. For example, if table a.b.t exists, this
   * method must return ["a"] in the result array.
   *
   * @return a List of namespace {@link Namespace} names
   */
  List<Namespace> listNamespaces()

  /**
   * List child namespaces from the namespace.
   *
   * <p>For two existing tables named 'a.b.c.table' and 'a.b.d.table', this method returns:
   *
   * <ul>
   *   <li>Given: {@code Namespace.empty()}
   *   <li>Returns: {@code Namespace.of("a")}
   * </ul>
   *
   * <ul>
   *   <li>Given: {@code Namespace.of("a")}
   *   <li>Returns: {@code Namespace.of("a", "b")}
   * </ul>
   *
   * <ul>
   *   <li>Given: {@code Namespace.of("a", "b")}
   *   <li>Returns: {@code Namespace.of("a", "b", "c")} and {@code Namespace.of("a", "b", "d")}
   * </ul>
   *
   * <ul>
   *   <li>Given: {@code Namespace.of("a", "b", "c")}
   *   <li>Returns: empty list, because there are no child namespaces
   * </ul>
   *
   * @return a List of child {@link Namespace} names from the given namespace
   * @throws NoSuchNamespaceException If the namespace does not exist (optional)
   */
  List<Namespace> listNamespaces(Namespace namespace) throws NoSuchNamespaceException;

  /**
   * Load metadata properties for a namespace.
   *
   * @param namespace a namespace. {@link Namespace}
   * @return a string map of properties for the given namespace
   * @throws NoSuchNamespaceException If the namespace does not exist (optional)
   */
  Map<String, String> loadNamespaceMetadata(Namespace namespace) throws NoSuchNamespaceException;

  /**
   * Drop a namespace. If the namespace exists and was dropped, this will return true.
   *
   * @param namespace a namespace. {@link Namespace}
   * @return true if the namespace was dropped, false otherwise.
   * @throws NamespaceNotEmptyException If the namespace is not empty
   */
  boolean dropNamespace(Namespace namespace) throws NamespaceNotEmptyException;

  /**
   * Set a collection of properties on a namespace in the catalog.
   *
   * <p>Properties that are not in the given map are not modified or removed by this method.
   *
   * @param namespace a namespace. {@link Namespace}
   * @param properties a collection of metadata to apply to the namespace
   * @throws NoSuchNamespaceException If the namespace does not exist (optional)
   * @throws UnsupportedOperationException If namespace properties are not supported
   */
  boolean setProperties(Namespace namespace, Map<String, String> properties)
      throws NoSuchNamespaceException;

  /**
   * Remove a set of property keys from a namespace in the catalog.
   *
   * <p>Properties that are not in the given set are not modified or removed by this method.
   *
   * @param namespace a namespace. {@link Namespace}
   * @param properties a collection of metadata to apply to the namespace
   * @throws NoSuchNamespaceException If the namespace does not exist (optional)
   * @throws UnsupportedOperationException If namespace properties are not supported
   */
  boolean removeProperties(Namespace namespace, Set<String> properties)
      throws NoSuchNamespaceException;

  /**
   * Checks whether the Namespace exists.
   *
   * @param namespace a namespace. {@link Namespace}
   * @return true if the Namespace exists, false otherwise
   */
  boolean namespaceExists(Namespace namespace)

  /**
   * Initialize a catalog given a custom name and a map of catalog properties.
   *
   * <p>A custom Catalog implementation must have a no-arg constructor. A compute engine like Spark
   * or Flink will first initialize the catalog without any arguments, and then call this method to
   * complete catalog initialization with properties passed into the engine.
   *
   * @param name a custom name for the catalog
   * @param properties catalog properties
   */
  default void initialize(String name, Map<String, String> properties) {}

  interface DatasetBuilder {

    DatasetBuilder withLocation(String location);

    DatasetBuilder withProperties(Map<String, String> properties);

    DatasetBuilder withProperty(String key, String value);
    
    //....

    Dataset create();
  }

}

DatasetOperation Interface

  • current: Return the dataset metadata that is currently loaded.
  • refresh: Refresh the dataset metadata.
  • commit: Replace the table's metadata with a new version. (The commit operation needs to consider locking.)

TODO(discussion)

  • Provide an InMemeory Catalog for testing;
  • Provide index-related APIs and store it in metadata;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant