Skip to content

Latest commit

 

History

History
178 lines (166 loc) · 9.96 KB

File metadata and controls

178 lines (166 loc) · 9.96 KB

Python based visualizations guideline

This document describes the architecture of python based visualizations, development guidelines to contribute new predefined visualizations to the Kubeflow Pipelines project, and current limitations. Python based visualizations are a new method of generating visualizations within Kubeflow Pipelines that allow for rapid development, experimentation, and customization when visualizing results. For information about Python based visualizations and how to use them, please visit the documentation page.

Please check the developer guidelines for additional development guidelines.

Architecture

Python based visualizations rely on three parts: the frontend, the API server, and the Python visualization service. The frontend is responsible for creating the visualization request and displaying the results of the created requests. The API server is responsible for transposing the request provided by the frontend to a request that is understandable by the python visualization service, returning the result of the transposed request to the frontend, and gracefully handling incorrectly formatted requests from the frontend and any errors encountered with the Python visualization service. Finally, the Python visualization service is responsible for generating a visualization from a provided request.

How to create predefined visualizations

  1. Determine if the visualization should become a predefined visualization. Consider the following:
    • How often will it be used?
      • Frequently used visualizations are a good candidate for predefine visualization.
    • How complex is it?
      • The complexity of a visualization can reduce its usability. Predefined visualizations are intended to be powerful and simple. Visualizations that require extensive or complex variables are not good candidates for predefined visualizations.
  2. Fork the Kubeflow Pipelines repository.
  3. Add a new type for the visualization within the visualization.proto file in the backend/api directory.
    • The name of the visualization should be in screaming snake case (that is VISUALIZATION_NAME).
  4. Run ./generate_api.sh within the backend/api directory to generate the Swagger API definition for the backend.
  5. Download the Swagger Codegen jar file.
    • Currently, version 2.3.1 of the Swagger Codegen jar file is used to generate the frontend API. Should this become out of date, the version can be checked within the VERSION file for the visualization Swagger Codegen directory.
    • This step is only required if the Swagger Codegen jar file is not present in the frontend directory. If you already have the jar file, you can skip steps 6 and 7.
  6. Place the Swagger Codegen jar file in the frontend directory.
  7. Rename the Swagger Codegen jar file to swagger-codegen-cli.jar.
  8. Run npm run apis:visualization within the frontend directory to generate the Swagger API definition for the frontend.
  9. Create a new Python file that will be executed to generate a visualization.
    • Python 3 MUST be used.
    • The new Python file should be created within the backend/src/apiserver/visualization directory and it should have the same name as the type that was created earlier, use snake case instead of screaming snake case (that is visualization_name.py).
    • Dependency injection is used to pass variables from the Kubeflow Pipelines UI to a visualization.
      • To obtain a path or path pattern from the Kubeflow Pipelines UI, you can use the following syntax:

        # The variable "source" will be injected to any visualization. The
        # value of "source" will be provided by the Kubeflow Pipelines UI
        # and will never be an empty string.
        ...
        # Open a file with a provided path or path pattern from the
        # Kubeflow Pipelines UI and append DataFrame to an array of
        # DataFrames
        dfs = []
        for f in file_io.get_matching_files(source):
            dfs.append(pd.read_csv(f))
        ...
        # Get a path from the Kubeflow Pipelines UI and create a DataFrame
        df = pd.read_csv(source)
        ...
        • Additional details about how this is implemented can be found in the server.py file.
      • To obtain additional variables from the Pipelines UI, you can use the following syntax:

        # Get a value for a specified key
        key = variables.get("key")
        # Get a value for a specified key with a default
        key = variables.get("key", "default_value")
        # Check if a value for a specified key exists
        if variables.get("key", "default_value") is "default_value":
            # Value for a specified key does not exist
            pass
        else:
            # Value for a specified key does exist
            pass
  10. Add any new dependencies to the requirements.txt file in the backend/src/apiserver/visualization directory.
  11. Add any new dependencies to the third_party_licenses.csv file.
    • The following format is used:
      package_name,url_to_package_license,license_name
      
    • The columns of the csv are as follows:
      • package_name is the name of the package on pypi.
      • url_to_package_license is the url where the license of the package can be downloaded from.
      • license_name is the name of package license.
    • Examples for all the columns can be found in the third_party_licenses.csv file.
  12. Submit these changes as a Pull Request or build docker image for usage within your cluster.

Known limitations

  • Multiple visualizations cannot be generated concurrently.

    • This is because a single Python kernel is used to generate visualizations.
    • If visualizations are a major part of your workflow, it is recommended to increase the number of replicas within the visualization deployment YAML file or within the visualization service deployment itself.
      • Please note that this does not directly solve the issue, instead it decreases the likelihood of experiencing delays when generating visualizations.
  • Visualizations that take longer than 30 seconds will fail to generate.

    • For visualizations where the 30 second timeout is reached, you can add the TimeoutValue header to the request made by the frontend, specifying a positive integer as ASCII string of at most 8 digits for the length of time required to generate a visualization as specified by the grpc documentation.
    • For visualizations that take longer than 100 seconds, you will have to specify a TimeoutValue within the request headers AND change the default kernel timeout of the visualization service. To change the default kernel timeout of the visualization service, set the KERNEL_TIMEOUT environment variable of the visualization service deployment to be the new timeout length in seconds within the visualization deployment YAML file or within the visualization service deployment itself.
    - env:
      - name: KERNEL_TIMEOUT
        value: 100
  • The HTML content of the generated visualizations cannot be larger than 4MB.

    • gRPC by default imposes a limit of 4MB as the maximum size that can be sent and received by a server. To allow for visualizations that are larger than 4MB in size to be generated, you must manually set MaxCallRecvMsgSize for gRPC. This can be done by editing the provided options given to the gRPC server within main.go to
    var maxCallRecvMsgSize = 4 * 1024 * 1024
    if serviceName == "Visualization" {
    	// Only change the maxCallRecvMesSize if it is for visualizations
    	maxCallRecvMsgSize = 50 * 1024 * 1024
    }
    opts := []grpc.DialOption{
    	grpc.WithDefaultCallOptions(grpc.MaxCallRecvMsgSize(maxCallRecvMsgSize)),
    	grpc.WithInsecure(),
    }

Update python dependencies

  1. Edit requirements.in with additional changes.
  2. Run ./update_requirements.sh to re-resolve dependencies.
  3. Pinned dependencies are in requirements.txt.