This document describes the architecture of python based visualizations, development guidelines to contribute new predefined visualizations to the Kubeflow Pipelines project, and current limitations. Python based visualizations are a new method of generating visualizations within Kubeflow Pipelines that allow for rapid development, experimentation, and customization when visualizing results. For information about Python based visualizations and how to use them, please visit the documentation page.
Please check the developer guidelines for additional development guidelines.
Python based visualizations rely on three parts: the frontend, the API server, and the Python visualization service. The frontend is responsible for creating the visualization request and displaying the results of the created requests. The API server is responsible for transposing the request provided by the frontend to a request that is understandable by the python visualization service, returning the result of the transposed request to the frontend, and gracefully handling incorrectly formatted requests from the frontend and any errors encountered with the Python visualization service. Finally, the Python visualization service is responsible for generating a visualization from a provided request.
- Determine if the visualization should become a predefined visualization.
Consider the following:
- How often will it be used?
- Frequently used visualizations are a good candidate for predefine visualization.
- How complex is it?
- The complexity of a visualization can reduce its usability. Predefined visualizations are intended to be powerful and simple. Visualizations that require extensive or complex variables are not good candidates for predefined visualizations.
- How often will it be used?
- Fork the Kubeflow Pipelines repository.
- Add a new type for the visualization within the visualization.proto
file in the
backend/api
directory.- The name of the visualization should be in screaming snake case (that is
VISUALIZATION_NAME
).
- The name of the visualization should be in screaming snake case (that is
- Run
./generate_api.sh
within thebackend/api
directory to generate the Swagger API definition for the backend. - Download the Swagger Codegen
jar file.
- Currently, version 2.3.1 of the Swagger Codegen jar file is used to generate the frontend API. Should this become out of date, the version can be checked within the VERSION file for the visualization Swagger Codegen directory.
- This step is only required if the Swagger Codegen jar file is not present
in the
frontend
directory. If you already have the jar file, you can skip steps 6 and 7.
- Place the Swagger Codegen jar file in the
frontend
directory. - Rename the Swagger Codegen jar file to swagger-codegen-cli.jar.
- Run
npm run apis:visualization
within thefrontend
directory to generate the Swagger API definition for the frontend. - Create a new Python file that will be executed to generate a visualization.
- Python 3 MUST be used.
- The new Python file should be created within the
backend/src/apiserver/visualization
directory and it should have the same name as the type that was created earlier, use snake case instead of screaming snake case (that isvisualization_name.py
). - Dependency injection is used to pass variables from the Kubeflow Pipelines
UI to a visualization.
-
To obtain a path or path pattern from the Kubeflow Pipelines UI, you can use the following syntax:
# The variable "source" will be injected to any visualization. The # value of "source" will be provided by the Kubeflow Pipelines UI # and will never be an empty string. ... # Open a file with a provided path or path pattern from the # Kubeflow Pipelines UI and append DataFrame to an array of # DataFrames dfs = [] for f in file_io.get_matching_files(source): dfs.append(pd.read_csv(f)) ... # Get a path from the Kubeflow Pipelines UI and create a DataFrame df = pd.read_csv(source) ...
- Additional details about how this is implemented can be found in the server.py file.
-
To obtain additional variables from the Pipelines UI, you can use the following syntax:
# Get a value for a specified key key = variables.get("key") # Get a value for a specified key with a default key = variables.get("key", "default_value") # Check if a value for a specified key exists if variables.get("key", "default_value") is "default_value": # Value for a specified key does not exist pass else: # Value for a specified key does exist pass
- Additional details about how this is implemented can be found in the exporter.py file and the Python documentation.
-
- Add any new dependencies to the requirements.txt
file in the
backend/src/apiserver/visualization
directory. - Add any new dependencies to the third_party_licenses.csv
file.
- The following format is used:
package_name,url_to_package_license,license_name
- The columns of the csv are as follows:
package_name
is the name of the package on pypi.url_to_package_license
is the url where the license of the package can be downloaded from.license_name
is the name of package license.
- Examples for all the columns can be found in the third_party_licenses.csv file.
- The following format is used:
- Submit these changes as a Pull Request or build docker image for usage within your cluster.
-
Multiple visualizations cannot be generated concurrently.
- This is because a single Python kernel is used to generate visualizations.
- If visualizations are a major part of your workflow, it is recommended to
increase the number of replicas within the visualization deployment YAML
file or within the visualization service deployment itself.
- Please note that this does not directly solve the issue, instead it decreases the likelihood of experiencing delays when generating visualizations.
-
Visualizations that take longer than 30 seconds will fail to generate.
- For visualizations where the 30 second timeout is reached, you can add the TimeoutValue header to the request made by the frontend, specifying a positive integer as ASCII string of at most 8 digits for the length of time required to generate a visualization as specified by the grpc documentation.
- For visualizations that take longer than 100 seconds, you will have to specify a TimeoutValue within the request headers AND change the default kernel timeout of the visualization service. To change the default kernel timeout of the visualization service, set the KERNEL_TIMEOUT environment variable of the visualization service deployment to be the new timeout length in seconds within the visualization deployment YAML file or within the visualization service deployment itself.
- env: - name: KERNEL_TIMEOUT value: 100
-
The HTML content of the generated visualizations cannot be larger than 4MB.
- gRPC by default imposes a limit of 4MB as the maximum size that can be sent and received by a server. To allow for visualizations that are larger than 4MB in size to be generated, you must manually set MaxCallRecvMsgSize for gRPC. This can be done by editing the provided options given to the gRPC server within main.go to
var maxCallRecvMsgSize = 4 * 1024 * 1024 if serviceName == "Visualization" { // Only change the maxCallRecvMesSize if it is for visualizations maxCallRecvMsgSize = 50 * 1024 * 1024 } opts := []grpc.DialOption{ grpc.WithDefaultCallOptions(grpc.MaxCallRecvMsgSize(maxCallRecvMsgSize)), grpc.WithInsecure(), }
- Edit
requirements.in
with additional changes. - Run
./update_requirements.sh
to re-resolve dependencies. - Pinned dependencies are in
requirements.txt
.