This repository compiles prescriptive guidance and code samples that show how to operationalize the Google Research T5X framework using Google Cloud Vertex AI. Using T5X with Vertex AI enables streamlined experimentation, development, and deployment of natural language processing (NLP) solutions at scale.
The guidance assumes that you're familiar with ML concepts such as large language models (LLMs), and that you're generally familiar with Google Cloud features like Cloud Storage, Cloud TPUs, and Google Vertex AI.
T5X is a machine learning (ML) framework for developing high-performance sequence models, including large language models (LLMs). For more information about T5X, see the following resources:
T5X is built as a JAX-based library for training, evaluating, and inferring with sequence models. T5X's primary focus is on Transformer type language models. You can use T5X to pretrain language models and to fine-tune a pretrained language model. The T5X GitHub repo includes references to a large number of pretrained Transformer models, including the T5 and Switch Transformer families of models.
T5X is streamlined, modular, and composable. You can implement pretraining, fine-tuning, evaluating, and inferring by configuring reusable components that are provided by T5X rather than having to develop custom Python modules.
Vertex AI is Google Cloud's unified ML platform that's designed to help data scientists and ML engineers increase their velocity of experimentation, deploy faster, and manage models with confidence.
.
├── configs
├── docs
├── examples
├── notebooks
├── tasks
├── Dockerfile
└── README.md
-
/notebooks
: Example notebooks demonstrating T5X fine-tuning, evaluating, and inferring scenarios: -
/configs
: Configuration files for the scenarios demonstrated in notebooks. -
/scripts
: Vertex AI Training T5X job configuration templates for selected fine-tuning, evaluating or inferring scenarios -
/tasks
: Python modules implementing custom SeqIO Tasks. -
/docs
- Technical guides compiling best practices for running T5X on Vertex AI: -
The main folder also includes Dockerfiles for custom container images used by Vertex Training.
This section outlines the steps to configure the Google Cloud environment that is required in order to run the code samples in this repo.
- You use a user-managed instance of Vertex AI Workbench as your development environment and the primary interface to Vertex AI services.
- You run T5X training, evaluating, and inferring tasks as Vertex Training custom jobs using a custom training container image.
- You use Vertex AI Experiments and Vertex AI Tensorboard for job monitoring and experiment tracking.
- You use a regional Cloud Storage bucket to manage artifacts created by T5X jobs.
To set up the environment execute the following steps.
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project. You need to be a project owner in order to set up the environment.
From Cloud Shell, run the following commands to enable the required Cloud APIs:
export PROJECT_ID=<YOUR_PROJECT_ID>
gcloud config set project $PROJECT_ID
gcloud services enable \
cloudbuild.googleapis.com \
compute.googleapis.com \
cloudresourcemanager.googleapis.com \
iam.googleapis.com \
container.googleapis.com \
cloudapis.googleapis.com \
cloudtrace.googleapis.com \
containerregistry.googleapis.com \
iamcredentials.googleapis.com \
monitoring.googleapis.com \
logging.googleapis.com \
notebooks.googleapis.com \
aiplatform.googleapis.com \
storage.googleapis.com
Note: When you work with Vertex AI user-managed notebooks, be sure that all the services that you're using are provisioned in the same project and the same compute region as the available Vertex AI TPU pods regions. For a list of regions where TPU pods are available, see Locations in the Vertex AI documentation.
Some notebooks demonstrate scenarios that require as many as 128 TPU cores.
If you need an increase in Vertex AI TPU quota values, follow these steps:
- In the Cloud Console, navigate to the Quotas tab of the Vertex AI API page.
- In the Enter property name or value box that's next to the Filter label, add a filter that has the following conditions:
- Quota: Custom model training TPU V2 cores per region or Custom model training TPU V3 cores per region
- Dimensions (e.g. location): Region: <YOUR_REGION>
Note: Vertex AI TPUs are not available in all regions. If the Limit value in the listing is 8, TPUs are available, and you can request more by increasing the Quota value. If the Limit value is 0, no TPUs are available, and the Quota value cannot be changed.
-
In the listing, select the quota that matches your filter criteria and then click Edit Quotas.
-
In the New limit box, enter the required value and then submit the quota change request.
Quota increases don’t directly impact your billing because you are still required to specify the number of TPU cores to submit your T5X tasks. Only the tasks submitted with a high number of TPU cores result in higher billing.
You can create a user-managed notebooks instance from the command line.
Note: Make sure that you're following these steps in the same project as before.
In Cloud Shell, enter the following command. For <YOUR_INSTANCE_NAME>
, enter a name starting with a lower-case letter followed by lower-case letters, numbers or dash sign. For <YOUR_LOCATION>
, add a zone (for example, us-central1-a
or europe-west4-a
).
PROJECT_ID=$(gcloud config list --format 'value(core.project)')
INSTANCE_NAME=<YOUR_INSTANCE_NAME>
LOCATION=<YOUR_LOCATION>
gcloud notebooks instances create $INSTANCE_NAME \
--vm-image-project=deeplearning-platform-release \
--vm-image-family=common-cpu-notebooks \
--machine-type=n1-standard-4 \
--location=$LOCATION
Vertex AI Workbench creates a user-managed notebooks instance based on the properties that you specified and then automatically starts the instance. When the instance is ready to use, Vertex AI Workbench activates an Open JupyterLab link next to the instance name in the Vertex AI Workbench Cloud Console page. To connect to your user-managed notebooks instance, click Open JupyterLab.
After the Vertex Workbench user-managed notebook Jupyter lab is launched, perform the following steps:
- On the Launcher page, start a new terminal session by clicking the Terminal icon.
- Clone the repository to your notebook instance:
git clone https://github.com/GoogleCloudPlatform/t5x-on-vertex-ai.git
- Install code dependencies:
cd t5x-on-vertex-ai
pip install -U pip
pip install google-cloud-aiplatform[tensorboard] tfds-nightly t5[gcp]
- Build the base T5X container image in Container Registry. For
<YOUR_PROJECT_ID>
, use the ID of the Google project that you are working with.
export PROJECT_ID=<YOUR_PROJECT_ID>
gcloud config set project $PROJECT_ID
IMAGE_NAME=t5x-base
IMAGE_URI=gcr.io/${PROJECT_ID}/${IMAGE_NAME}
gcloud builds submit --timeout "2h" --tag ${IMAGE_URI} . --machine-type=e2-highcpu-8
The notebooks in the repo require access to a Cloud Storage bucket that's used for staging and for managing ML artifacts created by the jobs submitted. The bucket must be in the same Google Cloud region as the region you will use to run Vertex AI custom jobs.
- In the Jupyter lab terminal, create the bucket. For
<YOUR_REGION>
, specify the region. For<YOUR_BUCKET_NAME>
, use a globally unique name.
REGION=<YOUR_REGION>
BUCKET_NAME=<YOUR_BUCKET_NAME>
gsutil mb -l $REGION -p $PROJECT_ID gs://$BUCKET_NAME
In the Jupyter lab Terminal, create the Vertex AI Tensorboard instance:
DISPLAY_NAME=<YOUR_INSTANCE_NAME>
gcloud ai tensorboards create --display-name $DISPLAY_NAME --project $PROJECT_ID --region=$REGION
Before you walk through the example notebooks, make sure that you pre-build all the required TensorFlow Datasets (TFDS) datasets.
From the Jupyter lab Terminal:
BUCKET_NAME=<YOUR_BUCKET_NAME>
export TFDS_DATA_DIR=gs://${BUCKET_NAME}/datasets
tfds build --data_dir $TFDS_DATA_DIR --experimental_latest_version squad
tfds build --data_dir $TFDS_DATA_DIR --experimental_latest_version wmt_t2t_translate
tfds build --data_dir $TFDS_DATA_DIR --experimental_latest_version cnn_dailymail
To build xsum you need to download and prepare the source data manually.
- Follow the instructions to create the
xsum-extracts-from-downloads
folder with source data. - Create a tar archive from the
xsum-extracts-from-downloads
folder.
tar -czvf xsum-extracts-from-downloads.tar.gz xsum-extracts-from-downloads/
- Copy the archive to the TFDS manual downloads folder.
gsutil cp -r xsum-extracts-from-downloads.tar.gz ${TFDS_DATA_DIR}/downloads/manual/
- Build the dataset
tfds build --data_dir $TFDS_DATA_DIR --experimental_latest_version xsum
The environment is ready.
Start by reading the Running and monitoring T5X jobs with Vertex AI guide and walking through the Getting Started notebook.
If you have any questions or if you found any problems with this repository, please report through GitHub issues.