Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for gpu_sharing_config on nodepool #1874

Merged
merged 9 commits into from
May 24, 2024

Conversation

jimgus
Copy link
Contributor

@jimgus jimgus commented Feb 13, 2024

Fixes #1506

@jimgus jimgus requested review from ericyz and a team as code owners February 13, 2024 11:33
@jimgus
Copy link
Contributor Author

jimgus commented Feb 27, 2024

@ericyz Would it be possible to get a review on this PR?

@NissesSenap
Copy link
Contributor

@apeabody do you have time to look at this?

@jimgus jimgus requested a review from gtsorbo as a code owner March 13, 2024 16:53
@apeabody
Copy link
Collaborator

/gcbrun

@apeabody
Copy link
Collaborator

Integration tests results:

-----> Setting up <node-pool-local>...
       Finished setting up <node-pool-local> (0m0.00s).
-----> Verifying <node-pool-local>...
$$$$$$ Reading the Terraform input variables from the Kitchen instance state...
$$$$$$ Finished reading the Terraform input variables from the Kitchen instance state.
$$$$$$ Reading the Terraform output variables from the Kitchen instance state...
$$$$$$ Finished reading the Terraform output variables from the Kitchen instance state.
$$$$$$ Verifying the systems...
$$$$$$ Verifying the 'node_pool' system...
WARN: Unresolved or ambiguous specs during Gem::Specification.reset:
      racc (>= 0)
      Available/installed versions of this gem:
      - 1.7.3
      - 1.6.2
WARN: Clearing out unresolved specs. Try 'gem cleanup <gem>'
Please report a bug if this causes problems.

Profile:   node_pool
Version:   (not specified)
Target:    local://
Target ID: ad86e15389e971002617c67a58841346aa7359db6f29de507ad5c7314325519e

  ×  gcloud: Google Compute Engine GKE configuration (1 failed)
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` exit_status is expected to eq 0
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` stderr is expected to eq ""
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` cluster-autoscaling has the expected cluster autoscaling settings
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools has 3
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 exists
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 is the expected machine type
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected image type
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has autoscaling enabled
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected minimum node count
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has autorepair enabled
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has automatic upgrades enabled
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected metadata
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected labels
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected network tags
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-01 has the expected linux node config sysctls
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 exists
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 is the expected machine type
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has autoscaling enabled
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected minimum node count
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected maximum node count
     ×  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected accelerators
     expected [{"config" => {"diskSizeGb" => 100, "diskType" => "pd-balanced", "imageType" => "COS_CONTAINERD", "loggingCon...RUNNING", "upgradeSettings" => {"maxSurge" => 1, "strategy" => "SURGE"}, "version" => "1.27.8-gke.1067004"}] to include (including {"name" => "pool-02", "config" => (including {"accelerators" => [{"acceleratorCount" => "1", "acceleratorType" => "nvidia-tesla-p4"}]})})
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected disk size
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected disk type
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected image type
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected labels
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected network tags
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-02 has the expected linux node config sysctls
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 exists
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 is the expected machine type
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has autoscaling disabled
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected node count
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has autorepair enabled
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has automatic upgrades enabled
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected labels
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected network tags
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected pod range
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected image
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected kubelet config
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` node pools pool-03 has the expected linux node config sysctls
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` pool-03 has nodes in correct locations
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` exit_status is expected to eq 0
     ✔  Command: `gcloud beta --project=ci-gke-ed295aed-fd8n container clusters --zone=europe-west4 describe node-pool-cluster-ml00 --format=json` stderr is expected to eq ""
  ✔  kubectl: Kubernetes configuration
     ✔  kubernetes nodes pool-01 has the expected taints
     ✔  kubernetes nodes pool-02 has the expected taints
     ✔  kubernetes nodes pool-03 has the expected taints


Profile Summary: 1 successful control, 1 control failure, 0 controls skipped
Test Summary: 44 successful, 1 failure, 0 skipped
>>>>>> Verifying the 'node_pool' system failed:
	Running InSpec failed:
		Running InSpec failed due to a non-zero exit code of 1.

@jimgus
Copy link
Contributor Author

jimgus commented Mar 13, 2024

OK, I should apparently not have updated the examples. Didn't realize they were part of the tests. Will revert those changes.

@apeabody
Copy link
Collaborator

apeabody commented Mar 13, 2024

OK, I should apparently not have updated the examples. Didn't realize they were part of the tests. Will revert those changes.

Thanks @jimgus - Alternatively, you could update the tests to match the updated examples.

@apeabody
Copy link
Collaborator

/gcbrun

Copy link

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the Stale label May 14, 2024
Copy link
Collaborator

@apeabody apeabody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot removed the Stale label May 15, 2024
@SamuZad
Copy link

SamuZad commented May 22, 2024

It seems like @jimgus has not been active here a couple months 😭 but it would be amazing to get this over the line though..

@apeabody would it be possible to finish this off? I'd be happy submit a new PR as well, of course. Whatever helps this move forward, without taking away from Jim's contribution ideally

@apeabody
Copy link
Collaborator

/gcbrun

@apeabody
Copy link
Collaborator

It seems like @jimgus has not been active here a couple months 😭 but it would be amazing to get this over the line though..

@apeabody would it be possible to finish this off? I'd be happy submit a new PR as well, of course. Whatever helps this move forward, without taking away from Jim's contribution ideally

Hi @SamuZad - Let me run the tests on this PR to see if it could be chained with a second PR.

@jimgus
Copy link
Contributor Author

jimgus commented May 23, 2024

Sorry for not being able to work on this issue for a while. I made update to the README as suggested by @apeabody

@apeabody
Copy link
Collaborator

/gcbrun

@apeabody
Copy link
Collaborator

/gcbrun

@apeabody
Copy link
Collaborator

Going to re-trigger the test:

Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185: 
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185: Error: NodePool default-node-pool was created in the error state "ERROR"
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185: 
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185:   with module.example.module.gke.google_container_node_pool.pools["default-node-pool"],
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185:   on ../../../cluster.tf line 462, in resource "google_container_node_pool" "pools":
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185:  462: resource "google_container_node_pool" "pools" {
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z command.go:185: 
Step #32 - "apply simple-regional-with-networking-local": TestSimpleRegionalWithNetworking 2024-05-23T19:58:50Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1; 

@apeabody apeabody self-assigned this May 23, 2024
@apeabody
Copy link
Collaborator

/gcbrun

Copy link
Collaborator

@apeabody apeabody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @jimgus!

@apeabody
Copy link
Collaborator

confirmed gpu_sharing_config is present in tpg v5.9.0

@apeabody apeabody merged commit b57387c into terraform-google-modules:master May 24, 2024
4 checks passed
@SamuZad
Copy link

SamuZad commented May 24, 2024

Thank you both so much for the swift turnaround! You have my (and others') eternal gratitude

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gpu nodepool with gpu_sharing_strategy & max_shared_clients_per_gpu not possible
5 participants