Workbench Stuck Provisioning

I've created a Vertex AI workbench and done some work, installed python libraries, run my model training, etc. But after the second or third startup the system gets stuck "Provisioning". This has happened to me 3 times now where each time i've setup a new workbench and then after a few startups it can no longer boot and gets stuck at "Provisioning".

In the Health tab the System Health Status says Unhealthy but all of the other statuses show Healthy. Checking the logs, this is the only error I've found. But I haven't changed the configuration of the workbench since I started it. Also the principle does have permissions for compute admin (v1) and compute admin roles. So I don't understand why this is breaking.

 

 

 

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "status": {
      "code": 7,
      "message": "Required 'Current principal doesn't have permission to mutate this resource!' permission for '[INSTANCE-NAME-REDACTED]'"
    },
    "authenticationInfo": {
      "principalEmail": "-----------------------------@---------------com",
      "serviceAccountDelegationInfo": [
        {
          "firstPartyPrincipal": {
            "principalEmail": "service----------------@--------------com"
          }
        }
      ],
      "principalSubject": "serviceAccount:------------------------------@---------------com"
    },
    "requestMetadata": {
      "callerIp": "##.##.###.###",
      "callerSuppliedUserAgent": "google-cloud-sdk gcloud/469.0.0 command/gcloud.compute.instances.remove-metadata invocation-id/####################### environment/GCE environment-version/None client-os/LINUX client-os-ver/5.10.0 client-pltf-arch/x86_64 interactive/False from-script/True python/3.11.8 term/ (Linux 5.10.0-28-cloud-amd64),gzip(gfe)",
      "callerNetwork": "//############com/projects/-----------------------------/global/networks/__unknown__",
      "requestAttributes": {
        "time": "2024-04-11T21:21:11.897314Z",
        "auth": {}
      },
      "destinationAttributes": {}
    },
    "serviceName": "############com",
    "methodName": "v1.compute.instances.setMetadata",
    "authorizationInfo": [
      {
        "resource": "projects/-----------------------------/zones/us-west1-a/instances/[INSTANCE-NAME-REDACTED]",
        "permission": "compute.instances.setMetadata",
        "granted": true,
        "resourceAttributes": {
          "service": "compute",
          "name": "projects/-----------------------------/zones/us-west1-a/instances/[INSTANCE-NAME-REDACTED]",
          "type": "compute.instances"
        }
      }
    ],
    "resourceName": "projects/-----------------------------/zones/us-west1-a/instances/[INSTANCE-NAME-REDACTED]",
    "request": {
      "@type": "type.googleapis.com/compute.instances.setMetadata"
    },
    "response": {
      "error": {
        "errors": [
          {
            "domain": "global",
            "reason": "forbidden",
            "message": "Required 'Current principal doesn't have permission to mutate this resource!' permission for '[INSTANCE-NAME-REDACTED]'"
          }
        ],
        "code": 403,
        "message": "Required 'Current principal doesn't have permission to mutate this resource!' permission for '[INSTANCE-NAME-REDACTED]'"
      },
      "@type": "##########com/error"
    },
    "resourceLocation": {
      "currentLocations": [
        "us-west1-a"
      ]
    }
  },
  "insertId": "-###########",
  "resource": {
    "type": "gce_instance",
    "labels": {
      "instance_id": "##################",
      "zone": "us-west1-a",
      "project_id": "-----------------------------"
    }
  },
  "timestamp": "2024-04-11T21:21:11.514971Z",
  "severity": "ERROR",
  "labels": {
    "############com/root_trigger_id": "######################################"
  },
  "logName": "projects/-----------------------------/logs/############com%2Factivity",
  "receiveTimestamp": "2024-04-11T21:21:12.788839353Z"
}

 

 

 

3 2 136
2 REPLIES 2

The issue you're experiencing with the Vertex AI Workbench getting stuck at "Provisioning" and showing "Unhealthy" in the System Health Status tab is likely related to insufficient permissions. The error message indicates that the current principal does not have the required permission to mutate the resource.

Here are some steps to diagnose and potentially resolve the issue:

  1. Verify IAM Roles and Permissions: Ensure that the service account used by the Vertex AI Workbench has the necessary permissions. The error specifically mentions lacking the compute.instances.setMetadata permission. The required roles typically include:

    • Compute Admin (roles/compute.admin)
    • Vertex AI User (roles/aiplatform.user)
    • Service Account User (roles/iam.serviceAccountUser)

                 Ensure these roles are correctly assigned to the service account.

  1. Check Service Account Permissions: Go to the IAM & Admin section in the Google Cloud Console and ensure that the service account has the correct permissions. Look for any custom roles that might be missing necessary permissions and compare them to predefined roles like Compute Admin.

  2. Examine Service Account Delegation: The log mentions serviceAccountDelegationInfo. Verify that the delegation settings are correct and that the service account has been properly configured to act on behalf of the required principal.

  3. Review Instance Metadata: The error involves setting metadata on the instance. Ensure that the instance metadata isn't being overridden or misconfigured by any startup scripts or custom settings. You can check the metadata configurations in the Compute Engine section.

Hi Poala, thank you for replying.

I've checked and the service account already has all of the listed roles. The only one it didn't have was Vertex AI User which I added but it made no difference.

The error seems to be from trying to set metadata: permission"compute.instances.setMetadata" but the Compute Admin role should grant the service account this permission.

"Required 'Current principal doesn't have permission to mutate this resource!' permission for 'NAME-OF-VERTEX-AI-WORKBENCH'"

Now the machine won't stop. The status just says "Stopping" continuously...