is there a place to report GPU errors

While one errors is not significant, it would be nice to be open to collect coredumps in one place so many times malfunctioning GPUs can be calculated from the lot.

2023-04-19 05:54:50.710228: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
  59/7504 [..............................] - ETA: 2:48:34 - total_loss: 73.0354 - rec_loss: 0.0428 - perc_loss: 0.2867 - smooth_loss: 0.0362 - warping_loss: 0.6300 - psnr: 24.0297 - ssim: 0.6890
2023-04-19 05:59:55.261140: F ./tensorflow/core/kernels/conv_2d_gpu.h:537] Non-OK-status: GpuLaunchKernel(ShuffleInTensor3Simple<T, 2, 1, 0>, config.block_count, config.thread_per_block, 0, d.stream(), config.virtual_thread_count, in.data(), combined_dims, out.data()) status: Internal: an illegal memory access was encountered
2023-04-19 05:59:55.261148: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Aborted (core dumped)
(tf-cuda) rac@gpu:~$ 

 

0 1 150
1 REPLY 1

You do realize this is googlecloudcommunity, area Infrastructure-Compute-Storage, and this post is about errors inside google compute engines?