Image source: Yeung, Gingfung, et al. "Towards GPU utilization prediction for cloud deep learning." 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 20). 2020. Given a GPU kernel, how do you optimize its implementation? Given a PyTorch script, how do you pinpoint its performance bottlenecks?