
Is the GPU faster at addition or multiplication?

This puzzle presents three different ways to add elements of a tensor. Can you figure out the fastest implementation?

The order of operations matters on the GPU. Can you find the faster ordering?

When is matrix multiplication compute bound and when is it memory bandwidth bound on a GPU?

What is the optimal way to do a matrix transpose on a GPU?

Can GPUs communicate and compute at the same time?

Can the arithmetic intensity of a program be increased?

Data can be transmitted in many ways but, can you find the most efficient way?