Marketing literature for GPUs stresses their high FLOPS. An A100 is advertised as capable of achieving $19.5$ FP32 TFLOPS. The code below performs a number of numerical operations on tensors: addition, scalar multiplication, transcendental functions, and matrix multiplications etc.
import torch from torch.profiler import ProfilerActivity, profile, tensorboard_trace_handler def flop_ops(): SIZE = 2**12 ones = torch.ones((SIZE, SIZE), device=torch.device('cuda:0')) torch.matmul(ones, ones, out=ones) # matrix multiplication ones.mul_(0.5) # in place multiplication result_mul = ones.mul(0.5) # out of place multiplication total = ones + result_mul # adding tensors result_sum = torch.sum(ones) # adding elements of a tensor result_sqrt = torch.sqrt(ones) # sqrt takes 7 ops result_sin = torch.sin(ones) # sin takes 17 ops (14 fp64, 3 fp32) result_sigmoid = torch.sigmoid(ones) # sigmoid takes 24 ops result = torch.log10(ones) # log10 takes 24 ops result = torch.pow(ones, 3.14159) # pow takes 142 ops # warmup flop_ops() trace_handler = tensorboard_trace_handler(dir_name="./flops_trace", use_gzip=True) with profile(activities = [ProfilerActivity.CPU, ProfilerActivity.CUDA], on_trace_ready = trace_handler, record_shapes = True, with_stack = True ) as prof: flop_ops()
The trace shown below indicates that other than matrix multiplication, all of the other operations are a tiny fraction (~$1\%$) of the advertised flops. Furthermore, simple operations like addition take exactly as long as complex ones like sine and log. Why?