## The Faster Way to Add?

The code snippet below sums up the elements in a 1D tensor of size $4096$ in three different ways. Which implementation is the fastest, which one is the slowest and why?

```
def first_sum(cuda_tensor):
total = 0.0
for i in range(cuda_tensor.size()[0]):
total += cuda_tensor[i].cpu()
return total
def second_sum(cuda_tensor):
total = torch.zeros(1, device='cuda')
for i in range(cuda_tensor.size()[0]):
total += cuda_tensor[i]
return total
def third_sum(cuda_tensor):
total = 0.0
tensor_on_cpu = cuda_tensor.cpu()
for i in range(tensor_on_cpu.size()[0]):
total += tensor_on_cpu[i]
return total
```