This optimization works in the following ways:
1. Reduces calls to malloc to 1. Instead of needing an extra array, we
can just swap the top half with the bottom half of the one array.
2. Unroll the inner for loop and remove a condition. Unrolling loops
buys some performance wins, but the real goal was to remove the if check
and just set the alpha channel to 255.
On my hidpi arm64 laptop, I saw ~60% improvement in performance in my
debug build (29 FPS vs 47 FPS). When optimized, the gains were roughly
10% (75 FPS vs 83%).
Signed-off-by: K. Adam Christensen <pope@shifteleven.com>