For blind BF ie using monotonically incrementing keys (as opposed to checking a list of random known old keys or date-encoded lists), there are some optimisations to libdvbcsa's key schedule.
eg when incrementing only CW[6], you can isolate those elements which change each time CW[6] changes (and consequentially CW[7]), and you can separate those parts which change when building the expanded key, from those that do not.
That way you, can pre-calculate the parts which don't change for 256 rounds, and save repetition.
That should work well for kernels which run on a CPU. But as I've gotten used to, any hopes to outsmart the CUDA compiler fell flat with this. For CUDA/GPU, it seems to like to see simple, elegant source code, and then it will do its best to produce fast executables.
Great to see some thinking here on raw BF
So to understand it fully it is the expanded key in the BC you do reduce to recalculate not the core BC rounds correct?
When using FPGA the Permutation of the key schedule are "just" wires and take no resources at all.
Yes CUDA wants so unroll code and run it all in parallel but the BC S-box needs to be a look up in ram and when all the cores does it at the same time constantly it not so fast, even in FPGA block ram are slow but still great to use since they all work at the same time = result on every clock cycle.