CSA Brute Force

K2TSET

Registered
Messages
125
For blind BF ie using monotonically incrementing keys (as opposed to checking a list of random known old keys or date-encoded lists), there are some optimisations to libdvbcsa's key schedule.

eg when incrementing only CW[6], you can isolate those elements which change each time CW[6] changes (and consequentially CW[7]), and you can separate those parts which change when building the expanded key, from those that do not.

That way you, can pre-calculate the parts which don't change for 256 rounds, and save repetition.

That should work well for kernels which run on a CPU. But as I've gotten used to, any hopes to outsmart the CUDA compiler fell flat with this. For CUDA/GPU, it seems to like to see simple, elegant source code, and then it will do its best to produce fast executables.

Great to see some thinking here on raw BF :)
So to understand it fully it is the expanded key in the BC you do reduce to recalculate not the core BC rounds correct?

When using FPGA the Permutation of the key schedule are "just" wires and take no resources at all.

Yes CUDA wants so unroll code and run it all in parallel but the BC S-box needs to be a look up in ram and when all the cores does it at the same time constantly it not so fast, even in FPGA block ram are slow but still great to use since they all work at the same time = result on every clock cycle.
 

campag5242

Feed Hunter
Messages
2,585
So to understand it fully it is the expanded key in the BC you do reduce to recalculate not the core BC rounds correct?
The optimisation I described was useful where the raw BF increments SW[5] (or when expressed as int, SW++). In that case, CW[0], CW[1], CW[2] et seq do not change so quickly, and you can split the key expansion in two: one function (rarely called) with the slowly-changing CW[0]....CW[5], the other (frequently called) involving only CW[6], CW[7], plus the pre-computed results of the slowly changing part.

More fruitful seems to be the approach which you described (if I understand correctly): pre-compute the first four rounds of BC itself. In this case, we increment SW[2] (then SW[1], SW[0])... leaving SW[3] SW[4] SW[5] changing slowly.
Those first four rounds involve only expanded key [55] [54] [53] [52], which correspond to our mostly static CW[7] [6] [5] [4].

What I have only just realised is the horrible performance of vanilla libdvbcsa Stream Cypher on cuda: all my seemingly fruitless optimisations to the Block Cypher have been made more or less invisible by the sluggish SC. Omitting the 800MCW/s SC, BC decrypt has gone from 2770MCW/s vanilla libdvbcsa to 3580MCW/s on a GTX 1070 Ti.

All efforts are now on the stream cypher.
 

K2TSET

Registered
Messages
125
The optimisation I described was useful where the raw BF increments SW[5] (or when expressed as int, SW++). In that case, CW[0], CW[1], CW[2] et seq do not change so quickly, and you can split the key expansion in two: one function (rarely called) with the slowly-changing CW[0]....CW[5], the other (frequently called) involving only CW[6], CW[7], plus the pre-computed results of the slowly changing part.

More fruitful seems to be the approach which you described (if I understand correctly): pre-compute the first four rounds of BC itself. In this case, we increment SW[2] (then SW[1], SW[0])... leaving SW[3] SW[4] SW[5] changing slowly.
Those first four rounds involve only expanded key [55] [54] [53] [52], which correspond to our mostly static CW[7] [6] [5] [4].

What I have only just realised is the horrible performance of vanilla libdvbcsa Stream Cypher on cuda: all my seemingly fruitless optimisations to the Block Cypher have been made more or less invisible by the sluggish SC. Omitting the 800MCW/s SC, BC decrypt has gone from 2770MCW/s vanilla libdvbcsa to 3580MCW/s on a GTX 1070 Ti.

All efforts are now on the stream cypher.

yeah you should only calc the Perm and the Chk when the the cw[x] byte actually changed.

Also I recall on PC test that unrolling the code made some improvement

I'm surprised, as far I know in CudaBiss the SC are implemented as bitslicing and the sboxes are done i parallel and since SC only have 32 init round and you only need 12 run rounds to get 24 bit to have a result to XOR with the BC and 0x000001 and CB2. so the SC should take way shorter time than the BC.

I recall way back I did some CUDA analysis (Nsight) on cudabiss and the SC was a fraction of the CUDA til compared to BC.

Could be the newer GTX 1070 or 1080 does have a great way to handle all the 1000's of BC sbox (8x8) lookup's at the same time, seems to be the bottleneck before

Guess you are aware by looking on the binary of the Cudabiss you can actually see in plain Ascii the code for the cuda kernels in some of the compiled versions.
 

campag5242

Feed Hunter
Messages
2,585
Thanks - I didn't know about the ascii code within Cudabiss. I took a quick look and only saw what looks like .asm in the versions I had.

What was more interesting was to run Nsight on it. I can only see one kernel "decrypt21", running in the default stream. It seems there's 192KiB of data computed on the CPU and copied to device memory before every grid is executed. I wonder if that is 64K x 3 SC decrypted bytes, ready to xor with each kernel's BC decrypt?

I'll have a go at that workflow and see how it compares - looks like that's far more work for the otherwise idle CPU than I currently give it (those initial 4 rounds of BC decrypt).
 

campag5242

Feed Hunter
Messages
2,585
Yeah, that's in line with my experience, and has influenced the strategy I've taken thus far.
But seeing as I've hit a brick wall speed-wise, I'm willing to give it a try: no point having the CPU idling between grid launches.
 

K2TSET

Registered
Messages
125
So the idea is to do SC on CPU and BC on GPU?

If you do that you will need to send the 24 bit over PCI for each test to see if you hit 0x000001 for all cores running in parallel, it will be very demanding.

When I did look into CUDA I tried to only offload the calculation and only return a respond for a CUDA Core it it had a hit and the did the second test on CPU.

As I said before the BF Sboxes was eating all the time on the CUDA when I did test since there is a lookup on each round and on all cores at the same time.

But it might have changed with newer Nvidia card, campag5242 are you willing to share source code for the CUDA part?

The 1070ti does have 2432 CUDA cores at 1.6Ghz clock, you need to have 56 rounds (where you can have 4 rounds) not changing often and the 52 round which have to be fully.

Assume you code in 64bit in VS, be alert that the CUDA often are converted to 32Bit in VS.

Since the BC already are 64nit itself it's a nice fit, so you can do the XOR on 64 in one go and only need a little shifting of the Sbox and Perm (make you Sbox with 16 bit out so you have the perm done in the same lookup)
 

campag5242

Feed Hunter
Messages
2,585
The idea to calculate the SC on the CPU is a no-go. I got only 0.58MCW/s one one core. I way underestimated the work required to SC decrypt those three bytes...

In that case I wonder what those 192KiB transfers in cudabiss are for?

Yes, I have been tripped up a couple of times by forgetting that CUDA is a 32-bit environment on the PC.
 

C0der

Registered
Messages
267
Those 192KiB could be some lookup tables. Just a guess.

But about lookups on Cuda:
They should use shared ram for speed.
But shared ram is limited.
 

campag5242

Feed Hunter
Messages
2,585
Some things go slower from shared (probably the cost of the transfer itself)... it's best to try both ways.

Identical data read by all threads in a grid are better in global, or so I find.

Anyways, if this idiot can get 695MCW/s on a 1070 ti, then I'm sure someone who knows what they are doing can get it to go way faster, and may explain this mystery biss AU server.
 

K2TSET

Registered
Messages
125
No doubt that the GTX1070 or 1080 can go faster if the right persons can tweak it, but what is the goal?
Maybe it can be doubled but that's still slow :confused:

With a FPGA with 32 cores running 200Mhz with a test on every clk, you test 6400 MCW/s this gives a full range test in about 12 hours this is way faster than 1080

So to get under 10 sec you will need over 4000 FPGA's

I doubt the AU servers do BF but more lookup in already known CW's.

For real-time there needs to be found some reduction in the CSA algo.
Could be the AU guy have that and keeps it secret, but I doubt it.

A huge FPGA cluster in the cloud where you can get free access could also be the way to do it.
 

campag5242

Feed Hunter
Messages
2,585
My goal is to get it to work as fast as possible: for the fun of it; to keep learning cuda; so that any gains in the BC part might be used in my rbt code; and to stimulate discussion/ideas.

I've no idea about FPGA. Is there a parallel rate of increase in clock rate on those devices cf CPU/GPU (so stalled)? And processing power (going wider / more parallel)? Is processing power per $ increasing similarly?
 

K2TSET

Registered
Messages
125
My goal is to get it to work as fast as possible: for the fun of it; to keep learning cuda; so that any gains in the BC part might be used in my rbt code; and to stimulate discussion/ideas.

I've no idea about FPGA. Is there a parallel rate of increase in clock rate on those devices cf CPU/GPU (so stalled)? And processing power (going wider / more parallel)? Is processing power per $ increasing similarly?

I really like the attitude by trying for fun and to learn, and for sure you can use the experience for other work in the future :thum:

The great thing with FPGA is that everything what happens in the FPGA does happen on all clk cycle so massive parallel processing can be done.
The fMax for a FPGA are no way in the region of a CPU/GPU since a huge area are used for routing (connection) of logic element which slow down the speed.

So in respect to pricing it's best to find the right FPGA with the max logic / vs price and often that does not have to be the latest since the latest often focus on numbers of high speed input /outputs which you do not need for an algo like CSA.

So 2 smaller FPGA might be faster and cheaper than 1 bigger one.

An FPGA are not a 8,16,32,64,128 bit device, it act as you program it so that is very different and can save so routines which you need in software like all the permutation.

Another this is the power consumption is very low compared to CPU / GPU = no noise.

I think we should all learn and understand as much on the CSA algo as possible so we can find some unseen shortcuts, that's the only way to get a game-changer in BF on th CSA
 
Last edited:

campag5242

Feed Hunter
Messages
2,585
Turning off the SC part, I get somewhere near your 32 core FPGA 6400MCW/s, around 4000. I'm still using libdvbcsa's single packet SC... if the bitslice version can be made to work, and with little impact on registers-per-kernel, it might then not drag the performance down so much.

Maybe the Biss-AU guy has spotted known-plaintexts other than the PUSI header (for BF) or bulk padding (for RBT)? I've identified some other known plains, suited to rbt attack, but only for a small subset of the encoder types in use. This Biss-AU seems to work on all feeds/encoders.
 

K2TSET

Registered
Messages
125
Turning off the SC part, I get somewhere near your 32 core FPGA 6400MCW/s, around 4000. I'm still using libdvbcsa's single packet SC... if the bitslice version can be made to work, and with little impact on registers-per-kernel, it might then not drag the performance down so much.

Maybe the Biss-AU guy has spotted known-plaintexts other than the PUSI header (for BF) or bulk padding (for RBT)? I've identified some other known plains, suited to rbt attack, but only for a small subset of the encoder types in use. This Biss-AU seems to work on all feeds/encoders.
Could you tell what other plains you are looking at?
 

campag5242

Feed Hunter
Messages
2,585
The plains I've identified are mostly seen on EBU feeds/encoders, typically those you'd expect to see issuing B8hx030000h padding.

When not issuing bulk padding ie transmitting normal video, these encoders are liable to terminate a payload unit with 10 bytes: 00 00 03 00 00 03 00 00 00 00

Those 10 bytes are frequently aligned such that the residue is 3 bytes long, 00 00 00. That leaves 7 preceding *known* bytes 00 00 03 00 00 03 00, prefixed by one random byte.

That random byte for unknown reasons is most often 80, & less frequently C0, E0, XY etc

So: collect 08h len C8s, filter by residue len 3, then rbt attack assuming first byte is 80 or other.

There are also (rarer) encoders which tack on 10 bytes all 00, and you can make tables to enable search for sibling plains 80 00 00 00 00 00 00 00 & similar.
 

campag5242

Feed Hunter
Messages
2,585
These AI optimised devices offer tensor / matrix operations. How would you exploit those to accelerate CSA?

CUDA started out optimised for floating point ops... eventually integer operations were added, and it's those which are employed when implementing CSA (logical adds, shifts, ands, ors, xors etc), whilst fp is ignored (other than sbox lookup in texture memory).

Maybe your post boils down to: can CSA be refactored to use matrix operations? I have no idea.
 
Top