CSA Brute Force

kebien · Apr 18, 2017

dale para abajo
Why would you think they developed CSA v3 with 128 bits keys for? they know computer power moves fast past today's limits.

At the same time,your statement is true since the first day there was a software implementation of CSA (freedec).
At the time,a breakthrough,and a warning for security companies.

All you are saying is well known,and eventually everything phase out.
The point is you can learn something in the process,which might be invaluable.
As an example.....
you can use nagra 2 cards to open any encryption (irdeto,viaccess,cryptoworks,powervu,others),since you can modify codespace to your needs,and use embedded algos as needed.
Like that,many things are "proof of concept",not useful for most users,but you learn tons of things in the process.

This is the same,even knowing it possibly gets nowhere,the process teach you things you cannot learn otherwise.

The irony is people that had no interest in doing it,if there are some real results,they take advantage of the work of others,naturally.

K2TSET · Apr 28, 2017

Could someone confirm that the Residue block in the CSA only are used if the Adaption Field has been present and the length of AF are not Modulus 8?

Or will block 23 always be XOR with SC only?

In other words will the Residue block of 8 bytes only be used if the length of the encrypted payload +AF are not modulus 8?

So to be able to try to BF the SC only on block 23 will only be possible if an AF is present and have a non modulus 8 length?

It's just due to I found a pdf called "Distinguishing Attack on Common Scrambling Algorithm" where they discuss to look at the SC only and it should be possible to reduce the key space from 2^64 to 2^55.

I do not see how that should be done for a real BF if you only can look for AF with a non mod8 length

BeatrixKiddo · May 3, 2017

HI,
someone can post for download "cw bruteforce 0.5"?
I can't find it!!!!
thanx!!!

K2TSET · May 19, 2017

Some finding / understanding on the Residue part.

In a TS you will have the PUSI indikator where you start a new encoded block ... it's where we do normal do look for the 0x00 0x00 0x01.

If you look on tha TS block for the same PID just before you will typical see an Adaption Field AF (non encrypted) which does take up a lot of space with 0xFF this is typical to do stuffing of the TS bloc of 184 data.

After the AF there will be the last part of the encrypted payload and if the length does not fit with chunks of 8bytes you will have a residue part 1 to 6 bytes ... Thosere are ONLY encrypted with SC.

So I have looked into some data to understand it better.
so here are a TS block form some old Mpeg2 ts where I found the key by BF.

Code:

In ts file search for PUSI and PID 0x216
47 42 16
Then grab the block of 188 bytes just before with a AF length not equal to mod8



Known key
fb dc 95 6c 13 16 8d b6

Encrypted just before PUSI
47 02 16 FF 
[COLOR="RoyalBlue"]83 00 FF FF FF FF FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF[/COLOR] [COLOR="Red"][B]E8 D9 AB 6B
D5 B3 2E 4E[/B] 02 8B 8D 30 C9 D3 75 51 69 79 E7 EC 71 4F C7 9C
86 FC EF 7B 97 F6 8F 36 A7 EC 48 E2 71 68 B3 8C D4 0C 80 B2 
C2 BC 88 FF [COLOR="Magenta"]D2 5D 97 18[/COLOR][/COLOR]


Clean just before PUSI
47 02 16 3F 83 00 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 13 5B 46 8D 1A 30 FD A1 11 8E 38 C0 9A DA 34 68 D1 87 ED 08 8C 71 C6 04 D6 D1 A3 46 8C 3F 68 44 63 8E 30 26 B6 8D 1A 34 61 FB 42 23 1C 71 81 35 [B]80 00 00 00[/B]

Take  E8 D9 AB 6B D5 B3 2E 4E and do SC init (32 steps)+ 6 rounds of (32 steps)  SC run

    00  f1 4b 33 5a dd 7a 6c 03
    01  20 68 53 48 97 dc 2b d7
    02  fd 0b 57 b5 e8 38 64 57
    03  16 bb 36 e3 c6 2f 1b a9
    04  76 6e 97 f5 91 7a 46 2d
    05  [COLOR="Magenta"][B]52 5d 97 18[/B][/COLOR] ad dc f3 9a

So in this TS you can see the clean part ends with 0x80 0x00 0x00 0x00 so if this is normal or if we know what it can end with and we find AF with a length where there will be a Residue it should be possible to BF for the key only by using SC.
This might be faster than normale BF since it the BC which normal use a lot of time.

Let me hear what you think?

ViaHussun · May 19, 2017

BeatrixKiddo said:
HI,
someone can post for download "cw bruteforce 0.5"?
I can't find it!!!!
thanx!!!

CW Brute Force 0.5 :thum:

srijan20 · May 19, 2017

ViaHussun said:
CW Brute Force 0.5 :thum:
https://mega.nz/#!GIJVgKqR!VCi8A6S_d_IfQZPToCEgURcGksuvQJSEVMLhf5z_BoA

Attach it.

K2TSET · May 19, 2017

K2TSET said:
Some finding / understanding on the Residue part.

Forgot to say:
Encrypted "D2 5D 97 18" XOR with output of SC "52 5d 97 18" Equal to "80 00 00 00"

dale_para_bajo · May 19, 2017

K2TSET

I do have similar opinion that klims on breaking CSA. So you had seem my comments before.

But I will express in this.
In GPU stream cypher is as fast as block cypher.
In a normal CSA BF attack you do BF 23 block cyphers and no stream.

Now you are proposing to look for even a less common situation where you get this unusual bytes. But now in your proposed BF you have to do 23 stream cypher in order to do the BF.

If both stream and block are as fast then what is the gain? You still have to do 23 of them.

C0der · May 20, 2017

dale_para_bajo said:
In a normal CSA BF attack you do BF 23 block cyphers and no stream.

No Stream? Are you sure?

dale_para_bajo · May 20, 2017

HEHEHEHE

Listen guys lets keep it friendly. If I make mistakes just lauft about it. As I sometimes do mistakes.

1- @coder as it is easier.
*Lets call the first 8 encrypted bytes SB0. Yes where you name crypt8.
*In a TMTO Attack like CSA RBT, what we do is encrypt 184 bytes of 0x00 or 0xFF to get SB0.
*In that process we do 23 Block cypher ONLY as 1rst encrypted byte has no Stream Cypher.
*So that is what I intended to say.

2-*Clearly that is not the fastest search. The Fastest Search is what I think uses Cudabiss. I am not sure about it as cudabiss source has not been shown nor I have seem any explanation.
*But if I was to build my own, the bytes positions involved are SB0 & SB1. And the first 8 Clear bytes. The books call that 1rst clear 8 bytes - DB0.
*Now as we ONLY need to play with SB0, SB1 & DB0 ew ONLY need to do Block Cypher 1 and Stream Cypher 1. One of each. Well in fact not even the whole stream cypher as ONLY the first 3 bytes are needed(24 bits). Just look at the aycwabtu program I posted.

3-*Here is where I did in fact make a mistake. And for the edit of K2TSET post I guess he saw my mistake.
*See I had little or no experience with Adaptation Field. And I forgot that the 1rst Xbytes are unencrypted. Where the 0XFF are. So I was wrong when I said he still need to do 23 stream cyphers. That was incorrect. But I am learning too.

PD: I am no expert all this info is on the papers and books I had read on the net. So is open source knowledge, not mine.

dale_para_bajo · May 28, 2017

I am just tired of what I was doing so I decided to check on you guys here.

And I realized that I did not read this tread from the begining.
Now that I read from 1rst post I notice some points.

*On 1rst post K2TSET defined correctly what I think is using Cudabiss. So please ignore my comment previouly on that.

K2TSET said:
As far I know you still need
56 rounds of BC
32 rounds of SC init and
12 rounds of SC run could be stopped if result are wrong (easy in software hard in hardware if fully unrolled / pipeline core on fpga are used)

*Post #4 show a picture that have a section of a picture use by master colibri. This is sad to me as I have been searching for long that colibri pdf as I misplace to never found it again. HEHEHE well I did found it this week as I found my problem. It is a pdf writen in german. that is why I could not find it. It was nice to translate it again and reading to refresh colibris ideas. In particular may explained why K2TSET was interested in the TS with Adattation Filed + payload that have a payload not multiple of 8. Sadly even when that finding was aa real nice crack most operators are aware of the hole and put in place counter meaasurements so it no longer can be used. But here I am asking you guys what is new.

*In general I am not interested in the 10 sec solution as I belive will create the begining of the end for CSA.

*Now Post 16 is trouble for me.

K2TSET said:
...I recall someone did an analyse of the Cudabiss and found the time for threads of the BC vs SC are like 100:1.

Also on FPGA the absolute resource killer are the BC and in there it's the S-box killing resource/performance.

"BC vs SC are like 100:1."

And I all ready said that I do not have seen Cudabiss source but in my aproach a full stream = block time. So here we have VERY diferent results!!!!!.
K2TSET can you tell us your own findigs? Just to compare.

*About the 40bit. I seen nothing that back that out. But it does say " DVB-CSA used a 56-bit key with only 40 bits of entropy".
Both numbers seems wrong to me. it is not 56 but 64bits to beginn with. Then it reads "40 bits of entropy" witch could mean tha some how some one may have found a way to combine bits and eliminates variables. I guess I need to read that paper too.

*Post 22

K2TSET said:
...
In other words will the Residue block of 8 bytes only be used if the length of the encrypted payload +AF are not modulus 8?

Yes is on colibries pdf, where you get the picture of post1. Now colibri's pdf will not explain what you ask in post 24?
Please be aware that colibri's paper seems to be writen at the very early stages of his work in CSA RBT. And that paper is in fact an analisys for 30W TDD single pid Tunel as he name it. And that it is NO LONGER the case, at least in the logs that I was given. You will not find a SINGLE AF in news LOGS as it seem it was removed by the provider. I guess as counter measure. Please be aware that colibri makes many asumptions that are ONLY valid for the case of 30W TDD Tuneled PID.

So what are you guys up to this day?

dale_para_bajo · May 28, 2017

I did just try to find out about

Code:

Securing digital video. Techniques for DRM and Content Protection

I could not found a FREE download fro the PDF. But what I think is that the book talks about "DRM" Digital rights management.

In the early days of DVD technologies, a DRM system called Content Scrambling System (CSS) was developed by the DVD Forum on film DVDs. It uses a simple encryption scheme designed to prevent byte-for-byte copying of a MPEG (digital video) stream. The key for the encryption is split and separately stored in the lead-in area of the restricted DVD and the DVD drive itself, and the encryption algorithm employs a proprietary 40-bit stream cipher [Schneier99]..

So that 40 bit cipher I guess talks about DVD copy protection.

K2TSET · May 28, 2017

dale_para_bajo said:
*Now Post 16 is trouble for me.
"BC vs SC are like 100:1."

And I all ready said that I do not have seen Cudabiss source but in my aproach a full stream = block time. So here we have VERY diferent results!!!!!.
K2TSET can you tell us your own findigs? Just to compare.

If you want to look into CUDA kernel use, then NVIDIA do have a tool called "Visual Profiler" where you can see the calls for the cores incl. memory use etc.

With "BC vs SC are like 100:1." I do say Block Cipher takes like 100 times longer than the Stream Cipher.

Reason for that in the BC you have to do the 256 bytes (8bit) lookup and this part are not easy for bitslicing also you will need another lookup for the Perm. (in FPGA you just connect the wires as fixed Perm)

the problem when using a lookup table are the the CPU or GPU has to add the input value (offset) to get the output of the memory position and adding are slow.

I recall some test while ago that if you worked with 16bit ( 2 rounds of lookup) it even gets slower than 2 x 8 bit lookup.

The SC can be done as bitsclicing since the 7 sboxes can be done as logic

So to sum up
BC needs 56 rounds where you only can calculate 1 BC per thread.
CS needs 32+12 rounds where you can do 32 SC per thread (assuming 32 bit)

So you will need run 32 threads og BC to 1 of CS

If the memory access does slow down the lookup for BC you might end up with something like 100:1

By the way if you need speed in software try to unroll the code instead for using small loop's, there is a big difference.

Regarding only 40 Bit used in the Key, I totally agree that it must be wrong... we have to look for 48 bit!

BTW: nice to see activity and thoughts in this thread

dale_para_bajo · May 28, 2017

Let me start with what I need.

K2TSET said:
If you want to look into CUDA kernel use, then NVIDIA do have a tool called "Visual Profiler" where you can see the calls for the cores incl. memory use etc.

I have never install nor use CUDA. But It is my understanding that you need the source code cu to do any profiling. But If I am wrong PLEASE correct me. I love to profile any available code for comparison purpose. Remember I do not have Cuda Card. In AMD I can profile the Code without having a AMD GPU or running it. But I need the cl source code.

@ALL please lets speak freely without thinking we can make mistakes. Or I am not try to imposed my version of things over yours. Lets just read all this as if we are sharing info and comparing Ideas. So here I am learning from you and giving my thought on some points.

"BC vs SC are like 100:1." I proposed we do some profiling ourself. Just for comparison. I am not trying to get your code nor giving away my code. Noe I am not trying to show I am better programmer. But I want to know how much I can improve compare to others.

In AMD reading to tables slow down things too. But it should not have anything to do with adding or offsets. In general it been use as a rule that memory type slow things down by x10 factor. So a private register is 10 times faster as Local Memory(shared). Sadly in AMD Arrays are store in Local Memory even when it is not Shared. The you need to consider the Wave factor. Almost every time you order to read GPU Program Control will generate a new wave to a counter measurement to try hide the delay. But consecutive reading will not help. That is apparent reason on AMD. Now you do not need to read twice for two (8 bits) Sbox+Perm, instead read one vectorX2 (8 bits). Or just create a new table of short with the two (8 bits). The you change the 2 reads for only one (16 bit )short read.

In the other hand you are making comment on GPU and FGPA. That confuse me. So are you saying 100:1 in both GPU & FPGA?

K2TSET · May 29, 2017

The visual profiler will show info even without source, I doubt it will work without a CUDA card, I will suggest you get hold on a CUDA card just to be able to play with the different software like CudaBiss and RBT I know you want to do OpenGL and that's possible as well.

If I should go for a new card I think I would get the GTX1080ti but if you need a cheap one buy a second hand smaller one just to be able to test

I have not played with GPU for some years since my conclusion was it was way slower than FPGA

You are right that the S-box and Perm in BC can be done as 2x8 in parallel.

Memory access are a killer and for sure you need the RAM (constant lookup tables) to be as local as possible, but the main problem are all your kernals / cores want to do lookup more or less all the time.

When you have a memory with a table it will be on a RAM loacation and when you do a lookup for eg. input = 0x00 you will read the base ram location.
when you do a lookup for eg. input = 0xB7 you will read the base ram location + 0xB7 this is an addition (offset) if you like it or not.
only way to get rid of the addition are to have a Base address ending on 0x00 and then replace the last bytes with you input but that is not so easy.

If you try some like the above in C and then try to disasenble after you will see a lot of stuff are going on = waste of time. (depending on your compiler)
Alternatile are to go for ASM you can do that as inline and I guess you can get some speed there.

In FPGA when you go unrolled Pipelined you will get results on ever clk and depending on the size of FPGA you can have many cores.

But again the BC and the Sbox makes the problem. it need 8 bit input and most modern FPGA do have "only" 6bit full LUT's and we need 8bit so the logical way to go here are to use BRAM as dual ROM for 2 x 256bit but you will hit another limit and that is speed of the BRAM.

In the FPGA you can read the BC S-box in all the BRAM on the chip on every clk and that a lot but we still want more

The SC in FPGA can be done with LUT so it small and faster and uses less rounds.

If might not be 100:1 in FPGA but there is a big difference on BC and SC

The only shortcuts I know for CSA at the moment are to split the key in a high and low part 11, 22, 33 and 44,55,66 and only do the key-schedule and BC for the round 1-4 each time you have tested 11,22,33 for all combination (see my image in one the first post (round 1-4 marked with green)

And the other shortcut are to first calculate the BC then run 32 SC int and and when doing the SC run as soon the result does not fit stop the SC.
You will get 2 bit on every round of SC run so after the first round only 25% are left and after 2 round 6,25% are left so you will save some time here but not easy to implement in GPU and even worse in FPGA.

dale_para_bajo · May 30, 2017

Buy!!! HEHEHE
I am broke. I witch I had money for antenna 1rst. Then I have health problems. I would not bather you with my personal issues. Well good news is I have time to play.

Listen I have not build my Cudabiss version, I guess I need to finish it. Well I have all the components. Just have not set my system to handle the schema you have nicely show us. But do not misunderstood I will.

So last shortcut, BC 1rst then trying SC and test every output(2bits) well that I have done.

But I do not see you 1rst shortcut.
"to split the key in a high and low part 11, 22, 33 and 44,55,66 and only do the key-schedule and BC for the round 1-4 each time you have tested 11,22,33 for all combination (see my image in one the first post (round 1-4 marked with green)"

I guess I need to study that response. Splitting Key in 2 part HI & LOW. Wao, I guess for BS I need to study Key Schedule again and see how it is used. That is a Hell of a comment! Now for SC I do not see how you split Key and add to the registers, then perform Init. I guess more to study. I will as I take all comments seriously. You never know when a good Idea will show up.

dale_para_bajo · May 30, 2017

I see so many guys that can edit their post and I can not. Well here we go again with another post.

I forgot to ask you. As to get an Idea of a FPGA system. Please suggest me what FPGA Model will do CSA in 1 day. So I can estimate its cost. Just to compare with GPU card.

Stefan2k16 · May 30, 2017

If I should go for a new card I think I would get the GTX1080ti but if you need a cheap one buy a second hand smaller one just to be able to test

I'm sure the GTX 1080 Ti is a great card and probably runs cudabiss very fast, but you don't have to have such an expensive card. A GTX 1060 does quite well also and you don't need a card with a lot of memory. The GTX 750 Ti was a good card for cudabiss because it performed well and used little power and produced little heat. Cudabiss had a command like argument that told it which card, in a system with mutliple cuda cards, to run on. Therefore often times the best approach with cudabiss was to use multiple lower cost and lower power/heat output cards than to use one top of the line card.

Cudabiss was written back in the Fermi era of Nvidia and probably was optimized for it. When the Kepler GPU came out they didn't perform as well. At that point everyone was off chasing rainbows and I don't believe Beeone, the person who released cudabiss to the public, ever provided an updated version. When Maxwell cards came out they performed quite well on Cudabiss, even though the program was never optimized for anything past Fermi. Pascal also seems to perform well. It would be interesting to see how well a more modern version of cudabiss could perform, but as i said everyone went chasing rainbows and Beeone disappeared or at least quit releasing new versions of cudabiss.

Please suggest me what FPGA Model will do CSA in 1 day.

I cannot answer this question but I bet it's not the cheapest models. My guess is that when you buy the FPGA, a development board for it, and the development tools you'll need to program it, the cost would probably be comparable to buying 4 GTX 1060 cards. 4 GTX 1060s will probably not do CSA in a day but will probably find most control words in 2-3 days. Remember the time required to search the entire range is not the time it will take to find the control word unless you just happen to be unlucky and the control word is exactly at the end of your search range. On average your going to find control words in about half the time to search the entire range and with a little luck perhaps sooner. Also there are sometimes patterns to the selection of the control words used. If you find a few control words and start to recognize a pattern, sometimes you can narrow your search significantly. For a hint about such "patterns" see the different search modes of cudabiss. I think it might not be wise to openly discuss such patterns beyond this.

K2TSET · May 30, 2017

dale_para_bajo said:
Now for SC I do not see how you split Key and add to the registers, then perform Init. I guess more to study. I will as I take all comments seriously. You never know when a good Idea will show up.

You don't for the SC.
It's only for the first 4 rounds of BC you have the benefit.

I agree with Stefan2k16, CudaBiss is an old program made for early CUDA boards and it might be better with 4 times old card then a newer one. But they do use a lot of power and if you are about to write your own software I would go for the newer ones.

On the FPGA:

I use Altera Cyclone V E A9
1 chip running 32 cores @200Mhz clk will do a complete search in about 12Hours ... using about 15 Watt

Raw FPGA chip cost about $250
but it's almost impossible to build you own board due to BGA chips

Dev boards goes easy + $1000 but typical you will have more features on the FPGA and a lot of addon which you do not need but you get a board ready to play with.

Dev software are free for the Cyclone V serie

If you want to learn FPGA... see a lot youtube + some more
Buy a cheap FPGA board (not to small and avoid crazy Chinese boards) But for example one with a Altera MAX10 M50
Start playing ... and build some great stuff

BUT remember there is a heavy learning curve for FPGA's (it's even worse than GPU) but it's great for parallel processing like BF.
You will need to learn to use pipeline approach to get speed and results on every clk

kebien · May 30, 2017

A lot of interest with little money to invest becomes a moot point.
You can learn but you cannot develop.
Any hacking involves tons of spending,with FPGA being the most expensive of all.
A cuda card like GTX970 is cheap,has 1664 cuda cores,and is low power.
Of course,you cannot upgrade an old computer to this cards without upgrading power supply too,and maybe motheboard if there is not a PCI-E socket in it.
Again,everything involves investing in tools,without it is just no gain.

CSA Brute Force

kebien

Registered

K2TSET

Registered

BeatrixKiddo

Registered

K2TSET

Registered

ViaHussun

srijan20

Registered

K2TSET

Registered

dale_para_bajo

Registered

C0der

Registered

dale_para_bajo

Registered

dale_para_bajo

Registered

dale_para_bajo

Registered

K2TSET

Registered

dale_para_bajo

Registered

K2TSET

Registered

dale_para_bajo

Registered

dale_para_bajo

Registered

Stefan2k16

Registered

K2TSET

Registered

kebien

Registered