GPGPU using Opencl

cayoenrique

Senior Member
Messages
476
I do not have Nidia. Well I downloaded Tool v6 32bit but have no chance to look at. Hopefully I do some time. Point is I have no experience and can be wrong.

@moonbase and any others.

You need to have the driver for you GPU. Then test that is working by following my tutorial 01_Inspcecting_PC_ver_0.01.pdf

See what @dvlajkovic posted on post #45
lFu2uJo.png

You can see on the bottom left, it has a check mark for OpenCL & Cuda.

You need a working space. So Open a Terminal

Code:
[Windows Key]+R  then type CMD [ENTER]
mkdir C:\Apps\home\cryptodir\opencl
Leave the terminal open. You will used it soon.

Move the extracted OCL_TEST_02 folder to: C:\Apps\home\cryptodir\opencl\OCL_TEST_02\OCLBiss_014

This is my best guess of how to do a similar to GNU make. It may not work. And my code may need changes. Download this file

visualc.zip (830.00 B)
Code:
https://workupload.com/file/hHzkmcWFgjY

Extract content into C:\Apps\home\cryptodir\opencl\OCL_TEST_02\OCLBiss_014
Click yes to replace old makefile

The next script will search for VisualC set up the terminal to use Visual Studio Developer Command Prompt
Code:
vsdevcmd.cmd

Now after that you should have something similar to
command-prompt.png


Just in case you want to double check do
Code:
cl

You should see something like Microsoft (R) C/C++ Optimizing Compiler Version ***

Ok now we are ready to try compile OCLBiss.exe. Make sure we are in the right directory then
in terminal type
Code:
cd C:\Apps\home\cryptodir\opencl\OCL_TEST_02\OCLBiss_014
NMAKE all

And if we are lucky you will have OCLBiss.exe

Now for 1rst test edit OCLBiss.cfg and change line 20 to DETECTDEVICEENABLE:1
We want 1 so that program detect your platform and devices.

I spent more that 45 minutes to get your answer. Be kind, and when you respond asking for more help PLEASE take a minute to Copy & Paste your screen so that we can continue with help.
 
Last edited:

cayoenrique

Senior Member
Messages
476
The Info/commads I posted where obtain from the net. But MS decide to not let you play with nmake as they want you to use msbuild.
I did work for long 14 hours, Installed VS2009, Cuda tools 6 32bit, and MS SDK. I found that that "win32.mak" as removed as they do not what you to play. And spent good few hours learning how to compile with nmake.
Now the BIGGGG problem is that MSVC what my code to change a LOTTTTT. To try to gain new adept to the cause I did give it a try. And ter more LONGGGG hours I finally change may many lines of codes. ANd the program build.

Sad part is that the program just start and in just seconds close.......

Listen guys. I can not thru my time on that adventure again. It is not hard to install mingw + the OCL IDC. And you will be able to compile the program as many times you want without efforts. Now I am going to sleep. But hope tomorow at least to show you how to use nmake for compilation. But unless new MSVC?Cuda Toold make a diferent do not expect to compile & RUN my program.

@all
UPDATE, before I try ans fail MSVC, I did spent lots of time figurin out how to split the kernel process into 2 parts, a Block Kernel + Stream Kernel. I need at least a day or two to have it working. This time I hope @dvlajkovic see an improve. I m not sure but les code = less registers = more seed sice more core do in fact work. In the other hand sving and reading to Global memory takes a load. But is memory that we do not push nor pull from our PC. And I have a few good Ideas to improve Keyschedule. Now there is always the Bit Slice aproach. I did try in the past and time went slower!!! Finally there is the option to launch multiple threads. That is weid becouse they claim that is a new thing only for new GPUs. But I recalled seed post. And infact CUDABIS do that in old Nvidias!!
 

C0der

Senior Member
Messages
270
The result of the split will be interesting.
How many bits did you use for Bit Slice?
And yeah, multiple threads is weird as a "new thing".
 

cayoenrique

Senior Member
Messages
476
@C0der
Splinting cl into is the easiest part as it is in fact split. It only require another section that reads "void kernel".
The difference that it is new for me is how to launch the kernels, telling Kernel B to STOP & wait for results of Kernel A. Then scheduling Kernel A to wait for Kernel B & results reading.
Then adding an Atomic write to announce any key found. Atomic means only one core can have access to write the variable that reports how many keys where found. All the kernels I had presented assume that the provability is 0% of more than one core in same wave finding keys at same time. And that may not be true. If we allow that to happen then 2 cores in same wave may take times to write almost at same time to possible same location, This may result in one of the two keys getting lost.

My attempts in the past was always 32 bits because that is the register size of all old cards. You had seen that we keep same core looping doing next possible key for long time. In fact @dvlajkovic love to use 65536 in LOOPSPERTHREAD. This means that we MOST increase a COUNTER that keep track of next key.

Now in a bit slice scenario, we work with an Array representing a transposed of a matrix. Increasing the value of the counter is a very trick business. It add so much complexity that it slow down the process of loops in a single thread.

Now multi-threading by one C Host program sending multiple consecutive kernels to same GPU at same time, is something I had never tried. So I do not know the results. It adds complexity as how we keep separates memory uses for input and output to and from GPU. Now you had seen comments on how People launches more than one CUDABISS and claim the can add the combined keys/seconds. But HOW the Hell each Cudabiss communicate with each other so that none of the parallel program used the same memory on GPU. Because doing so,will be the same as the problem I expressed on previews paragraph. At some point waves of different parallel Cudabiss will be writing to same global memory. So you can have report of two keys found but in reality only one key been shown or a combination values of two!! So I keep reading this reports of parallel Cudabis been launch but I do not know if the developer did put code to prevent keys been lost due to memory corruption!!! This is why we can find in the net that some developers say it is effectively impossible to be done in old GPU. But at same time others report it is perfectly ok. See it will all depend how we code each program. They need to share the knowledge of the global memory each used so that one do not overwrite the output of another parallel thread!! Do we know that Cudabiss can do that? Did the developer mention that it is perfectly find to launch multiple Cudabiss in parallel!! Or people that do, base all in the hope NEVER two threads report fake Keys at same time.

If you had seen our output, we get fake keys about every threads see @dvlajkovic report!!
Code:
Today is Sun Sep 10 15:37:46 2023
Connected to device:NVIDIA Corporation => NVIDIA GeForce RTX 4090

Device Kernel properties:
    number of cores:                 128
    recommended work group size (local threads):     32
    max work group size:                256


Number Loops per thread:                 65536
Number of keys per thread(x32):             65536
Local threads:                         256
Global threads:                     32768
Keys per kernel:                     0x80000000(2147483648)

BruteForcing for:
SB01:       

Range: 0000000080000000

Loop             From             To                kernel Time  Keys per seconds        #Keys Found

0000003B 00387C624F0000 00387CE24EFFFF [...] 12:39:28  kps:5.711649249351e+008 00000026         Key 001:38 7C AE 62 4F 05 3D 91 Key 002:38 7C 8F 43 52 08 13 6D Key 003:38 7C D2 86 7E 08 2E B4 Key 004:38 7C D2 86 57 70 99 60 Key 005:38 7C D3 87 10 76 45 CB Key 006:38 7C B1 65 F1 F3 98 7C Key 007:38 7C 67 1B 2E F3 92 B3 Key 008:38 7C 70 24 3A D0 57 61 Key 009:38 7C 6E 22 C4 E9 C1 6E Key 010:38 7C 80 34 BB 97 2A 7C Key 011:38 7C C2 76 8D DB E1 49 Key 012:38 7C B2 66 98 DC A5 19 Key 013:38 7C B2 66 F1 FD CC BA Key 014:38 7C CA 7E B8 E0 CE 66 Key 015:38 7C B5 69 FD F9 1B 11 Key 016:38 7C B0 64 52 E9 FA 35 Key 017:38 7C BE 72 90 EA FD 77 Key 018:38 7C BF 73 16 FD 5E 71 Key 019:38 7C BD 71 42 C8 78 82 Key 020:38 7C D4 88 B4 EE 2A CC Key 021:38 7C 6C 20 24 EF 94 A7 Key 022:38 7C AC 60 62 F2 09 5D Key 023:38 7C DE 92 B6 E3 EF 88 Key 024:38 7C 89 3D 1D CC B7 A0 Key 025:38 7C 93 47 21 D8 46 3F Key 026:38 7C 6A 1E 70 FD D8 45
0000003C 00387CE24F0000 00387D624EFFFF [...] 12:39:32  kps:5.696851920859e+008 00000014         Key 001:38 7D 01 B6 B9 00 36 EF Key 002:38 7D 33 E8 9B 03 14 B2 Key 003:38 7D 1E D3 58 04 2A 86 Key 004:38 7D 1E D3 72 48 33 ED Key 005:38 7D 1F D4 32 B6 05 ED Key 006:38 7C F0 A4 F9 51 D1 1B Key 007:38 7D 3A EF 99 C1 EA 44 Key 008:38 7C E5 99 27 C4 FB E6 Key 009:38 7C F5 A9 88 D9 74 D5 Key 010:38 7D 12 C7 A1 F0 F2 83 Key 011:38 7D 0A BF C6 AC F5 67 Key 012:38 7D 10 C5 3D E4 00 21 Key 013:38 7D 19 CE 2D E3 FE 0E Key 014:38 7D 27 DC 43 FD 8F CF
0000003D 00387D624F0000 00387DE24EFFFF [...] 12:39:36  kps:5.710754169933e+008 00000010         Key 001:38 7D 71 26 63 00 94 F7 Key 002:38 7D A0 55 9B 05 9F 3F Key 003:38 7D A0 55 9E 05 BD 60 Key 004:38 7D A0 55 C4 D1 89 1E Key 005:38 7D A0 55 B3 F9 F2 9E Key 006:38 7D AD 62 A9 F7 B6 56 Key 007:38 7D B8 6D DB F9 03 D7 Key 008:38 7D C6 7B AD 6B 82 9A Key 009:38 7D C6 7B 6B B0 8D A8 Key 010:38 7D D2 87 7B FB EA 60

50 fake keys in only 3 threads!!! This is something that bather me. I had no chase to Analise previous posted results. I have a feeling that kps reported is much higher that what we see on report. maybe there is a computational error. In fact I put a limit of 65536 but this number can go to infinite in a GPU that is not been used for Screen Display.

So HOW Cudabiss prevent this collisions???? Are this People that launch multiple Cudabiss at same time, really wasting their time? Gods Only knows.....
 
Last edited:

C0der

Senior Member
Messages
270
Yeah, to run more than one kernel makes it a bit more complicated. But there should be a simple "howto" out there.
Yes, must use atomic inc.

About "fake keys":
5E8 = 500000000 kps
0xffffff=16777215
Expected fake keys per second:
500000000/16777215=29

About parallel Cudabiss:
That's different hosts. Memory handling is done by driver or OS.
 

cayoenrique

Senior Member
Messages
476
Lets used @dvlajkovic data posted in #50 where we have more rounds.
Code:
***
Number Loops per thread: 16384
***
Keys per kernel: 0x20000000(536870912)
***
Loop             From             To                kernel Time  Keys per seconds        #Keys Found
00000001 0000385F66FF0000 0000385F86FEFFFF [...] 10:12:24 kps:4.57e+008 00000027          Key 001:38 5F 80 17 E4 C4 85 2D
***
0000001F 0000386326FF0000 0000386346FEFFFF [...] 10:12:52 kps:5.69e+008 00000019 Key 001:38 63 3A D5 F9 41 3D 77

Lets ignore posted kps as it may be offset. Instead lets use REAL exact data. We need to ignore the 1rst post info, except to take Initial time: 10:12:24. From last posted 0x0000001F we know final time was 10:12:52.
1) So total running time was 52-24= 28 seconds.
2) 0x0000001F in decimal is 31. But wee need to ignore 1rst data. So we ran 30 rounds.
3) Every round we do 536870912keys
4) he did found a total of 671 possible keys
5) 2,11,24,24,25,13,26,20,10,25,29,27,24,35,35,25,19 ,29,19,17,19,27,28,9,24,25,18,31,12,19
Max #keys/round = 35, Min #keys/round = 9, Average = 22.3667
6) Average kps = 5.7*10^8

To conclude:
Real Kps ( counting TOTAL time) = b]536870912[/b] / 30 = 17895697.0666667
= 1.77895697*10^7 kps ???

Ahaaa this may explain why he is not improving. There is a lot of CPU compare to GPU time. So his GPU is resting a lot. About

b] 1.77895697*10^7[/b] / 5.7*10^8 = 0.031394123735454 or 3.13% of the time !

@C0der THANKS for forcing me to do this numbers as I am not sure what it means but there are incompatibilities.

Now lets go back at what you said.
Do not really know if I understood clearly. What is 0xffffff? My best guess you insinuate is the 3 byte position or 000001?
I guess you are saying 500000000 kps x ( 1 expected REAL Key) / ( in 6777215 totals keys in 3 bytes ) . Interesting seems correct. Simple math. But that is not what in really happening,

You can not forget that what we are rolling is not all the posible numbers in 3 bytes. For others to understand is like saying that my loop goes like this 00 00 01, 00 00 02, 00 00 03…, FF FF FF. And this concept assume we have a NORMAL Distribution

NOPE IT is not what happening. We are rolling the keys and as result we obtain a RAMDOM data that follow NO NORMAL Distribution. And as result we get any possible number of find keys. From the sample we go from 9 to 35. Now You are correct in terms that we are close to 29 as our average is 22.3667 key per round and rounds are about 1 second apart.
 

cayoenrique

Senior Member
Messages
476

@Me2019H

Your are showin no rolling. Plus the selected GPU is in fact HD4400 the one inside the CPU. That I can only guess is also busy with display on SCREEN.

PLEASE use OCLBiss.cfg and ask it to detect your GPU so you know the number of your NVIDIA GPU.

NOW this raise an Issue. WHY HD4400 do not show in Linux. I do not recall seen that one in Xorg.conf uhmmm!. Did you forgot to add the 99-firmware! to live\modules ;)
 

moonbase

VIP
Donating Member
Messages
554
@C0der

So HOW Cudabiss prevent this collisions???? Are this People that launch multiple Cudabiss at same time, really wasting their time? Gods Only knows.....


People that launch multiple CudaBISS at the same time get increased speed compared to single instance, I can guarantee you this for 100% certainty.
It is a waste of time to run single instances of CudaBISS, single instance of CudaBISS is the method for the slow men
 

cayoenrique

Senior Member
Messages
476
You guys are fun people.
The point is not if you can launch more that one. The point is not if it is faster. The point is not if you guys fell powerful doing that.

The question is if the original programmer did thought that you guys where going to do that. And if he founds ways to prevent collisions in memory. See cores are like an ARMY. The all march at the same time and in same direction. In fact they all do the same. So imagine you have 100 men to feed in your army and they are hungry. And your instruction is, go to table #1 sit and eat. Then all 100 men will try to sit on table #1 !!!

So you saw how table #1 broke.

You got smart. You assign a number to each men and you buy 100 tables. The next time you say, please sit on table equal to you number. And you feel happy like a General, but you are just the cook.

Now the real general saw what you did. The general think way a minute, that table has 4 chairs. I can send 3 at at time and there will be even 1 chair per table to spare.

Now the 2 Start General see what up 1 start General did. and he ask his men to follow same orders. But guess what, the 2 start general have 10 time as men as the other one. Just guess what will happen, now you will have not 1 broken table but 100 of them.... Just because work for 100, or 300 it does not mean it will work more faster with 10,000. LONG story so see if I got you sleeping.
 

cayoenrique

Senior Member
Messages
476
What I promest, Here is the way to use nmake.

You need to have installed CUDA, MSVC & MS SDK . The program I did mention before will look for your installed programs and automatically set the directories for you. But as explained some time fail. So I am spiking that script and manually created a build process and a new script.

1) Download the next file.
nmake_sample.zip (44.54 KB)
Code:
https://workupload.com/file/GT4txA4UYtu

It is csa_core_001_nmake It was modify to build by MSVC and has a modified makefile. Also there are a cople of files. The first one issetvcvars.bat. I initially copy this from
C:\Program Files\Microsoft Visual Studio 9.0\VC\bin\vcvars32.bat. Mine did not work as is. Some people in the net also claim it was boggy as it looks to some registers in windows to check if your MSVC is a valid copy!!!
Now I follow some guy that claim by commenting 3 lines the validity check is ignored. You can edit this file carefully and you can see what I did. I commented lines 3,4&5 and put my name on it so you can see what I did.
Code:
  @echo Setting environment for using Microsoft Visual Studio 2010 x86 tools.

:: Added by Enrique @call :GetVSCommonToolsDir
:: Added by Enrique @if "%VS100COMNTOOLS%"=="" goto error_no_VS100COMNTOOLSDIR
:: Added by Enrique @call "%VS100COMNTOOLS%VCVarsQueryRegistry.bat" 32bit No64bit

@if "%VSINSTALLDIR%"=="" goto error_no_VSINSTALLDIR
***

2) Second file is setnmake.bat I took most of the code from the original auto detection that does not work. I remove the bad code that fail to detect and put instead 3 variables you need to manually change because depend what version of CUDA, MSVC & MS SDK you have.
Mines are located at:
Code:
C:\Program Files\Microsoft Visual Studio 9.0
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0
C:\Program Files\Microsoft Platform SDK\

So use your favorite editor, Open setnmake.bat and carefully change those directory names.
Mine looks like this:
@if not defined _echo echo off

@SET MVS=Microsoft Visual Studio 9.0
@SET CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0
@SET WINDOWSSDKDIR=C:\Program Files\Microsoft Platform SDK\
@SET SETVCVARDIR=C:\Windows\

Now most likely the blue make numbers is what change. PLEASE notice that line 2 & 3 do not have a SLASH at the end. But line 3 & 4 do need a \.
Save the file.

3) Now take both setnmake.bat + setnmake.bat and MOVE them in to C:\Windows. Why? so that you can call the script from any place in your disk. No need to keep coping then over and over.

4) Now lets test our new setup. Browse to where you had extracted the files and move into csa_core_001_nmake sample program.
Now you open a terminal in that windows. The easy way is to do [Left SHIFT][Mouse - Right] Click on any empty area where you do not have files. Then select Open Command windows here.
Now you need to set ALL the Variables required by MSVC. That is what the 2 script do. You only need to called one. The other one is a file use by your CMD to set up the environmental variables.

Now type setnmake and [Enter] and at the end you will see MSVC responding. Take a look this are the commands:
Code:
setnmake
cl
make
results
Setnmake.png


Now lets execute then clean all created files.
main.exe
nmake clean

nmakeclean.png


That's it. Enjoy

Ups I forgot there are twi more samples the add_numbers opencl and a basic clinfo.
 
Last edited:

moonbase

VIP
Donating Member
Messages
554
True that.
Launching multiple cudabiss simultaneously would also waste their graphics card much faster
AZjPGzN.gif


Most decent top end NVIDIA graphics cards come with three year warranties, often extendable to five years.
No worries about blowing up a card, simply get another one free under the warranty.
 

dvlajkovic

Senior Member
Messages
498
Most decent top end NVIDIA graphics cards come with three year warranties, often extendable to five years.
No worries about blowing up a card, simply get another one free under the warranty.
By all means go ahead, make my day
a_goodjob.gif

I'll catch a ride with Uber for slow men.
 

moonbase

VIP
Donating Member
Messages
554
.... The point is not if it is faster. ...

The question is if the original programmer did thought that you guys where going to do that.


The point is that IT IS faster to run multiple instance of CudaBISS

If you are searching for a CW using brute force for a feed, speed is the need.
If you are a slow man, the feed will have ended, you will not have found the CW and not viewed the feed.

I think you are doing some great work in trying to develop your OpenCL app and I wish you success.
However, if it is not faster than CudaBISS running multiple instances I cannot see who is going to use your app?

If your app is faster than multiple instances of CudaBISS once it is finalised I think there will be a large number of users/downloaders/testers.
I would certainly be one of them and as I said above, I wish you well with your coding.
 
Last edited:

moonbase

VIP
Donating Member
Messages
554
By all means go ahead, make my day
a_goodjob.gif

I'll catch a ride with Uber for slow men.


No worries here, I always run multiple instances of CudaBISS when I have the need to BF a CW.
I have never had a graphics card blow up on me yet and I have been using 3080 Ti's and 3090's for several years for this process.

Try taking a bus if you want to be with the slow men, they are slower than Uber.
 

dvlajkovic

Senior Member
Messages
498
However, if it is not faster than CudaBISS running multiple instances I cannot see who is going to use your app?
No need to put pressure nor conditions here.
It was me that have asked Enrique to develop a new BF app and I'm gonna use it.
 

moonbase

VIP
Donating Member
Messages
554
No need to put pressure nor conditions here.
It was me that have asked Enrique to develop a new BF app and I'm gonna use it.


I am not putting any pressure or conditions, where did you make that up from?

All I said was a simple truthful fact that if the new app is not faster than CudaBISS then I cannot see a reason for people with NVIDIA cards to use it.
If users have AMD cards, the tool allows them to have an option for BF without the need for CUDA cores, in which case there will be quite a few downloader/users/testers subject to speeds.

If you want to use it then that is up to you, you have previously said you are a slow man, I wish you a speedy journey in your Uber or Bus.
 
Top