GPGPU using Opencl

cayoenrique · Aug 28, 2023

THEORY & DEFINITIONS

OpenCL Platform Model: Composed of one HOST (PC) and one or multiple OpenCL Devices (GPU or CPU).

Platform is kind of weird, Mostly they are divided by Manufacturer

Device: is where the OpenCL action is perform. It contains its own memory + various Compute Units. Each Compute Units can have (PE) Processing Elements, this are the cores.

Kernel: A C-like language program that have functions that executed on an OpenCL device. A single kernel execution can run on all or many of the PEs in parallel.

Program: function like a single C file, that hold one or many Kernels

Context: Is defined as the environment within which kernels execute and in which synchronization and memory management

Queue: The Queue keeps track of different calls you did to the target device and keeps them in order. Most commands can also be executed either in blocking or non-blocking mode.

Buffer object: Defines a linear collection of bytes (input_buffer & output_buffer)

Memory hierarchy
OpenCL defines a four-level memory hierarchy for the compute device:
1 per-element private memory (registers; __private).
10 local memory: shared by a group of processing elements (__local);
15 read-only memory: smaller, low latency, writable by the host CPU but not the compute devices (__constant);
100 global memory: shared by all processing elements, but has high access latency (__global);
This memory hierarchy it is most important. Look I personally number them, instead of 1 to 4. I use 1 to 100. It is to show you how slow your program will go every time you read or write to one of thise memory.
In general is like 10 times slower. 1 10 or 100. __constant is some where in the middle so I gave it 15. Listen this is my own IDEA do not look for it on the net. Now you ask why not do all in _private. Because resources get smaller as you get closer to core or PE. So we are force to use slower memories as our kernel gets bigger.

The BIG IDEA of GPGPU is to take Loop Functions and split all single loop independently. The assign each split single loop to a different (PE) Processing Elements. At the end a serial operation is transform in many parallel functions.

6 Simple steps in a basic host program:
1. Define the platform ... platform = devices+context+queues
2. Create and Build the program (dynamic library for kernels)
3. Setup memory objects
4. Define the kernel (attach arguments to kernel function)
5. Submit commands ... transfer memory objects and execute kernels
6. Read the results

Listen guys I will provide some E books soon. But to prevent for some to get scare I will hold them for latter.

Now I know it seems slow but to be honest I been having issues with my new setup in WIN7 disk and new compilers. But as I catch the mistakes I make I will speed up.

Regards T2MI I have no time to work on it. Since this has more audience I have not been working on T2MI. Soon.[/b][/b]

cayoenrique · Aug 28, 2023

Ok Here is our 1rst template.
OCLBiss_002.zip (7.33 KB)

Code:

https://workupload.com/file/ejrEvqqCm68

This does NO CALCULATION. It is just like the clinfo.exe but less info.
This has only one utility. When you run it, it will detect your devices

Inspecting System for OpenCL devices...Platforms found: 1
00 AMD AMD TURKS (DRM 2.50.0 / 4.19.0-6-amd64, LLVM 7.0.1) GPU

Today is Mon Aug 28 11:22:08 2023
Connected to device:AMD => AMD TURKS (DRM 2.50.0 / 4.19.0-6-amd64, LLVM 7.0.1)

Device Kernel properties:
number of cores: 6
recommended work group size (local threads): 64
max work group size: 256

Leaving now OCLBiss

In the second line you see 00 as in PD where P=Platform D=device. In this simple setup you can NOT have more that 9 devices or platform!
Once you know witch is your device you can disable detection to make it run faster.

Code:

    int                     DETECTDEVICEENABLE=1;   // 0=OFF 1=Detectroutine, GPU will show at the Left a 2 digit number PD, P=#Platform, D=#device
    int                     SELECTEDPLATFORM=0;     // Remeber you GPU/CPU  show PD, So P_ is the number of platform you want to use here
    int                     SELECTEDDEVICE=0;       // Remeber you GPU/CPU  show PD, So _D is the number of device you want to use

cayoenrique · Aug 28, 2023

As per how it does it just look at the lables

Code:

    /** User selectable to be place in a config file in the future **/

    /** Data and buffers **/

    /** Initialize data to be processed by the kernel **/

    /** Host/device data structures **/

    /** Program/kernel data structures **/

    /** Identify a platform **/

    /** Access a device **/

    /** Create the context **/

    /** Printing Device info **/

    /** Read program file and place content into buffer **/

    /** Create program from file **/

    /** Create kernel **/

    /** Printing GPU Properties **/

    /** Create CL buffers **/

    /** Create kernel arguments from the CL buffers **/

    /** Create a CL command queue for the device **/

    /** Enqueue the command queue to the device **/

    /** Read the result **/

    /** Time and print the result **/

    /** Deallocate resources **/

Yes the last are missing. It is because most if not all the previous code will always be the same. So do a quick look and know that there is a script to follow. But never will change.

The program prints the result to a log file log.txt
Will read the user selectables from a config file, OCLBiss.cfg
Will Print last Key it tries and failed and will save any possible key it finds.
Will light the keyboard NUMLOCK Led If it find a possible key
A new file will be added "lastsearch.log" that will hold last key so that it can continue in the future.

dvlajkovic · Aug 28, 2023

After it's compiled and ran:

cayoenrique · Aug 28, 2023

00 10 11
0_ Is Platform Nvidia Corportion
1_ Is Patfoem Microsoft

_0 Is Nividia GeForce RTX 4090
_1 Is Microsoft ... Witch most be the Internal GPU inside the CPU of your Intel CPU.

So you know you what to use this is 00
In the future you set off

int DETECTDEVICEENABLE= 0; // 0=OFF 1=Detectroutine, GPU will show at the Left a 2 digit number PD, P=#Platform, D=#device
int SELECTEDPLATFORM=0; // Remeber you GPU/CPU show PD, So P_ is the number of platform you want to use here
int SELECTEDDEVICE=0; // Remeber you GPU/CPU show PD, So _D is the number of device you want to use

Next we are going to add an extra C file called OCLB_toolbox.c to add all extra needed routines. We need an external config file + a Timer function to measure speed. Speed is a function of time. And will see in the future some print routines, bytes/bits manipulation and maybe some extra CSA routines to be use in CPU side.

Me2019H · Aug 28, 2023

after run with make

root@live:~/Apps/home/cryptodir/opencl/OCLBiss_002# ls
add_numbers.cl log.txt makefile OCLBiss.c OCLBiss.cbp_Linux OCLBiss.cbp_win OCLBiss.layout
root@live:~/Apps/home/cryptodir/opencl/OCLBiss_002# make
Detected OS = Linux
CFLAGS = -Wall -g -fopenmp -std=gnu11 -DCL_TARGET_OPENCL_VERSION=110 -D__OPENCL_VERSION__=110
INC_DIRS =
LIB_DIRS =
LIBS = -lgomp -lOpenCL
CC = gcc
STRIP = strip
RM = rm -f
RRM = rm -f r
srcfiles = ./OCLBiss.c
objects = ./OCLBiss.o
BIN = ./OCLBiss
make: Warning: File 'OCLBiss.c' has modification time 10663 s in the future
gcc -Wall -g -fopenmp -std=gnu11 -DCL_TARGET_OPENCL_VERSION=110 -D__OPENCL_VERSION__=110 -c ./OCLBiss.c -o ./OCLBiss.o
CC OCLBiss.c
g++ ./OCLBiss.o -lgomp -lOpenCL -o ./OCLBiss
strip ./OCLBiss
make: warning: Clock skew detected. Your build may be incomplete.
root@live:~/Apps/home/cryptodir/opencl/OCLBiss_002# ls
add_numbers.cl log.txt makefile OCLBiss OCLBiss.c OCLBiss.cbp_Linux OCLBiss.cbp_win OCLBiss.layout OCLBiss.o
root@live:~/Apps/home/cryptodir/opencl/OCLBiss_002# ./OCLBiss

Inspecting System for OpenCL devices...Platforms found: 1
00 NVIDIA NVD7 GPU

Today is Mon Aug 28 13:20:18 2023
Connected to device:NVIDIA => NVD7
invalid source
root@live:~/Apps/home/cryptodir/opencl/OCLBiss_002#

always invalid source But it didn't work in the codeblocks

dvlajkovic

How do you run it

dvlajkovic · Aug 28, 2023

Huh
Two things cross my mind @Me2019H

1. Make sure you have the following folder present with source files inside to be included while compiling OCLBiss C:\Apps\home\add_numbers

2. Rename file OCLBiss.cbp_Linux to be OCLBiss.cbp and then only start Code Blocks, open OCLBiss.cbp , clickon Build and then click on Run

Me2019H · Aug 28, 2023

thank you

it works but the same result maybe Opencl driver is not compatible with the device

Inspecting System for OpenCL devices...Platforms found: 1
00 NVIDIA NVD7 GPU

Today is Mon Aug 28 13:20:18 2023
Connected to device:NVIDIA => NVD7
invalid source

cayoenrique · Aug 28, 2023

In respect to the Windows users that dislike Linux. I will explain in SU Forums > Off Topic > PC Section > Chat > Linux Chat > Linux general questions and answer

Linux general questions and answer

@ Me2019H See you there

cayoenrique · Aug 29, 2023

Back to windows users, an update on OCL
OCLBiss_009.zip (12.52 KB)

Code:

https://workupload.com/file/j5mYqmufuJq

PLEASE do not expect yet the CSA part.
What is new. The file as I explained has a new file OCLB_toolbox.c , OCLB_toolbox.h and a new makefile because as I explained de previous one will only work with 1 *.c. This package has 2 *.c.
Now we have a OCLBiss.cfg where in the future we could tell the program to use a different *.cl or even same *.cl but a different kernel. You can now set here DETECTDEVICEENABLE or the default GPU to use SELECTEDDEVICE. And you will see some other setup that may means nothing now but will be useful in the future.

Now the way OCLBiss.cfg works is as follow. The most important character is ":". This delimit the Variable witch is before and after the ":" its value. The second most important character is "space" one the program see a space it ignores whatever is after the space. The third most important is "#" once it sees a "#" it ignore that line. Empty lines are ignored.

Now whats new in the program. Well if you recall i told you to look over quickly at what I posted. As that code never changes. Then I draw a line

Code:

/*********************************************************************/

I told you that between this delimiter is where we will make changes to adapt the program to do what we want.

Now in this sample 009 I want you to see a complete program. To your deception it does the same-thing as the previous add_numbers program did in the examples. This only purpose is to show you that we can make it work by adding new stuff between the

Code:

/*********************************************************************/

Now we need to make it do something related to CSA.
Then after we try to make it LOOP getting new keys. So that we can measure its speed.
PLEASE do not act s7up1d. You will not see soon what you wanted. Only step by step we will get there.

cayoenrique · Sep 6, 2023

Update
I am alive. It is just having trouble with our new mesa setup. I am receiving errors I never seen or have to deal with. In general If a post failing code I will receive more questions than the one I can answer. So I try to learn.

Any way this is good enough for a class. Remember I am no expert I will used words that can be understood.

1 core is made of a computer chip (IC) that has an ALU ( Arithmetic Logic Unit ) so it can do math. But when you say 1 + 2 = 3. This core require a memory to hold 1 another one to hold 2 and possible another one for the result 3.
So in the same chip at its closest there is dram memory. And most of this low quantity but fast memory is called registers.

For simplification lets used CPU definitions. In CPU the main differences between L1, L2, and L3 cache memory are are capacity and transfer speed. L1 is low capacity but extremely fast, L2 is slower but has more storage space, and L3 is the slowest of the three but also usually has the biggest storage capacity.

Be aware that there are many capacity differences between models and manufacturers of GPU. Now just like bees work in hives, GPU cores work in groups called waves. The most common used to be 32 then 64. But we can have 128 or 256. so if you have 2000 cores, they do not work all at same time. Only a few waves are launch. While one wave work the next one prepares to be launch. Now this wave share a common area of memory where they can pass info between them. So a group of core working in one wave are called a Local Group/ Local Threads. And the memory they share is called Local Memory.

Then as you may expect all core or wave or local groups can also share a memory. This is the Global Memory. When you say that you GPU have 24 GB of memory. This is the Global memory.

In GPU we have:
__private is the memory used ONLY bu 1 core
__local is the memory used by a group of core in a wave
__global is the memory of the GPU that can be use by any care. Most common use is to hold input data or output data

Just as a example for understanding let say:
1 core have 10 registers of __private memory
1 wave have 100 of __local memory
and all care have 1000 of __global memory

Here is where it gets tricky. You want you kernel to run as fast as possible. So you write all your code in __private memory! You are so smart I give you A.
But all your variables can not fit in only 10 register. So we say it spillover. The build process or de schedulers I am not sure which select the most used and give them the 10 available. Then assigned the next to much slower __local memory. Still all you variables do not fit. Then they spillover to __global memory.

Now IN AMD SDK all this happens automatically. For some reason in mesa I am getting error

enrique@live:$ ./OpenCLBiss
config: parsing file 'config.ini'
Connected to device:AMD => AMD TURKS (DRM 2.50.0 / 4.19.0-6-amd64, LLVM 7.0.1)
LLVM ERROR: ran out of registers during register allocation
enrique@live:$

dvlajkovic · Sep 7, 2023

Good to hear that you're safe'n'sound.
Everything else will be sorted out eventually.

cayoenrique · Sep 8, 2023

@dvlajkovic

Maybe we shoul try to see if Nvidia Toolkit works better without issues. But for this I need to know where are the opencl files located. Can you help me?

1) Navigate into the folder "C:\ProgramData\NVIDIA Corporation\" in file explorer.
Press Shift, right-click mouse, and select "Open command window here".
Type tree /f /a > ProgramData_NVIDIA.txt and press Enter.
move ProgramData_NVIDIA.txt to your desktop

2) Navigate into the folder "C:\Program Files\NVIDIA GPU Computing Toolkit\" in file explorer.
Press Shift, right-click mouse, and select "Open command window here".
Type tree /f /a > ProgramFiles_NVIDIA.txt and press Enter.
move ProgramFiles_NVIDIA.txt to your desktop

3) zip both ProgramFiles_NVIDIA.txt+ProgramData_NVIDIA.txt and upload. Send me a PM where I can get them

The we see how to create a compiler setup using Nvidia Toolkit. Hopefully natine Nvidia will work withou issues.

Then I will see how I can get that "NVIDIA GPU Computing Toolkit" folder so that I can try at home.

dvlajkovic · Sep 8, 2023

All these system paths are default ones and therefore no secret, so here is the zip file with all of them > link
I have installed the latest NVIDIA GPU COMPUTING TOOLKIT v12.2 along with Visual Studio 2022 for personal use (free).
If the latest version does not fit your nvidia gpu, please look for a suitable one here > link

moonbase · Sep 8, 2023

Has there been any attempt to code up an OpenCL app yet for BF or is this all just a theoretical discussion listing public documents?

dvlajkovic · Sep 8, 2023

Introduction to CUDA Programming
CUDA in a few words.

CUDA C++ Programming Guide
Release 12.2 - online version by nvidia.

CUDA C++ Programming Guide
Release 12.2 - PDF version by nvidia.

dvlajkovic · Sep 8, 2023

moonbase said:
Has there been any attempt to code up an OpenCL app yet for BF or is this all just a theoretical discussion listing public documents?

We're setting up various IDE.
So far we've covered code blocks + mingw64 and is working good.
I've also got the latest visual studio 2022 + nvidia gpu computing toolkit v12.2. Already compiled and ran some cuda source files.
Beside all these, I also have cygwin64 and it's compiling everything.

What IDE have you installed to follow this venture or take part in development?

moonbase · Sep 8, 2023

dvlajkovic said:
We're setting up various IDE.
So far we've covered code blocks + mingw64 and is working good.
I've also got the latest visual studio 2022 + nvidia gpu computing toolkit v12.2. Already compiled and ran some cuda source files.
Beside all these, I also have cygwin64 and it's compiling everything.

What IDE have you installed to follow this venture or take part in development?

I do not have any IDE installed, I am a maggot who wishes to download your app and use it when it is coded up.
Same as 99% of other members of this forum.

cayoenrique · Sep 9, 2023

@dvlajkovic I had busy day. But I will download soon what you posted.

@moonbase

Code:

 attempt to code up an OpenCL app yet for BF or is this all just a theoretical discussion

This tread is an attempt to teach how to write Opencl so that it can be used to brute force. So at the moment your comment is correct, I had publish only theoretical info. Why? I been having difficult executing kernels that have complexity, meaning require some amount of register.

Now in you require NOW an app. I will respond, please check on us in a few month, hopefully weeks. A gain Why?
1rst like I said this is a teaching lesson. Then in reality. Even when in the past I had play with Bock Cypher and the Stream Cypher. The truth is I never got to improve it to make it a functional final code. So I do not have one to provide you. So PLEASE do not post text insinuating that there are people that do not want to share. Well at least I do, but I do not have a CUDABISS mimic jet.

@all
I know, people are getting anxiety. I guess I been to slow. But like I said I do not have lots of time. I will proceed with what was my intention. I will post a little more in the Understanding CSA going over the different parts of the C code. Then we will try to made the code fit in a OpenCL kernel.

@dvlajkovic
I can not used any Nvidia Tool kit. In order to install I need a Nvidia GPU the only one I have is a 96 Core running on Win7 32bit. All new tool kits are for 64bit OS. Highest 32 bit I found is version 6. This is just in case my setup does not work and I have to move to nvidia tools

C0der · Sep 9, 2023

Maybe start with BC only first and add SC later?
That way we can also compare the speed to the Cuda-implementation in the RB-tools (with payloadlen=8).

GPGPU using Opencl

cayoenrique

Member

cayoenrique

Member

cayoenrique

Member

dvlajkovic

Member

cayoenrique

Member

Me2019H

dvlajkovic

dvlajkovic

Member

Me2019H

cayoenrique

Member

cayoenrique

Member

cayoenrique

Member

dvlajkovic

Member

cayoenrique

Member

dvlajkovic

Member

moonbase

dvlajkovic

Member

dvlajkovic

Member

moonbase

cayoenrique

Member

C0der

Registered

GPGPU using Opencl

Member

Member

Member

Member

Member

dvlajkovic​

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Registered

dvlajkovic