cayoenrique
Senior Member
- Messages
- 477
THEORY & DEFINITIONS
OpenCL Platform Model: Composed of one HOST (PC) and one or multiple OpenCL Devices (GPU or CPU).
Platform is kind of weird, Mostly they are divided by Manufacturer
Device: is where the OpenCL action is perform. It contains its own memory + various Compute Units. Each Compute Units can have (PE) Processing Elements, this are the cores.
Kernel: A C-like language program that have functions that executed on an OpenCL device. A single kernel execution can run on all or many of the PEs in parallel.
Program: function like a single C file, that hold one or many Kernels
Context: Is defined as the environment within which kernels execute and in which synchronization and memory management
Queue: The Queue keeps track of different calls you did to the target device and keeps them in order. Most commands can also be executed either in blocking or non-blocking mode.
Buffer object: Defines a linear collection of bytes (input_buffer & output_buffer)
Memory hierarchy
OpenCL defines a four-level memory hierarchy for the compute device:
1 per-element private memory (registers; __private).
10 local memory: shared by a group of processing elements (__local);
15 read-only memory: smaller, low latency, writable by the host CPU but not the compute devices (__constant);
100 global memory: shared by all processing elements, but has high access latency (__global);
This memory hierarchy it is most important. Look I personally number them, instead of 1 to 4. I use 1 to 100. It is to show you how slow your program will go every time you read or write to one of thise memory.
In general is like 10 times slower. 1 10 or 100. __constant is some where in the middle so I gave it 15. Listen this is my own IDEA do not look for it on the net. Now you ask why not do all in _private. Because resources get smaller as you get closer to core or PE. So we are force to use slower memories as our kernel gets bigger.
The BIG IDEA of GPGPU is to take Loop Functions and split all single loop independently. The assign each split single loop to a different (PE) Processing Elements. At the end a serial operation is transform in many parallel functions.
6 Simple steps in a basic host program:
1. Define the platform ... platform = devices+context+queues
2. Create and Build the program (dynamic library for kernels)
3. Setup memory objects
4. Define the kernel (attach arguments to kernel function)
5. Submit commands ... transfer memory objects and execute kernels
6. Read the results
Listen guys I will provide some E books soon. But to prevent for some to get scare I will hold them for latter.
Now I know it seems slow but to be honest I been having issues with my new setup in WIN7 disk and new compilers. But as I catch the mistakes I make I will speed up.
Regards T2MI I have no time to work on it. Since this has more audience I have not been working on T2MI. Soon.[/b][/b]
OpenCL Platform Model: Composed of one HOST (PC) and one or multiple OpenCL Devices (GPU or CPU).
Platform is kind of weird, Mostly they are divided by Manufacturer
Device: is where the OpenCL action is perform. It contains its own memory + various Compute Units. Each Compute Units can have (PE) Processing Elements, this are the cores.
Kernel: A C-like language program that have functions that executed on an OpenCL device. A single kernel execution can run on all or many of the PEs in parallel.
Program: function like a single C file, that hold one or many Kernels
Context: Is defined as the environment within which kernels execute and in which synchronization and memory management
Queue: The Queue keeps track of different calls you did to the target device and keeps them in order. Most commands can also be executed either in blocking or non-blocking mode.
Buffer object: Defines a linear collection of bytes (input_buffer & output_buffer)
Memory hierarchy
OpenCL defines a four-level memory hierarchy for the compute device:
1 per-element private memory (registers; __private).
10 local memory: shared by a group of processing elements (__local);
15 read-only memory: smaller, low latency, writable by the host CPU but not the compute devices (__constant);
100 global memory: shared by all processing elements, but has high access latency (__global);
This memory hierarchy it is most important. Look I personally number them, instead of 1 to 4. I use 1 to 100. It is to show you how slow your program will go every time you read or write to one of thise memory.
In general is like 10 times slower. 1 10 or 100. __constant is some where in the middle so I gave it 15. Listen this is my own IDEA do not look for it on the net. Now you ask why not do all in _private. Because resources get smaller as you get closer to core or PE. So we are force to use slower memories as our kernel gets bigger.
The BIG IDEA of GPGPU is to take Loop Functions and split all single loop independently. The assign each split single loop to a different (PE) Processing Elements. At the end a serial operation is transform in many parallel functions.
6 Simple steps in a basic host program:
1. Define the platform ... platform = devices+context+queues
2. Create and Build the program (dynamic library for kernels)
3. Setup memory objects
4. Define the kernel (attach arguments to kernel function)
5. Submit commands ... transfer memory objects and execute kernels
6. Read the results
Listen guys I will provide some E books soon. But to prevent for some to get scare I will hold them for latter.
Now I know it seems slow but to be honest I been having issues with my new setup in WIN7 disk and new compilers. But as I catch the mistakes I make I will speed up.
Regards T2MI I have no time to work on it. Since this has more audience I have not been working on T2MI. Soon.[/b][/b]