In general you start the 1rst time with sugestion from GPU. You get them printed by selecting
Code:
LOCALWORKDROUPSIZE:0
GLOBALWORKDROUPSIZE:0
And make a note of that.
Then you use multiples of the number given for LOCALWORKDROUPSIZE.
Finally GLOBALWORKDROUPSIZE should always be a multiple of LOCALWORKDROUPSIZE.
There is logic in the program to force GLOBALWORKDROUPSIZE to be a multiple of LOCALWORKDROUPSIZE. Because if not kernel will not load.
MULTITHREADSIZE should make a difference. My best guess close to 2X. But if you already have manage to manually select a config that have you GPU saturated then the improve you see is less, as you already have saturated you GPU. Still you should see some improve due to PCIe load unload operations.