PMPP Compositional GPU

Reference: Fabian Czappa, Alexandru Calotoiu, Thomas Höhl, Heiko Mantel, Toni Nguyen, Felix Wolf: Design-Time Performance Modeling of Compositional Parallel Programs. Parallel Computing, 108:1–12, September 2021.

Validate this developed framework on GPU

1. GPU Function

We add 5 GPU kernel

nop-kernel, which do nothing. after being called, it will return.
min-kernel, which return the smallest element of a vector.
max-kernel, which return the biggest element of a vector.
scalarProduct-kernel, which return the scalar product of 2 vectors, or 2 same vectors.
matrix_double_kernel, which multiplies all elements in the matrix by 2.

2. Time for Single Funtion

By measuring we obtain the running time of all functions under different input sizes.

flag	function	time (of vector with 1 element/100 elements/10000 elements/65536 elements)
NK	nop-kernel	6.06e-07 sec
MIK	min-kernel	4.214e-04 sec /4.342e-04 sec/4.548e-04 sec/6.180e-04
MAK	max-kernel	4.193e-04 sec /4.410e-04 sec/ 4.543e-04 sec/ 5.930e-04 sec
SPK	scalarProduct-kernel	4.56e-04 sec/4.553e-04/4.905e-04 sec/4.944e-04 sec
MD	matrix_double_kernel	4.494e-04 sec/5.459e-04/8.197e-03 sec(1000)/0.189(3000)/0.347(5000)/1.264 sec(10000)
N	Nopper	1.774e-06 sec
I	Increaser	2.227e-06 sec/2.148e-06 sec/3.131e-06 sec/1.796e-05 sec
QS	QuickSorter	2.059e-06 sec/1.621e-06 sec/7.840e-05 sec/6.223e-04 sec
RA	ReduceAddVector	2.548e-06 sec/3.679e-06/1.849e-05/1.117e-04
RM	ReduceMin	2.031e-06 sec
RO	Reorderer	1.753e-06 sec/3.662e-06sec/7.88e-05 sec/4.788e-04 sec
SS	SelectionSorter	2.121e-06 sec/3.419e-06 sec/4.509e-03 sec/0.205 sec

It seems the time of function F with input size V is not equal to size(V)*time(F) in our experiments. The reason might be that the functions are all parallel but the vector size is not enough big.

So we choose bigger vector size for GPU kernel:

flag	function	time/s (of vector with 10000 element/100000 elements/1000000 elements/3000000 elements/5000000 elements/)
MIK	min-kernel	4.548e-04/6.294e-04/3.662e-03/8.365e-03/0.0161
MAK	max-kernel	4.543e-04/7.083e-04/2.859e-03/9.994e-03/0.0114
SPK	scalarProduct-kernel	4.905e-04/1.386e-03/9.247e-03/0.0337/0.0498

We can observe that each function has a base time of similar duration, which we guess is used for the loading of the task.

As the new result we get, min kernel, max kernel and scalar product kernel have linear growth with size of vector,

And matrix double kernel has super-linear growth with size of vector.

According to the data, we use least squares to get the estimation function of vector size x:

Flag	Function	function(s)
MIK	min-kernel	3.0286e-09 * x + 0.00036195
MAK	max-kernel	2.4106e-09 * x + 0.00058695
SPK	scalarProduct-kernel	1.01828e-08 * x + 0.00032437
MD	matrix_double_kernel	1.0473e-08 * x^2 + 2.1779e-05 * x - 0.0034685

The fitting curves are as following:

3. Time for Compositional GPU

We measure the running time of different combinations, including, the combination of GPU and GPU functions, the combination of GPU and CPU functions.

Composition Test:

function_A&function_B	size(vec_A)	size(vec_B)	time(A)/s	time(B)/s	time/s
MAK+NK	10000	1	4.543e-04	6.06e-07	4.640e-04
SPK+NK	100	1	4.553e-04	6.06e-07	4.702e-04
MIK + NK	10000	1	4.548e-04	6.06e-07	4.633e-04
N + NK	1	1	1.774e-06	6.06e-07	3.885e-06
I + MIK	10000	10000	3.131e-06	4.548e-04 sec	4.745e-04
QS+MAK	100	100	1.621e-06	4.410e-04	4.590e-04
SS+MAK	100	100	3.419e-06	4.410e-04	4.395e-04
RO + SPK	100	100	3.662e-06	4.553e-04	4.5046e-04
RM+NK	1	1	2.031e-06	6.06e-07	9.392e-06
SS+SPK	10000	10000	4.509e-03	4.944e-04	4.700e-03

Relative Error = (test_time - (timeA+timeB))/(timeA+timeB) * 100%

FunA & FunB	timeA+timeB/s	test_time/s	Relative error
MAK+NK	4.549e-04	4.640e.04	2%
SPK+NK	4.559e-04	4.702e-04	3.1%
MIK + NK	4.554e-04	4.633e-04	1.7%
N + NK	2.38e-06	3.885e-06	63.2%
I + MIK	4.579e-04	4.745e-04	3.6%
QS+MAK	4.426e-04	4.590e-04	3.7%
SS+MAK	4.444e-04	4.395e-04	-1.1%
RO + SPK	4.589e-04	4.5046e-04	-1.8%
SS+SPK	5.003e-03	4.700e-03	-6%

Conclusion:

Except for N+NK, the relative errors of the experimental results of all other groups are less than 10%, and it can be considered that the method of estimating the function combination by the sum of each single function is feasible.
We find that the relative error is positive for all GPU-to-GPU combinations,We think there may be other factors to consider in the GPU-to-GPU combination.
We observe that the relative error is positive for all experiments including the nop-kernel. Especially in the N+NK experiment, the error is as high as 63.2%.I think it can be deduced that there is still something not taken into account in the modeling with nop-kernel.

Prediction of unused size with model:

function(s)	size(vec_A)	size(vec_B)	time/s	model_time/s	Relative error
MAK+NK	60000	1	6.532e-04	7.322e-04	-10.78%
SPK+NK	200000	1	2.278e-03	2.236e-03	1.87%
MIK	8000000	-	0.0238564	0.0245906	-3.0%
MD	1500	-	0.0128734	0.05286	-75%

Conclusion:

We can see that for linear functions, the model obtained by combining the modeling methods is more accurate, but for superlinear functions, this accuracy may depend on the chosen fitting method.
Through experiments on a single GPU function and a combination of two GPU functions, we preliminarily verify the feasibility of combining performance modeling in GPU programming.

5. Problem found in experiments

GPU initialization

When we execute following code,we get following result
code:

        timer1(tp_nop);
        timer1(tp_nop);

result:

Testing Time of : taskpool(1,Nop_GPU)
repeats: 1000
size: 1
average time of taskpool(1,Nop_GPU) is 0.000282549 sec

Testing Time of : taskpool(1,Nop_GPU)
repeats: 1000
size: 1
average time of taskpool(1,Nop_GPU) is 6.06e-07 sec

I guess the first timer always includes the time that initialization of GPU using.

6. Conclusion

For composition between GPU&CPU and between GPU&GPU,our experiments initially proved that

C is a composition of A and B, so: time(C) = time(A) + time(B)

Further experiments preliminarily prove that if the functions A and B are linear functions:

A is function F1 with input size V1, B is function F2 with input size V2

=> time(C) = size(V1)*time(F1)+size(V2)*time(F2)(+constant_base_time)

So far, we have initially verified the application of compositional modeling on GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
source		source
CMakeLists.txt		CMakeLists.txt
README.md		README.md
draw.py		draw.py
myplot1.png		myplot1.png
myplot2.png		myplot2.png
myplot3.png		myplot3.png
myplot4.png		myplot4.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PMPP Compositional GPU

1. GPU Function

2. Time for Single Funtion

3. Time for Compositional GPU

5. Problem found in experiments

GPU initialization

6. Conclusion

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Superone77/PMPP-Compositional-GPU

Folders and files

Latest commit

History

Repository files navigation

PMPP Compositional GPU

1. GPU Function

2. Time for Single Funtion

3. Time for Compositional GPU

5. Problem found in experiments

GPU initialization

6. Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages