Parallel Dense Matrix Multiplication

This report is about the matrix multiplication methods and their differences. At the end of the report, there are compares of these methods from a time aspect. This report is divided into 3 parts:

Matrix Multiplication with traditional serial method
- for matrix size: 128, 256, 512, 1024, 2048, and 4096
Multithreaded Matrix Multiplication
- for thread size: 1, 2, 4, 8, and 16
- for matrix size: 128, 256, 512, 1024, 2048, and 4096
Cache Friendly Multithreaded Matrix Multiplication
- for thread size: 1, 2, 4, 8, and 16
- for matrix size: 128, 256, 512, 1024, 2048, and 4096
BONUS: How to run these codes?

Before all parts, let’s start with the simple program that generates matrix files for us. write_file.c is a C program that creates a folder called data if it doesn’t exist and fill this folder with .txt files which represent 2 different random generated matrices. These matrices have 128, 256, 512, 1024, 2048, and 4096 versions. The numbers in them, are randomly selected between 0-10. When this program runs, results will be like this:

Now, our different version matrices are ready. It is time to start Matrix Multiplication methods. All methods generate .txt files that represent multiplication results. At the end of the file, there is time information about them(total time of multiplication, writing to file and etc). Let`s start with the Traditional Serial Matrix Multiplication method.

1. Matrix Multiplication with traditional serial method

In the traditional serial method, the first matrix will be read from memory in row-wise order and the second matrix will be read from memory in column-wise order. This is classical matrix multiplication:

The program takes the matrices in all sizes and calculates multiplications of them. When this program runs, results will be like this:

2. Multithreaded Matrix Multiplication

In this method, the program will still use the traditional multiplication method but in extra, it will use the threads to operate multiplications at the same time on multiple threads which is called Multithreaded operation. In this section, the program will run the same .txt files for 1, 2, 4, 8, and 16 threads. When the program will run, results will be like this:

3. Cache Friendly Multithreaded Matrix Multiplication

This method is not using the traditional serial method. Cache Friendly Multithreaded Matrix Multiplication method is more complicated for humans but more efficient for computers.

C is a row-major language. That means it reaches out the data in row order. If a program needs the whole data in one column, it has to reach all row data(include useless ones) and take the data from it. Then the program will do a lot of cache miss because it can not store whole data in cache. It needs to go to memory to get specific data. The developer has to decrease the cache miss ratio to increase the performance of the program.

For these matrices, for accessing row data, cache miss almost decreases to 25%. But in column accesses cache miss is near to 100%. That means it gets the specific data always from the main memory instead of the cache. When we compare, it takes more time than normal cache access. In this method, there is not any column access. With the power of multithreaded operation, the program becomes almost the fastest version of itself.

As mentioned above, accessing specific data from the cache is efficient than accessing it from the main memory or RAM.

When the program will run, results will be like this:

In the traditional serial method, the multiplication of 4096 x 4096 matrices, takes 610 seconds (10 minutes). But with using 16 threads and a cache friendly algorithm, it decreases to 30 seconds (0.5 minutes). It is almost 20 times faster than the traditional one. Then, we can say that the most important point in multiplication is how the developer wrote it. The results are identically same but efficient one uses the resources in a better way. That made the multiplication time nearly 20 times faster. Let`s see the whole statistics in tables, that helps us to see more clearly.

The first one will be multiplication time comparison. There are 3 methods mentioned above and will start with traditional method, here is the results:

Matrix sizes	Traditinal serial method (Multiplication time)
128 x 128	0.008 seconds
256 x 256	0.059 seconds
512 x 512	0.479 seconds
1024 x 1024	5.084 seconds
2048 x 2048	63.21 seconds
4096 x 4096	610.01 seconds

Matrix sizes - thread size	Parallel(Multiplication time)	Cache friendly parallel(Multiplication time)
128 x 128 – 1 thread	0.005 seconds	0.007 seconds
128 x 128 – 2 threads	0.003 seconds	0.004 seconds
128 x 128 – 4 threads	0.002 seconds	0.003 seconds
128 x 128 – 8 threads	0.002 seconds	0.003 seconds
128 x 128 – 16 threads	0.003 seconds	0.001 seconds
256 x 256 – 1 thread	0.052 seconds	0.055 seconds
256 x 256 – 2 threads	0.024 seconds	0.028 seconds
256 x 256 – 4 threads	0.013 seconds	0.015 seconds
256 x 256 – 8 threads	0.013 seconds	0.008 seconds
256 x 256 – 16 threads	0.006 seconds	0.009 seconds
512 x 512 – 1 thread	0.425 seconds	0.439 seconds
512 x 512 – 2 threads	0.231 seconds	0.227 seconds
512 x 512 – 4 threads	0.110 seconds	0.120 seconds
512 x 512 – 8 threads	0.093 seconds	0.065 seconds
512 x 512 – 16 threads	0.061 seconds	0.068 seconds
1024 x 1024 – 1 thread	4.535 seconds	3.621 seconds
1024 x 1024 – 2 threads	2.637 seconds	1.818 seconds
1024 x 1024 – 4 threads	1.342 seconds	0.931 seconds
1024 x 1024 – 8 threads	0.687 seconds	0.518 seconds
1024 x 1024 – 16 threads	0.698 seconds	0.540 seconds
2048 x 2048 – 1 thread	56.92 seconds	28.13 seconds
2048 x 2048 – 2 threads	33.50 seconds	14.31 seconds
2048 x 2048 – 4 threads	17.36 seconds	7.384 seconds
2048 x 2048 – 8 threads	9.918 seconds	4.167 seconds
2048 x 2048 – 16 threads	9.888 seconds	4.135 seconds
4096 x 4096 – 1 thread	665.0 seconds	204.3 seconds
4096 x 4096 – 2 threads	343.7 seconds	113.6 seconds
4096 x 4096 – 4 threads	181.0 seconds	57.73 seconds
4096 x 4096 – 8 threads	107.5 seconds	33.82 seconds
4096 x 4096 – 16 threads	106.3 seconds	35.37 seconds

Let’s see the whole operation time (load files + multiplication + create/write files)

Matrix sizes - thread size	Parallel(Total operation time)	Cache friendly parallel(Total operation time)
128 x 128 – 1 thread	0.009 seconds	0.010 seconds
128 x 128 – 2 threads	0.007 seconds	0.007 seconds
128 x 128 – 4 threads	0.005 seconds	0.005 seconds
128 x 128 – 8 threads	0.005 seconds	0.008 seconds
128 x 128 – 16 threads	0.006 seconds	0.005 seconds
256 x 256 – 1 thread	0.065 seconds	0.069 seconds
256 x 256 – 2 threads	0.038 seconds	0.040 seconds
256 x 256 – 4 threads	0.025 seconds	0.028 seconds
256 x 256 – 8 threads	0.025 seconds	0.021 seconds
256 x 256 – 16 threads	0.019 seconds	0.022 seconds
512 x 512 – 1 thread	0.472 seconds	0.491 seconds
512 x 512 – 2 threads	0.280 seconds	0.275 seconds
512 x 512 – 4 threads	0.158 seconds	0.169 seconds
512 x 512 – 8 threads	0.141 seconds	0.113 seconds
512 x 512 – 16 threads	0.111 seconds	0.118 seconds
1024 x 1024 – 1 thread	4.725 seconds	3.818 seconds
1024 x 1024 – 2 threads	2.821 seconds	2.012 seconds
1024 x 1024 – 4 threads	1.512 seconds	1.124 seconds
1024 x 1024 – 8 threads	0.878 seconds	0.714 seconds
1024 x 1024 – 16 threads	0.882 seconds	0.743 seconds
2048 x 2048 – 1 thread	57.53 seconds	28.89 seconds
2048 x 2048 – 2 threads	34.12 seconds	15.08 seconds
2048 x 2048 – 4 threads	18.01 seconds	8.033 seconds
2048 x 2048 – 8 threads	10.66 seconds	4.933 seconds
2048 x 2048 – 16 threads	10.67 seconds	4.784 seconds
4096 x 4096 – 1 thread	668.1 seconds	207.0 seconds
4096 x 4096 – 2 threads	346.5 seconds	116.6 seconds
4096 x 4096 – 4 threads	184.1 seconds	60.71 seconds
4096 x 4096 – 8 threads	110.6 seconds	36.98 seconds
4096 x 4096 – 16 threads	109.1 seconds	38.34 seconds

These operations calculated on the SSD with using:
CPU: Intel(R) Core(TM) i7-9700K CPU 3.60GHz
RAM: 16.0 GB

BONUS: How to run these codes? (These instructions are for Ubuntu OS)

Creating matrix files

gcc -o cr_file cr_file.c
./cr_file

Now, files are created in the /data folder.

Run method 1: Traditional matrix multiplication

gcc -o trad_m trad_m.c
./trad_m

Run Method 2: Multithreaded Matrix Multiplication

gcc -o thread_m thread_m.c -lpthread
./thread_m

Run Method 3: Cache Friendly Multithreaded Matrix Multiplication

gcc -o thread_c_m thread_c_m.c -lpthread
./thread_c_m

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
code		code
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Parallel Dense Matrix Multiplication

1. Matrix Multiplication with traditional serial method

2. Multithreaded Matrix Multiplication

3. Cache Friendly Multithreaded Matrix Multiplication

BONUS: How to run these codes? (These instructions are for Ubuntu OS)

THANKS FOR READING

About

Uh oh!

Releases

Packages

Languages

ParvinEyvazov/multithreaded-matrix-multiplication-c

Folders and files

Latest commit

History

Repository files navigation

Parallel Dense Matrix Multiplication

1. Matrix Multiplication with traditional serial method

2. Multithreaded Matrix Multiplication

3. Cache Friendly Multithreaded Matrix Multiplication

BONUS: How to run these codes? (These instructions are for Ubuntu OS)

THANKS FOR READING

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages