**Parallel Matrix Multiply**

In the lecture I merely said write a parallel matrix multiplication program: C = A * B. Here I have added detail to make it easier for you.

The program should be able to accept any size square matrix and
utilize any
number of processors, although I will accept a fixed number of
processors.

Some comments on implementation and
submission:

1. Only one process (e.g., a process with rank = 0 acting as
the master) should read the data/matrix (or assign values to matrix A
and B) and then it should automatically
share this among the worker processes.

2. You can divide the work among processes in a number of
ways. Some suggested ways are: (a) treat computations of all elements
of each row of the resulting
matrix C as one work unit, (b) treat computations of all elements of
each column of the resulting
matrix C as one work unit, (c) computations of each element of
the resulting matrix C as one work unit, (d) create work units based on
the number of processes and the size of matrix and then assign them
different worker process.

3. Try
to do assign the work units to worker processes such that the load/work distribution among
them is balanced. The master process should collect the results from
worker processes and strores in matrix C.

4. Testing and Evaluation:

(a) Take the size of square matrix as an
input to the program by reading from the keyboard at runtime.

(b) Test your program to make sure that it works in
parallel!

(c) When the size of matrix (N) is larger than 5, it
is NOT be feasible for you to input elements of matrix A and B from the
keyboard. In such cases, simply assign some random values to elements
matrix A and B.

(d) Run your program for three different problem sizes:
matrix size: N = 100, 200, 300, 400, 500 on different combinations of
processes: 4, 8, 10, and 12 processes.

(e) Note the processing time by making use of appropriate
MPI calls and draw a graph for varying values of matrix size with "Number of CPUs/nodes on X column" and "the time taken " on Y column.

That is, use randomly generated matricies in a
range of sizes and produce a
report on the time taken for a range of different sized matricies in a
parallel programmin environment using MPI.

.

4. Submission:

- Write a report containing (a) details on your design and implementation and the way you did load balancing, (b) performance graphs and (c) program code

- Please email your report to Associate Lecturer only (by deadline). Please mention your name and student number on the first page of your report.
- Please bring a printed report and handover during the demo.

A better way of expressing the problem is: Write a function "square_dgemm", similar to the LAPACK "dgemm" routine but limited to square matrices, to compute the product of two matrices using MPI to distribute the task onto a number of processors. Note, only one processor should read the data/matrix and then the program should automatically share this among the processors.

There is a harness routine `matmul.c` in
http://www.hpc.unimelb.edu.au/cs/assignment2/
which provides checks on correctness and scalability. There are also
serial
basic and dlocked matrix multiiply sub-routines in the directory which
may
give assistance with methods of dealing with the odd bits left when the
number of processors does not divide the matrix size.