Obligatory task 1
Develop a parallel program for multiplication of large matrices A and B
( C = AxB ) of double precision distributed on a rectangular mesh of processors.
Choose a messagepassing algorithm that you consider the best, for example,
Cannon's algorithm or the DNS one. (Don't copy Pacheco's implementation
of Fox's algorithm). Use the MPI communication library.
You must write a report including

Detailed description of your algorithm with analysis of its efficiency
and scalability

Detailed description of your code

Source listing of the code with detailed comments

Description of tests with graphs of performance in Megaflops (run on several
processor grids and with several matrix sizes)

Report on the best Megaflop rate per node attained with your code

Make your source files accessible through the network
HINTS to reach maximum performance:
1. If you want to get maximum performance, you should code all local
(on each node) matrix operations in terms of the BLAS library (linked with
lblas).
2. Look at the local
page for compiler switches to use.