MPI tutorials (II)

April 11, 2018May 10, 2019 ~ aparajitanath ~ 2 Comments

This is a continuation of my notes on MPI. Part one of MPI tutorial notes is here. The sources and references are listed at the end of the article.

Communication

Point to point communication: Communications where one process sends a message to a receiver process with the rank of the process and a tag (identifier), for the receiver to process it accordingly. The crucial components are one sender and one receiver.

Collective communication: The scenario when a process needs to pass a message to multiple other processes. For example, when a master process wants to broadcast a message to its worker processes, it is referred to as a collective communication.

A combination of Point to point and collective communication can further optimise and speed up the calculations via MPI.

Point to point communication routine

There are different types of send and receive routines used for different purposes. For example:

Synchronous send
Blocking send / blocking receive
Non-blocking send / non-blocking receive
Buffered send
Combined send/receive
“Ready” send

Blocking vs Non-blocking

Most of the MPI routines can in be in Blocking or non-blocking mode.

Blocking send : Sending process is blocked until the received by the receiving process or mailbox.

Non-blocking send : Message is sent by the sending process and resumes operation. (asynchronous type)

Blocking receive : The receiver blocks until a message is available.

Non-blocking receive : The receiver retrieves a valid message or null. The receiver process is not blocked and receives a message if it comes upon. (asynchronous type)

Point to point communication routine arguments

Blocking sends	`MPI_Send(buffer,count,type,dest,tag,comm)`
Non-blocking sends	`MPI_Isend(buffer,count,type,dest,tag,comm,request)`
Blocking receive	`MPI_Recv(buffer,count,type,source,tag,comm,status)`
Non-blocking receive	`MPI_Irecv(buffer,count,type,source,tag,comm,request)`

Buffer: Program (application) address space that references the data that is to be sent or received. In most cases, this is simply the variable name that is be sent/received. For C programs, this argument is passed by reference and usually must be prepended with an ampersand: &var1

System buffer holds the data which has been sent by the send process and hasn’t yet been received by the receive process. There are three kinds of queues in system buffers:

1. Zero capacity : The buffer has zero capacity and hence the link/ queue cannot hold any message in it.

2. Finite capacity : The link is of finite length ‘n’. If the link is not filled completely, it will keep the message from the sender and the sending process will continue execution.

3. Unbounded capacity : The link is of infinite length. It can hold infinitely many messages.

Data Count: Indicates the number of data elements of a particular type to be sent.

Data Type: For reasons of portability, MPI predefines its elementary data types.

Source: Specified as the rank of the sending process for receiver routine to identify.


Tag: Arbitrary non-negative integer assigned by the programmer to uniquely identify a message. Send and receive operations should match message tags.


Communicator: Indicates the communication context, or set of processes for which the source or destination fields are valid. Unless the programmer is explicitly creating new communicators, the predefined communicator MPI_COMM_WORLD is usually used.


Status: For a receive operation, indicates the source of the message and the tag of the message. In C, this argument is a pointer to a predefined structure MPI_Status (ex. stat.MPI_SOURCE stat.MPI_TAG).


Request: Used by non-blocking send and receive operations. Since non-blocking operations may return before the requested system buffer space is obtained, the system issues a unique “request number”. The programmer uses this system assigned “handle” later (in a WAIT type routine) to determine completion of the non-blocking operation.

Blocking communication routines

MPI_Send: Basic blocking send operation. Routine returns only after the application buffer in the sending task is free for reuse.
MPI_Recv: Receive a message and block until the requested data is available in the application buffer in the receiving task.
MPI_Wait: Blocks until a specified non-blocking send or receive operation has completed. For multiple non-blocking operations, the programmer can specify any, all or some completions.
MPI_Get_count: Returns the source, tag and number of elements of datatype received.

MPI_Get_count (&status,datatype,&count)

In C, the status becomes a structure that stores the source and tag in status.MPI_SOURCE and status.MPI_TAG.

Example 2:

   #include "mpi.h"
   #include <stdio.h>

   main(int argc, char *argv[])  {
   int numtasks, rank, dest, source, rc, count, tag=1;  
   char inmsg, outmsg='x';
   MPI_Status Stat;   // required variable for receive routines

   MPI_Init(&argc,&argv);
   MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);

   // task 0 sends to task 1 and waits to receive a return message
   if (rank == 0) {
     dest = 1;
     source = 1;
     MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
     MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
     } 

   // task 1 waits for task 0 message then returns a message
   else if (rank == 1) {
     dest = 0;
     source = 0;
     MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
     MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
     }

   // query receive Stat variable and print message details
   MPI_Get_count(&Stat, MPI_CHAR, &count);
   printf("Task %d: Received %d char(s) from task %d with tag %d \n",
          rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG);

   MPI_Finalize();
   }

Gives the output for

“srun -p SHARED -N2 -n2 –mpi=pmi2 ./blockingsend”
Task 1: Received 1 char(s) from task 0 with tag 1
Task 0: Received 1 char(s) from task 1 with tag 1

“srun -p SHARED -N2 -n3 –mpi=pmi2 ./blockingsend”
Task 2: Received 0 char(s) from task 4196211 with tag 0
Task 0: Received 1 char(s) from task 1 with tag 1
Task 1: Received 1 char(s) from task 0 with tag 1

Exercise 2

Have each task determine a unique partner task to send/receive with. One easy way to do this:
C:
if (taskid < numtasks/2) 
    then partner = numtasks/2 + taskid
else if (taskid >= numtasks/2) 
    then partner = taskid - numtasks/2
Each task sends its partner a single integer message: its taskid

Each task receives from its partner a single integer message: the partner’s taskid

For confirmation, after the send/receive, each task prints something like “Task ## is partner with ##” where ## is the taskid of the task and its partner.

#include <mpi.h>
#include <stdio.h>
#include <math.h>

int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(&argc, &argv);

// Get the number of processes
int world_size, partner, source, tag=1;
MPI_Status Stat;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

// Get the rank of the process
int world_rank, rec_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

if (world_rank < world_size/2)
{
partner = world_size/2+world_rank;
source = partner;
MPI_Send(&world_rank, 1, MPI_INT, partner, tag, MPI_COMM_WORLD);
MPI_Recv(&rec_rank, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &Stat);

}
else
{
partner = world_rank-world_size/2;
source = partner;
MPI_Send(&world_rank, 1, MPI_INT, partner, tag, MPI_COMM_WORLD);
MPI_Recv(&rec_rank, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &Stat);
}

printf(“Task id %d is partner with %d \n”,world_rank,rec_rank);
// Finalize the MPI environment.
MPI_Finalize();
}

Output:

srun -p SHARED -n10 -N2 –mpi=pmi2 ./helloBsend
Task id 0 is partner with 5
Task id 5 is partner with 0
Task id 1 is partner with 6
Task id 3 is partner with 8
Task id 4 is partner with 9
Task id 2 is partner with 7
Task id 6 is partner with 1
Task id 7 is partner with 2
Task id 8 is partner with 3
Task id 9 is partner with 4

Non Blocking communication routines

MPI_Isend: Identifies an area in memory to serve as a send buffer. Processing continues immediately without waiting for the message to be copied out from the application buffer. A communication request handle is returned for handling the pending message status. The program should not modify the application buffer until subsequent calls to MPI_Wait or MPI_Test indicate that the non-blocking send has completed.
MPI_Irecv

Example 3:

#include <mpi.h>
#include <stdio.h>
#include <math.h>

int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(&argc, &argv);

// Get the number of processes
int numtasks, taskid, partner, tag=1;
MPI_Status stats[2];
MPI_Request reqs[2];
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

// Get the rank of the process
int message;
MPI_Comm_rank(MPI_COMM_WORLD, &taskid);

if (taskid < numtasks/2)
partner = numtasks/2 + taskid;
else if (taskid >= numtasks/2)
partner = taskid – numtasks/2;

MPI_Irecv(&message, 1, MPI_INT, partner, 1, MPI_COMM_WORLD, &reqs[0]);
MPI_Isend(&taskid, 1, MPI_INT, partner, 1, MPI_COMM_WORLD, &reqs[1]);

/* now block until requests are complete */
MPI_Waitall(2, reqs, stats);

/* print partner info and exit*/
printf(“Task %d is partner with %d\n”,taskid,message);
// Finalize the MPI environment.
MPI_Finalize();
}

srun -p SHARED -n10 -N2 –mpi=pmi2 ./helloNBsend
Task 0 is partner with 5
Task 5 is partner with 0
Task 1 is partner with 6
Task 6 is partner with 1
Task 3 is partner with 8
Task 8 is partner with 3
Task 7 is partner with 2
Task 2 is partner with 7
Task 4 is partner with 9
Task 9 is partner with 4

Collective communication routine

Collective action over a communicator. Collective communication routines must involve all processes within the scope of a communicator and are blocking. No tags are required.

Types of collective communication

Synchronisation: Processes wait until everyone has reached a synchronisation point. eg. MPI_Barrier
Data movement: Broadcast/ scatter/gather, all to all.
Collective computation (reduction): Performing an operation after data is collected by one process. eg. Global sum/ Global maximum or minimum.

The scope is all the processes in the communicator.

Collective Communication Routines

MPI_Barrier :

Synchronization operation. Creates a barrier synchronization in a group. Each task, when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same MPI_Barrier call. Then all tasks are free to proceed.

For example: Process 1 calls MPI_Barrier, from then onwards it will wait until every other processor in the communicator calls MPI_Barrier.

MPI_Bcast :

One process broadcasts a message to all the processes in the communicator. Note: Not only the root process posts this. Every process that needs to receive data has to send this as well.

MPI_Scatter :

MPI_Gather: Opposite of MPI Scatter.

Collective computation/ Reduction operators: MPI_Sum etc.

Screenshot 2019-04-20 04.22.55

MPI Reduction Operation		C Data Types	Fortran Data Type
`MPI_MAX`	maximum	integer, float	integer, real, complex
`MPI_MIN`	minimum	integer, float	integer, real, complex
`MPI_SUM`	sum	integer, float	integer, real, complex
`MPI_PROD`	product	integer, float	integer, real, complex
`MPI_LAND`	logical AND	integer	logical
`MPI_BAND`	bit-wise AND	integer, MPI_BYTE	integer, MPI_BYTE
`MPI_LOR`	logical OR	integer	logical
`MPI_BOR`	bit-wise OR	integer, MPI_BYTE	integer, MPI_BYTE
`MPI_LXOR`	logical XOR	integer	logical
`MPI_BXOR`	bit-wise XOR	integer, MPI_BYTE	integer, MPI_BYTE
`MPI_MAXLOC`	max value and location	float, double and long double	real, complex,double precision
`MPI_MINLOC`	min value and location	float, double and long double	real, complex, double precision

MPI_Reduce

MPI_Reduce (&sendbuf,&recvbuf,count,datatype,op,root,comm)

MPI_Allreduce

Collective computation operation + data movement. Applies a reduction operation and places the result in all tasks in the group. This is equivalent to an MPI_Reduce followed by an MPI_Bcast.

Example 3

#include "mpi.h"
   #include <stdio.h>
   #define SIZE 4

   main(int argc, char *argv[])  {
   int numtasks, rank, sendcount, recvcount, source;
   float sendbuf[SIZE][SIZE] = {
     {1.0, 2.0, 3.0, 4.0},
     {5.0, 6.0, 7.0, 8.0},
     {9.0, 10.0, 11.0, 12.0},
     {13.0, 14.0, 15.0, 16.0}  };
   float recvbuf[SIZE];

   MPI_Init(&argc,&argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

   if (numtasks == SIZE) {
     // define source task and elements to send/receive, then perform collective scatter
     source = 1;
     sendcount = SIZE;
     recvcount = SIZE;
     MPI_Scatter(sendbuf,sendcount,MPI_FLOAT,recvbuf,recvcount,
                 MPI_FLOAT,source,MPI_COMM_WORLD);

     printf("rank= %d  Results: %f %f %f %f\n",rank,recvbuf[0],
            recvbuf[1],recvbuf[2],recvbuf[3]);
     }
   else
     printf("Must specify %d processors. Terminating.\n",SIZE);

   MPI_Finalize();
   }

rank= 0 Results: 1.000000 2.000000 3.000000 4.000000
rank= 1 Results: 5.000000 6.000000 7.000000 8.000000
rank= 2 Results: 9.000000 10.000000 11.000000 12.000000
rank= 3 Results: 13.000000 14.000000 15.000000 16.000000

Next part of MPI notes is here.

Continue reading “MPI tutorials (II)” →

MPI tutorial (I)

April 4, 2018May 10, 2019 ~ aparajitanath ~ 2 Comments

I have taken notes from different sources (mentioned at the end of this article). The first part will cover the basics of MPI.

MPI stands for Message Passing Interface. MPI primarily addresses the message-passing parallel programming model: data is moved from the address space of one process to that of another process through cooperative operations on each process. It is not a language but just a library of functions.

Some familiar kinds of Message Passing Interface: Watsapp and Twitter

MPI buffers are data packet that transfer information between the sender and the receiver.

Memory Pipes (MPI) are memory-based structures used for communicating between the Internet Communication Manager and work processes. They are used to transfer the data packets between the sender and receiver.

Compilers used for MPI: mpicc, mpif90 etc.

Compilation looks like this:

$ mpicc -o prog prog.c

Running mpi code:

$ mpirun -np <no_of_processors> <prog_name>

$ srun -p SHARED -n<no_of_processors> -N<no_of_cores> –mpi=pmi2 <prog_name>

Structure of an MPI program and Environment management

General MPI Program Structure

Example 1:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {

    // Initialize the MPI environment
    MPI_Init(&argc, &argv);

    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Get the name of the processor
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

    if (world_rank == 0)
    {
    	// Print off a hello world message with process name only if rank is 0
    	printf("Hello world from processor %s, rank %d out of %d processors\n",processor_name, world_rank, world_size);
    }
    else
    {
    	// Print off only the hello world message
	printf("Yello world from processor %s, rank %d out of %d processors\n",processor_name, world_rank, world_size);
    }
    // Finalize the MPI environment.
    MPI_Finalize();

}

The easiest way to understand programming with MPI is a hello world application. Usually, MPI applications are designed in such a way that multiple processes will run the same code. Every line of code will be executed by multiple machines.

Header File:

mpi.h: Header File. Required for all programs that make MPI library calls.

MPI calls begin with MPI_xxx in C.

Environment Management Routines

Initialisation:

MPI_Init: Initializes the MPI execution environment. This function must be called in every MPI program, must be called before any other MPI functions and must be called only once in an MPI program. For C programs, MPI_Init may be used to pass the command line arguments to all processes, although this is not required by the standard and is implementation dependent.

Communicators or groups:

MPI uses objects called communicators and groups to define which collection of processes may communicate with each other. Among themselves they have ranks and they communicate with each other according to their ranks. Most MPI routines require you to specify a communicator as an argument. Use MPI_COMM_WORLD whenever a communicator is required – it is the predefined communicator that includes all of your MPI processes.

MPI_Comm_size : Returns the total number of MPI processes in the specified communicator, such as MPI_COMM_WORLD. If the communicator is MPI_COMM_WORLD, then it represents the number of MPI tasks available to your application.

Ranks:

MPI_Comm_rank:
- All processes perform the same code, but we want them to do different things. In order to differentiate the work between processes, we can fetch their rank. Their rank is just a numerical ID of the current process. There is going to be process with rank 0, 1, 2 and so on up to however many processes you have minus one. The rank is defined with a communicator and you can define your own communicators.
- Within a communicator, every process has its own unique, integer identifier assigned by the system when the process initializes. A rank is sometimes also called a “task ID”. Ranks are contiguous and begin at zero. Used by the programmer to specify the source and destination of messages. Often used conditionally by the application to control program execution (if rank=0 do this / if rank=1 do that).
MPI_Get_processor_name: You can also query which compute node you are running on, this function ‘get processor name will return the hostname, and you can query how many processes are in this communicator impact world.
MPI_finalise: If you want to exit gracefully.

When you execute your code, using

“srun -p SHARED -n4 -N2 –mpi=pmi2 ./helloworld” ;
Hello world from processor node14, rank 0 out of 4 processors
Yello world from processor node14, rank 1 out of 4 processors
Yello world from processor node15, rank 2 out of 4 processors
Yello world from processor node15, rank 3 out of 4 processors

Remember, that all processes will run the same code, and so when they print hello world from such and such a rank, it will be executed multiple times in a row, and it will have multiple hello world statements. This is an example of how you can use the rank to differentiate the work between processes. If the rank is ranked number 0, then I will print hello world size, so only one process will execute this print size statement.

Next part of the notes is here.

Continue reading “MPI tutorial (I)” →

Job management and queuing

SLURM

Derived data types

Groups and Communicators

Virtual Topologies