MPI Tutorials (IV)

This part of the tutorial was made after attending Jan Thorbecke’s class on MPI.

Job management and queuing

On a large system many users are running simultaneously. What to do when:

The system is full and you want to run your 512 CPU job?

You want to run 16 jobs, should others wait on that?

Have bigger jobs priority over smaller jobs?

Have longer jobs lower/higher priority?

The job manager and queue system takes care of it.

SLURM

Slurm is one such job manager.
Slurm commands:

sbatch: submit a job to a queue
squeue: status of queued jobs: options -l
scancel: delete job from queue
sinfo
sview: GUI

<We are using mpich here, instead of OpenMPI like in the previous examples>


#!/bin/bash
#SBATCH -J Name-of-job
#SBATCH –cpus-per-task=1
#SBATCH –ntasks=4
#SBATCH –nodes=1
#SBATCH -o mympijob-%A.out
#SBATCH —time=1:00:00
mpirun -np <number of nodes x number of tasks> ./your_mpi_executable


submit: sbatch job.scr
output: mympijob-jobid.out

If you want to know further about the kind of nodes and processors you used, just type:


mpirun -print-rank-map -np <number of nodes x number of tasks> ./your_mpi_executable


 

MPI tutorial (III)

This is a continuation of my notes on MPI. The last two notes on MPI tutorials are here and here. The references have been listed in Part 1 and 2 of the notes.

Derived data types

MPI defines its primitive data types. For example: MPI_CHAR/ MPI_FLOAT/ MPI_C_FLOAT_COMPLEX etc. But MPI also lets you define your own data structure based on MPI primitive data types. MPI provides several methods for constructing derived data types:

  • Contiguous
  • Vector
  • Indexed
  • Struct

MPI_Type_contiguous : The simplest constructor. Produces a new data type by making count copies of an existing data type.

MPI_Type_vector : Similar to contiguous, but allows for regular gaps (stride) in the displacements. MPI_Type_hvector is identical to MPI_Type_vector except that stride is specified in bytes.

MPI_Type_vector (count,blocklength,stride,oldtype,&newtype)

MPI_Type_indexed An array of displacements of the input data type is provided as the map for the new data type.

MPI_Type_indexed (count,blocklens[],offsets[],old_type,&newtype)

MPI_Type_struct/MPI_Type_create_struct : The new data type is formed according to completely defined map of the component data types. 

MPI_Type_struct (count,blocklens[],offsets[],old_types,&newtype)
M

Groups and Communicators

https://computing.llnl.gov/tutorials/mpi/#Group_Management_Routines

Virtual Topologies

https://computing.llnl.gov/tutorials/mpi/#Virtual_Topologies

MPI tutorials (II)

This is a continuation of my notes on MPI. Part one of MPI tutorial notes is here. The sources and references are listed at the end of the article.

Communication

Point to point communication: Communications where one process sends a message to a receiver process with the rank of the process and a tag (identifier), for the receiver to process it accordingly. The crucial components are one sender and one receiver.

Collective communication: The scenario when a process needs to pass a message to multiple other processes. For example, when a master process wants to broadcast a message to its worker processes, it is referred to as a collective communication.

A combination of Point to point and collective communication can further optimise and speed up the calculations via MPI.

Point to point communication routine

There are different types of send and receive routines used for different purposes. For example:

  • Synchronous send
  • Blocking send / blocking receive
  • Non-blocking send / non-blocking receive
  • Buffered send
  • Combined send/receive
  • “Ready” send

Blocking vs Non-blocking

Most of the MPI routines can in be in Blocking or non-blocking mode.

Blocking send : Sending process is blocked until the received by the receiving process or mailbox.

Non-blocking send : Message is sent by the sending process and resumes operation. (asynchronous type)

Blocking receive : The receiver blocks until a message is available.

Non-blocking receive : The receiver retrieves a valid message or null. The receiver process is not blocked and receives a message if it comes upon. (asynchronous type)

Point to point communication routine arguments

Blocking sends MPI_Send(buffer,count,type,dest,tag,comm)
Non-blocking sends MPI_Isend(buffer,count,type,dest,tag,comm,request)
Blocking receive MPI_Recv(buffer,count,type,source,tag,comm,status)
Non-blocking receive MPI_Irecv(buffer,count,type,source,tag,comm,request)

Buffer: Program (application) address space that references the data that is to be sent or received. In most cases, this is simply the variable name that is be sent/received. For C programs, this argument is passed by reference and usually must be prepended with an ampersand: &var1

System buffer holds the data which has been sent by the send process and hasn’t yet been received by the receive process. There are three kinds of queues in system buffers:

1. Zero capacity : The buffer has zero capacity and hence the link/ queue cannot hold any message in it.

2. Finite capacity : The link is of finite length ‘n’. If the link is not filled completely, it will keep the message from the sender and the sending process will continue execution.

3. Unbounded capacity : The link is of infinite length. It can hold infinitely many messages.

Data Count: Indicates the number of data elements of a particular type to be sent.

Data Type: For reasons of portability, MPI predefines its elementary data types.

Source: Specified as the rank of the sending process for receiver routine to identify.
Tag: Arbitrary non-negative integer assigned by the programmer to uniquely identify a message. Send and receive operations should match message tags.
Communicator: Indicates the communication context, or set of processes for which the source or destination fields are valid. Unless the programmer is explicitly creating new communicators, the predefined communicator MPI_COMM_WORLD is usually used.
Status: For a receive operation, indicates the source of the message and the tag of the message. In C, this argument is a pointer to a predefined structure MPI_Status (ex. stat.MPI_SOURCE stat.MPI_TAG).
Request: Used by non-blocking send and receive operations. Since non-blocking operations may return before the requested system buffer space is obtained, the system issues a unique “request number”. The programmer uses this system assigned “handle” later (in a WAIT type routine) to determine completion of the non-blocking operation. 

Blocking communication routines

  • MPI_Send: Basic blocking send operation. Routine returns only after the application buffer in the sending task is free for reuse.
  • MPI_RecvReceive a message and block until the requested data is available in the application buffer in the receiving task.
  • MPI_Wait: Blocks until a specified non-blocking send or receive operation has completed. For multiple non-blocking operations, the programmer can specify any, all or some completions.
  • MPI_Get_countReturns the source, tag and number of elements of datatype received.
    MPI_Get_count (&status,datatype,&count)

    In C, the status becomes a structure that stores the source and tag in status.MPI_SOURCE and status.MPI_TAG.


Example 2:

   #include "mpi.h"
   #include <stdio.h>

   main(int argc, char *argv[])  {
   int numtasks, rank, dest, source, rc, count, tag=1;  
   char inmsg, outmsg='x';
   MPI_Status Stat;   // required variable for receive routines

   MPI_Init(&argc,&argv);
   MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);

   // task 0 sends to task 1 and waits to receive a return message
   if (rank == 0) {
     dest = 1;
     source = 1;
     MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
     MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
     } 

   // task 1 waits for task 0 message then returns a message
   else if (rank == 1) {
     dest = 0;
     source = 0;
     MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
     MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
     }

   // query receive Stat variable and print message details
   MPI_Get_count(&Stat, MPI_CHAR, &count);
   printf("Task %d: Received %d char(s) from task %d with tag %d \n",
          rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG);

   MPI_Finalize();
   }

Gives the output for

“srun -p SHARED -N2 -n2 –mpi=pmi2 ./blockingsend”
Task 1: Received 1 char(s) from task 0 with tag 1
Task 0: Received 1 char(s) from task 1 with tag 1

“srun -p SHARED -N2 -n3 –mpi=pmi2 ./blockingsend”
Task 2: Received 0 char(s) from task 4196211 with tag 0
Task 0: Received 1 char(s) from task 1 with tag 1
Task 1: Received 1 char(s) from task 0 with tag 1


Exercise 2

  • Have each task determine a unique partner task to send/receive with. One easy way to do this:
    C:

    if (taskid < numtasks/2) 
        then partner = numtasks/2 + taskid
    else if (taskid >= numtasks/2) 
        then partner = taskid - numtasks/2
  • Each task sends its partner a single integer message: its taskid
  • Each task receives from its partner a single integer message: the partner’s taskid
  • For confirmation, after the send/receive, each task prints something like “Task ## is partner with ##” where ## is the taskid of the task and its partner.

#include <mpi.h>
#include <stdio.h>
#include <math.h>

int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(&argc, &argv);

// Get the number of processes
int world_size, partner, source, tag=1;
MPI_Status Stat;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

// Get the rank of the process
int world_rank, rec_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

if (world_rank < world_size/2)
{
partner = world_size/2+world_rank;
source = partner;
MPI_Send(&world_rank, 1, MPI_INT, partner, tag, MPI_COMM_WORLD);
MPI_Recv(&rec_rank, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &Stat);

}
else
{
partner = world_rank-world_size/2;
source = partner;
MPI_Send(&world_rank, 1, MPI_INT, partner, tag, MPI_COMM_WORLD);
MPI_Recv(&rec_rank, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &Stat);
}

printf(“Task id %d is partner with %d \n”,world_rank,rec_rank);
// Finalize the MPI environment.
MPI_Finalize();
}


Output:

srun -p SHARED -n10 -N2 –mpi=pmi2 ./helloBsend
Task id 0 is partner with 5
Task id 5 is partner with 0
Task id 1 is partner with 6
Task id 3 is partner with 8
Task id 4 is partner with 9
Task id 2 is partner with 7
Task id 6 is partner with 1
Task id 7 is partner with 2
Task id 8 is partner with 3
Task id 9 is partner with 4


Non Blocking communication routines

  • MPI_IsendIdentifies an area in memory to serve as a send buffer. Processing continues immediately without waiting for the message to be copied out from the application buffer. A communication request handle is returned for handling the pending message status. The program should not modify the application buffer until subsequent calls to MPI_Wait or MPI_Test indicate that the non-blocking send has completed.
  • MPI_Irecv

Example 3:

#include <mpi.h>
#include <stdio.h>
#include <math.h>

int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(&argc, &argv);

// Get the number of processes
int numtasks, taskid, partner, tag=1;
MPI_Status stats[2];
MPI_Request reqs[2];
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

// Get the rank of the process
int message;
MPI_Comm_rank(MPI_COMM_WORLD, &taskid);

if (taskid < numtasks/2)
partner = numtasks/2 + taskid;
else if (taskid >= numtasks/2)
partner = taskid – numtasks/2;

MPI_Irecv(&message, 1, MPI_INT, partner, 1, MPI_COMM_WORLD, &reqs[0]);
MPI_Isend(&taskid, 1, MPI_INT, partner, 1, MPI_COMM_WORLD, &reqs[1]);

/* now block until requests are complete */
MPI_Waitall(2, reqs, stats);

/* print partner info and exit*/
printf(“Task %d is partner with %d\n”,taskid,message);
// Finalize the MPI environment.
MPI_Finalize();
}


srun -p SHARED -n10 -N2 –mpi=pmi2 ./helloNBsend
Task 0 is partner with 5
Task 5 is partner with 0
Task 1 is partner with 6
Task 6 is partner with 1
Task 3 is partner with 8
Task 8 is partner with 3
Task 7 is partner with 2
Task 2 is partner with 7
Task 4 is partner with 9
Task 9 is partner with 4


Collective communication routine

Collective action over a communicator. Collective communication routines must involve all processes within the scope of a communicator and are blocking. No tags are required.

Types of collective communication

  • Synchronisation: Processes wait until everyone has reached a synchronisation point. eg. MPI_Barrier
  • Data movement: Broadcast/ scatter/gather, all to all.
  • Collective computation (reduction): Performing an operation after data is collected by one process. eg. Global sum/ Global maximum or minimum.

The scope is all the processes in the communicator.

Collective Communication Routines

MPI_Barrier :

    Synchronization operation. Creates a barrier synchronization in a group. Each task, when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same MPI_Barrier call. Then all tasks are free to proceed.
    For example: Process 1 calls MPI_Barrier, from then onwards it will wait until every other processor in the communicator calls MPI_Barrier.

MPI_Bcast :

      • Data movement operation. Broadcasts (sends) a message from the process with rank “root” to all other processes in the group.

 One process broadcasts a message to all the processes in the communicator. Note: Not only the root process posts this. Every process that needs to receive data has to send this as well.

MPI_Scatter :

    • Data movement operation. Distributes distinct messages from a single source task to each task in the group.
      • sendcnt: Number of elements sent by one process.
      • rcvcnt: Number of elements of rcv data type to be received by one process.

MPI_Gather: Opposite of MPI Scatter.

Collective computation/ Reduction operators: MPI_Sum etc.

Screenshot 2019-04-20 04.22.55

MPI Reduction Operation C Data Types Fortran Data Type
MPI_MAX maximum integer, float integer, real, complex
MPI_MIN minimum integer, float integer, real, complex
MPI_SUM sum integer, float integer, real, complex
MPI_PROD product integer, float integer, real, complex
MPI_LAND logical AND integer logical
MPI_BAND bit-wise AND integer, MPI_BYTE integer, MPI_BYTE
MPI_LOR logical OR integer logical
MPI_BOR bit-wise OR integer, MPI_BYTE integer, MPI_BYTE
MPI_LXOR logical XOR integer logical
MPI_BXOR bit-wise XOR integer, MPI_BYTE integer, MPI_BYTE
MPI_MAXLOC max value and location float, double and long double real, complex,double precision
MPI_MINLOC min value and location float, double and long double real, complex, double precision

MPI_Reduce

    • Collective computation operation. Applies a reduction operation on all tasks in the group and places the result in one task.
MPI_Reduce (&sendbuf,&recvbuf,count,datatype,op,root,comm)

MPI_Allreduce

    Collective computation operation + data movement. Applies a reduction operation and places the result in all tasks in the group. This is equivalent to an MPI_Reduce followed by an MPI_Bcast.

Example 3

#include "mpi.h"
   #include <stdio.h>
   #define SIZE 4

   main(int argc, char *argv[])  {
   int numtasks, rank, sendcount, recvcount, source;
   float sendbuf[SIZE][SIZE] = {
     {1.0, 2.0, 3.0, 4.0},
     {5.0, 6.0, 7.0, 8.0},
     {9.0, 10.0, 11.0, 12.0},
     {13.0, 14.0, 15.0, 16.0}  };
   float recvbuf[SIZE];

   MPI_Init(&argc,&argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

   if (numtasks == SIZE) {
     // define source task and elements to send/receive, then perform collective scatter
     source = 1;
     sendcount = SIZE;
     recvcount = SIZE;
     MPI_Scatter(sendbuf,sendcount,MPI_FLOAT,recvbuf,recvcount,
                 MPI_FLOAT,source,MPI_COMM_WORLD);

     printf("rank= %d  Results: %f %f %f %f\n",rank,recvbuf[0],
            recvbuf[1],recvbuf[2],recvbuf[3]);
     }
   else
     printf("Must specify %d processors. Terminating.\n",SIZE);

   MPI_Finalize();
   }

rank= 0 Results: 1.000000 2.000000 3.000000 4.000000 
rank= 1 Results: 5.000000 6.000000 7.000000 8.000000 
rank= 2 Results: 9.000000 10.000000 11.000000 12.000000 
rank= 3 Results: 13.000000 14.000000 15.000000 16.000000

Next part of MPI notes is here.

Continue reading “MPI tutorials (II)”

MPI tutorial (I)

I have taken notes from different sources (mentioned at the end of this article). The first part will cover the basics of MPI.

MPI stands for Message Passing Interface. MPI primarily addresses the message-passing parallel programming model: data is moved from the address space of one process to that of another process through cooperative operations on each process. It is not a language but just a library of functions.

Some familiar kinds of Message Passing Interface: Watsapp and Twitter

MPI buffers are data packet that transfer information between the sender and the receiver.

Memory Pipes (MPI) are memory-based structures used for communicating between the Internet Communication Manager and work processes. They are used to transfer the data packets between the sender and receiver.

Compilers used for MPI: mpicc, mpif90 etc.

Compilation looks like this:

$ mpicc -o prog prog.c

Running mpi code:

$ mpirun -np <no_of_processors> <prog_name>

$ srun -p SHARED -n<no_of_processors> -N<no_of_cores> –mpi=pmi2 <prog_name>

 

Structure of an MPI program and Environment management

General MPI Program Structure


Example 1:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {

    // Initialize the MPI environment
    MPI_Init(&argc, &argv);

    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Get the name of the processor
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

    if (world_rank == 0)
    {
    	// Print off a hello world message with process name only if rank is 0
    	printf("Hello world from processor %s, rank %d out of %d processors\n",processor_name, world_rank, world_size);
    }
    else
    {
    	// Print off only the hello world message
	printf("Yello world from processor %s, rank %d out of %d processors\n",processor_name, world_rank, world_size);
    }
    // Finalize the MPI environment.
    MPI_Finalize();

}

The easiest way to understand programming with MPI is a hello world application. Usually, MPI applications are designed in such a way that multiple processes will run the same code. Every line of code will be executed by multiple machines.
Header File:
  • mpi.hHeader File. Required for all programs that make MPI library calls.

MPI calls begin with MPI_xxx in C.

Environment Management Routines

Initialisation:

  • MPI_Init: Initializes the MPI execution environment. This function must be called in every MPI program, must be called before any other MPI functions and must be called only once in an MPI program. For C programs, MPI_Init may be used to pass the command line arguments to all processes, although this is not required by the standard and is implementation dependent.

Communicators or groups:

MPI uses objects called communicators and groups to define which collection of processes may communicate with each other. Among themselves they have ranks and they communicate with each other according to their ranks. Most MPI routines require you to specify a communicator as an argument. Use MPI_COMM_WORLD whenever a communicator is required – it is the predefined communicator that includes all of your MPI processes.

  • MPI_Comm_size : Returns the total number of MPI processes in the specified communicator, such as MPI_COMM_WORLD. If the communicator is MPI_COMM_WORLD, then it represents the number of MPI tasks available to your application.

Ranks:

  • MPI_Comm_rank:
    • All processes perform the same code, but we want them to do different things. In order to differentiate the work between processes, we can fetch their rank. Their rank is just a numerical ID of the current process. There is going to be process with rank 0, 1, 2 and so on up to however many processes you have minus one. The rank is defined with a communicator and you can define your own communicators.
    • Within a communicator, every process has its own unique, integer identifier assigned by the system when the process initializes. A rank is sometimes also called a “task ID”. Ranks are contiguous and begin at zero. Used by the programmer to specify the source and destination of messages. Often used conditionally by the application to control program execution (if rank=0 do this / if rank=1 do that).
  • MPI_Get_processor_name: You can also query which compute node you are running on, this function ‘get processor name will return the hostname, and you can query how many processes are in this communicator impact world.
  • MPI_finalise: If you want to exit gracefully.

When you execute your code, using

“srun -p SHARED -n4 -N2 –mpi=pmi2 ./helloworld” ;
Hello world from processor node14, rank 0 out of 4 processors
Yello world from processor node14, rank 1 out of 4 processors
Yello world from processor node15, rank 2 out of 4 processors
Yello world from processor node15, rank 3 out of 4 processors

Remember, that all processes will run the same code, and so when they print hello world from such and such a rank, it will be executed multiple times in a row, and it will have multiple hello world statements. This is an example of how you can use the rank to differentiate the work between processes. If the rank is ranked number 0, then I will print hello world size, so only one process will execute this print size statement.

Next part of the notes is here.

Continue reading “MPI tutorial (I)”