ANUSF
Fujitsu VPP300 Userguide

 

NQS Batch queues

Interactive user

Project Accounting

File Systems

Programming Languages

Programming Tools

Vectorization and Tuning Guide

HPC Programming Hints

 

 

Older doco that may still be relevant

VP2200 User Guide
 &  Vectorization

 

[Back to]
[VPP]

 

   

VPP300 Software

MPTools

MPTools is a profiling/performance analysis tool for VPP message passing programs. It provides a GUI running on covpp to analyse profiling data created by a previous vpp parallel program.

You will need to add the line
setenv USE_MPT
in you covpp .cshrc before the
source /opt/etc/system_cshrc
line. To use MPTools simply

  • run your MPI application on the VPP with the VPP_STATS environment variable set appropriately
  • when the job finishes, run mpt on covpp in the job directory (a GUI will appear)
  • click on the "GProf" file and use the menus to view profiling data.
If you have any problem let anusf know.

Index

0. MPTools-Fujitsu Tools Box / Release notes

This is the online help of MPLib/MPTools, basic communication kernel library and pre- and post- processing tools box for the Fujitsu VPP series of supercomputers. Some functions are not yet fully supported within this release like the local Print for each subwindow.

  1. Supported functions : This release provides all basic functions to display the profile generated by the embedded MPLib parallel performance analyser :
    • Execution Summary : all times (user, system, vector,...) as well as memory usage and IOs cost,
    • Hardware Performance : provides the exact count of floating point operations for both scalar and vector units,
    • Communication overhead : summary of the times spent to wait for a send completion or for a new message,
    • CPU usage and IOs cost : graph of sampled and cumulated times and IOs cost
    • Global profile : displays a unified merged profile at subroutine level, from all processes,
    • Top Subroutines : 10 top subroutines cost distribution over the processes
    • Normal and Cumulated Process profile : for a given processor, a standard or a per subroutine call (from the stack) cumulated profile,
    • oops Profile : displays a merged profile at loop level, for all subroutine of a given process
    • Lines Profile : displays a merged profile at line level, for all subroutine of a given process
    • Call Graph : dynamic call graph display with sampling informations
    • Process replay : these two functions usethe recorded stack information during the run, to replay the sampling with a timespace graph.

  2. Known problems :
    • The call graph reduction using the number of samples as filter sometimes failed

  3. Usage and limitations :
    To get a full profile fom a run, it is recommanded to set the environment variable VPP_STATS to 111 or 239. Using the extended profile is quite costfull (flag 16 of VPP_STATS). When loading a profile, both GProf and DProf files has to be inside the same directory. The GProf file contains all global informations about the parallel run. The DProf files contain specific process information like stack traces, symbols table and sampling data. The global and process replay functions are currently based on interpreted sampling data, and so the screen display can be slow on heavily loaded machines.

1. Performance Analysis overview

Two kind of analysis are peformed byt the MPLib embedded part of MPTools :
  • statistical analsysis, based on process sampling, with a stack dump performed on each sampling event to store the process context,
  • strict analysis based on hardware counters and system calls trap.
In both cases, a pre-processing phase is done in parallel, by each process, at the end of the run. This considerably reduces the post processing workload and provides self-contained profile files set. Three kinds of files are generated :
  • GProf.jobid.out : it contains all global informations about the parallel run (number of processors, times,...)
  • DProf.jobid.peid.out : these files contain for each process specific information like stack traces, symbols table and sampling data.
  • IOsTr.jobid.peid.out : contains the complete IOs profile about the parallel run (open, read, write, close,...)
All profile files are self-sufficient. There is no need to access binary and executable files to rebuild a profile.

2. Profilers usage and display

The embedded MPLib profilers are activated at run time when the VPP_STATS environment variable is set, according to the following values :
VPP_STATS
Communications 1
Performance 2
Global profile 4
Full profile 8
Extended Profile 16
CPU Usage Profile 32
IOs Traces 64
No report 128
All values can be added. For example VPP_STATS=6 means global profile and performance analysis performed with batch output, VPP_STATS=31 means complete analysis without batch ouptut. The recommanded values are 111 or 239 to get a full profile without or with batch output on standard job output.

111 = 64(IO) + 32(CPU) + 8(Full) + 4(Global) + 2(Performance) + 1(Comm)

3. MPTools configuration

The MPTools fucntions need a startup file to work properly. This file is used to locate MPTools environment (normally installed under `/usr/lang' or '/opt/FSUNmpt'), and also contains basic command description used on the system. The default configuration file is `/usr/lang/etc/mptrc' or '/opt/FSUNmpt/etc/mptrc'. It contains the following :
DEFINE 
EDIT      "xterm -T Text-Editor -n Edit -exec vi "
TERM      "xterm -T X-Term -n X-Term "
MENU File 
ENTRY     "Configuration"   "$EDIT $HOME/.mptrc"
ENTRY     "Edit File"       "$EDIT"
MENU Tools 
TITLE     "Run Informations"
MPT       "Execution Summary"     mpt_summary
MPT       "Performance Overview"  mpt_perf
MPT       "Communication Balance" mpt_com
MPT       "System Usage"          mpt_system
SEPARATOR 
TITLE     "Job Profile"
MPT       "Global Profile"        mpt_gprof
MPT       "Top Subroutines"       mpt_gdistrib
SEPARATOR 
TITLE     "Process Profile"
MPT       "Subroutines Profile"   mpt_sprof
MPT       "Cumulated Profile"     mpt_cprof
MPT       "Loops Profile"         mpt_Lprof
MPT       "Lines Profile"         mpt_lprof
MPT       "Call Graph"            mpt_graph
SEPARATOR 
TITLE     "Dynamic Replay"
MPT       "Process Replay"        mpt_replay
MENU Commands 
MPT       "Submit"                mpt_exec
ENTRY     "X-Term"                "$TERM"
MENU Misc 
ENTRY     "Unix man pages"        xman  
When launching MPTools first the first time, a copy of this file is created in the user home directory [ $HOME/.mptrc ].

4. Loops Profile

This window displays at loop level the profile of a specific process. The meaning of each column is as follow :
Subroutines subroutine containig the referenced loop
From starting line of the loop
To last line of the loop
Hits number of sampling inside the loop
Nest the nest level inside the source code
Kind loop kind : vector, scalar, mixed, inhib
Type loop type : DO, WHILE, FOR, ARRAY
%Local percentage of process cost
%Total percentage of job cost
Min.VL. minimum vector length
Max.VL. maximum vector length
Average average vector length
Mouse usage :
Selecting a subroutine name with the left hand side button will display a subwindow with the loops table of the selected subroutine only; selecting a subroutine name with the middle button will display a subwindow with the lines table of the selected subroutine only.

5. Communication Cost

This window displays the cost of the message passing layer processor to processor. The window is divided in three areas :
Time spent waiting for SEND completion
time spent by each processor to send a message to all the others
Barrier overhead
time spent by each processor inside the barrier
Time spent waiting for RECV completion
time spent by each processor to receive a message from the others
The scales are in seconds. The colors can be different in each areas, according the proper scale.

6. Cumulated Profile

This window displays the cumulated profile of a specific process. The meaning of each column is as follow :
Subroutines subroutine name within the process
Run % percentage of process cost
Samples number of collected samples
Cumul number of cumulated collected samples
Mouse usage :
Selecting a subroutine name with the left hand side button will display a subwindow with the loops table of the selected subroutine only; selecting a subroutine name with the middle button will display a subwindow with the lines table of the selected subroutine only.

7. Interactive Execution

TO BE COMPLETED

8. Top Subroutines Distribution

This window diplays for each top subroutines a curve showing the cost of this subroutine among the processors. Mouse usage : The left hand side button is controlling the zoom function. Cliking one time will the select the first corner and clicking a second time will fix the second corner of the zoomed area. The middle button is used to unzoom.

9. Global Profile

This window displays the global profile of a specific process. The meaning of each column is as follow :
Subroutines subroutine name within the process
Run % percentage of process cost
Samples number of collected samples
BarChart a simple cost distribution display
Mouse usage :
Selecting a subroutine name with the left hand side button will display a subwindow with the pre-processor cost of the subroutine only; selecting a subroutine name with the middle button will display a subwindow with the balance of the selected subroutine only.

10. Call Graph

This window displays the dynamic call graph of the selected process. The main area displays the call graph itself, indented by nest level. Using the left hand side button to select a subroutine name within the graph will open a subwindow with the subgraph starting at the selected level. The search option from the top menubar allows to navigate forward and backward inside the callgraph. The bottom part of the main window allows the user to apply a filter to the display, based on the number of nest levels or samples number threshold.

11. Execution Replay

TO BE COMPLETED (Execution Replay)

12. Lines Profile

This window displays the cumulated profile of a specific process. The meaning of each column is as follow :
Lines-Subroutines subroutine name within the process
Run % percentage of process cost
Samples number of collected samples
BarChart a simple cost distribution display
Mouse usage :
Selecting a subroutine name with the left hand side button will display a subwindow with the loops table of the selected subroutine only; selecting a subroutine name with the middle button will display a subwindow with the lines table of the selected subroutine only.

13. Communication Overview

TO BE COMPLETED (Communication Overview)

14. Performance Overview

This display shows three bargraphs :
Execution Times
A preprocessor summary of CPU times,user, system, vector, cumulated message passing times and waiting time.
Number of Executed Intructions
The exact count of instructions for the scalar unit and the vector unit (Add, Multiply, Divide).
Time spent doing Message-Passing
Total time spent inside the message passing layer.
Mouse usage :
The left hand side button is controlling the zoom function. Cliking one time will the select the first corner and clicking a second time will fix the second corner of the zoomed area. The middle button is used to unzoom.

15. Process Replay

This window displays the replay of the sampling done during the run. The top subroutines are displayed inside the colunm on the left. The main area in the center displays on red spot for each recorded sampling in the given subroutine. Inside the replay area, the left hand side button is used to display a

16. Subroutines Profile

This window displays the cumulated profile of a specific process. The meaning of each column is as follow :
Subroutines subroutine name within the process
Run % percentage of process cost
Samples number of collected samples
Vector %
Mixed %
Scalar %
Avr.VLen
Cumulated % number of cumulated collected samples
Mouse usage :
Selecting a subroutine name with the left hand side button will display a subwindow with the loops table of the selected subroutine only; selecting a subroutine name with the middle button will display a subwindow with the lines table of the selected subroutine only.

17. Execution Summary

This window displays all informations about each processors usage. The meaning of each column is as follow :
PEs processor number
Memory maximum memory used
Elapsed elapsed time of the process
User total user time
Vector total vector time
System total system time
MFlops sustained MFlops achieved
Send total time spent sending messages
Receive total time spent receiving messages
Barrier total time spent inside the barrier
IOs total time used to do IOs
The bottom part of the window displays the distribution of the previous values among ther processors. They are selected with the right hand side switches. The left hand side button of Mouse is controlling the zoom function. Cliking one time will the select the first corner and clicking a second time will fix the second corner of the zoomed area. The middle button is used to unzoom.

18. System Usage

This window displays a graph of the cumulated values for the memory, the user, system and vector times. The Details option from the top menubar is used to display two subwindows :
CPU load
the sampled system user and vector CPU usage
IOs profile
complete timing of IOs events
Mouse usage : The left hand side button is controlling the zoom function. Cliking one time will the select the first corner and clicking a second time will fix the second corner of the zoomed area. The middle button is used to unzoom.