CGPACK progress, MAR-2016

CGPACK > MAR-2016

previous | UP | next

24-MAR-2016: Trying to set up TAU

First want to install PAPI.

Using papi-5.4.3. Configured with

./configure MPICC=mpiicc

to make sure that Intel MPI C wrapper is used. Note double "i" above, not single "i". mpicc, single "i" is a an MPI wrapper for GCC C compiler.

newblue4> which mpiicc
/cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/bin/mpiicc

newblue4> mpiicc --version
icc (ICC) 16.0.2 20160204
Copyright (C) 1985-2016 Intel Corporation.  All rights reserved.

newblue4> which mpicc
/cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/bin/mpicc

newblue4> mpicc --version
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Here's the PAPI config.log.

Then

make

Here's the PAPI make.log.

Then

make test

and got errors:

ctests/zero
0x8000003b PAPI_TOT_CYC is not available.
0x80000034 PAPI_FP_INS is not available.
0x8000003b PAPI_TOT_CYC is not available.
0x80000066 PAPI_FP_OPS is not available.
0x8000003b PAPI_TOT_CYC is not available.
0x80000032 PAPI_TOT_INS is not available.
test_utils.c                           FAILED
Line # 717
Error: Not enough room to add an event!

Here's the full make test log papi-make-test.log.

No luck for now. So let's try using PAPI 5.3 provided via modules:

libraries/intel_builds/papi-5.3.0

The path to papi-5.3.0 is:

/cm/shared/libraries/intel_build/papi-5.3.0/bin

Let's see which events it supports:

$ papi_avail

Available events and hardware information.
--------------------------------------------------------------------------------
PAPI Version             : 5.3.0.0
Vendor string and code   : GenuineIntel (1)
Model string and code    : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (45)
CPU Revision             : 7.000000
CPUID Info               : Family: 6  Model: 45  Stepping: 7
CPU Max Megahertz        : 2599
CPU Min Megahertz        : 2599
Hdw Threads per core     : 1
Cores per Socket         : 8
Sockets                  : 2
NUMA Nodes               : 2
CPUs per Node            : 8
Total CPUs               : 16
Running in a VM          : no
Number Hardware Counters : 11
Max Multiplex Counters   : 32
--------------------------------------------------------------------------------

    Name        Code    Avail Deriv Description (Note)
PAPI_L1_DCM  0x80000000  Yes   No   Level 1 data cache misses
PAPI_L1_ICM  0x80000001  Yes   No   Level 1 instruction cache misses
PAPI_L2_DCM  0x80000002  Yes   Yes  Level 2 data cache misses
PAPI_L2_ICM  0x80000003  Yes   No   Level 2 instruction cache misses
PAPI_L3_DCM  0x80000004  No    No   Level 3 data cache misses
PAPI_L3_ICM  0x80000005  No    No   Level 3 instruction cache misses
PAPI_L1_TCM  0x80000006  Yes   Yes  Level 1 cache misses
PAPI_L2_TCM  0x80000007  Yes   No   Level 2 cache misses
PAPI_L3_TCM  0x80000008  Yes   No   Level 3 cache misses
PAPI_CA_SNP  0x80000009  No    No   Requests for a snoop
PAPI_CA_SHR  0x8000000a  No    No   Requests for exclusive access to shared cache line
PAPI_CA_CLN  0x8000000b  No    No   Requests for exclusive access to clean cache line
PAPI_CA_INV  0x8000000c  No    No   Requests for cache line invalidation
PAPI_CA_ITV  0x8000000d  No    No   Requests for cache line intervention
PAPI_L3_LDM  0x8000000e  No    No   Level 3 load misses
PAPI_L3_STM  0x8000000f  No    No   Level 3 store misses
PAPI_BRU_IDL 0x80000010  No    No   Cycles branch units are idle
PAPI_FXU_IDL 0x80000011  No    No   Cycles integer units are idle
PAPI_FPU_IDL 0x80000012  No    No   Cycles floating point units are idle
PAPI_LSU_IDL 0x80000013  No    No   Cycles load/store units are idle
PAPI_TLB_DM  0x80000014  Yes   Yes  Data translation lookaside buffer misses
PAPI_TLB_IM  0x80000015  Yes   No   Instruction translation lookaside buffer misses
PAPI_TLB_TL  0x80000016  No    No   Total translation lookaside buffer misses
PAPI_L1_LDM  0x80000017  Yes   No   Level 1 load misses
PAPI_L1_STM  0x80000018  Yes   No   Level 1 store misses
PAPI_L2_LDM  0x80000019  No    No   Level 2 load misses
PAPI_L2_STM  0x8000001a  Yes   No   Level 2 store misses
PAPI_BTAC_M  0x8000001b  No    No   Branch target address cache misses
PAPI_PRF_DM  0x8000001c  No    No   Data prefetch cache misses
PAPI_L3_DCH  0x8000001d  No    No   Level 3 data cache hits
PAPI_TLB_SD  0x8000001e  No    No   Translation lookaside buffer shootdowns
PAPI_CSR_FAL 0x8000001f  No    No   Failed store conditional instructions
PAPI_CSR_SUC 0x80000020  No    No   Successful store conditional instructions
PAPI_CSR_TOT 0x80000021  No    No   Total store conditional instructions
PAPI_MEM_SCY 0x80000022  No    No   Cycles Stalled Waiting for memory accesses
PAPI_MEM_RCY 0x80000023  No    No   Cycles Stalled Waiting for memory Reads
PAPI_MEM_WCY 0x80000024  No    No   Cycles Stalled Waiting for memory writes
PAPI_STL_ICY 0x80000025  Yes   No   Cycles with no instruction issue
PAPI_FUL_ICY 0x80000026  No    No   Cycles with maximum instruction issue
PAPI_STL_CCY 0x80000027  No    No   Cycles with no instructions completed
PAPI_FUL_CCY 0x80000028  No    No   Cycles with maximum instructions completed
PAPI_HW_INT  0x80000029  No    No   Hardware interrupts
PAPI_BR_UCN  0x8000002a  Yes   Yes  Unconditional branch instructions
PAPI_BR_CN   0x8000002b  Yes   No   Conditional branch instructions
PAPI_BR_TKN  0x8000002c  Yes   Yes  Conditional branch instructions taken
PAPI_BR_NTK  0x8000002d  Yes   No   Conditional branch instructions not taken
PAPI_BR_MSP  0x8000002e  Yes   No   Conditional branch instructions mispredicted
PAPI_BR_PRC  0x8000002f  Yes   Yes  Conditional branch instructions correctly predicted
PAPI_FMA_INS 0x80000030  No    No   FMA instructions completed
PAPI_TOT_IIS 0x80000031  No    No   Instructions issued
PAPI_TOT_INS 0x80000032  Yes   No   Instructions completed
PAPI_INT_INS 0x80000033  No    No   Integer instructions
PAPI_FP_INS  0x80000034  Yes   Yes  Floating point instructions
PAPI_LD_INS  0x80000035  Yes   No   Load instructions
PAPI_SR_INS  0x80000036  Yes   No   Store instructions
PAPI_BR_INS  0x80000037  Yes   No   Branch instructions
PAPI_VEC_INS 0x80000038  No    No   Vector/SIMD instructions (could include integer)
PAPI_RES_STL 0x80000039  No    No   Cycles stalled on any resource
PAPI_FP_STAL 0x8000003a  No    No   Cycles the FP unit(s) are stalled
PAPI_TOT_CYC 0x8000003b  Yes   No   Total cycles
PAPI_LST_INS 0x8000003c  No    No   Load/store instructions completed
PAPI_SYC_INS 0x8000003d  No    No   Synchronization instructions completed
PAPI_L1_DCH  0x8000003e  No    No   Level 1 data cache hits
PAPI_L2_DCH  0x8000003f  Yes   Yes  Level 2 data cache hits
PAPI_L1_DCA  0x80000040  No    No   Level 1 data cache accesses
PAPI_L2_DCA  0x80000041  Yes   No   Level 2 data cache accesses
PAPI_L3_DCA  0x80000042  Yes   Yes  Level 3 data cache accesses
PAPI_L1_DCR  0x80000043  No    No   Level 1 data cache reads
PAPI_L2_DCR  0x80000044  Yes   No   Level 2 data cache reads
PAPI_L3_DCR  0x80000045  Yes   No   Level 3 data cache reads
PAPI_L1_DCW  0x80000046  No    No   Level 1 data cache writes
PAPI_L2_DCW  0x80000047  Yes   No   Level 2 data cache writes
PAPI_L3_DCW  0x80000048  Yes   No   Level 3 data cache writes
PAPI_L1_ICH  0x80000049  No    No   Level 1 instruction cache hits
PAPI_L2_ICH  0x8000004a  Yes   No   Level 2 instruction cache hits
PAPI_L3_ICH  0x8000004b  No    No   Level 3 instruction cache hits
PAPI_L1_ICA  0x8000004c  No    No   Level 1 instruction cache accesses
PAPI_L2_ICA  0x8000004d  Yes   No   Level 2 instruction cache accesses
PAPI_L3_ICA  0x8000004e  Yes   No   Level 3 instruction cache accesses
PAPI_L1_ICR  0x8000004f  No    No   Level 1 instruction cache reads
PAPI_L2_ICR  0x80000050  Yes   No   Level 2 instruction cache reads
PAPI_L3_ICR  0x80000051  Yes   No   Level 3 instruction cache reads
PAPI_L1_ICW  0x80000052  No    No   Level 1 instruction cache writes
PAPI_L2_ICW  0x80000053  No    No   Level 2 instruction cache writes
PAPI_L3_ICW  0x80000054  No    No   Level 3 instruction cache writes
PAPI_L1_TCH  0x80000055  No    No   Level 1 total cache hits
PAPI_L2_TCH  0x80000056  No    No   Level 2 total cache hits
PAPI_L3_TCH  0x80000057  No    No   Level 3 total cache hits
PAPI_L1_TCA  0x80000058  No    No   Level 1 total cache accesses
PAPI_L2_TCA  0x80000059  Yes   Yes  Level 2 total cache accesses
PAPI_L3_TCA  0x8000005a  Yes   No   Level 3 total cache accesses
PAPI_L1_TCR  0x8000005b  No    No   Level 1 total cache reads
PAPI_L2_TCR  0x8000005c  Yes   Yes  Level 2 total cache reads
PAPI_L3_TCR  0x8000005d  Yes   Yes  Level 3 total cache reads
PAPI_L1_TCW  0x8000005e  No    No   Level 1 total cache writes
PAPI_L2_TCW  0x8000005f  Yes   No   Level 2 total cache writes
PAPI_L3_TCW  0x80000060  Yes   No   Level 3 total cache writes
PAPI_FML_INS 0x80000061  No    No   Floating point multiply instructions
PAPI_FAD_INS 0x80000062  No    No   Floating point add instructions
PAPI_FDV_INS 0x80000063  Yes   No   Floating point divide instructions
PAPI_FSQ_INS 0x80000064  No    No   Floating point square root instructions
PAPI_FNV_INS 0x80000065  No    No   Floating point inverse instructions
PAPI_FP_OPS  0x80000066  Yes   Yes  Floating point operations
PAPI_SP_OPS  0x80000067  Yes   Yes  Floating point operations; optimized to count scaled single precision vector operations
PAPI_DP_OPS  0x80000068  Yes   Yes  Floating point operations; optimized to count scaled double precision vector operations
PAPI_VEC_SP  0x80000069  Yes   Yes  Single precision vector/SIMD instructions
PAPI_VEC_DP  0x8000006a  Yes   Yes  Double precision vector/SIMD instructions
PAPI_REF_CYC 0x8000006b  Yes   No   Reference clock cycles
-------------------------------------------------------------------------
Of 108 possible events, 50 are available, of which 17 are derived.

avail.c                                     PASSED

So now let's try to build TAU 2.25 with papi 5.3. TAU uses PDT. So let's build PDT 3.21. Seems to build fine.

Let's build TAU:

./configure -c++=icpc -cc=icc -fortran=intel -mpi -pdt=$HOME/pdtoolkit-3.21/ -papi=/cm/shared/libraries/intel_build/papi-5.3.0 -PROFILE -TRACE -slog2

make install

Seems fine. Let's validate TAU:

newblue4> cat parallel.sh 
#!/bin/bash
mpirun -np 4 ./simple
newblue4> setenv TAU_VALIDATE_PARALLEL `pwd`/parallel.sh ; ./tau_validate -v --html --table table.heml --timeout 180 x86_64 > & results.html

Here are the The TAU validation results. No errors.

A simple demo that TAU + PAPI is working. For a small coarray program looking at PAPI_BR_NTK counter - branch not taken:

TAU histogram of PAPI_BR_NTK

25-MAR-2016: Checking TAU tracing

Using this TAU conf file:

TAU_COMM_MATRIX=1
TAU_METRICS=PAPI_BR_CN,PAPI_BR_TKN,PAPI_BR_NTK,PAPI_BR_MSP,PAPI_BR_PRC  
TAU_TRACE=1
TAU_PROFILE=1
#TAU_CALLPATH=1
#TAU_CALLPATH_DEPTH=100

Using this shell script to build an instrumented coarray executable. Note TAU_OPTIONS specify the use of compiler for instrumentation, not PDT:

#!/bin/sh

export TAU_OPTIONS="-optVerbose -optCompInst"
export TAU_MAKEFILE=$HOME/tau-2.25/x86_64/lib/Makefile.tau-icpc-papi-mpi-pdt-profile-trace

make clean
make all

Using this Makefile for shared coarray execution with Intel MPI:

TAU_MAKEFILE=   $(HOME)/tau-2.25/x86_64/lib/Makefile.tau-icpc-papi-mpi-pdt-profile-trace
include $(TAU_MAKEFILE)

RM=             /bin/rm

MOD_SRC=        m.f90
MOD_OBJ=        $(MOD_SRC:.f90=.o)
MOD_MOD=        $(MOD_SRC:.f90=.mod)
MOD_CLEAN=      $(MOD_OBJ) $(MOD_MOD)

PROG_SRC=       simple.f90
PROG_OBJ=       $(PROG_SRC:.f90=.o)
PROG_EXE=       $(PROG_SRC:.f90=.x)
PROG_CLEAN=     $(PROG_OBJ) $(PROG_EXE)

ALL_CLEAN=      $(MOD_CLEAN) $(PROG_CLEAN)

.SUFFIXES:
.SUFFIXES: .f90 .o .mod .x 

# Comment to disable TAU 
USE_TAU=1

F90=            tau_f90.sh # $(TAU_F90)
FFLAGS=         -coarray=shared -warn $(TAU_INCLUDE) $(TAU_MPI_INCLUDE) $(TAU_F90_SUFFIX)

LINKER=         $(TAU_F90)
#LINKER=                $(TAU_LINKER)
LDFLAGS=        -coarray=shared $(USER_OPT) $(TAU_LDFLAGS)

LIBS=           $(TAU_MPI_FLIBS) $(TAU_LIBS) $(TAU_CXXLIBS) 

PDTF90PARSE=    $(PDTDIR)/$(PDTARCHDIR)/bin/f95parse
TAUINSTR=       $(TAUROOTDIR)/$(CONFIG_ARCH)/bin/tau_instrumentor
CFLAGS=         $(TAU_INCLUDE) $(TAU_DEFS) $(TAU_MPI_INCLUDE)

all:    $(MOD_MOD) $(PROG_EXE)

.f90.o:
        $(F90) $(FFLAGS) -c $<

.f90.mod:
        $(F90) $(FFLAGS) -c $<

$(PROG_EXE): $(MOD_OBJ) $(PROG_OBJ)
        $(LINKER) $(LDFLAGS) $(MOD_OBJ) $(PROG_OBJ) -o $@ $(LIBS)

clean:
        $(RM) $(ALL_CLEAN)

Run the instrumented program, as a normal Intel shared memory coarray program:

./simple.x

This generates events and traces:

newblue2> ls events* tautrace*
events.0.edf   events.1.edf  events.8.edf         tautrace.13.0.0.trc  tautrace.6.0.0.trc
events.10.edf  events.2.edf  events.9.edf         tautrace.14.0.0.trc  tautrace.7.0.0.trc
events.11.edf  events.3.edf  tautrace.0.0.0.trc   tautrace.15.0.0.trc  tautrace.8.0.0.trc
events.12.edf  events.4.edf  tautrace.10.0.0.trc  tautrace.2.0.0.trc   tautrace.9.0.0.trc
events.13.edf  events.5.edf  tautrace.1.0.0.trc   tautrace.3.0.0.trc
events.14.edf  events.6.edf  tautrace.11.0.0.trc  tautrace.4.0.0.trc
events.15.edf  events.7.edf  tautrace.12.0.0.trc  tautrace.5.0.0.trc
newblue2> 

Need to prepare tracing output for Jumpshot-4. Using these instructions from the TAU User Guide, Sec. 4.3.

newblue2> tau_treemerge.pl
/panfs/panasas01/mech/mexas/tau-2.25/x86_64/bin/tau_merge -m tau.edf -e events.0.edf events.1.edf events.10.edf events.11.edf events.12.edf events.13.edf events.14.edf events.15.edf events.2.edf events.3.edf events.4.edf events.5.edf events.6.edf events.7.edf events.8.edf events.9.edf tautrace.0.0.0.trc tautrace.1.0.0.trc tautrace.10.0.0.trc tautrace.11.0.0.trc tautrace.12.0.0.trc tautrace.13.0.0.trc tautrace.14.0.0.trc tautrace.15.0.0.trc tautrace.2.0.0.trc tautrace.3.0.0.trc tautrace.4.0.0.trc tautrace.5.0.0.trc tautrace.6.0.0.trc tautrace.7.0.0.trc tautrace.8.0.0.trc tautrace.9.0.0.trc tau.trc
tau.trc exists; override [y]? y
tautrace.0.0.0.trc: 24 records read.
tautrace.1.0.0.trc: 975574 records read.
tautrace.10.0.0.trc: 975124 records read.
tautrace.11.0.0.trc: 975124 records read.
tautrace.12.0.0.trc: 975124 records read.
tautrace.13.0.0.trc: 975124 records read.
tautrace.14.0.0.trc: 975124 records read.
tautrace.15.0.0.trc: 975392 records read.
tautrace.2.0.0.trc: 975214 records read.
tautrace.3.0.0.trc: 975124 records read.
tautrace.4.0.0.trc: 975124 records read.
tautrace.5.0.0.trc: 975124 records read.
tautrace.6.0.0.trc: 975214 records read.
tautrace.7.0.0.trc: 975124 records read.
tautrace.8.0.0.trc: 975124 records read.
tautrace.9.0.0.trc: 975124 records read.
newblue2> 
newblue2> tau2slog2 tau.trc tau.edf -o tau.slog2
14627782 records initialized.  Processing.
1521 enters: 0 exits: 0
292555 Records read. 1% converted
585110 Records read. 3% converted
877665 Records read. 5% converted
1170220 Records read. 7% converted
1462775 Records read. 9% converted
1755330 Records read. 11% converted
2047885 Records read. 13% converted
2340440 Records read. 15% converted
2632995 Records read. 17% converted
2925550 Records read. 19% converted
3218105 Records read. 21% converted
3510660 Records read. 23% converted
3803215 Records read. 25% converted
4095770 Records read. 27% converted
4388325 Records read. 29% converted
4680880 Records read. 31% converted
4973435 Records read. 33% converted
5265990 Records read. 35% converted
5558545 Records read. 37% converted
5851100 Records read. 39% converted
6143655 Records read. 41% converted
6436210 Records read. 43% converted
6728765 Records read. 45% converted
7021320 Records read. 47% converted
7313875 Records read. 49% converted
7606430 Records read. 51% converted
7898985 Records read. 53% converted
8191540 Records read. 55% converted
8484095 Records read. 57% converted
8776650 Records read. 59% converted
9069205 Records read. 61% converted
9361760 Records read. 63% converted
9654315 Records read. 65% converted
9946870 Records read. 67% converted
10239425 Records read. 69% converted
10531980 Records read. 71% converted
10824535 Records read. 73% converted
11117090 Records read. 75% converted
11409645 Records read. 77% converted
11702200 Records read. 79% converted
11994755 Records read. 81% converted
12287310 Records read. 83% converted
12579865 Records read. 85% converted
12872420 Records read. 87% converted
13164975 Records read. 89% converted
13457530 Records read. 91% converted
13750085 Records read. 93% converted
1521 enters: 0 exits: 0
14042640 Records read. 95% converted
1521 enters: 0 exits: 0
1521 enters: 0 exits: 0
14335195 Records read. 97% converted
1521 enters: 0 exits: 0
1521 enters: 0 exits: 0
1521 enters: 0 exits: 0
1521 enters: 0 exits: 0
1521 enters: 0 exits: 0
1521 enters: 0 exits: 0
1521 enters: 0 exits: 0
1521 enters: 0 exits: 0
1521 enters: 0 exits: 0
1521 enters: 0 exits: 0
1521 enters: 0 exits: 0
14627750 Records read. 99% converted
1521 enters: 0 exits: 0
Reached end of trace file.
         SLOG-2 Header:
version = SLOG 2.0.6
NumOfChildrenPerNode = 2
TreeLeafByteSize = 65536
MaxTreeDepth = 10
MaxBufferByteSize = 609542
Categories  is FBinfo(2111 @ 50851451)
MethodDefs  is FBinfo(0 @ 0)
LineIDMaps  is FBinfo(164 @ 50853562)
TreeRoot    is FBinfo(380932 @ 50470519)
TreeDir     is FBinfo(42368 @ 50853726)
Annotations is FBinfo(0 @ 0)
Postamble   is FBinfo(0 @ 0)

1521 enters: 0 exits: 0


Number of Drawables = 1462957
timeElapsed between 1 & 2 = 36 msec
timeElapsed between 2 & 3 = 8337 msec
newblue2> 

The launch jumpshot as

jumpshot tau.slog2

Jumpshot seems to hang a lot, perhaps my Java or graphics card are too old. Also cannot see any comms shown. Also cannot figure out how to save images. So just an xwd for now:

Jumpshot4 timeline

previous | UP | next