![]() |
Cube GUI User Guide
(CubeGUI 4.8, revision ecc06d5d)
Introduction in Cube GUI and its usage
|
We focus on the three metrics. The first metric calculates the computational density, i.e. the number of operations performed on average for each piece of loaded data. The L1 compute to data access ratio can be used to judge how suitable an application is to run on the KNL architecture. Ideally, operations should be vectorized and each datum fetched from L1 cache should be used for multiple operations.
Similar to this, the L2 compute to data access ratio is calculated as the number of vector operations against the loads that initially miss the L1 cache. While the L1 metric is critical in esti- mating a codes general suitability, the L2 metric is an indicator whether the code is operating efficiently.
The thresholds are considered the limits where an investigation into the code section?s vectorization would be useful. These limits are based on recommendations of Intel R for the KNL architecture and while these hold true for most applications running on KNL, they are only guide- lines and should be applied with care.
An additional metric, the VPU intensity, offers a rule of thumb on how well a loop is vectorized, calculating the proportion of vectorized operations on total arithmetic operations. This metric should be applied only to small pieces of code and certain non-arithmetic operations, such as mask manipulation instructions, are counted as vector operations, which can skew this ratio. One defines the metrics as ratios of hardware counters provided by the KNL architecture. These can be accessed in Score-P through the PAPI metrics interface
UOPS RETIRED.PACKED SIMD/ MEM UOPS RETIRED.ALL LOADS
UOPS RETIRED.PACKED SIMD/ MEM UOPS RETIRED.L1 MISS LOADS
UOPS RETIRED.PACKED SIMD/ (UOPS RETIRED.PACKED SIMD + UOPS RETIRED.SCALAR SIMD)
and can measured at a call-path level on each thread. To calculate all derived metrics, multiple native hardware counters have to be recorded. Since the KNL architecture provides only two general purpose counters per thread, multiple measurements have to be used to obtain the full set of counters required.
![]() |
Copyright © 1998–2022 Forschungszentrum Jülich GmbH,
Jülich Supercomputing Centre
Copyright © 2009–2015 German Research School for Simulation Sciences GmbH, Laboratory for Parallel Programming |