

# INTEL® ADVISOR

Part of Intel® Parallel Studio XE

# VECTOR SIMD PARALLELISM, VECTORIZATION



## **VECTORIZATION OF CODE**

Transform sequential code to exploit vector processing capabilities





# INTEL® ADVISOR



# MODERNIZE YOUR CODE WITH INTEL® ADVISOR OPTIMIZE VECTORIZATION, PROTOTYPE THREADING, CREATE & ANALYZE FLOW GRAPHS

The Difference Is Growing with Each New Hardware Generation



#### Modern Performant Code

- Vectorized (uses Intel® AVX-512/AVX2)
- Efficient memory access
- Threaded

#### Capabilities

- Adds & optimizes vectorization
- Analyzes memory patterns
- Quickly prototypes threading

Benchmark: Binomial Options Pricing Model https://software.intel.com/en-us/articles/binomial-options-pricing-model-code-for-intel-xeon-phi-coprocessor

Performance results are based on testing as of August 2017 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks See Vectorize & Thread or Performance Dies Configurations for 2010-2017 Benchmarks in Backup. Testing by Intel as of August 2017.

Learn More: http: intel.ly/advisor-xe

## PERMISSION TO DESIGN FOR ALL LANES

#### THREADING <u>and</u> vectorization needed to fully utilize modern hardware





## INTEL® ADVISOR: VECTORIZATION OPTIMIZATION

#### Have you:

- Recompiled for AVX2 with little gain?
- Wondered where to vectorize?
- Recoded intrinsics for new arch.?
- Struggled with compiler reports?

#### Data Driven Vectorization:

- What vectorization will pay off most?
- What's blocking vectorization? Why?
- Are my loops vector friendly?
- Will reorganizing data increase performance?
- Is it safe to just use pragma simd?



#### THE LAB ACTIVITIES

- Activity 0: Building Stencil
- Activity 1: Doing Survey
- Activity 2: Dealing with data type conversions
- Activity 3: Checking for dependencies
- Activity 4: Adding threading and trying to enable vectorization
- Activity 5: Checking Memory Access Patterns
- Activity 6: Making unit stride explicit
- Activity 7: Doing Roofline analysis
- Activity 8: Splitting task to tiles
- Activity 9: Enabling AVX512
- Activity 10: Comparing roofline charts





#### STENCIL CODE EXAMPLE

- Consider solving differential equation with finite-difference method on 3-dimensional grid
- Example: calculating Laplace operator of some field

```
for (k = 1; k < \dim -1; k++)
uint64 t size = DIM * DIM * DIM * sizeof(float);
float * X = (float*) malloc(size);
float * Y = (float*) malloc(size);
                                                                             int ijk = i * iStride + j * jStride + k * kStride:
int iStride = 1;

. . . . Y[ijk] = -6.0 * X[ijk] +
. . . X[ijk - iStride] + X[ijk + iStride] +
. . . . X[ijk - iStride] + X[ijk + iStride] +
. . . . X[ijk - kStride] + X[ijk + kStride];

int iStride = DIM;
int kStride = DIM * DIM;
```

## **ACTIVITY 0: BUILDING STENCIL**



## **BUILD & RUN**

#### Purpose: Build an application, observe the performance

Launch Terminal:
 Right click -> Open Terminal





#### **BUILD & RUN**

- Setup environment:
  - \$ source /opt/intel/parallel\_studio\_xe\_2019/psxevars.sh intel64
- Go to working directory
  - \$ cd lab2
- Build application
  - \$ make -C ver0
- Run application
  - \$ ./stencil



### **ACTIVITY O. SCREENSHOT**

```
[day1@clx-3 ~]$ source /opt/intel/parallel studio xe 2019/bin/psxevars.sh intel64
Intel(R) Parallel Studio XE 2019 Update 3 for Linux*
Copyright (C) 2009-2019 Intel Corporation. All rights reserved.
[day1@clx-3 ~]$
[day1@clx-3 ~]$ cd lab2
[day1@clx-3 lab4]$ make -C ver0
make: Entering directory `/home/day1/lab4/ver0'
icc -Ofast -gopenmp -no-ipo -fno-inline-functions -g -gopt-report=5 -c main.c -o main.o
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icc -Ofast -gopenmp -no-ipo -fno-inline-functions -g -gopt-report=5 -c bench stencil.c -o bench stencil.o
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icc -Ofast -gopenmp -no-ipo -fno-inline-functions -g -gopt-report=5 main.o bench stencil.o -o stencil
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
nkdir -p ...
nv stencil ../stencil
make: Leaving directory `/home/day1/lab4/ver0'
[dav1@clx-3 lab4]$
[day1@clx-3 lab4]$ ./stencil
              Naive: Dim= 512, nIterations= 10, Time= 0.000s, Useful GB/s= inf
```



## **ACTIVITY 1: DOING SURVEY**



## **LAUNCH ADVISOR**

#### Purpose: Run Survey analysis in Advisor to get the baseline version

- Open new terminal tab
  - File -> New Tab
- Setup environment:
  - \$ source ./advixe\_vars.sh
- Launch Advisor GUI:
  - \$ advixe-gui



## **CREATE ADVISOR PROJECT**



#### **SET UP PROJECT**

- Set the application to launch: /home/day1/lab2/stencil
- Press OK button





### **START SURVEY ANALYSIS**

Press "Collect" button in "1. Survey Target" section





## **ACTIVITY 1. SCREENSHOT**





## **CREATE A SNAPSHOT**





# ACTIVITY 2: DEALING WITH DATA TYPE CONVERSIONS



## **LOOK AT THE RECOMMENDATIONS**





## **ACTIVITY 2**

#### Purpose: Identify and fix data type conversion issue

- Build version without data type conversions
   \$ make -C ver1
- Re-run Survey analysis
- Create a snapshot
- Compare with previous version



## **ACTIVITY 2. VERSION COMPARISON**

#### 1,414x ↑







## **ACTIVITY 3: CHECKING FOR DEPENDENCIES**



### **ACTIVITY 3. COLLECT DATA TO GET DEPENDENCIES**

#### Purpose: Find loop-carried dependencies

- Select [loop in bench\_stencil at bench\_stencil.c:21]
- Press "Collect" button in"2.2 Check Dependencies" section
- Wait ~1 minute
- Create a snapshot





## **ACTIVITY 3. SNAPSHOT**





All Advisor-detectable issues: C++ | Fortran

#### Assumed dependency present

The compiler assumed there is an anti-dependency (Write after read - WAR) or a true dependency (Read after write - RAW) in the loop. Improve performance by investigating the assumption and handling accordingly.

#### **□** Enable vectorization

The Dependencies analysis shows there is no real dependency in the loop for the given workload. Tell the compiler it is safe to vectorize using the restrict keyword or a directive:

```
Example ⊙

#pragma ivdep
...
```



# ACTIVITY 4: ADDING THREADING AND TRYING TO ENABLE VECTORIZATION



## **ACTIVITY 4**

#### Purpose: Add threading and try to enable vectorization

- Build a version with threading and vectorization
   \$ make -C ver2
- Re-run Survey analysis
- Create a snapshot
- Compare with previous version



### **ACTIVITY 4. VERSION COMPARISON**

#### 1,536x ↑





# ACTIVITY 5: CHECKING MEMORY ACCESS PATTERNS



#### **TYPES OF MEMORY ACCESS PATTERNS**

#### Unit-Stride access

```
for (i=0; i<N; i++)
A[<mark>i</mark>] = C[i]*D[i]
```

#### Constant stride access

#### Variable stride access





## **ACTIVITY 5**

#### Purpose: Checking memory access patterns

Select [loop in bench\_stencil\$omp\$parallel\_for@23 at bench\_stencil.c:26]



Press "Collect" button in "2.1 Check Memory Access Patterns" section



Wait ~1 minute



## **ACTIVITY 5. SCREENSHOTS**



```
LOOP BEGIN at bench_stencil.c(26,9)

remark #25084: Preprocess Loopnests: Moving Out Store [bench_stencil.c(26,34)]

remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override

remark #15329: vectorization support: non-unit strided store was emulated for the variable <new>,

stride is unknown to compiler [bench stencil.c(29,11)]

remark #15328: vectorization support: non-unit strided load was emulated for the variable <old>,

stride is unknown to compiler [bench stencil.c(29,30)]
```



## **ACTIVITY 6: MAKING UNIT STRIDE EXPLICIT**



#### Purpose: Making unit stride explicit to improve memory access pattern

- Build a version with explicit unit stride
  - \$ make -C ver3
- Re-run Survey analysis
- Create a snapshot
- Compare with previous version



## **ACTIVITY 6. VERSION COMPARISON**

#### 1,592x ↑





# ACTIVITY 7: DOING ROOFLINE ANALYSIS



# **ACTIVITY 7. COLLECT DATA TO GET ROOFLINE CHART**

# Purpose: Characterize the application using roofline model

- Select "With Callstacks" and "For all memory levels"
- Press "Collect" button in "Run Roofline" section
- Wait ~4 minutes
- Create a snapshot

Threading Vectorization Workflow Workflow OFF Batch mode Run Roofline Collect D ✓ With Callstacks For Integrated ✓ For All Memory Levels Roofline (NEW!) 1. Survey Target O Collect 🐚 🖿 🗔 Mark Loops for Deeper Analysis Select checkboxes in the Survey & Roofline tab to mark loops for other Advisor analyses. -- There are no marked loops --1.1 Find Trip Counts and FLOP G Collect 🗽 🖿 🗔 ✓ Trip Counts FLOP -- Analyze all loops --G Re-finalize Survey



## **ROOFLINE MODEL**

A roofline model helping you answer these questions:

- Does my application work optimally on the current hardware? If not, what is the most underutilized hardware resource?
- What limits performance? Is my application workload memory or compute bound?
- What is the right strategy to improve application performance?





# **ACTIVITY 7. SCREENSHOT**





**ACTIVITY 7. ROOFLINE GUIDANCE** 





# **ACTIVITY 8: SPLITTING TASK TO TILES**



#### Purpose: Splitting task to tiles to reduce cache working set

- Build a version with splitting task to tiles
  - \$ make -C ver4
- Re-run Roofline analysis
- Create a snapshot
- Compare with previous version



## **ACTIVITY 8. SCREENSHOT**





# **ACTIVITY 8. VERSION COMPARISON**

#### L2 bandwidth: 1,364x ↑

| Data Trans          | sfers and   | Bandwidth    |               | <b>(a)</b> |  |  |
|---------------------|-------------|--------------|---------------|------------|--|--|
|                     | Per Loop    | Per Instance | Per Iteration | Float Al   |  |  |
| L1, Gb <sup>®</sup> | 42.95       | 4.10e-06     | 2.41e-07      | 0.21875    |  |  |
| L2, Gb <sup>®</sup> | 26.33       | 2.51e-06     | 1.48e-07      | 0.356847   |  |  |
| L3, Gb <sup>®</sup> | 24.37       | 2.32e-06     | 1.37e-07      | 0.385497   |  |  |
| DRAM, Gb            | 20.11       | 1.92e-06     | 1.13e-07      | 0.467254   |  |  |
| Self bandwidth      | n by memory | levels       |               |            |  |  |
| L1 Gb/s             | 66.0738     |              |               |            |  |  |
| L2 Gb/s             | 40.5038     |              |               |            |  |  |
| L3 Gb/s             | 37.4936     |              |               |            |  |  |
| DRAM<br>Gb/s        | 30.9332     |              |               |            |  |  |

| Data Trans          | sfers and   | Bandwidth    |               | <b>(a)</b> |  |
|---------------------|-------------|--------------|---------------|------------|--|
| 11010               | Per Loop    | Per Instance | Per Iteration | Float AI   |  |
| L1, Gb®             | 42.95       | 4.10e-06     | 2.41e-07      | 0.21875    |  |
| L2, Gb <sup>®</sup> | 28.18       | 2.69e-06     | 1.58e-07      | 0.333416   |  |
| L3, Gb <sup>©</sup> | 18.09       | 1.73e-06     | 1.02e-07      | 0.51924    |  |
| DRAM, Gb            | 17.63       | 1.68e-06     | 9.89e-08      | 0.533059   |  |
| Self bandwidtl      | n by memory | levels       |               |            |  |
| L1 Gb/s             | 84.216      |              |               |            |  |
| L2 Gb/s             | 55.2531     |              |               |            |  |
| L3 Gb/s             | 35.4793     |              |               |            |  |
| DRAM<br>Gb/s        |             | 34.5         | 5595          |            |  |
|                     |             |              |               |            |  |



## **ACTIVITY 8. VERSION COMPARISON**

#### 1,185x ↑

Number of CPU Threads 4

- Performance characteristics
- Vectorization Gain/Efficiency
- OP/S and Bandwidth

| Effective OP/S And Bandwidth |        | Utilization | # Hardware Peak            |  |
|------------------------------|--------|-------------|----------------------------|--|
| > GFLOPS                     | 4.450  | 4.35% out o | of 102.218 (DP) FLOPS      |  |
|                              |        | 2.41% out o | of 184.999 (SP) FLOPS      |  |
| > GINTOPS                    | 0.267  | 0.42% out o | of 64.232 (Int64) INTOPS   |  |
|                              |        | 0.21% out o | of 129.620 (Int32) INTOPS  |  |
| > CPU <-> Memory [L1+NTS     | 21.012 | 3.47% out o | of 606.037 GB/s [bytes]    |  |
| GB/s]                        |        |             |                            |  |
| > L2 Bandwidth [GB/s]        | 12.822 | 5.43% out o | of 236.138 GB/s [cacheline |  |
|                              |        |             | bytes]                     |  |
| > L3 Bandwidth [GB/s]        | 11.869 | 26.77%out   | 44.330 GB/s [cacheline     |  |
|                              |        | of          | bytes]                     |  |
| > DRAM Bandwidth [GB/s]      | 9.791  | 45.82%out   | 21.368 GB/s [cacheline     |  |

Elapsed Time 1.78s ▶ GFLOPS 5.27

Vector Instruction Set ▶ SSE ▶ GINTOPS 0.32

Number of CPU Threads 4

- Performance characteristics
- Vectorization Gain/Efficiency
- OP/S and Bandwidth

| Effective OP/S And Bandwi | dth    | Utilization     | # Hardware Peak                   |  |
|---------------------------|--------|-----------------|-----------------------------------|--|
| > GFLOPS                  | 5.270  | 4.46% out       | of 118.063 (DP) FLOPS             |  |
|                           |        | 2.19% out       | of 240.529 (SP) FLOPS             |  |
| > GINTOPS                 | 0.317  | 0.43% out       | of 74.470 (Int64) INTOPS          |  |
|                           |        | 0.22% out       | of 142.379 (Int32) INTOPS         |  |
| > CPU <-> Memory [L1+NTS  | 24.863 | 5.32% out       | of 467.375 GB/s [bytes]           |  |
| GB/s]                     |        |                 |                                   |  |
| > L2 Bandwidth [GB/s]     | 16.002 | 6.29% out       | of 254.313 GB/s [cacheline bytes] |  |
| > L3 Bandwidth [GB/s]     | 10.234 | 22.96%out<br>of | 44.580 GB/s [cacheline bytes]     |  |
| > DRAM Bandwidth [GB/s]   | 9.968  | 45.70%out       | 21.810 GB/s [cacheline            |  |



# **ACTIVITY 9: ENABLING AVX512**



#### Purpose: Set compilation options to use the highest available ISA

- Build a version with new compilation flags
  - \$ make -C ver5
- Re-run Survey analysis
- Create a snapshot
- Compare with previous version



## **ACTIVITY 9. VERSION COMPARISON**

1,059x ↑

**Elapsed Time** 

1.78s SSE

Vector Instruction Set

Number of CPU Threads 4

▶ GFLOPS 5.27

▶ GINTOPS 0.32

**Elapsed Time** 

1.68s

Vector Instruction Set AVX512

Number of CPU Threads 4

Number of CPU Threads 4



# ACTIVITY 10: COMPARING ROOFLINE CHARTS



Purpose: See the performance difference for non-optimized and optimized versions.

- Run Roofline analysis w/o additional options for ver0 and ver5
- Compare profiles



# **ACTIVITY 9. ROOFLINE COMPARISON**

Hotspot elapsed time speedup: ~14x ↑ Program elapsed time speedup: ~5x ↑







# Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804