Polly: Examples

Optimize Matrix Multiplication Manually

Polly does not yet focus on end user, but on research and the development of new optimizations. Hence for the users of Polly it is often necessary to understand how Polly works internally. To get an to know the different steps taken during polyhedral compilation, we give a step by step example on how to use the different Polly passes. For this we optimize a simple matrix multiplication kernel. In case you look for a more automated way of executing Polly, check out the pollycc tool in utils/pollycc.

The files used and created in this example are available here. They can be created automatically by running the runall.sh script.
  1. Create LLVM-IR from the C code

    Polly works on LLVM-IR. Hence it is necessary to translate the source files into LLVM-IR. If more than on file should be optimized the files can be combined into a single file with llvm-link.
    clang -S -emit-llvm matmul.c -o matmul.s
  2. Load Polly automatically when calling the 'opt' tool

    Polly is not built into opt or bugpoint, but it is a shared library that needs to be loaded into these tools explicitally. The Polly library is called LVMPolly.so. For a cmake build it is available in the build/lib/ directory, autoconf creates the same file in build/tools/polly/{Release+Asserts|Asserts|Debug}/lib. For convenience we create an alias that automatically loads Polly if 'opt' is called.
    export PATH_TO_POLLY_LIB="~/polly/build/lib/"
    alias opt="opt -load ${PATH_TO_POLLY_LIB}/LLVMPolly.so"
  3. Prepare the LLVM-IR for Polly

    Polly is only able to work with code that matches a canonical form. To translate the LLVM-IR into this form we use a set of canonicalication passes. For this example only three passes are necessary. To get good coverage on more complicated input files often more canonicalization passes are needed. pollycc contains a list of passes that have shown to be beneficial.
    opt -S -mem2reg -loop-simplify -indvars matmul.s > matmul.preopt.ll
  4. Show the SCoPs detected by Polly (optional)

    To understand if Polly was able to detect SCoPs, we print the structure of the detected SCoPs. In our example two SCoPs were detected. One in 'init_array' the other in 'main'.
    opt -basicaa -polly-cloog -analyze -q matmul.preopt.ll
    init_array():
    for (c2=0;c2<=1023;c2++) {
      for (c4=0;c4<=1023;c4++) {
        Stmt_5(c2,c4);
      }
    }
    
    main():
    for (c2=0;c2<=1023;c2++) {
      for (c4=0;c4<=1023;c4++) {
        Stmt_4(c2,c4);
        for (c6=0;c6<=1023;c6++) {
          Stmt_6(c2,c4,c6);
        }
      }
    }
    
  5. Highlight the detected SCoPs in the CFGs of the program (requires graphviz/dotty)

    Polly can use graphviz to graphically show a CFG in which the detected SCoPs are highlighted. It can also create '.dot' files that can be translated by the 'dot' utility into various graphic formats.
    opt -basicaa -view-scops -disable-output matmul.preopt.ll
    opt -basicaa -view-scops-only -disable-output matmul.preopt.ll
    The output for the different functions
    view-scops: main, init_array, print_array
    view-scops-only: main, init_array, print_array
  6. View the polyhedral representation of the SCoPs

    opt -basicaa -polly-scops -analyze matmul.preopt.ll
    [...]
    Printing analysis 'Polly - Create polyhedral description of Scops' for region:
    '%1 => %17' in function 'init_array':
       Context:
       { [] }
       Statements {
       	Stmt_5
               Domain :=
                   { Stmt_5[i0, i1] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 };
               Scattering :=
                   { Stmt_5[i0, i1] -> scattering[0, i0, 0, i1, 0] };
               WriteAccess := 
                   { Stmt_5[i0, i1] -> MemRef_A[1037i0 + i1] };
               WriteAccess := 
                   { Stmt_5[i0, i1] -> MemRef_B[1047i0 + i1] };
       	FinalRead
               Domain :=
                   { FinalRead[0] };
               Scattering :=
                   { FinalRead[i0] -> scattering[200000000, o1, o2, o3, o4] };
               ReadAccess := 
                   { FinalRead[i0] -> MemRef_A[o0] };
               ReadAccess := 
                   { FinalRead[i0] -> MemRef_B[o0] };
       }
    [...]
    Printing analysis 'Polly - Create polyhedral description of Scops' for region:
    '%1 => %17' in function 'main':
       Context:
       { [] }
       Statements {
       	Stmt_4
               Domain :=
                   { Stmt_4[i0, i1] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 };
               Scattering :=
                   { Stmt_4[i0, i1] -> scattering[0, i0, 0, i1, 0, 0, 0] };
               WriteAccess := 
                   { Stmt_4[i0, i1] -> MemRef_C[1067i0 + i1] };
       	Stmt_6
               Domain :=
                   { Stmt_6[i0, i1, i2] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1023 };
               Scattering :=
                   { Stmt_6[i0, i1, i2] -> scattering[0, i0, 0, i1, 1, i2, 0] };
               ReadAccess := 
                   { Stmt_6[i0, i1, i2] -> MemRef_C[1067i0 + i1] };
               ReadAccess := 
                   { Stmt_6[i0, i1, i2] -> MemRef_A[1037i0 + i2] };
               ReadAccess := 
                   { Stmt_6[i0, i1, i2] -> MemRef_B[i1 + 1047i2] };
               WriteAccess := 
                   { Stmt_6[i0, i1, i2] -> MemRef_C[1067i0 + i1] };
       	FinalRead
               Domain :=
                   { FinalRead[0] };
               Scattering :=
                   { FinalRead[i0] -> scattering[200000000, o1, o2, o3, o4, o5, o6] };
               ReadAccess := 
                   { FinalRead[i0] -> MemRef_C[o0] };
               ReadAccess := 
                   { FinalRead[i0] -> MemRef_A[o0] };
               ReadAccess := 
                   { FinalRead[i0] -> MemRef_B[o0] };
       }
    [...]
    
  7. Show the dependences for the SCoPs

    opt -basicaa -polly-dependences -analyze matmul.preopt.ll
    Printing analysis 'Polly - Calculate dependences for SCoP' for region:
    'for.cond => for.end28' in function 'init_array':
       Must dependences:
           {  }
       May dependences:
           {  }
       Must no source:
           {  }
       May no source:
           {  }
    Printing analysis 'Polly - Calculate dependences for SCoP' for region:
    'for.cond => for.end48' in function 'main':
       Must dependences:
           {  Stmt_4[i0, i1] -> Stmt_6[i0, i1, 0] :
                  i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023;
              Stmt_6[i0, i1, i2] -> Stmt_6[i0, i1, 1 + i2] :
                  i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1022; 
              Stmt_6[i0, i1, 1023] -> FinalRead[0] :
                  i1 <= 1091540 - 1067i0 and i1 >= -1067i0 and i1 >= 0 and i1 <= 1023;
              Stmt_6[1023, i1, 1023] -> FinalRead[0] :
                  i1 >= 0 and i1 <= 1023 
           }
       May dependences:
           {  }
       Must no source:
           {  Stmt_6[i0, i1, i2] -> MemRef_A[1037i0 + i2] :
                  i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1023; 
              Stmt_6[i0, i1, i2] -> MemRef_B[i1 + 1047i2] :
                  i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1023; 
              FinalRead[0] -> MemRef_A[o0];
              FinalRead[0] -> MemRef_B[o0]
              FinalRead[0] -> MemRef_C[o0] :
                  o0 >= 1092565 or (exists (e0 = [(o0)/1067]: o0 <= 1091540 and o0 >= 0
                  and 1067e0 <= -1024 + o0 and 1067e0 >= -1066 + o0)) or o0 <= -1;
           }
       May no source:
           {  }
    
  8. Export jscop files

    Polly can export the polyhedral representation in so called jscop files. Jscop files contain the polyhedral representation stored in a JSON file.
    opt -basicaa -polly-export-jscop matmul.preopt.ll
    Writing SCoP 'for.cond => for.end28' in function 'init_array' to './init_array___%for.cond---%for.end28.jscop'.
    Writing SCoP 'for.cond => for.end48' in function 'main' to './main___%for.cond---%for.end48.jscop'.
    
  9. Import the changed jscop files and print the updated SCoP structure (optional)

    Polly can reimport jscop files, in which the schedules of the statements are changed. These changed schedules are used to descripe transformations. It is possible to import different jscop files by providing the postfix of the jscop file that is imported.

    We apply three different transformations on the SCoP in the main function. The jscop files describing these transformations are hand written. If PoCC is installed Polly can sometimes calculate such schedules fully automatically. Hwever, this is still an area we are actively working on.

    No Polly

    As a baseline we do not call any Polly code generation, but only apply the normal -O3 optimizations.

    opt matmul.preopt.ll -basicaa \
        -polly-import-jscop \
        -polly-cloog -analyze
    
    [...]
    main():
    for (c2=0;c2<g;=1535;c2++) {
      for (c4=0;c4<g;=1535;c4++) {
        Stmt_4(c2,c4);
        for (c6=0;c6<g;=1535;c6++) {
          Stmt_6(c2,c4,c6);
        }
      }
    }
    [...]
    
    Interchange (and Fission to allow the interchange)

    We split the loops and can now apply an interchange of the loop dimensions that enumerate Stmt_6.

    opt matmul.preopt.ll -basicaa \
        -polly-import-jscop -polly-import-jscop-postfix=interchanged \
        -polly-cloog -analyze
    
    [...]
    Reading JScop '%1 => %17' in function 'main' from './main___%1---%17.jscop.interchanged'.
    [...]
    main():
    for (c2=0;c2<=1535;c2++) {
      for (c4=0;c4<=1535;c4++) {
        Stmt_4(c2,c4);
      }
    }
    for (c2=0;c2<=1535;c2++) {
      for (c4=0;c4<=1535;c4++) {
        for (c6=0;c6<=1535;c6++) {
          Stmt_6(c2,c6,c4);
        }
      }
    }
    [...]
    
    Interchange + Tiling

    In addition to the interchange we tile now the second loop nest.

    opt matmul.preopt.ll -basicaa \
        -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled \
        -polly-cloog -analyze
    
    [...]
    Reading JScop '%1 => %17' in function 'main' from './main___%1---%17.jscop.interchanged+tiled'.
    [...]
    main():
    for (c2=0;c2<=1535;c2++) {
      for (c4=0;c4<=1535;c4++) {
        Stmt_4(c2,c4);
      }
    }
    for (c2=0;c2<=1535;c2+=64) {
      for (c3=0;c3<=1535;c3+=64) {
        for (c4=0;c4<=1535;c4+=64) {
          for (c5=c2;c5<=c2+63;c5++) {
            for (c6=c4;c6<=c4+63;c6++) {
              for (c7=c3;c7<=c3+63;c7++) {
                Stmt_6(c5,c7,c6);
              }
            }
          }
        }
      }
    }
    [...]
    
    Interchange + Tiling + Strip-mining to prepare vectorization
    To later allow vectorization we create a so called trivially parallelizable loop. It is innermost, parallel and has only four iterations. It can be replaced by 4-element SIMD instructions.
    opt matmul.preopt.ll -basicaa \
        -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
        -polly-cloog -analyze 
    [...]
    Reading JScop '%1 => %17' in function 'main' from './main___%1---%17.jscop.interchanged+tiled+vector'.
    [...]
    main():
    for (c2=0;c2<=1535;c2++) {
      for (c4=0;c4<=1535;c4++) {
        Stmt_4(c2,c4);
      }
    }
    for (c2=0;c2<=1535;c2+=64) {
      for (c3=0;c3<=1535;c3+=64) {
        for (c4=0;c4<=1535;c4+=64) {
          for (c5=c2;c5<=c2+63;c5++) {
            for (c6=c4;c6<=c4+63;c6++) {
              for (c7=c3;c7<=c3+63;c7+=4) {
                for (c8=c7;c8<=c7+3;c8++) {
                  Stmt_6(c5,c8,c6);
                }
              }
            }
          }
        }
      }
    }
    [...]
    
  10. Codegenerate the SCoPs

    This generates new code for the SCoPs detected by polly. If -polly-import is present, transformations specified in the imported openscop files will be applied.

    opt matmul.preopt.ll | opt -O3 > matmul.normalopt.ll
    opt -basicaa \
        -polly-import-jscop -polly-import-jscop-postfix=interchanged \
        -polly-codegen matmul.preopt.ll \
       | opt -O3 > matmul.polly.interchanged.ll
    Reading JScop '%1 => %19' in function 'init_array' from
        './init_array___%1---%19.jscop.interchanged'.
    File could not be read: No such file or directory
    Reading JScop '%1 => %17' in function 'main' from
        './main___%1---%17.jscop.interchanged'.
    
    opt -basicaa \
        -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled \
        -polly-codegen matmul.preopt.ll \
       | opt -O3 > matmul.polly.interchanged+tiled.ll
    Reading JScop '%1 => %19' in function 'init_array' from
        './init_array___%1---%19.jscop.interchanged+tiled'.
    File could not be read: No such file or directory
    Reading JScop '%1 => %17' in function 'main' from
        './main___%1---%17.jscop.interchanged+tiled'.
    
    opt -basicaa \
        -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
        -polly-codegen -enable-polly-vector matmul.preopt.ll \
       | opt -O3 > matmul.polly.interchanged+tiled+vector.ll
    Reading JScop '%1 => %19' in function 'init_array' from
        './init_array___%1---%19.jscop.interchanged+tiled+vector'.
    File could not be read: No such file or directory
    Reading JScop '%1 => %17' in function 'main' from
        './main___%1---%17.jscop.interchanged+tiled+vector'.
    
    opt -basicaa \
        -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
        -polly-codegen -enable-polly-vector -enable-polly-openmp matmul.preopt.ll \
      | opt -O3 > matmul.polly.interchanged+tiled+openmp.ll
    Reading JScop '%1 => %19' in function 'init_array' from
        './init_array___%1---%19.jscop.interchanged+tiled+vector'.
    File could not be read: No such file or directory
    Reading JScop '%1 => %17' in function 'main' from
        './main___%1---%17.jscop.interchanged+tiled+vector'.
    
  11. Create the executables

    Create one executable optimized with plain -O3 as well as a set of executables optimized in different ways with Polly. One changes only the loop structure, the other adds tiling, the next adds vectorization and finally we use OpenMP parallelism.
    llc matmul.normalopt.ll -o matmul.normalopt.s && \
        gcc matmul.normalopt.s -o matmul.normalopt.exe
    llc matmul.polly.interchanged.ll -o matmul.polly.interchanged.s && \
        gcc matmul.polly.interchanged.s -o matmul.polly.interchanged.exe
    llc matmul.polly.interchanged+tiled.ll -o matmul.polly.interchanged+tiled.s && \
        gcc matmul.polly.interchanged+tiled.s -o matmul.polly.interchanged+tiled.exe
    llc matmul.polly.interchanged+tiled+vector.ll -o matmul.polly.interchanged+tiled+vector.s && \
        gcc matmul.polly.interchanged+tiled+vector.s -o matmul.polly.interchanged+tiled+vector.exe
    llc matmul.polly.interchanged+tiled+vector+openmp.ll -o matmul.polly.interchanged+tiled+vector+openmp.s && \
        gcc -lgomp matmul.polly.interchanged+tiled+vector+openmp.s -o matmul.polly.interchanged+tiled+vector+openmp.exe 
  12. Compare the runtime of the executables

    By comparing the runtimes of the different code snippets we see that a simple loop interchange gives here the largest performance boost. However by adding vectorization and by using OpenMP we can further improve the performance significantly.
    time ./matmul.normalopt.exe
    42.68 real, 42.55 user, 0.00 sys
    time ./matmul.polly.interchanged.exe
    04.33 real, 4.30 user, 0.01 sys
    time ./matmul.polly.interchanged+tiled.exe
    04.11 real, 4.10 user, 0.00 sys
    time ./matmul.polly.interchanged+tiled+vector.exe
    01.39 real, 1.36 user, 0.01 sys
    time ./matmul.polly.interchanged+tiled+vector+openmp.exe
    00.66 real, 2.58 user, 0.02 sys