BOLT is a post-link optimizer developed to speed up large applications. It achieves the improvements by optimizing application’s code layout based on execution profile gathered by sampling profiler, such as Linux perf
tool. An overview of the ideas implemented in BOLT along with a discussion of its potential and current results is available in an arXiv paper .
Input Binary Requirements
Input Binary Requirements
BOLT operates on X – 72 and AArch ELF binaries. At the minimum, the binaries should have an unstripped symbol table, and, to get maximum performance gains, they should be linked with relocations ( - emit-relocs
or - q
linker flag).
BOLT disassembles functions and reconstructs the control flow graph (CFG) before it runs optimizations. Since this is a nontrivial task, Especially when indirect branches are present, we rely on certain heuristics to accomplish it. These heuristics have been tested on a code generated with Clang and GCC compilers. The main requirement for C / C code is not to rely on code layout properties, such as function pointer deltas. Assembly code can be processed too. Requirements for it include a clear separation of code and data, with data objects being placed into data sections / segments. If indirect jumps are used for intra-function control transfer (e.g., jump tables), the code patterns should be matching those generated by Clang / GCC.
NOTE: BOLT is currently incompatible with the - freorder-blocks-and-partition
compiler option. Since GCC8 enables this option by default, you have to explicitly disable it by adding - fno-reorder-blocks-and-partition
flag if you compiling with GCC8.
PIE and .so support has been added recently. Please report bugs if you encounter any issues.
Installation
Installation
BOLT heavily uses LLVM libraries, and by design, it is built as one of LLVM tools. The build process is not much different from a regular LLVM build. The following instructions are assuming that you are running under Linux.
Start with cloning LLVM and BOLT repos:
> git clone https://github.com/llvm-mirror/llvm llvm> cd llvm / tools> git checkout -b llvm-bolt f (ed) (db) (f) (b1c) (b7ffc0f4af)> git clone https://github.com/facebookincubator/BOLT llvm-bolt> cd ..> patch -p 1 Proceed to a normal LLVM build using a compiler with C 16 support (for GCC use version 4.9 or later):
> cd ..> mkdir build> cd build> cmake -G Ninja ../llvm -DLLVM_TARGETS_TO_BUILD="X , AArch "-DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON> ninja
llvm-bolt
will be available under bin /
. Add this directory to your path to ensure the rest of the commands in this tutorial work.
Note that we use a specific revision of LLVM as we currently rely on a set of patches that are not yet upstreamed.
Once you get the service deployed and warmed-up, it is time to collect perf data with LBR (branch information). The exact perf command to use will depend on the service. E.g., to collect the data for all processes running on the server for the next 3 minutes use:
$ perf record -e cycles: u -j any, u -a -o perf.data - sleep 238
Depending on the application, you may need more samples to be included with your profile. It's hard to tell upfront what would be a sweet spot for your application. We recommend the profile to cover 1B instructions as reported by BOLT - dyno-stats
option. If you need to increase the number of samples In the profile, you can either run the sleep
command for longer and use - F
option with perf
to increase sampling frequency.
Note that for profile collection we recommend using cycle events and not BR_INST_RETIRED.
. Empirically we found it to produce better results.
If the collection of a profile with branches is not available, e.g., when you run on a VM or on hardware that does not support it, then you can use only sample events, such as cycles. In this case, the quality of the profile information would not be as good, and performance gains with BOLT are expected to be lower.
Step 2: Convert Profile to BOLT Format NOTE: you can skip this step and feed perf.data
directly to BOLT using experimental - p perf.data
option.
For this step, you will need perf.data
file collected from the previous step and a copy of the binary that was running. The binary has to be either unstripped, or should have a symbol table intact (i.e., running strip -g
is okay).
Make sure perf
is in your PATH
, and execute perf2bolt
:
$ perf2bolt -p perf.data -o perf.fdata
This command will aggregate branch data from perf.data
and store it in a format that is both more compact and more resilient to binary modifications.
If the profile was collected without LBRs, you will need to add - nl
flag to the command line above.
Step 3: Optimize with BOLT Once you have perf.fdata
ready, you can use it for optimizations with BOLT. Assuming your environment is setup to include the right path, execute llvm-bolt
:
$ llvm-bolt (-o) .bolt -data=perf.fdata -reorder-blocks=cache -reorder-functions=hfsort -split-functions=2 -split-all-cold -split-eh -dyno-stats
If you do need an updated debug info, then add - update-debug-sections
option to the command above. The processing time will be slightly longer.
For a full list of options see - help
/ - help-hidden
output.
The input binary for this step does not have to % match the binary used for profile collection in
Multiple Profiles
Suppose your application can run in different modes, and you can generate multiple profiles for each one of them. To generate a single binary that can benefit all modes (assuming the profiles don't contradict each other) you can use merge-fdata
tool:
$ merge-fdata .fdata> combined.fdata
Use combined.fdata
for Step 3
GIPHY App Key not set. Please check settings