Feedback directed optimization with GCC and Perf

Reading Time: 2 minutes

Gcc 5.0 has added support for FDO which uses perf to generate profile. There is documentation for this in gcc manual, to quote:

Enable sampling-based feedback-directed optimizations, and the following optimizations which are generally profitable only with profile feedback available: -fbranch-probabilities, -fvpt, -funroll-loops, -fpeel-loops, -ftracer, -ftree-vectorize,
-finline-functions, -fipa-cp, -fipa-cp-clone, -fpredictive-commoning, -funswitch-loops, -fgcse-after-reload, and -ftree-loop-distribute-patterns.
path is the name of a file containing AutoFDO profile information. If omitted, it defaults to fbdata.afdo in the current directory.
Producing an AutoFDO profile data file requires running your program with the perf utility on a supported GNU/Linux target system. For more information, see .
perf record -e br_inst_retired:near_taken -b -o \
— your_program
Then use the create_gcov tool to convert the raw profile data to a format that can be used by GCC. You must also supply the unstripped binary for your program to this tool. See .
create_gcov –binary=your_program.unstripped – \

However, this skims over a few details:

  • br_inst_retired:near_taken is not available as shown there. See this gcc thread for details.

    I did with:

    perf record \
    -e  " cpu/event=0xc4,umask=0x20,name=br_inst_retired_near_taken,period=400009/pp " \
    -p ...  -b -o

    You can use the ocperf from pmu-tools here to get the correct event (with list).

  • create_gcov is not packaged with gcc and is only available with autofdo from google.

  • However, you can run into incompatibility due to autofdo being incompatible with latest perf. I am using perf with linux 4.0. You can apply the patches here.

    • I also have a github branch with patches applied here.
  • Finally, you can also run into gcov version incompatibility:
AutoFDO profile version 875575082 does match 1.
  • You need to explicitly provide the gcov_version for this:
create_gcov --binary=/pxc56/bin/mysqld -gcov_version 1 

Now, with all tools in place, all you need to do is:

  1. Build the program. In my case, I built percona-xtradb-cluster with RelWithDebInfo profile. The debug symbols are required.
  • Run it against representative workload. I used sysbench oltp for this.

    sysbench --test=/pxc56/db/oltp.lua --db-driver=mysql \
    --mysql-engine-trx=yes --mysql-table-engine=innodb \
    --mysql-user=root --mysql-password=test --oltp-table-size=100000 \
    --num-threads=4 --init-rng=on --max-requests=0 --oltp-auto-inc=off --max-time=60 \
    --max-requests=100000 --oltp-tables-count=5 run

  • While the workload is running, run perf concurrently.
    perf record -e \
    " cpu/event=0xc4,umask=0x20,name=br_inst_retired_near_taken,period=400009/pp " \
    -p $(pidof mysqld)  -b -o
  • After sysbench ends, stop perf and then convert to gcov format.
    create_gcov --binary=/pxc56/bin/mysqld \ -gcov_version 1 --gcov=perf.ado
  • Now, rebuild the program again but this time with:
    export CFLAGS+=" -fauto-profile=/tmp/perf.ado "
    export CXXFLAGS+=" -fauto-profile=/tmp/perf.ado "
  • The binary produced now is the one which would be optimized with hints/feedback from profile captured by perf.
  • I have skipped the results for now, that is for another post with actual benchmarking in place and a better representative workload.

    To conclude, even though gcc has had gcov profiling before, it wasn’t that convenient to use. perf has been a good low-overhead profiler in use in various environments, so using its output/profile certainly makes it easier for optimization based on it.