Clang's -O0 output: branch displacement and size increase

tl;dr Clang 19 will remove the -mrelax-all default at
-O0significantly decreasing the text section size for
x86.

Span-dependent instructions

In assembly languages, some instructions with an immediate operand
can be encoded in two (or more) forms with different sizes. On x86-64, a
direct JMP/JCC can be encoded either in 2 bytes with a 8-bit relative
offset or 6 bytes with a 32-bit relative offset. A short jump is
preferred because it takes less space. However, when the target of the
jump is too far away (out of range for a 8-bit relative offset), a near
jump must be used.

ja foo    # jump short if above, 77 <rel8>
ja foo    # jump near if above, 0f 87 <rel32>
.nops 126
foo: ret

A 1978 paper by Thomas G. Szymanski (“Assembling Code for
Machines with Span-Dependent Instructions“) used the term
“span-dependent instructions” to refer to such instructions with short
and long forms. Assemblers grapple with the challenge of choosing the
optimal size for these instructions, often referred to as the “branch
displacement problem” since branches are the most common type. A good
resource for understanding Szymanski’s work is Assembling
Span-Dependent Instructions.

Start small and grow

Popular assemblers still used today tend to favor a “start small and
grow” approach, typically requiring one more pass than Szymanski’s
“start big and shrink” method. This approach often results in smaller
code and can handle additional complexities like alignment
directives.

In LLVM, the MC
library (Machine Code) is reponsible for assembly, disassembly, and
object file formats. Within MC, “assembler relaxation” deals with
span-dependent instructions. This is distinct from linker
relaxation.

Eli Bendersky provides a detailed explanation in a 2013
blog post and highlights an interesting behavior:

For example, when compiling with -O0, the LLVM assembler simply
relaxes all jumps it encounters on first sight. This allows it to put
all instructions immediately into data fragments, which ensures there’s
much fewer fragments overall, so the assembly process is faster and
consumes less memory.

When -O0 is enabled and the integrated assembler is used
(common by default), clangDriver passes the -mrelax-all
flag to the LLVM MC library. This sets the MCRelaxAll flag
in MCTargetOptionsinstructing the assembler to
potentially start with the long form (near) for JMP and JCC instructions
on the X86 target only. Other instructions like ADD/SUB/CMP and non-x86
architectures remain unaffected.

`-mrelax-all` tradeoff

Here is an example:

void foo(int a) {
  
  
  if (a) bar();
}

The assembly (clang -S) looks like:

foo:                                    # @foo
# %bb.0:                                # %entry
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $16, %rsp
        movl    %edi, -4(%rbp)
        cmpl    $0, -4(%rbp)
        je      .LBB0_2
# %bb.1:                                # %if.then
        movb    $0, %al
        callq   bar@PLT
.LBB0_2:                                # %if.end
        addq    $16, %rsp
        popq    %rbp
        retq

The JE instruction assembles to either a short jump (8-bit relative
offset) or near jump (32-bit relative offset).

# -mrelax-all
MCSection
  MCDataFragment: empty
  MCAlignFragment: alignment=4
  MCDataFragment: instructions including JE (jump near if equal, 6 bytes)

# -mno-relax-all
MCSection
  MCDataFragment: empty
  MCAlignFragment: alignment=4
  MCDataFragment: instructions before JE (push; mov; sub; mov; cmp)
  MCRelaxableFragment: JE (jump short if equal, 2 bytes). This JE could be expanded, but not in this case.
  MCDataFragment: instructions after JE (mov; call; add; pop; ret)

The impact of -mrelax-all on text section size is
significant, especially when there are many branch instructions. In an
x86-64 release build of lld, -mrelax-all increased the
.text section size by 7.9%. This translates to a 5.4%
increase in VM size and a 4.6% increase in the overall file size.

Dean Michael Berris proposed to remove the
-mrelax-all default for -O0 in 2016, but
it stalled. -mrelax-all caused undesired interaction issues
with RISC-V’s conditional
branch transformsleading Craig Topper to remove
-mrelax-all at -O0 for RISC-V
recently.

While -mrelax-all might have offered slight compile time
benefits in the past, the gains are negligible today. Benchmarking using
stage 2 builds of Clang showed no measurable difference between
-mrelax-all and -mno-relax-all. On
llvm-compile-time-tracker running the llvm-test-suite/CTMark benchmark,
compile time actually increased
slightly by 0.62% while the text section size decreased
by 4.44%.

A difference for assembly at different optimisation levels would be
quite surprising. GCC/GNU assembler don’t exhibit similar expansion of
JMP/JCC instructions even at -O0.

These arguments strengthen the case for removing
-mrelax-all as the default for -O0. My patch has
landed and will be included in the next major release, LLVM 19.1.

Understanding the
compile time difference

I have studied a notorious huge file,
llvm/lib/Target/X86/X86ISelLowering.cpp.

Fragment count: A significant difference exists in
the number of assembler fragments generated:

-mrelax-all: 89633
-mno-relax-all: 143852

With -mrelax-allthe number of
MCRelaxableFragments is substantially reduced (to zero when
building Clang). This reduction likely contributes to the compile time
difference.

Fixed-point iteration: -mrelax-all
ensures the fixed-point iteration algorithm (almost always) converges in
a single iteration. In contrast, with -mno-relax-all,
around 6% of sections require additional iterations. However, this
difference is likely not the primary factor affecting compile time.

// -mrelax-all
1: 13919
2: 1

// -mno-relax-all
1: 13103
2: 793
3: 23
4: 1

Why
didn’t people complain about the code size increase?

Because people generally care less about -O0 code size.
-O0 is often used alongside -g for debugging
purposes. The total file size increase caused by
-mrelax-all might seem less significant in comparison.

In addition, not all projects can be successfully built with
-O0 optimization. This is typically due to issues like very
large programs or mandatory inlining behavior.

For a discussion on size reduction ideas in ELF relocatable files,
please check out my blog post about Light
ELF.

You might also be interested in my notes about GNU assembler and
LLVM integrated assembler.

Clang's -O0 output: branch displacement and size increase

Span-dependent instructions

Start small and grow

`-mrelax-all` tradeoff

Understanding the
compile time difference

Why
didn’t people complain about the code size increase?

What do you think?

LiteSpeed Cache WordPress plugin actively exploited in the wild

Most Tinyproxy Instances are potentially vulnerable to flaw CVE-2023-49606

UK Ministry of Defense disclosed a third-party data breach exposing military personnel data

Binance accuses the Nigerian government of arresting its executives through the Hongmen Banquet, saying this is a dangerous precedent

COURT DOC: U.S. Charges Russian National with Developing and Operating LockBit Ransomware

An introduction to the 6502 reversing tools Retro Debugger and 6502Bench

LiteSpeed Cache WordPress plugin actively exploited in the wild

Most Tinyproxy Instances are potentially vulnerable to flaw CVE-2023-49606

UK Ministry of Defense disclosed a third-party data breach exposing military personnel data

Binance accuses the Nigerian government of arresting its executives through the Hongmen Banquet, saying this is a dangerous precedent

COURT DOC: U.S. Charges Russian National with Developing and Operating LockBit Ransomware

An introduction to the 6502 reversing tools Retro Debugger and 6502Bench

Leave a ReplyCancel reply

Cheats For Little Alchemy

3TB Of Mega.nz Links For Free Courses And E-Books 2022 (Updated)

The Carding Masterclass: A Complete Course Of Carding

Good Keywords For Slayer Leecher 2022 (Updated)

Udemy Coupon [100% OFF] QuickBooks Online 2020

"How Many Colors Can the Human Eye See?": The Application

Analysis of binary vulnerability CVE-2024-27284 under Rust

Span-dependent instructions

Start small and grow

-mrelax-all tradeoff

Understanding the compile time difference

Why didn’t people complain about the code size increase?

What do you think?

Leave a ReplyCancel reply

Log In

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections

`-mrelax-all` tradeoff

Understanding the
compile time difference

Why
didn’t people complain about the code size increase?