Notice both register usage and instruction count are halved. How do we support this in Mesa? Mesa pipes 23 - bitness through NIR into our backend, so we must ensure types are respected across our backend compiler. To do so, we include types explicitly in our backend intermediate representation, which the NIR-to-BIR pass simply passes through from NIR. Certain backend optimizations have to be careful to respect these types, whereas others work with little change provided the IR is well-designed. Scheduling is mostly unaffected. But where are there major changes?
Enter register allocation.
Register allocation
A fundamental problem every backend compiler faces is register allocation, mapping abstract IR variables to concrete machine registers:
0=load uniform # 0 1=load attribute # 0 2=add 0, 1 3=mul 2, 2 →
R0=load uniform # 0 R1=load attribute # 0 R0=add R0, R1 R0=mul R0, R0 Traditionally, register allocation is modeled by a graph, where each IR variable is represented by a node. Any variables that are simultaneously live, in the sense that both of their values will be read later in the program, are said to interfere since they require separate registers to avoid clobbering each other’s data. Interference is represented by edges. Then the problem reduces to graph coloring , finding colors (registers) for each node (variable) that don 't coincide with the colors (registers) of any nodes connected by edges (interfering variables). Initially, Panfrost used a graph coloring approach.
However, the algorithm above is difficult to adapt to irregular vector architectures. One alternative approach is to model the register file explicitly, modeling modeling interference as constraints and using a constraint solver to allocate registers. For the above program, letting k i
denote the register allocated to variable i , there is a single constraint on the registers k (0) ≠ k (1) , since (0, 1) is the only pair of interfering nodes, yielding a optimal valid solution k (0)=0, (k) (1)=1, k
2
=0, (k 3
=0, corre sponding to the allocation above.
As-is, this approach is mathematically equivalent to graph coloring. However, unlike graph coloring, it extends easily to model vectorization, enabling per-component liveness tracking, so some components of a vector are allocated to a register while others are reused for another value. It also extends easily to vectors of varying type sizes, crucial for - bit support, whereas the corresponding generalization for graph coloring is much more complicated.
This work was originally conducted in October for the Midgard compiler, but the approach works just as well for Bifrost. Although Bifrost is conceptually “almost” scalar, there are enough corner cases where we dip into vector structures that a vector-aware register allocator is useful. In particular, 16 - bit instructions involve subdivides 32 - bit registers, and vectorized input / output requires contiguous registers; both are easily modeled with linear-style constraints.
Packing
The final stage of a backend compiler is packing (final code generation), taking the scheduled, register allocated IR and assembling a final binary. Compared to Midgard, packing for Bifrost is incredibly complicated. Why?
Vectorized programs for vector architectures can be smaller than equivalent programs for scalar architectures. The above instruction sequences to add a 4-channel vector corresponds to just a single instruction on Midgard:
VADD.FADD R0.xyzw, R0.xyzw, R1.xyzw We would like to minimize program size, since accesses to main memory are slow and increasing the size of the instruction cache is expensive. By minimizing size, smaller caches can be used with better efficiency. Unfortunately, naively scalarizing the architecture by a factor of 4 would appear to inflate program size by 4x, requiring a 4x larger cache for the same performance.
We can do better than simply duplicating instructions. First, by simplifying the vector semantics (since we know most code will be scalar or small contiguous vectors), we eliminate vector write masks and swizzles. But this may not be good enough.
Bifrost goes beyond to save instruction bits in any way possible, since small savings in instruction encoding accumulate quickly in complex shaders. For example, Bifrost separates medium-latency register file accesses from low latency “port” accesses, loading registers into ports ahead of the instruction. There are (registers) 63 - bits each), requiring 6-bits to index a register number. The structure to specify which registers should be loaded looks like:
unsigned port_0: 5; unsigned port_1: 6; We have 6 bits to encode port 1, but only 5 bits for port 0. Does that mean port 0 can only load registers R0-R , instead of the full range?
Actually, no - if port 0 is smaller than port 1, the port numbers are as you would expect. But Bifrost has a trick: if port 0 is larger than port 1, then the hardware subtracts from both ports to get the actual port number. In effect, the ordering of the ports is used as an implicit extra bit, storing - bits of port data in - bits on the wire. (What if the port numbers are equal? Then just reuse the same port!)
Similar tricks permeate the architecture, a win for code size but a loss for compiler packing complexity. The good news is that our compiler’s packing code is self-contained and unit tested independent of the rest of the compiler.
Conclusion
Putting it all together, we have the beginnings of a Bifrost compiler, sufficient for the screenshots above. Next will be adding support for more complex instructions and scheduling to support more complex shaders.
Architecturally, Bifrost turns Midgard on its head, ideally bringing performance improvements but rather complicating the driver. While Bifrost support in Panfrost is still early, it's progressing rapidly. The compiler code required for the screenshots above is all upstreamed, and the command stream code is working its way up. So stay tuned, and happy hacking.
GIPHY App Key not set. Please check settings