Switch to native 512b vectors to improve performance by AngryLoki · Pull Request #110 · amd/IRON

AngryLoki · 2026-04-26T13:13:10Z

Per AM027 AIE-ML v2 Architecture Manual, full vector register is 512b. AIE2P target provides instruction to full 512b load (vldb x, [p], #64) or half-width load (vldb wl, [p], #32, wastes bandwidth), with optimal 32 bf16 vmull, vadd, max, etc.

Some operations like transcendentals vtanh and vexp2 provide only 16 floats instructions (no 32 bfloats). Moreover, on AIE2 there is is no no hardware vtanh instruction, which results in LUT-based implementation. On AIE2, a 32-element bf16 load requires two 256b load instructions (vlda wl + vlda wh or two sequential vldb wl). On AIE2P, the same load is a single 512b instruction (vldb x). The key interesting note here, is that despite needing two load instructions, AIE2 still benefits from v32 because the vector ALU and accumulator paths operate on 512b registers natively anyways, reducing insns/elem.

Switching to 32 bf16 vectors provides a consistent 1.3 - 1.5 speedup for gelu/sigmoid/silu/tanh/rope/dequant/layer_norm/rms_norm/mul/add on AIE2P. As for AIE2, it also should improve the performance (from the assembly reading, not checked on a real hardware).

For tanh and exp2 the strategy is as simple as:

Load 32 bf16 elements
Split: vec.extract<16>(0), vec.extract<16>(1)
Apply tanh to each half (2 calls)
Merge: aie::concat(lo, hi) -> 32 bf16 elements
Continue with 32-wide bf16 operations

For llama the improvements are also there (few percent for TTFT); +12.5% for swiglu (2048x2048). Not a large improvement there, as these operations are not a bottleneck, but are still nice to have.

Other notes:

There was a minor issue in aie2/relu.cc increment by 16 while working with v32 vectors.
layer_norm.cc was affected by unhandled case in copyPhysReg after few operations over aie::accum<accfloat, 16> Xilinx/llvm-aie#734, which was fixed recently, so now it works fine with 32-sized vectors

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR has been reviewed and approved.
All checks are passing.

Per AM027 AIE-ML v2 Architecture Manual, full vector register is 512b. AIE2P target provides instruction to full 512b load (`vldb x, [p], amd#64`) or half-width load (`vldb wl, [p], amd#32`, wastes bandwidth), with optimal 32 bf16 vmull, vadd, max, etc. Some operations like transcendentals vtanh and vexp2 provide only 16 floats instructions (no 32 bfloats). Moreover, on AIE2 there is is no no hardware `vtanh` instruction, which results in LUT-based implementation. On AIE2, a 32-element bf16 load requires two 256b load instructions (`vlda wl` + `vlda wh` or two sequential `vldb wl`). On AIE2P, the same load is a single 512b instruction (`vldb x`). The key interesting note here, is that despite needing two load instructions, AIE2 still benefits from v32 because the vector ALU and accumulator paths operate on 512b registers natively anyways, reducing insns/elem. Switching to 32 bf16 vectors provides a consistent 1.3 - 1.5 speedup for gelu/sigmoid/silu/tanh/rope/dequant/layer_norm/rms_norm/mul/add on AIE2P. As for AIE2, it also should improve the performance (from the assembly reading, not checked on a real hardware). For tanh and exp2 the strategy is as simple as: 1. Load 32 bf16 elements 2. Split: `vec.extract<16>(0)`, `vec.extract<16>(1)` 3. Apply tanh to each half (2 calls) 4. Merge: `aie::concat(lo, hi)` -> 32 bf16 elements 5. Continue with 32-wide bf16 operations For llama the improvements are also there (few percent for TTFT); +12.5% for swiglu (2048x2048). Not a large improvement there, as these operations are not a bottleneck, but are still nice to have. Other notes: 1. There was a minor issue in `aie2/relu.cc` increment by 16 while working with v32 vectors. 2. layer_norm.cc was affected by Xilinx/llvm-aie#734, which was fixed recently, so now it works fine with 32-sized vectors

AngryLoki requested review from andrej, hunhoffe and jgmelber as code owners April 26, 2026 13:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to native 512b vectors to improve performance#110

Switch to native 512b vectors to improve performance#110
AngryLoki wants to merge 1 commit intoamd:develfrom
AngryLoki:vec32

AngryLoki commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AngryLoki commented Apr 26, 2026

PR Merge Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant