Skip to content

Switch to native 512b vectors to improve performance#110

Open
AngryLoki wants to merge 1 commit intoamd:develfrom
AngryLoki:vec32
Open

Switch to native 512b vectors to improve performance#110
AngryLoki wants to merge 1 commit intoamd:develfrom
AngryLoki:vec32

Conversation

@AngryLoki
Copy link
Copy Markdown

Per AM027 AIE-ML v2 Architecture Manual, full vector register is 512b. AIE2P target provides instruction to full 512b load (vldb x, [p], #64) or half-width load (vldb wl, [p], #32, wastes bandwidth), with optimal 32 bf16 vmull, vadd, max, etc.

Some operations like transcendentals vtanh and vexp2 provide only 16 floats instructions (no 32 bfloats). Moreover, on AIE2 there is is no no hardware vtanh instruction, which results in LUT-based implementation. On AIE2, a 32-element bf16 load requires two 256b load instructions (vlda wl + vlda wh or two sequential vldb wl). On AIE2P, the same load is a single 512b instruction (vldb x). The key interesting note here, is that despite needing two load instructions, AIE2 still benefits from v32 because the vector ALU and accumulator paths operate on 512b registers natively anyways, reducing insns/elem.

Switching to 32 bf16 vectors provides a consistent 1.3 - 1.5 speedup for gelu/sigmoid/silu/tanh/rope/dequant/layer_norm/rms_norm/mul/add on AIE2P. As for AIE2, it also should improve the performance (from the assembly reading, not checked on a real hardware).

For tanh and exp2 the strategy is as simple as:

  1. Load 32 bf16 elements
  2. Split: vec.extract<16>(0), vec.extract<16>(1)
  3. Apply tanh to each half (2 calls)
  4. Merge: aie::concat(lo, hi) -> 32 bf16 elements
  5. Continue with 32-wide bf16 operations

For llama the improvements are also there (few percent for TTFT); +12.5% for swiglu (2048x2048). Not a large improvement there, as these operations are not a bottleneck, but are still nice to have.

Other notes:

  1. There was a minor issue in aie2/relu.cc increment by 16 while working with v32 vectors.
  2. layer_norm.cc was affected by unhandled case in copyPhysReg after few operations over aie::accum<accfloat, 16> Xilinx/llvm-aie#734, which was fixed recently, so now it works fine with 32-sized vectors

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR has been reviewed and approved.
  3. All checks are passing.

Per AM027 AIE-ML v2 Architecture Manual, full vector register is 512b. AIE2P target provides instruction to full 512b load (`vldb x, [p], amd#64`) or half-width load (`vldb wl, [p], amd#32`, wastes bandwidth), with optimal 32 bf16 vmull, vadd, max, etc.

Some operations like transcendentals vtanh and vexp2 provide only 16 floats instructions (no 32 bfloats). Moreover, on AIE2 there is is no no hardware `vtanh` instruction, which results in LUT-based implementation.
On AIE2, a 32-element bf16 load requires two 256b load instructions (`vlda wl` + `vlda wh` or two sequential `vldb wl`). On AIE2P, the same load is a single 512b instruction (`vldb x`). The key interesting note here, is that despite needing two load instructions, AIE2 still benefits from v32 because the vector ALU and accumulator paths operate on 512b registers natively anyways, reducing insns/elem.

Switching to 32 bf16 vectors provides a consistent 1.3 - 1.5 speedup for gelu/sigmoid/silu/tanh/rope/dequant/layer_norm/rms_norm/mul/add on AIE2P.
As for AIE2, it also should improve the performance (from the assembly reading, not checked on a real hardware).

For tanh and exp2 the strategy is as simple as:
1. Load 32 bf16 elements
2. Split: `vec.extract<16>(0)`, `vec.extract<16>(1)`
3. Apply tanh to each half (2 calls)
4. Merge: `aie::concat(lo, hi)` -> 32 bf16 elements
5. Continue with 32-wide bf16 operations

For llama the improvements are also there (few percent for TTFT); +12.5% for swiglu (2048x2048). Not a large improvement there, as these operations are not a bottleneck, but are still nice to have.

Other notes:
1. There was a minor issue in `aie2/relu.cc` increment by 16 while working with v32 vectors.
2. layer_norm.cc was affected by Xilinx/llvm-aie#734, which was fixed recently, so now it works fine with 32-sized vectors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant