Switch to native 512b vectors to improve performance#110
Open
Switch to native 512b vectors to improve performance#110
Conversation
Per AM027 AIE-ML v2 Architecture Manual, full vector register is 512b. AIE2P target provides instruction to full 512b load (`vldb x, [p], amd#64`) or half-width load (`vldb wl, [p], amd#32`, wastes bandwidth), with optimal 32 bf16 vmull, vadd, max, etc. Some operations like transcendentals vtanh and vexp2 provide only 16 floats instructions (no 32 bfloats). Moreover, on AIE2 there is is no no hardware `vtanh` instruction, which results in LUT-based implementation. On AIE2, a 32-element bf16 load requires two 256b load instructions (`vlda wl` + `vlda wh` or two sequential `vldb wl`). On AIE2P, the same load is a single 512b instruction (`vldb x`). The key interesting note here, is that despite needing two load instructions, AIE2 still benefits from v32 because the vector ALU and accumulator paths operate on 512b registers natively anyways, reducing insns/elem. Switching to 32 bf16 vectors provides a consistent 1.3 - 1.5 speedup for gelu/sigmoid/silu/tanh/rope/dequant/layer_norm/rms_norm/mul/add on AIE2P. As for AIE2, it also should improve the performance (from the assembly reading, not checked on a real hardware). For tanh and exp2 the strategy is as simple as: 1. Load 32 bf16 elements 2. Split: `vec.extract<16>(0)`, `vec.extract<16>(1)` 3. Apply tanh to each half (2 calls) 4. Merge: `aie::concat(lo, hi)` -> 32 bf16 elements 5. Continue with 32-wide bf16 operations For llama the improvements are also there (few percent for TTFT); +12.5% for swiglu (2048x2048). Not a large improvement there, as these operations are not a bottleneck, but are still nice to have. Other notes: 1. There was a minor issue in `aie2/relu.cc` increment by 16 while working with v32 vectors. 2. layer_norm.cc was affected by Xilinx/llvm-aie#734, which was fixed recently, so now it works fine with 32-sized vectors
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Per AM027 AIE-ML v2 Architecture Manual, full vector register is 512b. AIE2P target provides instruction to full 512b load (
vldb x, [p], #64) or half-width load (vldb wl, [p], #32, wastes bandwidth), with optimal 32 bf16 vmull, vadd, max, etc.Some operations like transcendentals vtanh and vexp2 provide only 16 floats instructions (no 32 bfloats). Moreover, on AIE2 there is is no no hardware
vtanhinstruction, which results in LUT-based implementation. On AIE2, a 32-element bf16 load requires two 256b load instructions (vlda wl+vlda whor two sequentialvldb wl). On AIE2P, the same load is a single 512b instruction (vldb x). The key interesting note here, is that despite needing two load instructions, AIE2 still benefits from v32 because the vector ALU and accumulator paths operate on 512b registers natively anyways, reducing insns/elem.Switching to 32 bf16 vectors provides a consistent 1.3 - 1.5 speedup for gelu/sigmoid/silu/tanh/rope/dequant/layer_norm/rms_norm/mul/add on AIE2P. As for AIE2, it also should improve the performance (from the assembly reading, not checked on a real hardware).
For tanh and exp2 the strategy is as simple as:
vec.extract<16>(0),vec.extract<16>(1)aie::concat(lo, hi)-> 32 bf16 elementsFor llama the improvements are also there (few percent for TTFT); +12.5% for swiglu (2048x2048). Not a large improvement there, as these operations are not a bottleneck, but are still nice to have.
Other notes:
aie2/relu.ccincrement by 16 while working with v32 vectors.unhandled case in copyPhysRegafter few operations overaie::accum<accfloat, 16>Xilinx/llvm-aie#734, which was fixed recently, so now it works fine with 32-sized vectorsPR Merge Checklist
develcommit and pointing todevel.