WIP: Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP #23365

sushraja-msft · 2025-01-14T23:07:53Z

Description

This change implements accuracy level 4 - quantize A to int8 matmul for the WebGPU EP. The matmul kernel here uses DP4A for matrix multiplication, in order to keep the DP4A fed co-operative matrix multiplication is implemented which preloads the row/col into local variables before the multiplication operation.

Credits to @qjia7 for help with the quantizer shader.

Performance metrics on intel ADL/TGL GPU.

PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       2.76762e+06
        **avg (tokens/s): 181.022**   <<< Prefill speed
        p50 (us):       2.74843e+06
        stddev (us):    41756.4
        n:              5 * 501 token(s)
Token generation:
        avg (us):       81500.7
        avg (tokens/s): 12.2698
        p50 (us):       81104.1
        stddev (us):    2961.31
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       13.1836
        avg (tokens/s): 75851.9
        p50 (us):       12
        stddev (us):    6.47085
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       13120
        p50 (ms):       13081.6
        stddev (ms):    114.689
        n:              5
Peak working set size (bytes): 5467533312
WebGPU device lost (2): Device was destroyed.

This kernel is 2.10x faster than its F16 counterpart for a 500 token prefill. Previous prefill record is 86tks/s.

In order to support devices with subgroup size 8/32, a no subgroup version of the same shader is included. Performance is slower than the subgroup version on ADL.

PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500 
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       4.11989e+06
        avg (tokens/s): 121.605
        p50 (us):       4.11847e+06
        stddev (us):    2147.48
        n:              5 * 501 token(s)
Token generation:
        avg (us):       81174.9
        avg (tokens/s): 12.3191
        p50 (us):       81301.1
        stddev (us):    2177.2
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       14.7998
        avg (tokens/s): 67568.3
        p50 (us):       12.3
        stddev (us):    11.5481
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       14431.1
        p50 (ms):       14433.8
        stddev (ms):    5.02473
        n:              5
Peak working set size (bytes): 5466480640
WebGPU device lost (2): Device was destroyed.

…79s for 500 tk prompt. 126tps or 7.9s for 1k prompt. On latest driver around 140tps for 500tk

…et because model is not made with accuracy level 4.

no subgroup Prompt processing (time to first token): avg (us): 4.11989e+06 avg (tokens/s): 121.605 with subgroup Prompt processing (time to first token): avg (us): 2.77983e+06 avg (tokens/s): 180.227

…ing the write out of lane output.

sushraja-msft changed the title ~~Dp4MatMulNBits low accuracy matmul for WebGPU EP~~ WIP: Dp4MatMulNBits low accuracy matmul for WebGPU EP Jan 14, 2025

guschmue added the ep:WebGPU ort-web webgpu provider label Jan 16, 2025

sushraja-msft changed the title ~~WIP: Dp4MatMulNBits low accuracy matmul for WebGPU EP~~ WIP: Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP Jan 17, 2025

sushraja-msft added 16 commits January 17, 2025 10:17

Initial commit with the dequantizer

5d97234

Enable the matmul

9f20029

Add comment

82236b8

Fix quantize to support sg_size 8

af2bc78

Rearrange code to keep inputs in registry. On WU driver 179 tps or 2.…

b9df530

…79s for 500 tk prompt. 126tps or 7.9s for 1k prompt. On latest driver around 140tps for 500tk

Quantize now supports all subgroup sizes.

283f4b1

removing subgroup stuff from loadSHMA

61c1d60

loadSHMB should now work for all sg_size

05570aa

Ready for sgsize implementation of perform_16_64_16 matmul

b739ad6

Remove dequantize and read accuracy level. Not using accuracy level y…

429b784

…et because model is not made with accuracy level 4.

Add a no subgroup version, perf on adl 500tk

97a359b

no subgroup Prompt processing (time to first token): avg (us): 4.11989e+06 avg (tokens/s): 121.605 with subgroup Prompt processing (time to first token): avg (us): 2.77983e+06 avg (tokens/s): 180.227

Ran the linter and added back the accuracy level check.

b690857

Fix typos

ceddd08

Add additional check that N%16 == 0, because we dont check bounds dur…

23701b9

…ing the write out of lane output.

Fix last typo

ab38991

Add subgroup check

73ee5d1

sushraja-msft force-pushed the user/sushraja/dp4_matmul branch from 0dd9e67 to 73ee5d1 Compare January 17, 2025 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP #23365

WIP: Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP #23365

sushraja-msft commented Jan 14, 2025 •

edited

Loading

WIP: Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP #23365

Are you sure you want to change the base?

WIP: Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP #23365

Conversation

sushraja-msft commented Jan 14, 2025 • edited Loading

Description

sushraja-msft commented Jan 14, 2025 •

edited

Loading