博客 / 詳情

返回

【Triton 教程】Libdevice (tl_extra.libdevice) 函數

Triton 是一種用於並行編程的語言和編譯器。它旨在提供一個基於 Python 的編程環境,以高效編寫自定義 DNN 計算內核,並能夠在現代 GPU 硬件上以最大吞吐量運行。

更多 Triton 中文文檔可訪問 →https://triton.hyper.ai/

Triton 可以調用外部庫中的自定義函數。在這個例子中,我們將使用 libdevice 庫在張量上應用 asin 函數。請參考以下鏈接獲取關於所有可用 libdevice 函數語義的詳細信息:

  • CUDA:https://docs.nvidia.com/cuda/libdevice-users-guide/index.html
  • HIP:https://github.com/ROCm/llvm-project/tree/amd-staging/amd/device-libs/ocml/src

在 libdevice.py 中,我們試圖將相同計算但不同數據類型的函數聚合在一起。例如,__nv_asin 和 __nv_asinf 都計算輸入的反正弦的主值,但 __nv_asin 適用於 double 類型,而 __nv_asinf 適用於 float 類型。使用 Triton,您可以簡單地調用 tl.math.asin。根據輸入和輸出類型,Triton 會自動選擇正確的底層設備函數來調用。

asin 內核

import torch


import triton
import triton.language as tl
from triton.language.extra import libdevice




@triton.jit
def asin_kernel(
    x_ptr,
    y_ptr,
    n_elements,
    BLOCK_SIZE: tl.constexpr,
):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    x = libdevice.asin(x)
    tl.store(y_ptr + offsets, x, mask=mask)

使用默認的 libdevice 庫路徑

可以使用 triton/language/math.py 中編碼的默認 libdevice 庫路徑。

torch.manual_seed(0)
size = 98432
x = torch.rand(size, device='cuda')
output_triton = torch.zeros(size, device='cuda')
output_torch = torch.asin(x)
assert x.is_cuda and output_triton.is_cuda
n_elements = output_torch.numel()
grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']), )
asin_kernel[grid](x, output_triton, n_elements, BLOCK_SIZE=1024)
print(output_torch)
print(output_triton)
print(f'The maximum difference between torch and triton is '
      f'{torch.max(torch.abs(output_torch - output_triton))}')

Out:

tensor([0.4105, 0.5430, 0.0249, ..., 0.0424, 0.5351, 0.8149],
device='cuda:0') tensor([0.4105, 0.5430, 0.0249, ..., 0.0424, 0.5351,
0.8149], device='cuda:0') The maximum difference between torch and triton is 2.384185791015625e-07

定製 libdevice 庫路徑

可以通過將 libdevice 庫的路徑傳遞給 asin 內核來定製 libdevice 庫的路徑。

output_triton = torch.empty_like(x)
asin_kernel[grid](x, output_triton, n_elements, BLOCK_SIZE=1024)
print(output_torch)
print(output_triton)
print(f'The maximum difference between torch and triton is '
      f'{torch.max(torch.abs(output_torch - output_triton))}')

Out:

tensor([0.4105, 0.5430, 0.0249, ..., 0.0424, 0.5351, 0.8149],
device='cuda:0') tensor([0.4105, 0.5430, 0.0249, ..., 0.0424, 0.5351,
0.8149], device='cuda:0') The maximum difference between torch and triton is 2.384185791015625e-07

Download Jupyter notebook: 07-extern-functions.ipynb

Download Python source code: 07-extern-functions.py

Download zipped: 07-extern-functions.zip

user avatar lintp 頭像 prepared 頭像 bianchengdandan 頭像
3 位用戶收藏了這個故事!

發佈 評論

Some HTML is okay.