module llutil.pycuda
Integrates PyCUDA to PyTorch and ice.
Global Variables
- TYPE_CHECKING
class CUDAModule
Just-In-Time compilation of a set of CUDA kernel functions and device functions from source.
ice.CUDAModule works differently compared to pycuda's SourceModule in following ways:
-
Support efficient multi-dimensional torch.Tensor access with optional boundary check.
-
Automatically handle scalar data type conversion from python/pytorch to c++.
-
Compile error message will report the original source code.
-
Easier API.
-
Configurable in the ice-learn eco-system.
Example:
import ice
M, N, K = 4, 4, 1
a = torch.rand((M, K), dtype=torch.float32).cuda()
b = torch.rand((K, N), dtype=torch.float32).cuda()
c = torch.empty((M, N), dtype=torch.float32).cuda()
kernels = ice.CUDAModule(r"""
__global__ void matmul(Tensor<float, 2> *a, Tensor<float, 2> *b, Tensor<float, 2> *c, int M, int N, int K) {
int m = blockIdx.y * blockDim.y + threadIdx.y;
int n = blockIdx.x * blockDim.x + threadIdx.x;
float v = 0.f;
if (m >= M || n >= N) return;
for (int k = 0; k < K; ++k) {
v += (*a)[m][k] * (*b)[k][n];
}
(*c)[m][n] = v;
}
""", float_bits=32)
kernels.matmul(a, b, c, M, N, K, grid=(N // 32 + 1, M // 32 + 1), block=(32, 32, 1))
torch.cuda.synchronize()
assert torch.allclose(c, torch.mm(a, b))
method __init__
Setup the parameters for compiling a CUDA source.
Args:
-
source(str): CUDA C++ source string. -
float_bits(int): bit width of float values used as tensor scalar. -
int_bits(int, optional): bit width of default int scalar. Defaults to 32. -
include_dirs(list, optional): paths of extra include dirs. Defaults to []. -
boundscheck(bool, optional): enable out of bound check for tensors. Defaults to True. -
**kwds: other keyword args you would like to pass to pycuda'sSourceModule.
Note:
Direct written
floatandinttoken in the source string will be substituted to ensure the default scalar data type matches the tensors. If you do not want this to happen, use more specific CUDA typename such as__half,double,int16_t, etc.