vllm.utils.torch_utils ¶
STR_DTYPE_TO_TORCH_DTYPE module-attribute
¶
STR_DTYPE_TO_TORCH_DTYPE = {
"float32": float32,
"half": half,
"bfloat16": bfloat16,
"float": float,
"fp8": uint8,
"fp8_e4m3": uint8,
"fp8_e5m2": uint8,
"int8": int8,
"fp8_inc": float8_e4m3fn,
"fp8_ds_mla": uint8,
}
TORCH_DTYPE_TO_NUMPY_DTYPE module-attribute
¶
TORCH_DTYPE_TO_NUMPY_DTYPE = {
float16: float16,
float32: float32,
float64: float64,
uint8: uint8,
int32: int32,
int64: int64,
}
_StreamPlaceholder ¶
Source code in vllm/utils/torch_utils.py
_cuda_device_count_stateless cached
¶
Source code in vllm/utils/torch_utils.py
_generate_random_fp8 ¶
Source code in vllm/utils/torch_utils.py
_get_precision_level ¶
_is_torch_equal ¶
Source code in vllm/utils/torch_utils.py
_is_torch_equal_or_newer ¶
async_tensor_h2d ¶
async_tensor_h2d(
data: list,
dtype: dtype,
target_device: str | device,
pin_memory: bool,
) -> Tensor
Asynchronously create a tensor and copy it from host to device.
Source code in vllm/utils/torch_utils.py
common_broadcastable_dtype ¶
common_broadcastable_dtype(dtypes: Collection[dtype])
Get the common dtype
where all of the other dtypes
can be cast to it without losing any information.
Source code in vllm/utils/torch_utils.py
create_kv_caches_with_random ¶
create_kv_caches_with_random(
num_blocks: int,
block_size: int,
num_layers: int,
num_heads: int,
head_size: int,
cache_dtype: str | dtype | None,
model_dtype: str | dtype | None = None,
seed: int | None = None,
device: str | None = "cuda",
) -> tuple[list[Tensor], list[Tensor]]
Source code in vllm/utils/torch_utils.py
create_kv_caches_with_random_flash ¶
create_kv_caches_with_random_flash(
num_blocks: int,
block_size: int,
num_layers: int,
num_heads: int,
head_size: int,
cache_dtype: str | dtype | None,
model_dtype: str | dtype | None = None,
seed: int | None = None,
device: str | None = "cuda",
cache_layout: str | None = "NHD",
) -> tuple[list[Tensor], list[Tensor]]
Source code in vllm/utils/torch_utils.py
cuda_device_count_stateless ¶
cuda_device_count_stateless() -> int
Get number of CUDA devices, caching based on the value of CUDA_VISIBLE_DEVICES at the time of call.
This should be used instead of torch.cuda.device_count() unless CUDA_VISIBLE_DEVICES has already been set to the desired value.
Source code in vllm/utils/torch_utils.py
current_stream ¶
current_stream() -> Stream
replace torch.cuda.current_stream()
with vllm.utils.current_stream()
. it turns out that torch.cuda.current_stream()
is quite expensive, as it will construct a new stream object at each call. here we patch torch.cuda.set_stream
to keep track of the current stream directly, so that we can avoid calling torch.cuda.current_stream()
.
the underlying hypothesis is that we do not call torch._C._cuda_setStream
from C/C++ code.
Source code in vllm/utils/torch_utils.py
direct_register_custom_op ¶
direct_register_custom_op(
op_name: str,
op_func: Callable,
mutates_args: list[str] | None = None,
fake_impl: Callable | None = None,
target_lib: Library | None = None,
dispatch_key: str | None = None,
tags: tuple[Tag, ...] = (),
)
torch.library.custom_op
can have significant overhead because it needs to consider complicated dispatching logic. This function directly registers a custom op and dispatches it to the CUDA backend. See https://gist.github.com/youkaichao/ecbea9ec9fc79a45d2adce1784d7a9a5 for more details.
By default, the custom op is registered to the vLLM library. If you want to register it to a different library, you can pass the library object to the target_lib
argument.
IMPORTANT: the lifetime of the operator is tied to the lifetime of the library object. If you want to bind the operator to a different library, make sure the library object is alive when the operator is used.
Source code in vllm/utils/torch_utils.py
get_cuda_view_from_cpu_tensor ¶
Get a CUDA view of a CPU tensor using Unified Virtual Addressing (UVA).
Source code in vllm/utils/torch_utils.py
get_dtype_size ¶
get_kv_cache_torch_dtype ¶
get_kv_cache_torch_dtype(
cache_dtype: str | dtype | None,
model_dtype: str | dtype | None = None,
) -> dtype
Source code in vllm/utils/torch_utils.py
is_lossless_cast ¶
Test whether it is lossless to cast a tensor from src_dtype
to tgt_dtype
.
Source code in vllm/utils/torch_utils.py
is_torch_equal ¶
Check if the installed torch version is == the target version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
target | str | a version string, like "2.6.0". | required |
Returns:
Type | Description |
---|---|
bool | Whether the condition meets. |
Source code in vllm/utils/torch_utils.py
is_torch_equal_or_newer ¶
Check if the installed torch version is >= the target version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
target | str | a version string, like "2.6.0". | required |
Returns:
Type | Description |
---|---|
bool | Whether the condition meets. |
Source code in vllm/utils/torch_utils.py
kv_cache_dtype_str_to_dtype ¶
kv_cache_dtype_str_to_dtype(
kv_cache_dtype: str, model_config: ModelConfig
) -> dtype
Source code in vllm/utils/torch_utils.py
make_ndarray_with_pad ¶
make_ndarray_with_pad(
x: list[list[T]],
pad: T,
dtype: DTypeLike,
*,
max_len: int | None = None,
) -> NDArray
Make a padded array from 2D inputs.
The padding is applied to the end of each inner list until it reaches max_len
.
Source code in vllm/utils/torch_utils.py
make_tensor_with_pad ¶
make_tensor_with_pad(
x: list[list[T]],
pad: T,
dtype: dtype,
*,
max_len: int | None = None,
device: str | device | None = None,
pin_memory: bool = False,
) -> Tensor
Make a padded tensor from 2D inputs.
The padding is applied to the end of each inner list until it reaches max_len
.
Source code in vllm/utils/torch_utils.py
set_default_torch_dtype ¶
set_default_torch_dtype(dtype: dtype)
Sets the default torch dtype to the given dtype.
Source code in vllm/utils/torch_utils.py
set_default_torch_num_threads ¶
set_default_torch_num_threads(num_threads: int)
Sets the default number of threads for PyTorch to the given value.
Source code in vllm/utils/torch_utils.py
weak_ref_tensor ¶
Create a weak reference to a tensor. The new tensor will share the same data as the original tensor, but will not keep the original tensor alive.
Source code in vllm/utils/torch_utils.py
weak_ref_tensors ¶
weak_ref_tensors(
tensors: Tensor
| list[Tensor]
| tuple[Tensor]
| IntermediateTensors,
) -> Tensor | list[Any] | tuple[Any] | Any
Convenience function to create weak references to tensors, for single tensor, list of tensors or tuple of tensors.