vllm.model_executor.warmup.kernel_warmup ¶
Warmup kernels used during model execution. This is useful specifically for JIT'ed kernels as we don't want JIT'ing to happen during model execution.
flashinfer_autotune ¶
flashinfer_autotune(runner: GPUModelRunner) -> None
Autotune FlashInfer operations. FlashInfer have many implementations for the same operation, autotuning runs benchmarks for each implementation and stores the results. The results are cached transparently and future calls to FlashInfer will use the best implementation. Without autotuning, FlashInfer will rely on heuristics, which may be significantly slower.
Source code in vllm/model_executor/warmup/kernel_warmup.py
flashinfer_autotune_supported ¶
flashinfer_autotune_supported(
vllm_config: VllmConfig,
) -> bool
Record known issues with vllm + flashinfer autotune here. Return True if and only if flashinfer autotune will run through without issues.
Source code in vllm/model_executor/warmup/kernel_warmup.py
kernel_warmup ¶
kernel_warmup(worker: Worker)