vllm.utils.mem_utils ¶
DeviceMemoryProfiler ¶
Source code in vllm/utils/mem_utils.py
MemoryProfilingResult dataclass
¶
Memory profiling result. All numbers are in bytes.
Source code in vllm/utils/mem_utils.py
after_profile class-attribute
instance-attribute
¶
after_profile: MemorySnapshot = field(
default_factory=MemorySnapshot
)
before_create class-attribute
instance-attribute
¶
before_create: MemorySnapshot = field(
default_factory=MemorySnapshot
)
before_profile class-attribute
instance-attribute
¶
before_profile: MemorySnapshot = field(
default_factory=MemorySnapshot
)
__init__ ¶
__init__(
non_kv_cache_memory: int = 0,
torch_peak_increase: int = 0,
non_torch_increase: int = 0,
weights_memory: float = 0,
before_create: MemorySnapshot = MemorySnapshot(),
before_profile: MemorySnapshot = MemorySnapshot(),
after_profile: MemorySnapshot = MemorySnapshot(),
profile_time: float = 0.0,
) -> None
__repr__ ¶
__repr__() -> str
Source code in vllm/utils/mem_utils.py
MemorySnapshot dataclass
¶
Memory snapshot.
Source code in vllm/utils/mem_utils.py
__init__ ¶
__init__(
torch_peak: int = 0,
free_memory: int = 0,
total_memory: int = 0,
cuda_memory: int = 0,
torch_memory: int = 0,
non_torch_memory: int = 0,
timestamp: float = 0.0,
auto_measure: bool = True,
) -> None
__post_init__ ¶
__sub__ ¶
__sub__(other: MemorySnapshot) -> MemorySnapshot
Source code in vllm/utils/mem_utils.py
measure ¶
Source code in vllm/utils/mem_utils.py
get_max_shared_memory_bytes cached
¶
Returns the maximum shared memory per thread block in bytes.
Source code in vllm/utils/mem_utils.py
memory_profiling ¶
memory_profiling(
baseline_snapshot: MemorySnapshot, weights_memory: int
) -> Generator[MemoryProfilingResult, None, None]
Memory profiling context manager. baseline_snapshot: the memory snapshot before the current vLLM instance. weights_memory: memory used by PyTorch when loading the model weights. Note that, before loading the model weights, we also initialize the device and distributed environment, which may consume some memory. This part is not included in the weights_memory because PyTorch does not control it.
The memory in one GPU can be classified into 3 categories: 1. memory used by anything other than the current vLLM instance. 2. memory used by torch in the current vLLM instance. 3. memory used in the current vLLM instance, but not by torch.
A quantitive example:
Before creating the current vLLM instance
category 1: 1 GiB category 2: 0 GiB category 3: 0 GiB
After creating the current vLLM instance and loading the model, (i.e. before profiling): category 1: 1 GiB category 2: 2 GiB (model weights take 2 GiB) category 3: 0.5 GiB (memory used by NCCL)
During profiling (peak): category 1: 1 GiB category 2: 4 GiB (peak activation tensors take 2 GiB) category 3: 1 GiB (memory used by NCCL + buffers for some attention backends)
After profiling
category 1: 1 GiB category 2: 3 GiB (after garbage-collecting activation tensors) category 3: 1 GiB (memory used by NCCL + buffers for some attention backends)
In this case, non-kv cache takes 5 GiB in total, including: a. 2 GiB used by the model weights (category 2) b. 2 GiB reserved for the peak activation tensors (category 2) c. 1 GiB used by non-torch components (category 3)
The memory used for loading weights (a.) is directly given from the argument weights_memory
.
The increase of torch.cuda.memory_stats()["allocated_bytes.all.peak"]
during profiling gives (b.).
The increase of non_torch_memory
from creating the current vLLM instance until after profiling to get (c.).
Source code in vllm/utils/mem_utils.py
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 |
|