-
Couldn't load subscription status.
- Fork 11
Challenges in Implementing GPU‐Based Inference FaaS: Cold Start Latency
Deploying inference workloads on GPUs often results in low utilization rates, primarily due to the need for pre-provisioned resources that remain idle during periods without requests. In production environments, maintaining high availability necessitates additional backup inference instances, further diminishing GPU utilization. Reports indicate that average inference GPU utilization hovers around 10–20%, highlighting significant opportunities to enhance efficiency and reduce costs.
Function as a Service (FaaS) has emerged as a popular paradigm for CPU-based computations, allowing service providers to allocate compute resources on-demand, thereby improving utilization. Applying FaaS principles to GPU-based inference could theoretically bolster GPU utilization; however, several challenges impede this transition.
- Cold Start Latency: Inference services are typically interactive, necessitating low response times to ensure a seamless user experience. Cold start latency for GPU-based inference instances can range from 10 seconds to several minutes, rendering them unsuitable for applications requiring prompt responses.
- Warm Start Limitations: In CPU-based FaaS, warm start techniques retain execution instances in system memory after request completion, allowing immediate handling of subsequent requests. Applying this approach to GPU-based inference would involve keeping model data in GPU VRAM, effectively binding the GPU to a specific model and preventing it from processing other inferences. Consequently, traditional warm start strategies are not directly applicable to GPU-based inference.
To make GPU-based FaaS viable, it's crucial to reduce cold start latency to acceptable levels, ideally under 5 seconds, as per industry standards. Key components contributing to cold start latency include:
- Container Startup Time: This phase involves a few stages:
- Namespace and Cgroup Initialization: Setting up isolated environments for the container.
- Container Image Preparation: Retrieving and extracting the container image to establish the filesystem. Image retrieval can be time-consuming, especially if downloading from a central registry. Pre-downloading images locally can mitigate this delay, potentially reducing startup time to approximately 100 milliseconds.
- Network Setup: Configuring network interfaces and assigning IP addresses.
- Model loading latency: For LLM model, before models can process user inference requests, model data has to be loaded into the GPU. The model loading latency depends on storage to GPU bandwidth.
- Inference framework start latency: in industry production environment, the inference normally be based on some inference framework to improve performance such as vLLM. The framework start has latency.
Extensive research has been conducted to address model loading latency in both industry and academia. For instance, Run:ai's Model Streamer employs concurrent model loading techniques, utilizing multiple threads to load model weights from various storage types while streaming them to the GPU in parallel, resulting in up to a sixfold increase in model loading speed. Similarly, ServerlessLLM leverages the substantial near-GPU storage and memory capacities of inference servers to achieve effective local checkpoint storage, minimizing the need for remote checkpoint downloads and ensuring efficient checkpoint loading, thereby achieving significant reductions in latency.
However, there is limited research focusing on reducing inference framework initialization latency. For example, an analysis of the TinyLlama-1.1B-Chat-v1.0 model's cold start process revealed a total startup time of 17 seconds, with approximately 2 seconds dedicated to loading the model's 2.0512 GB of data, and the remaining 15 seconds attributed to framework initialization.
Even with optimizations that reduce container startup and model loading latencies to negligible levels, the framework initialization time of 15 seconds still exceeds the target cold start latency of 5 seconds. Therefore, without effective optimization of framework initialization, it becomes necessary to over-provision inference instances to achieve low-latency responses.
Currently, numerous companies offer serverless inference products designed to deliver on-demand AI inference. However, due to inherent cold start challenges in GPU-based inference, these systems typically rely on pre-provisioned GPU resources to achieve the low-latency responses required in production environments. Although the promise of serverless and FaaS paradigms lies in on-demand, scalable resource utilization, GPU-based inference systems must still pre-provision resources to meet stringent latency targets. Ongoing research is actively addressing cold start issues, but until framework initialization and overall GPU cold start latencies can be drastically reduced, pre-provisioning will remain essential for maintaining low-latency inference responses. Consequently, this reliance on pre-provisioning limits overall GPU utilization, making it less efficient compared to CPU-based FaaS systems, where resources can be dynamically allocated with minimal overhead.