Techniques for On-Demand Model Loading
The primary hurdle for serverless IaaS is the "Cold Start"—the time required to pull a multi-gigabyte model from storage into GPU memory.
To achieve "instant-on" performance, providers use Lazy Loading and Layered Caching. Instead of waiting for the entire model to download, the inference engine loads the first few layers and the embedding table, allowing it to begin processing the initial tokens while the rest of the model is streamed in the background.
Additionally, IaaS platforms utilize Shared Memory Object Stores (like Plasma) to keep common model weights "warm" across multiple containers. If two different users request two different fine-tuned versions of the same base model (e.g., two different LoRA adapters for Llama-3), the system only loads the base weights once. Only the small, specific "adapter" weights are swapped in, reducing the startup time from minutes to milliseconds and enabling truly elastic scaling.

