How to Cut Hugging Face Model Load Time from 18 Minutes to 2

One of the most critical performance factors for both training and inference is how quickly models, weights, and supporting state are loaded into GPU memory. The reasons differ slightly: during training, fast loading ensures GPUs remain fully utilized every millisecond, while in inference it minimizes startup latency and prevents costly reloads. In both cases, delays in moving weights into GPU memory translate directly into wasted GPU cycles.
Because of this, AI workload performance often hinges on the efficiency of model loading. Understanding and optimizing how pre-trained models are delivered into GPU memory is essential to achieving high utilization and reliable performance.
The way a model loads into GPU memory is shaped by its size (number of parameters) and whether it’s untrained or pre-trained.
In this blog, we will explore key aspects of pre-trained models and examine how the process of loading weights, parameters, and state across the network can be optimized to achieve significantly better performance.
How to speed up pre-trained Hugging Face model loading
Problem statement: Model loading performance from network-attached storage is significantly slower than expected, creating a bottleneck in workflow efficiency.
When one of our customers reported that it was taking nearly 18 minutes to load a pre-trained Hugging Face 30B parameter model into GPU memory, we dug in to understand why.
The user was following the default approach:
model = AutoModelForCausalLM.from_pretrained("/path-to-model/shard-data")
At first glance, nothing looked unusual. But under the hood, two subtle defaults were creating a perfect storm for slow performance:
- Random I/O from memory mapping – Hugging Face’s safetensors library uses memory map or mmap, which results in many small, random reads instead of larger sequential reads. On local NVMe this is fine, but over network-attached storage it can become a major bottleneck.
- Low shard count – The model was packaged into just 16 shards. Each shard was mmap’d separately, so the combination of a small number of large shards and random access patterns amplified latency and kept I/O throughput well below the available bandwidth.
The outcome was that GPUs were sitting idle, waiting on data, and expensive cycles were being wasted.
To address this, we experimented with different Hugging Face load-time parameters. The breakthrough came from a small but powerful tweak: switching to torch_dtype="auto" allowed hugging face to look for a config file that has dtype setting defined. If the setting exists within the config file it will use the recommend dtype setting (fp32, fp16, bf16) to reduce memory usage speeding up the amount of data being loaded. If it doesn't find the setting it will default back to float32 (full precision).
By pairing this with other optimizations such as enabling safetensors, reducing CPU memory pressure, and letting PyTorch auto-select the appropriate precision, we cut load time from 18 minutes down to ~2 minutes.
Here’s the final load call that unlocked the performance:
model = AutoModelForCausalLM.from_pretrained(
"/path-to-model/shard-data",
use_safetensors=True,
low_cpu_mem_usage=True,
torch_dtype="auto" # key improvement
)
This simple change not only improved raw throughput (bytes transferred per second) but also boosted goodput, the amount of useful model data actually delivered into GPU memory, by aligning access patterns with how the storage system performs best.
The lesson is clear: default settings aren’t always optimal for large-scale AI workloads. By understanding how model files are sharded, memory-mapped, and delivered to GPUs, you can dramatically accelerate startup times and keep GPU utilization high.
You can find more detail on the model and configurations at: https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct
Additional configurations
NFS Mount and tuning options with VAST
By default, a Linux NFS mount points to a single IP or VIP, which can pin all I/O to a single VAST CNode. This creates a bottleneck, since every read and write flows through one front-end.
The first optimization is nconnect, which allows the NFS client to open multiple TCP sessions to the same VIP. This adds parallelism for reads and writes, but if all sessions are still pinned to a single VIP, throughput remains limited.
VAST solves this with the remoteports mount option. Instead of talking to just one VIP, remoteports lets the client use a range of VIPs, each mapped to a different CNode. The client spreads its TCP sessions across these VIPs, delivering true multipath access. When nconnect and remoteports are combined, you achieve full parallelism across the cluster, significantly increasing aggregate throughput.
Controlling I/O Chunk Size with rsize and wsize
The next tuning knobs are rsize (read size) and wsize (write size):
- rsize defines the maximum bytes per read RPC the client issues.
- wsize defines the maximum bytes per write RPC.
Modern Linux defaults range from 4 KB up to a maximum of 1 MB (1048576 bytes). For our configuration, we set both rsize and wsize to 1 MB.
This doesn’t change the application’s block size, but it does control how the kernel splits I/O into RPCs. Larger values dramatically reduce the number of RPCs required. For example, reading a 1 GB checkpoint file:
- rsize=1M → 1,024 RPCs
- rsize=64K → 16,384 RPCs
That’s a 16× increase in RPC overhead with the smaller value.
Boosting Sequential Throughput with Read-Ahead
Finally, we tuned read-ahead, one of the most impactful settings for AI and HPC workloads. Read-ahead tells the client to prefetch additional data beyond what the application explicitly requested, on the assumption that sequential blocks will soon be needed.
This reduces latency and increases effective throughput when streaming large, sequential files such as model checkpoints. For AI workloads, the optimal read-ahead window is typically 8–16 MB.
A mount command example
mount -t nfs -o rw,nfsvers=3,nconnect=32,remoteports=10.5.69.1-10.5.69.140,rsize=1048576,wsize=1048576,readahead=16M server:/data /mnt/data