Why use a flamethrower to light a candle when you can just use a match?
The Storage Bottleneck in AI Training
Training Large Language Models (LLMs) is all about feeding huge amounts of data to help the model pick up language patterns and context. This takes a ton of computing power and storage to hold both the data and the model’s settings. As the model trains, it keeps adjusting its weights based on the data, and all these changes need to be saved along the way. The bigger and more complex the model gets, the more storage it needs to manage all the data, in-progress results, and checkpoints.
Imagine you’re filling bottles with water on an assembly line. If the conveyor belt is too slow, workers stand idle, wasting time and money. AI training operates the same way—GPUs are incredibly expensive, and if they sit idle waiting for data, resources are wasted.
Many organizations address this problem with AI storage by deploying parallel file systems, which distribute data across multiple storage nodes. While effective, these setups require a significant amount of infrastructure, increasing costs and complexity.
Why Throw Too Much Hardware at Parallel File Systems?
Parallel file systems were originally designed for high-performance computing (HPC) workloads, which have very different data access patterns than AI training. The difference in access patterns coupled with the additional hardware and software demands of parallel file systems make such a storage solution less cost-effective for AI training.
To illustrate this point, consider the analogy of using a flamethrower to light a candle. While a flamethrower is undeniably effective at generating fire, it’s a wildly excessive and potentially dangerous solution for the simple task of lighting a candle. Similarly, while parallel file systems are powerful tools, they are often overkill for AI training. In many cases, a simpler and more efficient approach is a better option.
MLPerf Storage is a benchmark suite designed to measure how well storage systems handle the demands of machine learning workloads. It’s a joint effort by industry leaders and researchers to create a standardized way to compare the capabilities of different storage solutions for AI and ML applications. Interestingly, the results submitted for v1.0 of the benchmark showed that the leading parallel file system solution was using 22 instances for data servers and also 2 additional instances for metadata servers, while the leading remote block storage solution was using only 3 servers to achieve similar (normalized) results.
A Simpler and More Efficient Approach
Instead of relying on expensive parallel file systems, there’s a more streamlined solution: using a local file system mounted as read-only across multiple AI nodes. This can be achieved by placing the dataset on a remote block storage system that supports multi-attach. The result?
- Lower Infrastructure Costs – No need for additional metadata servers or storage controllers.
- Better Performance – Local file system reads are faster than complex distributed file systems.
- Simpler Management – Fewer components mean fewer failure points and easier administration.
How It Works
Easy. Simply:
- Store training data on a high-performance block storage system that supports multi-attach.
- Attach the block storage volume to multiple AI nodes simultaneously.
- Mount the volume as a local file system in read-only mode across all nodes.
The remote block storage solution makes it easy to update the dataset using its snapshot and clone capabilities. A dataset can be written to a volume, and then a snapshot can be taken to preserve a specific version of the dataset. This snapshot can then be cloned and shared across multiple nodes for training. When an updated dataset needs to be used, the volume can be updated, a new snapshot can be taken, and new clones can be created from that snapshot. This process ensures that each node has access to the latest version of the dataset without the need to transfer the entire dataset again, saving time and resources.
Extra Perks of Remote Block Storage: Beyond Training Datasets
On top of the benefits for training datasets, remote block storage is also great for checkpointing and Retrieval-Augmented Generation (RAG) in AI.
Checkpointing involves periodically saving the state of a training run, allowing the training process to be resumed from that point in case of a failure or if the training needs to be interrupted. High-performance, disaggregated storage removes direct-attached storage (DAS) complexity, enables fast failure recovery, and provides sufficient write throughput to avoid idle time during the checkpointing process. Remote block storage provides a fast and reliable way to store and retrieve checkpoints, ensuring that the training process can be resumed quickly and efficiently.
RAG involves using previously generated data to improve the quality of new generations. It leverages vector databases to store domain-specific knowledge, which necessitates low-latency and high-performance block storage to effectively augment Large Language Models (LLMs).
Unlocking the Benefits of Software-Defined Storage for AI
When it comes to AI storage, Software-Defined Storage (SDS) is the way to go because it’s super flexible, scalable, and cost-effective. Traditional storage solutions can’t always keep up with the huge data demands of AI, but SDS lets you easily manage and expand storage as needed. It separates storage from the hardware, so you can tweak and optimize it to fit your specific needs. Plus, with SDS, it’s much easier to repurpose or reuse existing hardware, saving money and reducing waste. Whether you’re upgrading your setup or trying to make the most of what you have, SDS makes it simple to adapt your storage without having to invest in entirely new hardware. With SDS, you can easily scale up as AI models grow, making sure everything runs smoothly and saving costs by using resources more efficiently.
The Bottom Line
Throwing more hardware at a problem isn’t always the best solution. A simple, read-only shared file system on multi-attached block storage can provide the performance required for AI training, while also reducing both complexity and costs. Before jumping into a parallel file system, consider if a leaner approach could effectively keep your GPUs running smoothly and optimize training times—and if it’s software-defined, even better, with added flexibility and scalability. Oh, and you can also use it for tasks like RAG (Retrieval-Augmented Generation) and checkpointing, making it even more powerful and efficient.