Record breaking strength at  SC24

Scaling Out to Keep Data Out of the Lost and Found

Scaling Out to Keep Data Out of the Lost and Found

Dataveristy – August 29, 2024

How much data would you be comfortable with losing?

In the world of high-performance computing (HPC), the simple answer should be none. Given that HPC systems involve massive amounts of data, any loss – whether big or small – can have a catastrophic impact on customer and shareholder relationships, finances, complex simulations, and organizational reputation.  Any system lacking in durability is at a heightened risk for data loss.

While data loss was more acceptable in the early days of HPC, today there is a zero-tolerance policy with customers expecting uncompromising service level agreements (SLAs) reflecting those provided by hosting providers. Going forward, companies need to place as much urgency on data durability as they place on any other aspect of data services, like availability.

Data loss is the unintentional or malicious alteration, deletion, or unavailability of data that is essential for computational operations. It can result from cyberattacks, human errors, software defects, or hardware failures. A Verizon report estimates that large-scale data loss (100 million-plus records) can cost an organization anything between $5 million and $15.6 million, highlighting data protection is key.

Data managers often prioritize ROI, performance, and data availability. However, the significance of durability in data recovery is paramount to the ROI equation. Durability speaks to the ability to protect data from loss or corruption during an outage or failure. While availability ensures data is accessible day or night, its capabilities are interdependent with durability. You can’t access your data if it’s lost after an outage. Both are crucial for system integrity, and they should be jointly considered to guarantee a full return on investment and system efficiency.

Combat Data Loss and Scale Out Instead of Up 

As AI continues to shape industry needs, a company’s ability to scale its data is crucial for success. While HPC environments have typically relied on the legacy practice of scaling up, this process adds unnecessary risk for potential data loss.

Scale-up models rely on pairs of high-availability (HA) controllers to manage data and come with several drawbacks – mainly the potential for a single point of failure to data loss for an entire cluster. In order to scale, these models require the use of additional HA controllers. But when a single node can take down an entire cluster, adding more HA pairs for the sake of scaling only further puts the data at risk. The larger the cluster, the worse its resiliency becomes, as the system’s complexity increases the risk of failure.

Now let’s compare that to a scaling-out that distributes data across multiple nodes in the cluster, eliminating single points of failure. With the scale-out model, if a server node fails, the cluster manager can simply redistribute the data across the remaining nodes, minimizing performance impact. This approach enhances resiliency, as larger clusters can handle multiple node failures without significant performance degradation. The more nodes in the cluster, the greater the system’s overall resiliency and ability to maintain data protection.

The industry’s reliance on traditional scale-up strategies can actually increase the risk of data loss when more clusters are added. Alternatively, linear scale-out reconstruction can enable companies to strengthen their data protection as they grow. Essentially, the larger an organization, the less time it takes to recover lost data due to the correlation between system scaling and decreased recovery time. Scaling out offers several benefits, including enhanced resilience and high availability. There are risks associated with overloading a small number of nodes; if too much data is concentrated on just a few nodes and one fails, it can lead to significant issues. Therefore, distributing the load across more nodes helps mitigate these risks and ensures more efficient data recovery.

In the face of data loss, scaling out significantly enhances system durability through the distribution of data across multiple nodes, thereby mitigating the risk of data loss. This architecture ensures that even if individual nodes fail, the system can seamlessly redistribute data, maintaining performance and availability.

By leveraging advanced data protection mechanisms and a distributed file system, scaling out provides a robust framework that safeguards critical information and ensures continuous business operations. This approach not only prevents data loss but also builds a durable infrastructure capable of adapting to growing data demands and unexpected failures.