Looking into the World’s Largest Data Warehouse from the inside

world communications map blue

It’s not every day that you set a new world record, but that’s exactly what SAP did recently in conjunction with NetApp and several other partners at the SAP/Intel data center in Santa Clara, Calif. An independent audit documented that more than 12 petabytes (PB) of addressable storage had been created, and after reviewing the test data, the folks at Guinness World Records confirmed that the team had succeeded in creating the world’s largest data warehouse.

The data warehouse was based on the SAP® HANA in-memory data platform, SAP IQ (formerly Sybase IQ), and BMMsoft Federated EDMT. Specifically, the warehouse contained more than 221 trillion transactional records and more than 100 billion unstructured documents, including emails, SMS, and images. It also contained data from 30 billion sources, including users, smart sensors, and mobile devices.

To achieve these impressive results, a data warehouse environment was created by ingesting 3 PB per day of synthetic data for four consecutive days—a feat that required exceptional storage system performance and reliability. For that, SAP turned to NetApp® SAN storage.

Curious to learn more, I contacted NetApp Solutions Architect Jens Langer to find out what was involved in setting up and managing such a large-scale storage system.

First, Build a Massive SAN
In spite of the massive scale, the installation and configuration required to achieve a Guinness world record proved fairly routine for the storage team, which included Langer and Jamal Boudi, a NetApp consulting systems engineer for enterprise architectures. The routine nature of the project was due, in large part, to SAP’s clearly defined requirements for capacity and throughput.

Langer explained that deployment of the storage hardware (weighing nearly two tons!) occurred over several weeks as various storage components arrived at the lab. NetApp E-Series storage was deployed for the majority of the data, to the tune of a total of 5.4 petabytes of physical storage capacity spread across 20 E5460 storage arrays and 1,800 3 terabytes (TB) NL-SAS disk drives. The record of 12.1 petabytes of total addressable capacity was achieved thanks in large part to an SAP in-system data compression rate of 85% for the 50/50 mix of structured and unstructured data.

NetApp E-Series storage was selected for the project because of its proven 99.999% availability and its ability to handle the project’s data ingest requirement of 34.3 TB per hour. The E-Series SAN used a Fibre Channel fabric to support SAP IQ’s large and varied data needs: from hot, unstructured data to warm and cold, structured and unstructured data.

Next, Add NFS Storage
In addition to the massive SAN, a NetApp unified FAS storage system was deployed with three disk shelves, each containing 24 x 900GB NL-SAS drives. The FAS system supported hot, structured NFS data running on four logical SAP HANA nodes and one failover node. Ease of management influenced the decision to use NFS for the SAP HANA portion of the warehouse. Although NetApp E-Series–based SAN storage for FC HANA systems has an extremely high HANA node density, Langer’s prior experience with other SAP installations prompted his choice of NFS, based largely on its ease of use, setup and maintenance. “If you used Fibre Channel and a HANA node failed, you’d have to go through many more steps to ensure failover,” he said. “With NFS, it’s just a simple matter of mounting and unmounting the HANA node.”

Then, Start It Up and Never Stop
Once the physical storage arrays were in place, several layers of logical abstraction were used to simplify data management.  At one level, BMMsoft

EDMT acted as an intermediary between the SAP platform and the underlying NetApp storage. It was able to treat the multiple E-Series storage arrays as one large, shared storage cluster. On the storage-array side, NetApp Dynamic Disk Pools (DDP) were used for abstraction within the E-Series arrays to automatically balance data across the underlying drives. Langer notes that DDP brought other benefits to the project, such as the ability to quickly exchange drives in the event of a real or potential drive failure.

Although the secure lab environment presented some unique challenges with respect to rack space, cooling and remote monitoring, the storage systems delivered high-performance and continuous availability, even when pushed past their official specifications for operating temperature. Over the official testing period, Langer was able to perform routine maintenance, such as replacing failed disk drives (two out of more than 1,800) and reseating cables to correct a potential path failure to a drive shelf—all while the SAP system was still running.

Just Like Your Data Warehouse, Only Bigger
According to the independent audit report by InfoSizing, attaining the world record offered an occasion to test the performance of real-world data warehouse scenarios based on both structured and unstructured data, including the use of unstructured data “representative of email, social messaging, images, documents, audio and videos.” InfoSizing compared the world-record configuration and results to similar data warehouse environments found in “worldwide financial trading networks, health payment systems, oil exploration and production operations or mobile device networks over multiple years.”

The solid performance, availability and speed of NetApp storage made it a perfect foundation for attaining the world record, and the same approach can be applied to enterprise applications of any size.

Henry Sapiecha

Leave a Reply

Your email address will not be published. Required fields are marked *