Big data Hadoop has already shown its momentum from the development landscape to small-scale and large-scale production mode. On the other hand, the core need of HPC storage generates for the storage requirement of Big data. As the need for analytical processing, AI, Machine learning, Industrial IoT, etc., has risen, at the same time, it demands to deal with the massive amount of data. Not only data, but such data also needs to be processed at high speed for real-time data analytics and decision making. There comes the need for high-performance computing.
HPC comes with specific characteristics like –
- High processor density – it signifies placing the processors as close together as possible to reduce interconnection latency;
- High-speed IO – for organizing processors feed;
- power density – management to keep the infrastructure running
- high-density cooling mechanisms – that ensure reliability and continuity;
- automated high-availability solutions – to keep production going when a node fails.
Production HPC success is measured by time to value, where it is measured as overall job elapsed time. For this, you need to keep processors as busy as possible that can ultimately build up a balanced system. This, on the other hand, keeps system overhead to a minimum whereas provides low-latency interconnection with a just-in-time highly parallel data movement within the processors. If you consider an automated high availability solution, it avoids data corruption and automatic job restarts in case of node failure. One retries on a block of data can delay the work of hundreds of processors.
Related post – What is HPC Storage and its importance
How can we compare HPC & Big Data Systems?
The key metric for both production Big data and production HPC is common: time to value. The key technical aspect also holds a similar principle that keeps processors fed with data and avoids bottlenecks!
From Big data perspective, there are specific characteristics –
- It is massive
- It does not move easily
- Virtualization is not easy due to the impact of IO throughput and processor.
Between job-schedulers and virtualization, the prior one is more important for large-scale production. Along with it, you need data consolidation.
Big Data is often built around Hadoop clusters. Furthermore, these environments have no single deployment option; Big Data systems development is often deployed straight to cloud environments such as Amazon AWS or OpenStack Savanna (under development). While deploying Big data systems to the cloud, practical issues around the cost and elapsed time movement of large amounts of data to the cloud have often dictated the use of in-house solutions.
For large-scale in-house deployment, a high-growth area of servers dubbed “density optimized servers” is often deployed. As expected, in response to the high growth, most of the server vendors are pushing hard to deliver solutions that meet the specialized requirements of this market segment, including high availability.
Four significant Density Optimized Servers For Hadoop
Four density optimized server racks for Hadoop are highlighted from four vendors. They have different variations.
- One of them deploys standard racks with a Hadoop reference architecture,
- one with deployment guides,
- one with integrated hardware and software mainly from the vendor,
- one integrated solution from an HPC base with specific hardware and mainly open-source integrated software.
The vendors include:
As per the IDC report, Dell comes with has the early lead in this market segment, which is approx. 60.5% in 2Q13 with its C-line “cloud servers.” Dell creates custom solutions in large volumes at low margins, which led to many designs wins in cloud providers. Dell has a Cloudera Solution Deployment Guide to help configure and deploy solutions.
HP is #1 in both the overall server and blade server market and is a solid #2 in the density optimized market with several offerings, including the Proliant SL-series launched in 2012 for Big Data and cloud deployments. HP has a reference architecture for Hadoop based on HortonWorks.
Oracle has emphasized Oracle’s Big Data Appliance X3-2 is a pre-integrated full rack configuration with 18 12-core x86 servers that include InfiniBand and Ethernet connectivity. The Cloudera distribution of Apache Hadoop is included to acquire and organize data, together with Oracle NoSQL Database Community Edition. Additional integrated system software includes Oracle Linux, Oracle Java Hotspot VM, and open-source distribution of R.
SGI is a company that has specialized in high-performance computing and providing converged infrastructure for that market. High-Performance Computing (HPC) is driven by highly parallel architectures with very large numbers of processors. SGI InfiniteData Cluster is a cluster-computing platform with high server and storage density. InfiniteData Cluster offers up to 1,920 cores and up to 1.9PB of data capacity per rack. The cluster is centrally managed using SGI Management Center. InfiniteData Cluster solutions for Apache Hadoop® are pre-racked, pre-configured, and pre-tested with all compute/storage and network hardware, Red Hat® Enterprise Linux®, and Cloudera® software.
There are two crucial things here –
- The balance of the design – specifically the cores: spindle ratio
- environmental factors – space, power, cooling.
Interestingly, each family of servers offers a wide variety of options since different workloads have different requirements:
- Hadoop environments are usually optimized at a 1:1 core/spindle ratio;
- NoSQL and MPP databases at about a 1:2 ratio;
- Object storage at 1:10.
As Big data deployments are often multiple racks of gear at a time, the scalability, power, and cooling requirements of large server clusters are very similar to the high-performance computing (HPC) marketplace. While rack-level architectures are not new, most hyper-scale deployments have spun their own designs. Converged infrastructure vendors such as VCE have been geared more for virtualization environments rather than scale-out applications.
Final verdict
From the above discussion, we can conclude that high-performance computing and large-scale Hadoop clusters have similar availability and high-density deployment requirements. All the vendors in the comparison are well established with excellent products and services. Because large-scale production Big Data deployments are few and the technologies not yet fully ready for prime time, a converged infrastructure solution with Hadoop node HA built-in is recommended by the experts. Operational costs are likely to be higher at this stage of platform maturity, but, together with integrated upgrades, it has a significant benefit.