Large-Scale Analytics

Using computers for analytics and modeling has become commonplace. Whether the topic is consumer behavior, biochemistry, financial risk, or product design, the application of large data sets and complex modeling has become routine. More compute capacity and faster job completion are always desirable; the benefits are more accurate, more comprehensive, and more timely results.
Data centers attempt to meet the challenge with arrays of servers, high speed network fabrics, and giant storage systems. With improved speeds in each of these components, the time it takes to complete an analysis is increasingly a function of how long it takes to move data around the data center. Moving large volumes of data from storage to working memory on multiple nodes can demand excessive bandwidth. Models may take days to run when retrieving data from storage. Forced with slow execution times, users will constrain their analytics. Using a subset of data or partial models effectively trades time for thoroughness.

Servers can contribute RAM as Memory Hosts, or can access
shared RAM as Memory Guests, or can play both roles at once.
The RNA Memory Cloud provides a breakthrough technology for analytics. It is a new foundation for data distribution in the data center that addresses the key performance issues for large-scale analytics and modeling. The RNA MVX product aggregates RAM from around the data center to build a cloud, or pool, of high speed memory which can be strategically applied to the bottlenecks in analytic processing, typically yielding speed increases of 6X to 20X in job completion time.
Fast access to large data sets
A central issue for analytics is the size of the data sets -- the original reference data, the working result set, and the intermediate results that are passed from one stage of processing to the next. Depending on the problem, one or more of these data sets will be larger than can fit in any single server’s RAM. The server will have to access the data from or offload it to a disk (either local or network-based), and that adds delay. RNA MVX provides three methods to use its Memory Cloud that match the three kinds of analytic data sets. Each method can be applied to any number of data sets, and each can use as much memory as you allocate to the Memory Cloud -- a terabyte or more of RAM.
- For reference data coming from a network storage server, MVX Memory Cache holds a copy of the most-used data in RAM. In many cases, the cache can be sized to hold the entire reference data set in RAM. This can be log data, sensor data, stock tick data, or geographic image data. This cache is particularly beneficial when performing parallel computing, as in Hadoop systems, since it removes the bottleneck of the storage tier and keeps the data within the compute tier, sharing it among nodes from RAM to RAM.
- For working results, MVX Memory Motion allows any server operating system to attach a “virtual swap” drive that runs far faster than a physical swap drive. This permits a server to address a very large address space, which can speed up algorithms that need to keep working results in RAM.
- For intermediate results, MVX Memory Store provides servers with a “virtual RAMdisk” to store results at blazingly fast speeds. The RAMdisk can also be shared as a clustered file system among multiple servers, so it can communicate intermediate results from one stage of processing to the next.

In each case, the compute node is given a RAM-based resource that can be far larger than what any one machine could hold in memory. Furthermore, the resource is made available to the application without changes in its programs or in the underlying hardware.
The results are often dramatic. A financial investment model cut its run time from 92 minutes to 5.3 minutes. One application utilized 3 TB of pooled RAM for a reference data set, almost eliminating storage wait times for a 300 node cluster. A time series analysis with 100 variables completed in 6X less time, and an analysis with 300 variables finished in 17X less time.
A unique solution for analytics
By improving the utilization of RAM, MVX exploits the advances in data center network fabric that have been made in recent years. In addition, it gains advantage from the data reuse that is common in analysis and modeling. MVX is unique in providing only software, utilizing existing hardware without application tuning, disruptive server upgrades, or additional layers of infrastructure.
The alternatives fall short of what MVX provides. SMP or global shared memory systems are costly, have scaling restrictions, and involve additional hardware that’s used for only specialized analytical workloads. Configuring ‘fat’ nodes with large amounts of RAM is far more costly than using commodity servers, and they are less flexible when needs change. Fast conventional storage is 100X slower than memory virtualization using MVX, while SSD is still 10X slower than MVX and is not a shared resource.
For companies running modeling and simulation applications, accommodating large data sets and getting results fast are crucial to stay competitive. When analytics can be made fast enough, demand for them usually grows, as more “what-if” questions are analyzed. By addressing fundamental capacity constraints, RNA MVX extends capabilities and adds system scale at extremely low cost.



