Eddie Dunn's Blog

Musings and Insights from the Mind of a Human

Archive for the ‘big data’ tag

Big Everything – The new frontier of computing

without comments

Eddie Dunn – University of North Carolina Wilmington

I. Introduction

The term “Big Data” is on track to eclipse the venerable “Cloud Computing” buzzword from recent years. In fact the term big data is not new and is traditionally and technically defined as data too large to be processed by anything short of a “super computer”. In today’s terms that means exabytes but it is a constantly moving target and we will soon be using zettabytes (1). I would suggest that this phenomenon is not limited to the space that we have defined. In fact a better term might be “Big Everything”. We have arrived at a crossroads in our fast changing discipline that requires us to rethink very basic aspects of the way our craft has been performed since any recent recollection. Hardware manufacturers have done a stellar job at keeping up with Moore’s law even when the speed of light and power requirements forced them into the parallel and data-parallel computational models and the benefits of that work are coming to fruition in a big way. What is still evolving is the software tools and training needed for the rest of us to effectively utilize those resources not only to tackle “big” problems but also in our day to day jobs and lives. With techniques like MapReduce (2) and its famous Apache Hadoop implementation and related ecosystem, and the proliferation of GPU and CPU cores a computation that would once take days can now be theoretically performed in sheer minutes (or soon seconds, ad infinitum) with roughly the same cost hardware.

II. Architecture

Micheal Flynn in 1966 came up with a taxonomy of computing that lends an excellent way to discuss the different modes of computation coming online from the hardware vendors. He elicited four major types of computer architecture (3).

1. Single Instruction, Single Data stream (SISD) – This is the Von Nuemann model that is taught in basic computer architecture courses. One instruction is executed on one piece of data at any given time. The tradition x86 (Pentium class) chip is an example of this.

2. Single Instruction, Multiple Data streams (SIMD) – Also known as data parallel. A single instruction is applied to multiple pieces of data at once. The GPU is an example of this architecture.

3. Multiple Instruction, Single Data stream (MISD) – In this model multiple instructions are executed on the same piece of data. This is the most uncommon architecture type. The example of control systems in the space shuttle is an example of this type of computing model.

4. Multiple Instruction, Multiple Data stream (MIMD) – In this architecture multiple separate instructions are executed on multiple separate pieces of data at the same time. Modern multi-core processors from AMD and Intel represent this model. This model is also commonly divided into sub-models: 1) Single Program, Multiple Data (SPMD) is cited as the most common form of parallel computing. MapReduce in its most basic form follows this model. 2) Multiple Program, Multiple Data (MPMD) This can be either a shared memory model such as a modern multi-core processor or a distributed memory model such as the MPI architecture.

In fact many of the “rules of thumb” that we as a discipline have come to rely as it turns out is only one use case of Flynn’s taxonomy (SISD) and this oversight is rapidly overtaking us. SISD has reached the end of its ability to bring novelty to our discipline. The power of SIMD and MIMD is here now. We need to gather the communities, tools, and paradigms to confront these challenges in a proactive manner. It is with the lens and better model resolution obtained with the massively parallel that we can, as a discipline, keep up our end of pragmatic and life-saving power that is modern technology.

III. Languages

On the coding side of this picture it was once though that in object oriented languages we could solve the problems of concurrency and the non-determinism it introduces through bolt-on language features. This has taken us a large step towards this end. Anyone who has done any high concurrency programming with these features can attest to the very subtle and hard to debug issues these bolt-on additions and constructs tend to create. We have heard for years now the buzz created by multi-paradigm languages such as scala, f# and the like and the mantra of why we all need closures and functional language features. There has been recent work concluding that contrary to popular belief functional languages do not incur the performance penalty once thought. (4) There is no argument that the design patterns and anti-patterns that will best adapt to the new hardware architectures we are seeing will need to utilize the paradigms  afforded by functional languages added to imperative (and even declarative)  toolsets but the problem is much more complex! The benefit of functional constructs  is no new idea however the importance of tailoring a program’s composition, structure, and execution to best utilize the hardware topology is a very hard problem! The complexity of the hardware possibilities and the correspondingly complex set of performance characteristics introduced as result is mind boggling. The most commonly stated problem with bolt-on concurrency is mainly the problem that come with un-necessarily scoped variables and the race conditions they create when lots of execution units attempt access.  We need the tools and design patterns (or anti-patterns) to productively produce code that we can have some confidence that will run as we expect in any of the multitude of execution environments that pervade the digital landscape.

What new skills and tools should we be imparting to future generations of computer scientists? The most common languages that are mentioned in this new context seem to be Erlang, Haskell, Clojure, Scala, Go, etc.  These languages were designed with concurrency in mind.  New languages and paradigms for concurrency are only a part of the puzzle. We also need fundamental new ways of conceptualizing data processing on a massive scale.

IV. Storage and Processing

Now what about how to store, organize, and process all this data?  While not without its dissenters (5), the MapReduce data processing paradigm and its corresponding success wake is just one example of many use cases that are being explored. The data stores that are used in these systems are typically of a key/value nature. Google’s big table. Amazon S3, and Hadoop HDFS are all examples of this type of model. This model allows data to be loaded very quickly and in an ad-hoc manner that is crucial to the success of these systems. In fact the real business problem that these new analytics players are leveraging is that traditional business intelligence models typically have a process called Extract Transform, Load (ETL) that transform’s their transaction-based, mutable “live” database (along with other data of various types into the OLAP cube the  analysts use to provide actionable information to the leadership of the organization. Many organizations using traditional data warehousing products are finding it difficult for the ETL phase to stay current and decisions have to be made based on progressively less and less current (ie accurate) data. In fact the real power in these new MapReduce based storage, processing, and retrieval techniques is its ability to bring to process quantities of data previously unimaginable with clusters of commodity hardware that can be rented by the hour. In fact the database naysayers’ only argument is that you can pay (royally) to have RDMS systems scale to whatever level of performance is desired. Unfortunately,  the vast majority of organizations who have data that can be utilized do not have the budget to get the big RDMS systems to do the things that they are able to accomplish that which theoretically can be done with a fraction of the budget in these new models. What’s missing currently are people that have skills in the implementation and have practical knowledge about the limitations of these new techniques. One of the most common iterated criticisms of the big data trend is that it creates a elite group of folks that can leverage these tools and those that cannot. (6) (7)Some are envisioning a unified system that will employ concepts and processing models from both the RDMS and MapReduce space to provide a best of both worlds system that can provide the desired performance characteristics for the desired workload. (8)

V. Analysts

With the realization of the analysts position in the Big Data puzzle some have suggested packages such as R, Matlab, Octave, SPSS, and SAS as platforms very well suited to provide the familiar interface from the world of the analyst to the world of designing software and systems to bring us closer to realizing in a much more full manner the power that is latent in an organization better understanding itself through its data. These packages can and should have the ability to abstract away the details of where data is and where it needs to be or how code is executed. This is a natural marriage with big data, scientific computing, and massively parallel. These packages are in a unique position to immediately effect change on a large segment of the world’s problem solvers who have the skills in mathematics and statistics are already familiar with these tools. In a recent article Efficient Statistical Computing on Multicore and MultiGPU Systems the authors describe just a scenario where they rewrite common statistical algorithms chi squared distributions, Pearson correlation coefficient, and unary linear regression in MPI and CUDA and provided interfaces from the R language to utilize multiple CPU and GPU clusters to speedup calculations (9). They achieved a 20X speedup with a four node MPI implementation and 15x speed up with three GPU’s with the CUDA implementation. The company Revolution Analytics has an open core version of R that is sells support for that is attempting to close this gap from the industry side (10) (11) (12).

VI. Deep Learning Networks

Areas such as machine learning and artificial intelligence as well as more traditional statistical techniques are big players in this game. In fact it seems that these concepts and techniques will in all likelihood shed the most future light on ways that we as designers of software systems can best leverage the resources available to do our part to bring about a more full realization of Moore’s Law based on capability of our computer systems and associated physical devices with respect to doing human tasks. This all begs the question as to the real business we are accomplishing with respect to our field and not solely how to help organizations make better decisions. Richard Bellman and his 50 year old dynamic programming theory correctly pointed out what today’s AI and big data researchers are constantly reminded: As the number of dimensions increases linearly in a pattern classification application the computational complexity increases exponentially.  He coined the term “the curse of dimensionality” to describe this observation (13). The traditional way to battle this curse is to pre-process the data into fewer dimensions (a process called feature extraction) in an effort to reduce the overall computational complexity. Recent discoveries in neuroscience suggest that the neocortex in our brain does not do feature extraction as we have come to know it and actually seems to propagate unaltered signals through a hierarchy that over time learns to robustly represent observations in a spatio-temporal manner. A new area of artificial intelligence research has cropped up around using this insight called Deep Machine Learning that makes its goal to more accurately model the way our brain functions (14).

VII. Conclusion

We are experiencing exponential growth not only with respect to number of transistors on a silicon chip but also in our ability to bring technology to bear on automating many more human tasks (15). We have cars that drive themselves and the largest thing holding back an automated transportation system is people’s fears. We are on the brink of having wearable “heads up display” computers that can help us in our most basic daily interactions.  Thanks goes to leading artificial intelligence researchers, Google (for publishing it’s secret sauce), and the multi-billion game industry for pushing the limits in 3-D , virtual worlds we now have at our fingertips the power to literally transform our reality. We just have to figure out how to write the software to bring it to bear!

Works Cited

1. Big Data: Issues and Challenges Moving   Forward. Kaisler, Stephen, et al., et al. Manoa, Hawaii :   IEEE, 2013. 46th Hawaii International Conference on System Sciences. pp.   995-1004.

2. Dean, Jeffrey   and Ghemawat, Sanjay. MapReduce: simplified data processing on large   clusters. Communications of the ACM. January 2008, pp. 107-113.

3. Some Computer   Organizations and Their Effectiveness. Flynn, Michael J. 1972,   IEEE Transactions on Computers, pp. 948-960.

4. Combining   Functional and Imperative Programming for Multicore Software: An Empirical   Study Evaluating Scala and Java. Pankratius, Victor, Schmidt, Felix   and Garreton, Gilda. Zurich, Switzerland : IEEE, 2012. International   Conference on Software Engineering. pp. 123-133.

5. Kraska, Tim.   Finding the Needle in the Big Data Haystack. IEEE Internet Computing. Jan-Feb   2013, pp. 84-86.

6. Leavitt, Neal.   Bringing Big Analytics to the Masses. Computer. January 2013, pp.   20-23.

7. Six   Provocations for Big Data. Boyd, Daniel and Crawford, Kate.   Oxford, UK : Oxford Internet Institute, 2011.

8. Beyond Simple   Integration of RDBMS and MapReduce – Paving the Way toward a Unified System   for Big Data Analytics: Vision and Progress. Qin, Xiongpai, et al., et   al. Xiangtan, China : IEEE, 2012. Second International Conference on   Cloud and Green Computing. pp. 716-725.

9. Efficient   Statistical Computing on Multicore and MultiGPU Systems. Ou, Yulong,   et al., et al. Melbourne, Australia : IEE, 2012. 15th International   Conference on Network-Based Information Systems. pp. 709-714.

10. Smith, David.   R Competition Brings Out the Best in Data Analytics. s.l. :   Revolution Analytics, 2011. White Paper.

11. Revolution   Analytics. Advanced “Big Data” Analytics with R and Hadoop. s.l. :   Revolution Analytics, 2011. White Paper.

12. A Platform for   Parallel R-based Analytics on Cloud Infrastructure. Patel, Ishan,   Rau-Chaplin, Andrew and Varghese, Blesson. Pittsburgh, PA : IEEE,   2012. 41st International Conference on Parallel Processing Workshops. pp.   188-193.

13. Bellman,   Richard. Dynamic Programming. Princeton, NJ : Princeton   University Press, 1957.

14. Arel, Itamar,   Rose, Derek C and Karnowski, Thomas P. Deep Machine Learning – A New   Frontier in Artificial Intelligence Research. IEEE Computational   Intelligence Magazine. November 2010, pp. 13-18.

15. Kurzweil, Ray.   How To Create a Mind. New York, NY : Viking Published by the   Penguin Group, 2012.

Written by tmwsiy

March 4th, 2013 at 10:44 am