Will Data Eat the World? Yes, With Some Help from Open Source

Alfred Essa, VP, Analytics and R&D, McGraw-Hill Education
50
108
21

Alfred Essa, VP, Analytics and R&D, McGraw-Hill Education

“Six decades into the computer revolution, four decades since the invention of the microprocessor, and two decades into the rise of the modern Internet, all of the technology required to transform industries through software finally works and can be widely delivered at global scale.”
                                                                                            – Mark Andreesen

In 2011 Marc Andreesen famously made the declaration that “software is eating the world.” Five years later software’s victory is complete. The world now runs on software and every competitive company, at its core, is a software company.

Ironically, the software revolution is only just beginning. The next wave of the internet’s digital transformation is already upon us and it is being fueled by data, lots and lots of data. Digital data, which underlies the next generation of software, has three defining characteristics. Digital data is pervasive, it is connected, and it is cognitive.

The pervasiveness of digital data is being driven by the Internet of Things (IoT), which means that everything in the physical world potentially transmits and receives data. The connectedness of digital data is being enabled by cloud computing, a digital fabric which connects disparate data, either by source or structure, and also provides the computational power to deliver new services at global scale. The cognitive nature of digital data is emerging from AI-powered software, whereby new techniques in statistical science, machine learning, and deep learning allows us to create intelligent software.

What is intelligent software? Traditional software applications perform predefined tasks which are explicitly programmed by human programmers. Intelligent software can go beyond the pre-defined tasks by sensing, adapting, and learning automatically based on new data. A powerful example is the cognitive software underlying IBM’s Watson. Imagine Watson offers an initial medical diagnosis based on machine learning. But then it goes further by carrying out a dialog with the attending physician to confirm the diagnosis and, going even further, to consider various treatment options. The most exciting direction of intelligent software is, therefore, augmented intelligence where machines augment but not replace the intelligence of humans.

In the Valley of the Geeks (not to be confused with Silicon Valley) open source magicians are laying place a number of the foundational innovations for enabling the next generation of intelligent software. The first software revolution was made possible by open source technologies such as Linux, Apache, MySQL, PhP, TCP/IP, and Ethernet. Industry creatively co-opted these open source innovations and made it the basis of the first wave of software innovation. A similar dynamic is at play today. Open source, which includes the academic research community, is spawning new technologies and methodologies which are now beginning to be at the data-driven intelligent software.

  ‚Äč The pervasiveness of digital data is being driven by the Internet of Things (IoT), which means that everything in the physical world potentially transmits and receives data 

Hadley Wickham and Fernando Perez are two superheroes leading the open source charge in analytics and data science. Wickham, originally from New Zealand, comes from the world of statistics. He did a stint as a professor at Rice University and is now chief scientist at Revolution R. Fernando Perez, originally from Colombia, comes from the world of physics with expertise in applied mathematics. Wickham and Perez are leaders in the open source R and Python communities. R and Python are programming languages. But R and Python are also vibrant ecosystems of code, ideas and methodologies at the center of modern data science. Wickham and Perez have contributed to multiple projects in the R and Python communities. But each is responsible for at least one Big Idea in data science.

Wickham is most well known in the R community for having authored ggplot2, which is arguably the most elegant and powerful graphics package in the data science world. John Tukey, the father of data science, emphasized the importance of visualization and exploratory data analysis as a necessary prelude to model and algorithm construction. Wickham’s ggplot embodies a deep philosophy of visualization called the “grammar of graphics.” A grammar is the “fundamental principles or rules of an art or science.” Instead of approaching every visualization or graphic as a one-off, each with its own set of rules and logic, the grammar of graphics imposes a well defined structure for building any visualization. Much like in Ikebana, the Japanese art of flower arrangement, visualization is built up in a series of layers or aesthetics while adhering to a set of simple rules. Working with a well-defined grammar frees up the data scientist to focus and reveal the underlying structure, pattern, and meaning of data.

Wickham recently has gone further by creating a new set of powerful libraries called “tidyverse”, which is built on similar grammatical principles. At the core of tidyverse is a package called dplyr, which embodies a “grammar of data manipulation.” The “two-by-four” of the data science world is a data structure called a data frame. A large part of data analysis, especially during the exploratory data analysis phase, consists of manipulating, arranging, and re-arranging data frames. The set and sequence of operations on data frame is a core part of data analysis. It also has tended to be tedious and arcane. Wickham’s dplyr package provides an elegant and consistent method for handling the most common data manipulation tasks faced by data scientists. Wickham’s Big Idea has been to take a grammatical approach to data science and making that the basis of a series of innovative R libraries. His Big Idea has fundamentally transformed how data scientists approach their daily job.

Perez comes from the world of modern science which is increasingly built on computation and data-intensive modes of scientific discovery. Alan Turing award winner Jim Gray referred to this new approach to scientific investigation as the Fourth Paradigm. If we look at the detection and confirmation of gravitational waves, for example, much of the experimental work relied on sophisticated data analysis and data science. The preferred tool set for the analysis of gravitational waves is Python. Indeed Python is now the de facto standard for scientific computing and data analysis.

As a graduate student Perez, along with some of his colleagues, developed an interactive command shell tool called IPython along with a web-based interface called the IPython notebook. Now called Project Jupyter, Perez’s work is utilized on a daily basis by tens of thousands of scientists and data scientists. Perez’s great insight, which led to his Big Idea, was to notice that the available tools in the programming world don’t support the iterative workflow of science. Empirical science progresses in a cycle: we pose questions, form hypotheses, conduct experiments, collect and analyze data. Based on the data we make adjustments to our questions and hypotheses. The cycle then begins all over again. Scientific innovation is sparked in part by how rapidly we can iterate each cycle.

As an open source innovator Perez led the development of a tool set that more closely matches the scientific workflow. As an added benefit the Jupyter notebook paradigm provides a mechanism for ensuring reproducible research. Much like a laboratory notebook the Jupyter environment gives researchers the ability to annotate their work, collaborate in teams, share and disseminate their findings. It is inevitable that very soon research papers will be published in the style of a Jupyter notebook, where the narrative of the paper is only a part of the research. The published paper, in the form of an interactive notebook, will also contain the data (or links to it) and the computation in the form of code.

As data-driven intelligent software begins to dominate the internet, spawning further software innovations, we need to acknowledge that the emerging technologies stand on the shoulder of giants like Hadley Wickham and Fernando Perez.  

Read Also

The "Black Box Paradox" in Big Data Analytics and Data-Driven Modeling

Daniel Lingenfelter, Staff Engineer, Seagate Technology

A Look into the Disrupting Industry of Cyber Space

Jerrod Chong, VP of Solutions, Yubico

The Storm Behind the Cloud: Ushering In the Next Era of Innovation

Merijn te Booij, Chief Marketing Officer, Genesys

Cloud: Enhancing Truly Unified Communications for Businesses

Ken Bisnoff, SVP of Strategic Opportunities, TelePacific