Goodbye Age of Hadoop – Hello Cambrian Explosion of Deep Learning

Summary:  Some observations about new major trends and directions in data science drawn from the Strata+Hadoop conference in San Jose last week.

I’m fresh off my annual field trip to the Strata+Hadoop conference in San Jose last week.  This is always exciting, enervating, and exhausting but it remains the single best place to pick up on what’s changing in our profession.

This conference is on a world tour with four more stops before repeating next year.  The New York show is supposed to be a little bigger (hard to imagine) but the San Jose show is closest to our intellectual birthplace.  After all this is the place where to call yourself a nerd would be regarded as a humble brag.

I’ll try to briefly share the major themes and changes I found this year and will write later in more depth about some of these.


End of the Era of Hadoop

From the time it went open source in 2007 Hadoop and its related technologies have been profound drivers of the growth of data science.  Doug Cutting remains one of the three Strata conference chairs.  However, we all know that Hadoop/MapReduce has made its mark but that it’s no longer cutting edge.  In fact we know that Apache Spark has eclipsed Hadoop and it would be fair to say that Spark was last year’s big news.

To put a stake in it, O’Reilly announced at this year’s Strata+Hadoop that the conference would henceforth be known as the Strata Data Conference.  So farewell age of Hadoop.


Artificial Intelligence

I am as jaded as the next guy and maybe a little more so at the over hyped furor around AI.  As I walked the conference floor I felt compelled to challenge any vendor with the temerity to put AI in their descriptors.

Actually there was very little of this at the show.  AI tends to be most over hyped when we’re talking about apps but Strata+Hadoop is more about tools than apps.  There were two or three vendors that I thought had applied the AI frosting a little thick but there were a few others where the label was appropriate and the capabilities pretty interesting.  More about these good guys later.

In the learning program there were again two or three sessions aimed at AI use cases in business and these were uniformly well reasoned.  Specifically that means acknowledging that this is in its infancy and while you should keep an eye on it, investing now would be speculative at best.


The Cambrian Explosion in Deep Learning

One of our general session speakers used this phrase to describe the hockey-stick like growth we’ve been experiencing in Deep Learning and AI in general.  The original use of the phrase is credited to Gill Pratt, the DARPA Program Manager who oversaw the DARPA Robotics Challenge.

If you remember a little about your earth history, we trundled along with one-celled creatures for billions of years until about a half-a-billion years ago when, at the beginning of the Cambrian period, life diversified in a way that can truly be characterized as an explosion.  Academic theory is that very small changes like the evolution of sight organs so changed the playing field that the exploitation of this new capability drove the development of additional capabilities that – you know – resulted in us.

So while data scientists are a little cautious to talk about the wonders of artificial intelligence, they are very enthusiastic in talking about the new capabilities presented by Deep Learning.  This may seem a little paradoxical but I invite you to think about it this way.

Robust AI is the accumulated capabilities of speech, text, NLP, image processing, robotics, knowledge recovery, and several other human-like capabilities that at this point are very early in development and not at all well or easily integrated.

Deep Learning however is a group of tools that we are applying to develop these capabilities, including Convolutional Neural Nets, Recurrent Neural Nets, Generative Adversarial Neural Nets, and Reinforcement Learning to name the most popular.  All of these are subsets of Deep Learning and all are accessed through the newly emerging Deep Learning platforms like TensorFlow, MXNet, Theano, Torch, and several others.

Like all platform battles, the winner who gains the most users will be the next IoS, Android, or Windows.  Right now it appears Google’s TensorFlow is in the lead and there were at least four or five program sessions, some of them full-day, that were oversubscribed providing both general guidance as well as hands-on training in TensorFlow.  So while the buzz around AI was appropriately subdued, the enthusiasm for learning about TensorFlow was in full flower.  The emergence of Deep Learning platforms may be the slight evolutionary change that triggers the explosion of AI.


Platform Convergence

In the beginning you could pick a portion of the data science workflow and build a successful business there.  Many of today’s largest companies got their start this way.  Not anymore.  Now everybody wants to be an end-to-end platform from data source to the deployment of models and other forms of exploitation.  He with the most users will win and once adopted the pain of switching will be high.  The same dynamic that continues to make enterprise ERP systems so sticky – it’s too painful to switch.

We’ve seen in the last years analytic platforms like SAS and SPSS add full data access and blending capability.  We’ve seen blending platforms like Alteryx extend into analytics and visualization.  So here are two new and rather unexpected additions to the full spectrum platform game:

Cloudera announces its own Data Science Workbench with capabilities in R, Python, and Scala.

Intel (yes Intel?) who just paid $15 Billion for Mobileye to seize its place in the self-driving car space is rolling out two data science platforms, Saffron and Nirvana, one aimed at IoT and the other at deep learning.


DataOps and Data Engineers

As recently as a year or so ago the term ‘data scientist’ applied to someone doing predictive analytics as well as the person you would turn to to implement Spark or a data lake.  Thankfully over not too long a period we have come to differentiate Data Scientists from Data Engineers and acknowledge their special skill set that blends traditional CS skills with the new disciplines needed to store, extract, and utilize data for data scientists.

Now that this differentiation is a little clearer, we see a parallel rise in a new category of tools and platforms best described as DataOps.  Philosophically similar to DevOps, DataOps tools and platforms are aimed at regularizing and simplifying the tasks of Data Engineers, particularly as it applies to repetitive tasks that may need to be repeated dozens or even hundreds of times for different data sources and different data destinations.  Two new companies, both startups, Nexla, and Data Kitchen take a fairly narrow but deep view.  Others like Qubole are laying claim to this area by better defining capabilities within their existing platforms.


Emerging Productivity Enhancements for Data Scientists

We may think the business world is populated by companies with just a few (if any) data scientists working together and for the most part we’d be right.  However, this is not the market most vendors at Strata are interested in.  They are pursuing wallet share among the Global 8000.  That’s 8000 companies with more than $1 Billion in revenue and assuredly 100% commitment to predictive analytics.

I haven’t seen any specific data but an informal poll of vendors says these companies employee from 20 to several hundred data scientists each.  When you have that many data scientists in one place you have to start thinking about efficiency and productivity.  And there’s a major theme for this year – productivity enhancements for data scientists.

The list of vendors with this focus is too long for this article and DataOps just above is part of this.  Here are just a few mentions of notable companies and their approach.

DataRobot:  We reviewed DataRobot a year ago when it looked like predictive analytics was about to be fully automated and data scientists unemployed by 2025.  That was a little premature.  However, DataRobot has found a foothold by dramatically speeding up model development.  This is one-click-to-model.  Their platform cleans data, does feature engineering and identification, runs thousands of potential models/hyper parameter combinations in parallel, and deploys champion models in a fraction of the time it would take a team of data scientists.

SqlStream:  Deploy a blazing fast stream processing system in a fraction of the time and with a fraction of the compute resources distros like Spark require.  Make it so easy to manage that very little is needed of data engineers, and make it easy to change the logic and models within the stream without a team of data scientists.

Bansai:  TensorFlow is complex and tough to learn.  Bansai is introducing a higher level language that looks a lot like Python but manipulates deep learning algorithms in all the major deep learning platforms.  Their initial target is reinforcement learning for robotics and the payoff is to solve the shortage of deep learning-qualified data scientists that are a bottleneck for development.

Qubole:  Makes it easy to almost instantly establish a big data repository and begin analytics.  You can’t completely replace data engineers but you can dramatically increase the number of data scientists each engineer can support with this SaaS implementation.


Emergence of the Data Science Appliance

Similar to productivity enhancements but aimed at business users who want solutions without necessarily needing to know the underlying data science are a group of offerings that intentionally hide the data science and focus on the business problem.

Anodot:  Delivers a sophisticated anomaly detection system that looks at all your streaming data and decides both what’s anomalous and what’s important.  This is catching on among ecommerce vendors and digital enterprises, some of whom have reportedly thrown out their internally developed anomaly detectors in favor of Anodot’s offering.

GeoStrategies:  This company uses GIS data for site location and market identification and penetration studies.  Lots of sophisticated platforms can do that too but GeoStrategies goes out of its way to hide the data science in favor of a UI that’s very intuitive for their business users.


Women in Data Science

Finally, my unscientific tally was that about 20% of attendees and a slightly higher percentage of presenters were women.  This may not be representative of our profession as a whole as folks who attend these conferences may have different profiles than the whole industry.  Still, while we might wish this was more like 50/50 I thought participation by our female members was a reasonably strong showing.



About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:

Original post here

Posted by William Vorhies on March 20, 2017 at 4:48pm


The Big Data Trends of 2016

Solr + Hadoop = Big Data Love

2016’s BIG Three Trends are:

  1. Apache Spark production deployments
  2. Conversion from other platforms to Hadoop
  3. Leveraging Hadoop for advanced use cases

One of the actual utilizations of future era parallel and appropriated frameworks is in big-data analytics. Analysis tasks regularly have hard due dates, and data quality is an essential concern in yet different applications. For most rising applications, data-driven models and strategies, fit for operating at scale, are as-yet unknown.

Solr + Hadoop = Big Data Love big data image

Hadoop, a structure, and collection of tools for processing enormous data sets, was originally designed to work on clusters of physical machines. That has changed.

Distributed analytic frameworks, for example, MapReduce, are developing into appropriate resource managers that are gradually transforming Hadoop into a universally useful data operating system. With these frameworks, one can perform a broad range of data manipulations and analytics operations by connecting them to Hadoop as the disseminated document storage system.

Solr + Hadoop = Big Data Love big data image 2

The blend of big data and compute power likewise allows analysts investigate new behavioral data for the duration of the day, for example, websites visited or location.

Big data isn’t significantly big and can be as much about the complexities of preparing information as about volumes or data types.

One of the actual utilizations of future era parallel and appropriated frameworks is in big-data analytics. Analysis tasks regularly have hard due dates, and data quality is an essential concern in yet different applications. For most rising applications, data-driven models and strategies, fit for operating at scale, are as-yet unknown.

Solr + Hadoop = Big Data Love why is hadoop important new image


Hadoop, a structure, and collection of tools for processing enormous data sets, was originally designed to work on groups of physical machines. That has changed.

Hadoop is the first data operating system which makes it so powerful, and large enterprises are interested in it. But maybe they’re not all followers yet.

Research shows that 45% of big companies say they’re doing a Hadoop proof of concept, with 16 percent using it in

Hadoop is an open-source software framework for storing data and running applications on bunches of specialty hardware. It provides massive storage for any data, enormous processing power and the ability to manage essentially endless concurrent tasks or jobs.

Many people use the Hadoop accessible source project to process large data sets because it’s an excellent solution for scalable, reliable data processing workflows. Hadoop is by far the most conventional system for handling big data, with companies using massive clusters to store and process petabytes of data on thousands of servers.


Solr + Hadoop = Big Data Love solr image

Solr is highly reliable, scalable and faults liberal, implementing assigned indexing, replication and load-balanced querying, automated failover and restoration, centralized arrangement and further. Solr capability the exploration and research features of many of the world’s largest internet sites.


  1. Uses the Lucene library for full-text search
  2. Faceted navigation
  3. Hit highlighting
  4. Inquiry language supports structured as well as textual search
  5. Schema-less mode and Schema REST API
  6. JSON, XML, PHP, Ruby, Python, XSLT, Velocity and custom Java binary output formats over HTTP
  7. HTML administration interface
  8. Built-in security: Authentication, Authorization, SSL
  9. Replication to separate Solr servers – facilitates scaling QPS and High Availability
  10. Distributed Search through Sharding – enables scaling content volume
  11. Search results clustering based on Carrot2
  12. Extensible through plugins
  13. Flexible relevance – boost through function queries
  14. Caching – queries, filters, and documents
  15. Embeddable in a Java Application
  16. Geo-spatial search, including multiple points per documents and polygons
  17. Automated management of large clusters through ZooKeeper


The opinions expressed in this article are the author’s own and do not reflect the view of the organization. Original post here


NoSQL and Hadoop for Solving Big Data #WhiteboardWalkthrough

In this week’s Whiteboard Walkthrough, Dale Kim, Director of Industry Solutions at MapR, gets you up to speed on Apache Hadoop and NoSQL. He talks about the similarities and differences between the two, but most importantly how both technologies should be a requirement for any true big data environment.



Hadoop Tutorial Build a Real Time Analytic dashboard with Solr Search and Spark Streaming

Search is a great way to interactively explore your data. The Search App is continuously improving and now comes with a better support for real time!


Coursera free training: Introduction to Big Data. University of California, San Diego


Interested in increasing your knowledge of the Big Data landscape? This course is for those new to data science and interested in understanding why the Big Data Era has come to be. It is for those who want to become conversant with the terminology and the core concepts behind big data problems, applications, and systems. It is for those who want to start thinking about how Big Data might be useful in their business or career. It provides an introduction to one of the most common frameworks, Hadoop, that has made big data analysis easier and more accessible — increasing the potential for data to transform our world!

At the end of this course, you will be able to:

* Describe the Big Data landscape including examples of real world big data problems including the three key sources of Big Data: people, organizations, and sensors.

* Explain the V’s of Big Data (volume, velocity, variety, veracity, valence, and value) and why each impacts data collection, monitoring, storage, analysis and reporting.

* Get value out of Big Data by using a 5-step process to structure your analysis.

* Identify what are and what are not big data problems and be able to recast big data problems as data science questions.

* Provide an explanation of the architectural components and programming models used for scalable big data analysis.

* Summarize the features and value of core Hadoop stack components including the YARN resource and job management system, the HDFS file system and the MapReduce programming model.

* Install and run a program using Hadoop!

This course is for those new to data science. No prior programming experience is needed, although the ability to install applications and utilize a virtual machine is necessary to complete the hands-on assignments.

Hardware Requirements:
(A) Quad Core Processor (VT-x or AMD-V support recommended), 64-bit; (B) 8 GB RAM; (C) 20 GB disk free. How to find your hardware information: (Windows): Open System by clicking the Start button, right-clicking Computer, and then clicking Properties; (Mac): Open Overview by clicking on the Apple menu and clicking “About This Mac.” Most computers with 8 GB RAM purchased in the last 3 years will meet the minimum requirements.You will need a high speed internet connection because you will be downloading files up to 4 Gb in size.

Software Requirements:
This course relies on several open-source software tools, including Apache Hadoop. All required software can be downloaded and installed free of charge. Software requirements include: Windows 7+, Mac OS X 10.10+, Ubuntu 14.04+ or CentOS 6+ VirtualBox 5+.

University of California, San Diego

More info: