The Mandelbrot Set – The only video you need to see!

This video has been edited together from several other video documentaries to describe the Mandelbrot set! An incredible mathematical formula explaining fractals and geometry! Several mathematicians and scientists explain this phenomenon in clear detail. Please enjoy!

We are not the owners of these video clips nor do we claim to be. This video is for educational & entertainment purposes only.



Difference of Data Science, Machine Learning and Data Mining

Data is almost everywhere. The amount of digital data that currently exists is now growing at a rapid pace. The number is doubling every two years and it is completely transforming our basic mode of existence. According to a paper from IBM, about 2.5 billion gigabytes of data had been generated on a daily basis in the year 2012. Another article from Forbes informs us that data is growing at a pace which is faster than ever. The same article suggests that by the year 2020, about 1.7 billion of new information will be developed per second for all the human inhabitants on this planet. As data is growing at a faster pace, new terms associated with processing and handling data are coming up. These include data science, data mining and machine learning. In the following section- we will give you a detailed insight on these terms.

What is data science?

Data Science deals with both structured and unstructured data. It is a field that includes everything that is associated with the cleansing, preparation and final analysis of data. Data science combines the programming, logical reasoning, mathematics and statistics. It captures data in the most ingenious ways and encourages the ability of looking at things with a different perspective. Likewise, it also cleanses, prepares and aligns the data. To put it more simply, data science is an umbrella of several techniques that are used for extracting the information and the insights of data. Data scientists are responsible for creating the data products and several other data based applications that deal with data in such a way that conventional systems are unable to do.

What is data mining? 

Data mining is simply the process of garnering information from huge databases that was previously incomprehensible and unknown and then using that information to make relevant business decisions. To put it more simply, data mining is a set of various methods that are used in the process of knowledge discovery for distinguishing the relationships and patterns that were previously unknown. We can therefore term data mining as a confluence of various other fields like artificial intelligence, data room virtual base management, pattern recognition, visualization of data, machine learning, statistical studies and so on. The primary goal of the process of data mining is to extract information from various sets of data in an attempt to transform it in proper and understandable structures for eventual use. Data mining is thus a process which is used by data scientists and machine learning enthusiasts to convert large sets of data into something more usable.

What is machine learning? 

Machine learning is kind of artificial intelligence that is responsible for providing computers the ability to learn about newer data sets without being programmed via an explicit source. It focuses primarily on the development of several computer programs that can transform if and when exposed to newer sets of data. Machine learning and data mining follow the relatively same process. But of them might not be the same. Machine learning follows the method of data analysis which is responsible for automating the model building in an analytical way. It uses algorithms that iteratively gain knowledge from data and in this process; it lets computers find the apparently hidden insights without any help from an external program. In order to gain the best results from data mining, complex algorithms are paired with the right processes and tools.

What is the difference between these three terms?

As we mentioned earlier, data scientists are responsible for coming up with data centric products and applications that handle data in a way which conventional systems cannot. The process of data science is much more focused on the technical abilities of handling any type of data. Unlike data mining and data machine learning it is responsible for assessing the impact of data in a specific product or organization.

While data science focuses on the science of data, data mining is concerned with the process. It deals with the process of discovering newer patterns in big data sets. It might be apparently similar to machine learning, because it categorizes algorithms. However, unlike machine learning, algorithms are only a part of data mining. In machine learning algorithms are used for gaining knowledge from data sets. However, in data mining algorithms are only combined that too as the part of a process. Unlike machine learning it does not completely focus on algorithms.


Original post here

Posted by Leonard Heiler on March 20, 2017 at 10:30am


Free Machine Learning eBooks – March 2017

Here are three eBooks available for free.


Edited by Abdelhamid Mellouk and Abdennacer Chebira

Machine Learning can be defined in various ways related to a scientific domain concerned with the design and development of theoretical and implementation tools that allow building systems with some Human Like intelligent behaviour.

Machine Learning addresses more specifically the ability to improve automatically through experience.


by Shai Ben-David and Shai Shalev-Shwartz

Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way.

The book provides an extensive theoretical account of the fundamental ideas underlying machine learning and the mathematical derivations that transform these principles into practical algorithms. Following a presentation of the basics of the field, the book covers a wide array of central topics that have not been addressed by previous textbooks. These include a discussion of the computational complexity of learning and the concepts of convexity and stability; important algorithmic paradigms including stochastic gradient descent, neural networks, and structured output learning; and emerging theoretical concepts such as the PAC-Bayes approach and compression-based bounds.

Designed for an advanced undergraduate or beginning graduate course, the text makes the fundamentals and algorithms of machine learning accessible to students and non-expert readers in statistics, computer science, mathematics, and engineering.


by D. Kriesel

The purpose of this book is to help you master the core concepts of neural networks, including modern techniques for deep learning.

After working through the book you will have written code that uses neural networks and deep learning to solve complex pattern recognition problems.

And you will have a foundation to use neural networks and deep learning to attack problems of your own devising.

To check those books and receive announcements when new free eBooks are published, click here.

Top DSC Resources

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Original post here

Posted by Emmanuelle Rieuf on March 20, 2017 at 4:00pm


Goodbye Age of Hadoop – Hello Cambrian Explosion of Deep Learning

Summary:  Some observations about new major trends and directions in data science drawn from the Strata+Hadoop conference in San Jose last week.

I’m fresh off my annual field trip to the Strata+Hadoop conference in San Jose last week.  This is always exciting, enervating, and exhausting but it remains the single best place to pick up on what’s changing in our profession.

This conference is on a world tour with four more stops before repeating next year.  The New York show is supposed to be a little bigger (hard to imagine) but the San Jose show is closest to our intellectual birthplace.  After all this is the place where to call yourself a nerd would be regarded as a humble brag.

I’ll try to briefly share the major themes and changes I found this year and will write later in more depth about some of these.


End of the Era of Hadoop

From the time it went open source in 2007 Hadoop and its related technologies have been profound drivers of the growth of data science.  Doug Cutting remains one of the three Strata conference chairs.  However, we all know that Hadoop/MapReduce has made its mark but that it’s no longer cutting edge.  In fact we know that Apache Spark has eclipsed Hadoop and it would be fair to say that Spark was last year’s big news.

To put a stake in it, O’Reilly announced at this year’s Strata+Hadoop that the conference would henceforth be known as the Strata Data Conference.  So farewell age of Hadoop.


Artificial Intelligence

I am as jaded as the next guy and maybe a little more so at the over hyped furor around AI.  As I walked the conference floor I felt compelled to challenge any vendor with the temerity to put AI in their descriptors.

Actually there was very little of this at the show.  AI tends to be most over hyped when we’re talking about apps but Strata+Hadoop is more about tools than apps.  There were two or three vendors that I thought had applied the AI frosting a little thick but there were a few others where the label was appropriate and the capabilities pretty interesting.  More about these good guys later.

In the learning program there were again two or three sessions aimed at AI use cases in business and these were uniformly well reasoned.  Specifically that means acknowledging that this is in its infancy and while you should keep an eye on it, investing now would be speculative at best.


The Cambrian Explosion in Deep Learning

One of our general session speakers used this phrase to describe the hockey-stick like growth we’ve been experiencing in Deep Learning and AI in general.  The original use of the phrase is credited to Gill Pratt, the DARPA Program Manager who oversaw the DARPA Robotics Challenge.

If you remember a little about your earth history, we trundled along with one-celled creatures for billions of years until about a half-a-billion years ago when, at the beginning of the Cambrian period, life diversified in a way that can truly be characterized as an explosion.  Academic theory is that very small changes like the evolution of sight organs so changed the playing field that the exploitation of this new capability drove the development of additional capabilities that – you know – resulted in us.

So while data scientists are a little cautious to talk about the wonders of artificial intelligence, they are very enthusiastic in talking about the new capabilities presented by Deep Learning.  This may seem a little paradoxical but I invite you to think about it this way.

Robust AI is the accumulated capabilities of speech, text, NLP, image processing, robotics, knowledge recovery, and several other human-like capabilities that at this point are very early in development and not at all well or easily integrated.

Deep Learning however is a group of tools that we are applying to develop these capabilities, including Convolutional Neural Nets, Recurrent Neural Nets, Generative Adversarial Neural Nets, and Reinforcement Learning to name the most popular.  All of these are subsets of Deep Learning and all are accessed through the newly emerging Deep Learning platforms like TensorFlow, MXNet, Theano, Torch, and several others.

Like all platform battles, the winner who gains the most users will be the next IoS, Android, or Windows.  Right now it appears Google’s TensorFlow is in the lead and there were at least four or five program sessions, some of them full-day, that were oversubscribed providing both general guidance as well as hands-on training in TensorFlow.  So while the buzz around AI was appropriately subdued, the enthusiasm for learning about TensorFlow was in full flower.  The emergence of Deep Learning platforms may be the slight evolutionary change that triggers the explosion of AI.


Platform Convergence

In the beginning you could pick a portion of the data science workflow and build a successful business there.  Many of today’s largest companies got their start this way.  Not anymore.  Now everybody wants to be an end-to-end platform from data source to the deployment of models and other forms of exploitation.  He with the most users will win and once adopted the pain of switching will be high.  The same dynamic that continues to make enterprise ERP systems so sticky – it’s too painful to switch.

We’ve seen in the last years analytic platforms like SAS and SPSS add full data access and blending capability.  We’ve seen blending platforms like Alteryx extend into analytics and visualization.  So here are two new and rather unexpected additions to the full spectrum platform game:

Cloudera announces its own Data Science Workbench with capabilities in R, Python, and Scala.

Intel (yes Intel?) who just paid $15 Billion for Mobileye to seize its place in the self-driving car space is rolling out two data science platforms, Saffron and Nirvana, one aimed at IoT and the other at deep learning.


DataOps and Data Engineers

As recently as a year or so ago the term ‘data scientist’ applied to someone doing predictive analytics as well as the person you would turn to to implement Spark or a data lake.  Thankfully over not too long a period we have come to differentiate Data Scientists from Data Engineers and acknowledge their special skill set that blends traditional CS skills with the new disciplines needed to store, extract, and utilize data for data scientists.

Now that this differentiation is a little clearer, we see a parallel rise in a new category of tools and platforms best described as DataOps.  Philosophically similar to DevOps, DataOps tools and platforms are aimed at regularizing and simplifying the tasks of Data Engineers, particularly as it applies to repetitive tasks that may need to be repeated dozens or even hundreds of times for different data sources and different data destinations.  Two new companies, both startups, Nexla, and Data Kitchen take a fairly narrow but deep view.  Others like Qubole are laying claim to this area by better defining capabilities within their existing platforms.


Emerging Productivity Enhancements for Data Scientists

We may think the business world is populated by companies with just a few (if any) data scientists working together and for the most part we’d be right.  However, this is not the market most vendors at Strata are interested in.  They are pursuing wallet share among the Global 8000.  That’s 8000 companies with more than $1 Billion in revenue and assuredly 100% commitment to predictive analytics.

I haven’t seen any specific data but an informal poll of vendors says these companies employee from 20 to several hundred data scientists each.  When you have that many data scientists in one place you have to start thinking about efficiency and productivity.  And there’s a major theme for this year – productivity enhancements for data scientists.

The list of vendors with this focus is too long for this article and DataOps just above is part of this.  Here are just a few mentions of notable companies and their approach.

DataRobot:  We reviewed DataRobot a year ago when it looked like predictive analytics was about to be fully automated and data scientists unemployed by 2025.  That was a little premature.  However, DataRobot has found a foothold by dramatically speeding up model development.  This is one-click-to-model.  Their platform cleans data, does feature engineering and identification, runs thousands of potential models/hyper parameter combinations in parallel, and deploys champion models in a fraction of the time it would take a team of data scientists.

SqlStream:  Deploy a blazing fast stream processing system in a fraction of the time and with a fraction of the compute resources distros like Spark require.  Make it so easy to manage that very little is needed of data engineers, and make it easy to change the logic and models within the stream without a team of data scientists.

Bansai:  TensorFlow is complex and tough to learn.  Bansai is introducing a higher level language that looks a lot like Python but manipulates deep learning algorithms in all the major deep learning platforms.  Their initial target is reinforcement learning for robotics and the payoff is to solve the shortage of deep learning-qualified data scientists that are a bottleneck for development.

Qubole:  Makes it easy to almost instantly establish a big data repository and begin analytics.  You can’t completely replace data engineers but you can dramatically increase the number of data scientists each engineer can support with this SaaS implementation.


Emergence of the Data Science Appliance

Similar to productivity enhancements but aimed at business users who want solutions without necessarily needing to know the underlying data science are a group of offerings that intentionally hide the data science and focus on the business problem.

Anodot:  Delivers a sophisticated anomaly detection system that looks at all your streaming data and decides both what’s anomalous and what’s important.  This is catching on among ecommerce vendors and digital enterprises, some of whom have reportedly thrown out their internally developed anomaly detectors in favor of Anodot’s offering.

GeoStrategies:  This company uses GIS data for site location and market identification and penetration studies.  Lots of sophisticated platforms can do that too but GeoStrategies goes out of its way to hide the data science in favor of a UI that’s very intuitive for their business users.


Women in Data Science

Finally, my unscientific tally was that about 20% of attendees and a slightly higher percentage of presenters were women.  This may not be representative of our profession as a whole as folks who attend these conferences may have different profiles than the whole industry.  Still, while we might wish this was more like 50/50 I thought participation by our female members was a reasonably strong showing.



About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:

Original post here

Posted by William Vorhies on March 20, 2017 at 4:48pm


How to Make a Text Summarizer – Intro to Deep Learning #10

I’ll show you how you can turn an article into a one-sentence summary in Python with the Keras machine learning library. We’ll go over word embeddings, encoder-decoder architecture, and the role of attention in learning theory.

Code for this video (Challenge included):…

Jie’s Winning Code:…

More Learning resources:……………

Please subscribe! And like. And comment. That’s what keeps me going.

Join us in the Wizards Slack channel:

And please support me on Patreon:


What to Expect for Big Data and Apache Spark in 2017

Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, Matei Zaharia will cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, he will talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.

This talk was originally presented at Spark Summit East 2017.

You can view the slides on Slideshare:…




Book: Evaluating Machine Learning Models

Data science today is a lot like the Wild West: there’s endless opportunity and excitement, but also a lot of chaos and confusion. If you’re new to data science and applied machine learning, evaluating a machine-learning model can seem pretty overwhelming. Now you have help. With this O’Reilly report, machine-learning expert Alice Zheng takes you through the model evaluation basics.

In this overview, Zheng first introduces the machine-learning workflow, and then dives into evaluation metrics and model selection. The latter half of the report focuses on hyperparameter tuning and A/B testing, which may benefit more seasoned machine-learning practitioners.

With this report, you will:

  • Learn the stages involved when developing a machine-learning model for use in a software application
  • Understand the metrics used for supervised learning models, including classification, regression, and ranking
  • Walk through evaluation mechanisms, such as hold?out validation, cross-validation, and bootstrapping
  • Explore hyperparameter tuning in detail, and discover why it’s so difficult
  • Learn the pitfalls of A/B testing, and examine a promising alternative: multi-armed bandits
  • Get suggestions for further reading, as well as useful software packages

Get the Free Ebook, here.

More information here

Posted by Emmanuelle Rieuf on March 10, 2017 at 8:00am



How We Combined Different Methods to Create Advanced Time Series Prediction

Today, businesses need to be able to predict demand and trends to stay in line with any sudden market changes and economy swings. This is exactly where forecasting tools, powered by Data Science, come into play, enabling organizations to successfully deal with strategic and capacity planning. Smart forecasting techniques can be used to reduce any possible risks and assist in making well-informed decisions. One of our customers, an enterprise from the Middle East, needed to predict their market demand for the upcoming twelve weeks. They required a market forecast to help them set their short-term objectives, such as production strategy, as well as assist in capacity planning and price control. So, we came up with an idea of creating a custom time series model capable of tackling the challenge. In this article, we will cover the modelling process as well as the pitfalls we had to overcome along the way.

There is a number of approaches to building time series prediction ….and neither fit us

With the emergence of the powerful forecasting methods based on Machine Learning, future predictions have become more accurate. In general, forecasting techniques can be grouped into two categories: qualitative and quantitative. Qualitative forecasts are applied when there is no data available and prediction is based only on expert judgement. Quantitative forecasts are based on time series modeling. This kind of models uses historical data and is especially efficient in forecasting some events that occur over periods of time: for example prices, sales figures, volume of production etc.

The existing models for time series prediction include the ARIMA models that are mainly used to model time series data without directly handling seasonality; VAR modelsHolt-Winters seasonal methods, TAR modelsand other. Unfortunately, these algorithms may fail to deliver the required level of the prediction accuracy, as they can involve raw data that might be incomplete, inconsistent or contain some errors. As quality decisions are based only on quality data, it is crucial to perform preprocessing to prepare entry information for further processing.

Why combining models is an answer

It is clear that one particular forecasting technique cannot work in every situation. Each of the methods has its specific use case and can be applied with regard to many factors (the period over which the historical data is available, the time period that has to be observed, the size of the budget, the preferred level of accuracy) and the output required. So, we faced the question: which method/methods to use to obtain the desired result? As different approaches had their unique strengths and weaknesses, we decided to combine a number of methods and make them work together. In this way, we could build a time series model capable of providing trustworthy predictions to ensure data reliability and time/cost saving. And this is how we did it.

The modeling process; let’s dive into the details

The demand data depends on various factors that can influence the result of the forecast, such as the price and types of goods, geographical location, the country’s economics, manufacturing technology, etc. As we wanted our time series model to provide the customer with high-accuracy predictions, we used the interpolation method for missing values to ensure that the input is reliable.

When conducting the time series analysis in Python 2.7., we analyzed the past data starting from 2010 to 2015 to calculate precisely the demand and predict its behavior in the future.

Fig1 The demand data over 2010-2015 timeframe
Fig. 1. The demand data over the 2010-2015 timeframe

At first sight, it may seem that there is no constant demand value, as the variance goes up and down, making the prediction hardly possible. But, there is a method that can help here.

We used the decomposition method to separately extract trend (the increase or decrease in the series over a period of time), seasonality (the fluctuation that occurs within the series over each week, each month, etc.) and residuals (the data point that falls outside of the expected data range). With these three components we built the additive model:

How We Combined Different Methods to Create Advanced Time Series Prediction Formula 1
where yt is the data, Tt is the trend-cycle component, St is the seasonal component and Rt is the residual component, all defined over the time period t.

An important first step in describing various components of the series is smoothing, although it does not really provide you with a ready-to-use model. In the beginning, we estimated the trend (behavior) component. Such methods as Moving Average, Exponential Smoothing, Chow’s Adaptive Control, Winter’s Linear and Seasonal Exponential Smoothing methods did not provide us with the trend estimation accuracy we expected. The most reliable result was obtained using the Hodrick-Prescott Filter technique.

Fig. 2. The estimated trend

Then, we defined the seasonality from the available data. This component could change over time, so we applied a powerful tool for decomposing the time series – the Loess method. This approach can handle any type of seasonality, and the rate of change can be controlled by a user.

Fig. 3. Multi-seasonality

We obtained a multi-seasonal component with some high and low variances, causing large fluctuations.

After applying Elastic Net Regression and Fourier transformation, we built a forecast for the trend based on the results obtained. The approximation of the trend can be found from the formula below,

where Pn(t) is a degree polynomial and Ak is a set of indexes, including the first k indexes with highest amplitudes.

Then, we calculated the Fourier coefficients using The Discrete Fourier Transform (DFT).
Fig. 4. The example of code of the DFT in Python

The effect of the Fourier terms, used as external regressors in the model, is visualised below.

Fig. 5. The visualised effect of Fourier terms

We built the trend prediction using the additive model.

Fig. 6. Trend prediction

When the trend and seasonal components are removed from the model, we can obtain the residuals (the difference between an observed value and its forecast based on other observations) from the remaining part to validate and fit our mathematical model.

Fig. 7. Obtained residuals

You may notice that there are some negative values present, showing that something unusual was happening during that period of time. We aimed to find out the circumstances causing such behavior, so we came up with an idea to compile the outliers with a simple calendar and discovered that the negative values tightly correlate with such public holidays as Ramadan, Eid Al Fitr and other. Having collected and summarized all the data, we applied Machine Learning methods based on previous data points as entry features and Machine Learning Strategies for Time Series Prediction.

After a few training sessions conducted with ML models, we built a prediction for residuals that can be observed below.

Fig. 8. Prediction for residuals

As a result, we got a final forecasting model that minimizes the mean absolute percentage error (MAPE) to 6% for one particular city and 10% for the entire country in general.

Fig. 9. The forecast at the original scale

A 24-times faster prediction? Yes, it’s possible

When building our model, we attempted not only to use the available information, but also discover the factors which could affect the results. This approach helped us develop the model generating more accurate forecasting results faster than the existing models. For example, to train the developed model to make a prediction for 300 different cities, we need about 15 minutes, while other methods require about 6 hours.

Also, the fact that the deviation between the actual demand and the predicted demand was only 6% resulted in possibilities to resolve mismatches between supply and demand. Now, the customer can more quickly and more easily plan the capacity, minimize future risks and optimize inventory.

What’s next?

Well, the results are quite promising. And there is a long way we can go further in improvement of this model, so it could provide accurate long-term forecasts as well. As for now, the degree of error for long-term predictions is still quite high. Sounds like a challenge? So stay tuned! Some new experiments are coming!

Posted by ELEKS on November 1, 2016 at 6:00am




The ROI of Machine Learning in Business – Infographics

This infographic comes from TechEmergence. TechEmergence is a market research firm specializing in the applications and implications of artificial intelligence/machine learning. TechEmergence’s team has recently polled a total of 30 artificial intelligence researchers and executives on the criterion needed for a company to derive maximum value from machine learning to solve business problems:

Read the original article here.

Top DSC Resources

Follow on Twitter: @DataScienceCtrl | @AnalyticBridge

Original post here




Magic Quadrant for Data Science Platforms


Data science platforms are engines for creating machine-learning solutions. Innovation in this market focuses on cloud, Apache Spark, automation, collaboration and artificial-intelligence capabilities. We evaluate 16 vendors to help you make the best choice for your organization.

Table of Contents

  • Market Definition/Description
  • Magic Quadrant
    • Vendor Strengths and Cautions
      • Alpine Data
      • Alteryx
      • Angoss
      • Dataiku
      • Domino Data Lab
      • FICO
      • IBM
      • KNIME
      • MathWorks
      • Microsoft
      • Quest
      • RapidMiner
      • SAP
      • SAS
      • Teradata
    • Vendors Added and Dropped
      • Added
      • Dropped
  • Inclusion and Exclusion Criteria
  • Evaluation Criteria
    • Ability to Execute
    • Completeness of Vision
    • Quadrant Descriptions
      • Leaders
      • Challengers
      • Visionaries
      • Niche Players
  • Context
  • Market Overview
  • Gartner Recommended Reading

More info at Gartner