Top 10 Data Mining Algorithms, Explained

Today, I’m going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper.

Once you know what they are, how they work, what they do and where you can find them, my hope is you’ll have this blog post as a springboard to learn even more about data mining.

What are we waiting for? Let’s get started!

Here are the algorithms:

  • 1. C4.5
  • 2. k-means
  • 3. Support vector machines
  • 4. Apriori
  • 5. EM
  • 6. PageRank
  • 7. AdaBoost
  • 8. kNN
  • 9. Naive Bayes
  • 10. CART

We also provide interesting resources at the end.

1. C4.5

What does it do? C4.5 constructs a classifier in the form of a decision tree. In order to do this, C4.5 is given a set of data representing things that are already classified.

Wait, what’s a classifier? A classifier is a tool in data mining that takes a bunch of data representing things we want to classify and attempts to predict which class the new data belongs to.

What’s an example of this? Sure, suppose a dataset contains a bunch of patients. We know various things about each patient like age, pulse, blood pressure, VO2max, family history, etc. These are called attributes.


Given these attributes, we want to predict whether the patient will get cancer. The patient can fall into 1 of 2 classes: will get cancer or won’t get cancer. C4.5 is told the class for each patient.

And here’s the deal:

Using a set of patient attributes and the patient’s corresponding class, C4.5 constructs a decision tree that can predict the class for new patients based on their attributes.

Cool, so what’s a decision tree? Decision tree learning creates something similar to a flowchart to classify new data. Using the same patient example, one particular path in the flowchart could be:

  • Patient has a history of cancer
  • Patient is expressing a gene highly correlated with cancer patients
  • Patient has tumors
  • Patient’s tumor size is greater than 5cm

The bottom line is:

At each point in the flowchart is a question about the value of some attribute, and depending on those values, he or she gets classified. You can find lots of examples of decision trees.

Is this supervised or unsupervised? This is supervised learning, since the training dataset is labeled with classes. Using the patient example, C4.5 doesn’t learn on its own that a patient will get cancer or won’t get cancer. We told it first, it generated a decision tree, and now it uses the decision tree to classify.

You might be wondering how C4.5 is different than other decision tree systems?

  • First, C4.5 uses information gain when generating the decision tree.
  • Second, although other systems also incorporate pruning, C4.5 uses a single-pass pruning process to mitigate over-fitting. Pruning results in many improvements.
  • Third, C4.5 can work with both continuous and discrete data. My understanding is it does this by specifying ranges or thresholds for continuous data thus turning continuous data into discrete data.
  • Finally, incomplete data is dealt with in its own ways.

Why use C4.5? Arguably, the best selling point of decision trees is their ease of interpretation and explanation. They are also quite fast, quite popular and the output is human readable.

Where is it used? A popular open-source Java implementation can be found over at OpenTox. Orange, an open-source data visualization and analysis tool for data mining, implements C4.5 in their decision tree classifier.

Classifiers are great, but make sure to checkout the next algorithm about clustering…

2. k-means

What does it do? k-means creates k groups from a set of objects so that the members of a group are more similar. It’s a popular cluster analysis technique for exploring a dataset.

Hang on, what’s cluster analysis? Cluster analysis is a family of algorithms designed to form groups such that the group members are more similar versus non-group members. Clusters and groups are synonymous in the world of cluster analysis.

Is there an example of this? Definitely, suppose we have a dataset of patients. In cluster analysis, these would be called observations. We know various things about each patient like age, pulse, blood pressure, VO2max, cholesterol, etc. This is a vector representing the patient.


You can basically think of a vector as a list of numbers we know about the patient. This list can also be interpreted as coordinates in multi-dimensional space. Pulse can be one dimension, blood pressure another dimension and so forth.

You might be wondering:

Given this set of vectors, how do we cluster together patients that have similar age, pulse, blood pressure, etc?

Want to know the best part?

You tell k-means how many clusters you want. K-means takes care of the rest.

How does k-means take care of the rest? k-means has lots of variations to optimize for certain types of data.

At a high level, they all do something like this:

  1. k-means picks points in multi-dimensional space to represent each of the k clusters. These are called centroids.
  2. Every patient will be closest to 1 of these k centroids. They hopefully won’t all be closest to the same one, so they’ll form a cluster around their nearest centroid.
  3. What we have are k clusters, and each patient is now a member of a cluster.
  4. k-means then finds the center for each of the k clusters based on its cluster members (yep, using the patient vectors!).
  5. This center becomes the new centroid for the cluster.
  6. Since the centroid is in a different place now, patients might now be closer to other centroids. In other words, they may change cluster membership.
  7. Steps 2-6 are repeated until the centroids no longer change, and the cluster memberships stabilize. This is called convergence.

Is this supervised or unsupervised? It depends, but most would classify k-means as unsupervised. Other than specifying the number of clusters, k-means “learns” the clusters on its own without any information about which cluster an observation belongs to. k-means can be semi-supervised.

Why use k-means? I don’t think many will have an issue with this:

The key selling point of k-means is its simplicity. Its simplicity means it’s generally faster and more efficient than other algorithms, especially over large datasets.

It gets better:

k-means can be used to pre-cluster a massive dataset followed by a more expensive cluster analysis on the sub-clusters. k-means can also be used to rapidly “play” with k and explore whether there are overlooked patterns or relationships in the dataset.

It’s not all smooth sailing:

Two key weaknesses of k-means are its sensitivity to outliers, and its sensitivity to the initial choice of centroids. One final thing to keep in mind is k-means is designed to operate on continuous data — you’ll need to do some tricks to get it to work on discrete data.

Where is it used? A ton of implementations for k-means clustering are available online:

If decision trees and clustering didn’t impress you, you’re going to love the next algorithm.

3. Support vector machines

What does it do? Support vector machine (SVM) learns a hyperplane to classify data into 2 classes. At a high-level, SVM performs a similar task like C4.5 except SVM doesn’t use decision trees at all.

Whoa, a hyper-what? A hyperplane is a function like the equation for a line, y = mx + b. In fact, for a simple classification task with just 2 features, the hyperplane can be a line.

As it turns out…

SVM can perform a trick to project your data into higher dimensions. Once projected into higher dimensions…

…SVM figures out the best hyperplane which separates your data into the 2 classes.

Do you have an example? Absolutely, the simplest example I found starts with a bunch of red and blue balls on a table. If the balls aren’t too mixed together, you could take a stick and without moving the balls, separate them with the stick.

You see:

When a new ball is added on the table, by knowing which side of the stick the ball is on, you can predict its color.

What do the balls, table and stick represent? The balls represent data points, and the red and blue color represent 2 classes. The stick represents the simplest hyperplane which is a line.

And the coolest part?

SVM figures out the function for the hyperplane.

What if things get more complicated? Right, they frequently do. If the balls are mixed together, a straight stick won’t work.

Here’s the work-around:

Quickly lift up the table throwing the balls in the air. While the balls are in the air and thrown up in just the right way, you use a large sheet of paper to divide the balls in the air.

You might be wondering if this is cheating:

Nope, lifting up the table is the equivalent of mapping your data into higher dimensions. In this case, we go from the 2 dimensional table surface to the 3 dimensional balls in the air.

How does SVM do this? By using a kernel we have a nice way to operate in higher dimensions. The large sheet of paper is still called a hyperplane, but it is now a function for a plane rather than a line. Note from Yuval that once we’re in 3 dimensions, the hyperplane must be a plane rather than a line.

I found this visualization super helpful:

Reddit also has 2 great threads on this in the ELI5 and ML subreddits.

How do balls on a table or in the air map to real-life data? A ball on a table has a location that we can specify using coordinates. For example, a ball could be 20cm from the left edge and 50cm from the bottom edge. Another way to describe the ball is as (x, y) coordinates or (20, 50). x and y are 2 dimensions of the ball.

Here’s the deal:

If we had a patient dataset, each patient could be described by various measurements like pulse, cholesterol level, blood pressure, etc. Each of these measurements is a dimension.

The bottom line is:

SVM does its thing, maps them into a higher dimension and then finds the hyperplane to separate the classes.

Margins are often associated with SVM? What are they? The margin is the distance between the hyperplane and the 2 closest data points from each respective class. In the ball and table example, the distance between the stick and the closest red and blue ball is the margin.

The key is:

SVM attempts to maximize the margin, so that the hyperplane is just as far away from red ball as the blue ball. In this way, it decreases the chance of misclassification.

Where does SVM get its name from? Using the ball and table example, the hyperplane is equidistant from a red ball and a blue ball. These balls or data points are called support vectors, because they support the hyperplane.

Is this supervised or unsupervised? This is a supervised learning, since a dataset is used to first teach the SVM about the classes. Only then is the SVM capable of classifying new data.

Why use SVM? SVM along with C4.5 are generally the 2 classifiers to try first. No classifier will be the best in all cases due to the No Free Lunch Theorem. In addition, kernel selection and interpretability are some weaknesses.

Where is it used? There are many implementations of SVM. A few of the popular ones are scikit-learn, MATLAB and of course libsvm.

The next algorithm is one of my favorites…

4. Apriori

What does it do? The Apriori algorithm learns association rules and is applied to a database containing a large number of transactions.

What are association rules? Association rule learning is a data mining technique for learning correlations and relations among variables in a database.

What’s an example of Apriori? Let’s say we have a database full of supermarket transactions. You can think of a database as a giant spreadsheet where each row is a customer transaction and every column represents a different grocery item.

shopping database

Here’s the best part:

By applying the Apriori algorithm, we can learn the grocery items that are purchased together a.k.a association rules.

The power of this is:

You can find those items that tend to be purchased together more frequently than other items — the ultimate goal being to get shoppers to buy more. Together, these items are called itemsets.

For example:

You can probably quickly see that chips + dip and chips + soda seem to frequently occur together. These are called 2-itemsets. With a large enough dataset, it will be much harder to “see” the relationships especially when you’re dealing with 3-itemsets or more. That’s precisely what Apriori helps with!

You might be wondering how Apriori works? Before getting into the nitty-gritty of algorithm, you’ll need to define 3 things:

  1. The first is the size of your itemset. Do you want to see patterns for a 2-itemset, 3-itemset, etc.?
  2. The second is your support or the number of transactions containing the itemset divided by the total number of transactions. An itemset that meets the support is called a frequent itemset.
  3. The third is your confidence or the conditional probability of some item given you have certain other items in your itemset. A good example is given chips in your itemset, there is a 67% confidence of having soda also in the itemset.

The basic Apriori algorithm is a 3 step approach:

  1. Join. Scan the whole database for how frequent 1-itemsets are.
  2. Prune. Those itemsets that satisfy the support and confidence move onto the next round for 2-itemsets.
  3. Repeat. This is repeated for each itemset level until we reach our previously defined size.

Is this supervised or unsupervised? Apriori is generally considered an unsupervised learning approach, since it’s often used to discover or mine for interesting patterns and relationships.

But wait, there’s more…

Apriori can also be modified to do classification based on labelled data.

Why use Apriori? Apriori is well understood, easy to implement and has many derivatives.

On the other hand…

The algorithm can be quite memory, space and time intensive when generating itemsets.

Where is it used? Plenty of implementations of Apriori are available. Some popular ones are the ARtool, Weka, and Orange.

The next algorithm was the most difficult for me to understand, look at the next algorithm…

5. EM

What does it do? In data mining, expectation-maximization (EM) is generally used as a clustering algorithm (like k-means) for knowledge discovery.

In statistics, the EM algorithm iterates and optimizes the likelihood of seeing observed data while estimating the parameters of a statistical model with unobserved variables.

OK, hang on while I explain…

I’m not a statistician, so hopefully my simplification is both correct and helps with understanding.

Here are a few concepts that will make this way easier…

What’s a statistical model? I see a model as something that describes how observed data is generated. For example, the grades for an exam could fit a bell curve, so the assumption that the grades are generated via a bell curve (a.k.a. normal distribution) is the model.

Wait, what’s a distribution? A distribution represents the probabilities for all measurable outcomes. For example, the grades for an exam could fit a normal distribution. This normal distribution represents all the probabilities of a grade.

In other words, given a grade, you can use the distribution to determine how many exam takers are expected to get that grade.

Cool, what are the parameters of a model? A parameter describes a distribution which is part of a model. For example, a bell curve can be described by its mean and variance.

Using the exam scenario, the distribution of grades on an exam (the measurable outcomes) followed a bell curve (this is the distribution). The mean was 85 and the variance was 100.

So, all you need to describe a normal distribution are 2 parameters:

  1. The mean
  2. The variance

And likelihood? Going back to our previous bell curve example… suppose we have a bunch of grades and are told the grades follow a bell curve. However, we’re not given all the grades… only a sample.

Here’s the deal:

We don’t know the mean or variance of all the grades, but we can estimate them using the sample. The likelihood is the probability that the bell curve with estimated mean and variance results in those bunch of grades.

In other words, given a set of measurable outcomes, let’s estimate the parameters. Using these estimated parameters, the hypothetical probability of the outcomes is called likelihood.

Remember, it’s the hypothetical probability of the existing grades, not the probability of a future grade.

You’re probably wondering, what’s probability then?

Using the bell curve example, suppose we know the mean and variance. Then we’re told the grades follow a bell curve. The chance that we observe certain grades and how often they are observed is the probability.

In more general terms, given the parameters, let’s estimate what outcomes should be observed. That’s what probability does for us.

Great! Now, what’s the difference between observed and unobserved data? Observed data is the data that you saw or recorded. Unobserved data is data that is missing. There a number of reasons that the data could be missing (not recorded, ignored, etc.).

Here’s the kicker:

For data mining and clustering, what’s important to us is looking at the class of a data point as missing data. We don’t know the class, so interpreting missing data this way is crucial for applying EM to the task of clustering.

Once again: The EM algorithm iterates and optimizes the likelihood of seeing observed data while estimating the parameters of a statistical model with unobserved variables. Hopefully, this is way more understandable now.

The best part is…

By optimizing the likelihood, EM generates an awesome model that assigns class labels to data points — sounds like clustering to me!

How does EM help with clustering? EM begins by making a guess at the model parameters.

Then it follows an iterative 3-step process:

  1. E-step: Based on the model parameters, it calculates the probabilities for assignments of each data point to a cluster.
  2. M-step: Update the model parameters based on the cluster assignments from the E-step.
  3. Repeat until the model parameters and cluster assignments stabilize (a.k.a. convergence).

Is this supervised or unsupervised? Since we do not provide labeled class information, this is unsupervised learning.

Why use EM? A key selling point of EM is it’s simple and straight-forward to implement. In addition, not only can it optimize for model parameters, it can also iteratively make guesses about missing data.

This makes it great for clustering and generating a model with parameters. Knowing the clusters and model parameters, it’s possible to reason about what the clusters have in common and which cluster new data belongs to.

EM is not without weaknesses though…

  • First, EM is fast in the early iterations, but slow in the later iterations.
  • Second, EM doesn’t always find the optimal parameters and gets stuck in local optima rather than global optima.

Where is it used? The EM algorithm is available in Weka. R has an implementation in the mclust package. scikit-learn also has an implementation in its gmm module.

What data mining does Google do? Take a look…

6. PageRank

What does it do? PageRank is a link analysis algorithm designed to determine the relative importance of some object linked within a network of objects.

Yikes.. what’s link analysis? It’s a type of network analysis looking to explore the associations (a.k.a. links) among objects.

Here’s an example: The most prevalent example of PageRank is Google’s search engine. Although their search engine doesn’t solely rely on PageRank, it’s one of the measures Google uses to determine a web page’s importance.

Let me explain:

Web pages on the World Wide Web link to each other. If links to a web page on CNN, a vote is added for the CNN page indicating finds the CNN web page relevant.

And it doesn’t stop there…’s votes are in turn weighted by’s importance and relevance. In other words, any web page that’s voted for increases’s relevance.

The bottom line?

This concept of voting and relevance is PageRank.’s vote for CNN increases CNN’s PageRank, and the strength of’s PageRank influences how much its vote affects CNN’s PageRank.

What does a PageRank of 0, 1, 2, 3, etc. mean? Although the precise meaning of a PageRank number isn’t disclosed by Google, we can get a sense of its relative meaning.

And here’s how:

Pank Rank Table

You see?

It’s a bit like a popularity contest. We all have a sense of which websites are relevant and popular in our minds. PageRank is just an uber elegant way to define it.

What other applications are there of PageRank? PageRank was specifically designed for the World Wide Web.

Think about it:

At its core, PageRank is really just a super effective way to do link analysis.The objects being linked don’t have to be web pages.

Here are 3 innovative applications of PageRank:

  1. Dr Stefano Allesina, from the University of Chicago, applied PageRank to ecology to determine which species are critical for sustaining ecosystems.
  2. Twitter developed WTF (Who-to-Follow) which is a personalized PageRank recommendation engine about who to follow.
  3. Bin Jiang, from The Hong Kong Polytechnic University, used a variant of PageRank to predict human movement rates based on topographical metrics in London.

Is this supervised or unsupervised? PageRank is generally considered an unsupervised learning approach, since it’s often used to discover the importance or relevance of a web page.

Why use PageRank? Arguably, the main selling point of PageRank is its robustness due to the difficulty of getting a relevant incoming link.

Simply stated:

If you have a graph or network and want to understand relative importance, priority, ranking or relevance, give PageRank a try.

Where is it used? The PageRank trademark is owned by Google. However, the PageRank algorithm is actually patented by Stanford University.

You might be wondering if you can use PageRank:

I’m not a lawyer, so best to check with an actual lawyer, but you can probably use the algorithm as long as it doesn’t commercially compete against Google/Stanford.

Here are 3 implementations of PageRank:

  1. C++ OpenSource PageRank Implementation
  2. Python PageRank Implementation
  3. igraph – The network analysis package (R)

With our powers combined, we are…

7. AdaBoost

What does it do? AdaBoost is a boosting algorithm which constructs a classifier.

As you probably remember, a classifier takes a bunch of data and attempts to predict or classify which class a new data element belongs to.

But what’s boosting? Boosting is an ensemble learning algorithm which takes multiple learning algorithms (e.g. decision trees) and combines them. The goal is to take an ensemble or group of weak learners and combine them to create a single strong learner.

What’s the difference between a strong and weak learner? A weak learner classifies with accuracy barely above chance. A popular example of a weak learner is the decision stump which is a one-level decision tree.


A strong learner has much higher accuracy, and an often used example of a strong learner is SVM.

What’s an example of AdaBoost? Let’s start with 3 weak learners. We’re going to train them in 10 rounds on a training dataset containing patient data. The dataset contains details about the patient’s medical records.

The question is…

How can we predict whether the patient will get cancer?

Here’s how AdaBoost answers the question…

In round 1: AdaBoost takes a sample of the training dataset and tests to see how accurate each learner is. The end result is we find the best learner.

In addition, samples that are misclassified are given a heavier weight, so that they have a higher chance of being picked in the next round.

One more thing, the best learner is also given a weight depending on its accuracy and incorporated into the ensemble of learners (right now there’s just 1 learner).

In round 2: AdaBoost again attempts to look for the best learner.

And here’s the kicker:

The sample of patient training data is now influenced by the more heavily misclassified weights. In other words, previously misclassified patients have a higher chance of showing up in the sample.


It’s like getting to the second level of a video game and not having to start all over again when your character is killed. Instead, you start at level 2 and focus all your efforts on getting to level 3.

Likewise, the first learner likely classified some patients correctly. Instead of trying to classify them again, let’s focus all the efforts on getting the misclassified patients.

The best learner is again weighted and incorporated into the ensemble, misclassified patients are weighted so they have a higher chance of being picked and we rinse and repeat.

At the end of the 10 rounds: We’re left with an ensemble of weighted learners trained and then repeatedly retrained on misclassified data from the previous rounds.

Is this supervised or unsupervised? This is supervised learning, since each iteration trains the weaker learners with the labelled dataset.

Why use AdaBoost? AdaBoost is simple. The algorithm is relatively straight-forward to program.

In addition, it’s fast! Weak learners are generally simpler than strong learners. Being simpler means they’ll likely execute faster.

Another thing…

It’s a super elegant way to auto-tune a classifier, since each successive AdaBoost round refines the weights for each of the best learners. All you need to specify is the number of rounds.

Finally, it’s flexible and versatile. AdaBoost can incorporate any learning algorithm, and it can work with a large variety of data.

Where is it used? AdaBoost has a ton of implementations and variants. Here are a few:

If you like Mr. Rogers, you’ll like the next algorithm…

8. kNN

What does it do? kNN, or k-Nearest Neighbors, is a classification algorithm. However, it differs from the classifiers previously described because it’s a lazy learner.

What’s a lazy learner? A lazy learner doesn’t do much during the training process other than store the training data. Only when new unlabeled data is input does this type of learner look to classify.

On the other hand, an eager learner builds a classification model during training. When new unlabeled data is input, this type of learner feeds the data into the classification model.

How does C4.5, SVM and AdaBoost fit into this? Unlike kNN, they are all eager learners.

Here’s why:

  1. C4.5 builds a decision tree classification model during training.
  2. SVM builds a hyperplane classification model during training.
  3. AdaBoost builds an ensemble classification model during training.

So what does kNN do? kNN builds no such classification model. Instead, it just stores the labeled training data.

When new unlabeled data comes in, kNN operates in 2 basic steps:

  1. First, it looks at the k closest labeled training data points — in other words, the k-nearest neighbors.
  2. Second, using the neighbors’ classes, kNN gets a better idea of how the new data should be classified.

You might be wondering…

How does kNN figure out what’s closer? For continuous data, kNN uses a distance metric like Euclidean distance. The choice of distance metric largely depends on the data. Some even suggest learning a distance metric based on the training data. There’s tons more details and papers on kNN distance metrics.

For discrete data, the idea is transform discrete data into continuous data. 2 examples of this are:

  1. Using Hamming distance as a metric for the “closeness” of 2 text strings.
  2. Transforming discrete data into binary features.

These 2 Stack Overflow threads have some more suggestions on dealing with discrete data:

How does kNN classify new data when neighbors disagree? kNN has an easy time when all neighbors are the same class. The intuition is if all the neighbors agree, then the new data point likely falls in the same class.

I’ll bet you can guess where things get hairy…

How does kNN decide the class when neighbors don’t have the same class?

2 common techniques for dealing with this are:

  1. Take a simple majority vote from the neighbors. Whichever class has the greatest number of votes becomes the class for the new data point.
  2. Take a similar vote except give a heavier weight to those neighbors that are closer. A simple way to do this is to use reciprocal distance e.g. if the neighbor is 5 units away, then weight its vote 1/5. As the neighbor gets further away, the reciprocal distance gets smaller and smaller… exactly what we want!

Is this supervised or unsupervised? This is supervised learning, since kNN is provided a labeled training dataset.

Why use kNN? Ease of understanding and implementing are 2 of the key reasons to use kNN. Depending on the distance metric, kNN can be quite accurate.

But that’s just part of the story…

Here are 5 things to watch out for:

  1. kNN can get very computationally expensive when trying to determine the nearest neighbors on a large dataset.
  2. Noisy data can throw off kNN classifications.
  3. Features with a larger range of values can dominate the distance metric relative to features that have a smaller range, so feature scaling is important.
  4. Since data processing is deferred, kNN generally requires greater storage requirements than eager classifiers.
  5. Selecting a good distance metric is crucial to kNN’s accuracy.

Where is it used? A number of kNN implementations exist:

Spam? Fuhgeddaboudit! Read ahead to learn about the next algorithm…

9. Naive Bayes

What does it do? Naive Bayes is not a single algorithm, but a family of classification algorithms that share one common assumption:

Every feature of the data being classified is independent of all other features given the class.

What does independent mean? 2 features are independent when the value of one feature has no effect on the value of another feature.

For example:

Let’s say you have a patient dataset containing features like pulse, cholesterol level, weight, height and zip code. All features would be independent if the value of all features have no effect on each other. For this dataset, it’s reasonable to assume that the patient’s height and zip code are independent, since a patient’s height has little to do with their zip code.

But let’s not stop there, are the other features independent?

Sadly, the answer is no. Here are 3 feature relationships which are not independent:

  • If height increases, weight likely increases.
  • If cholesterol level increases, weight likely increases.
  • If cholesterol level increases, pulse likely increases as well.

In my experience, the features of a dataset are generally not all independent.

And that ties in with the next question…

Why is it called naive? The assumption that all features of a dataset are independent is precisely why it’s called naive — it’s generally not the case that all features are independent.

What’s Bayes? Thomas Bayes was an English statistician for which Bayes’ Theorem is named after. You can click on the link to find about more about Bayes’ Theorem.

In a nutshell, the theorem allows us to predict the class given a set of features using probability.

The simplified equation for classification looks something like this:

P(\textit{Class A}|\textit{Feature 1}, \textit{Feature 2}) = \dfrac{P(\textit{Feature 1}|\textit{Class A}) \cdot P(\textit{Feature 2}|\textit{Class A}) \cdot P(\textit{Class A})}{P(\textit{Feature 1}) \cdot P(\textit{Feature 2})}

Let’s dig deeper into this…

What does the equation mean? The equation finds the probability of Class A given Features 1 and 2. In other words, if you see Features 1 and 2, this is the probability the data is Class A.

The equation reads: The probability of Class A given Features 1 and 2 is a fraction.

  • The fraction’s numerator is the probability of Feature 1 given Class A multiplied by the probability of Feature 2 given Class A multiplied by the probability of Class A.
  • The fraction’s denominator is the probability of Feature 1 multiplied by the probability of Feature 2.

What is an example of Naive Bayes? Below is a great example taken from a Stack Overflow thread.

Here’s the deal:

  • We have a training dataset of 1,000 fruits.
  • The fruit can be a Banana, Orange or Other (these are the classes).
  • The fruit can be Long, Sweet or Yellow (these are the features).

Fruit Probabilities

What do you see in this training dataset?

  • Out of 500 bananas, 400 are long, 350 are sweet and 450 are yellow.
  • Out of 300 oranges, none are long, 150 are sweet and 300 are yellow.
  • Out of the remaining 200 fruit, 100 are long, 150 are sweet and 50 are yellow.

If we are given the length, sweetness and color of a fruit (without knowing its class), we can now calculate the probability of it being a banana, orange or other fruit.

Suppose we are told the unknown fruit is long, sweet and yellow.

Here’s how we calculate all the probabilities in 4 steps:

Step 1: To calculate the probability the fruit is a banana, let’s first recognize that this looks familiar. It’s the probability of the class Banana given the features Long, Sweet and Yellow or more succinctly:

P(Banana|Long, Sweet, Yellow)

This is exactly like the equation discussed earlier.

Step 2: Starting with the numerator, let’s plug everything in.

  • P(Long|Banana) = 400/500 = 0.8
  • P(Sweet|Banana) = 350/500 = 0.7
  • P(Yellow|Banana) = 450/500 = 0.9
  • P(Banana) = 500 / 1000 = 0.5

Multiplying everything together (as in the equation), we get:

0.8 \times 0.7 \times 0.9 \times 0.5 = 0.252

Step 3: Ignore the denominator, since it’ll be the same for all the other calculations.

Step 4: Do a similar calculation for the other classes:

  • P(Orange|Long, Sweet, Yellow) = 0
  • P(Other|Long, Sweet, Yellow) = 0.01875

Since the 0.252 is greater than 0.01875, Naive Bayes would classify this long, sweet and yellow fruit as a banana.

Is this supervised or unsupervised? This is supervised learning, since Naive Bayes is provided a labeled training dataset in order to construct the tables.

Why use Naive Bayes? As you could see in the example above, Naive Bayes involves simple arithmetic. It’s just tallying up counts, multiplying and dividing.

Once the frequency tables are calculated, classifying an unknown fruit just involves calculating the probabilities for all the classes, and then choosing the highest probability.

Despite its simplicity, Naive Bayes can be surprisingly accurate. For example, it’s been found to be effective for spam filtering.

Where is it used? Implementations of Naive Bayes can be found in Orange, scikit-learn, Weka and R.

Finally, check out the 10th algorithm…

10. CART

What does it do? CART stands for classification and regression trees. It is a decision tree learning technique that outputs either classification or regression trees. Like C4.5, CART is a classifier.

Is a classification tree like a decision tree? A classification tree is a type of decision tree. The output of a classification tree is a class.

For example, given a patient dataset, you might attempt to predict whether the patient will get cancer. The class would either be “will get cancer” or “won’t get cancer.”

What’s a regression tree? Unlike a classification tree which predicts a class, regression trees predict a numeric or continuous value e.g. a patient’s length of stay or the price of a smartphone.

Here’s an easy way to remember…

Classification trees output classes, regression trees output numbers.

Since we’ve already covered how decision trees are used to classify data, let’s jump right into things…

How does this compare with C4.5?

Uses information gain to segment data during decision tree generation. Uses Gini impurity (not to be confused with Gini coefficient). A good discussion of the differences between the impurity and coefficient is available on Stack Overflow.
Uses a single-pass pruning process to mitigate over-fitting. Uses the cost-complexity method of pruning. Starting at the bottom of the tree, CART evaluates the misclassification cost with the node vs. without the node. If the cost doesn’t meet a threshold, it is pruned away.
The decision nodes can have 2 or more branches. The decision nodes have exactly 2 branches.
Probabilistically distributes missing values to children. Uses surrogates to distribute the missing values to children.

Is this supervised or unsupervised? CART is a supervised learning technique, since it is provided a labeled training dataset in order to construct the classification or regression tree model.

Why use CART? Many of the reasons you’d use C4.5 also apply to CART, since they are both decision tree learning techniques. Things like ease of interpretation and explanation also apply to CART as well.

Like C4.5, they are also quite fast, quite popular and the output is human readable.

Where is it used? scikit-learn implements CART in their decision tree classifier. R’s tree package has an implementation of CART. Weka and MATLAB also have implementations.

Interesting Resources

Now it’s your turn…

Now that I’ve shared my thoughts and research around these data mining algorithms, I want to turn it over to you.

  • Are you going to give data mining a try?
  • Which data mining algorithms have you heard of but weren’t on the list?
  • Or maybe you have a question about an algorithm?

Let me know what you think by leaving a comment.


Bio: Raymond Li is a software engineer and data enthusiast who has been blogging at for over a decade. He loves to learn, teach and grow. You’ll usually find him wrangling data, programming and lifehacking. Like what you’ve read? If so, subscribe via email to get started now!

By Raymond Li.

Original post here




Customer Insights From Zoomdata and Cloudera

Understanding and preventing churn (customer loss) requires connecting to all customer touch points – transaction data, call logs, customer complains, social media engagement – to create a complete customer view in real time first. This demo then showcases how this US telco analyzes root causes of churn by discerning key behavior and customer journey, identifies customer profiles at risk accordingly, and executes plan to prevent churn proactively.



How to Make a Text Summarizer – Intro to Deep Learning #10

I’ll show you how you can turn an article into a one-sentence summary in Python with the Keras machine learning library. We’ll go over word embeddings, encoder-decoder architecture, and the role of attention in learning theory.

Code for this video (Challenge included):…

Jie’s Winning Code:…

More Learning resources:……………

Please subscribe! And like. And comment. That’s what keeps me going.

Join us in the Wizards Slack channel:

And please support me on Patreon:


Intro – Training a neural network to play a game with TensorFlow and Open AI

This tutorial mini series is focused on training a neural network to play the Open AI environment called CartPole.

The idea of CartPole is that there is a pole standing up on top of a cart. The goal is to balance this pole by wiggling/moving the cart from side to side to keep the pole balanced upright.


How We Combined Different Methods to Create Advanced Time Series Prediction

Today, businesses need to be able to predict demand and trends to stay in line with any sudden market changes and economy swings. This is exactly where forecasting tools, powered by Data Science, come into play, enabling organizations to successfully deal with strategic and capacity planning. Smart forecasting techniques can be used to reduce any possible risks and assist in making well-informed decisions. One of our customers, an enterprise from the Middle East, needed to predict their market demand for the upcoming twelve weeks. They required a market forecast to help them set their short-term objectives, such as production strategy, as well as assist in capacity planning and price control. So, we came up with an idea of creating a custom time series model capable of tackling the challenge. In this article, we will cover the modelling process as well as the pitfalls we had to overcome along the way.

There is a number of approaches to building time series prediction ….and neither fit us

With the emergence of the powerful forecasting methods based on Machine Learning, future predictions have become more accurate. In general, forecasting techniques can be grouped into two categories: qualitative and quantitative. Qualitative forecasts are applied when there is no data available and prediction is based only on expert judgement. Quantitative forecasts are based on time series modeling. This kind of models uses historical data and is especially efficient in forecasting some events that occur over periods of time: for example prices, sales figures, volume of production etc.

The existing models for time series prediction include the ARIMA models that are mainly used to model time series data without directly handling seasonality; VAR modelsHolt-Winters seasonal methods, TAR modelsand other. Unfortunately, these algorithms may fail to deliver the required level of the prediction accuracy, as they can involve raw data that might be incomplete, inconsistent or contain some errors. As quality decisions are based only on quality data, it is crucial to perform preprocessing to prepare entry information for further processing.

Why combining models is an answer

It is clear that one particular forecasting technique cannot work in every situation. Each of the methods has its specific use case and can be applied with regard to many factors (the period over which the historical data is available, the time period that has to be observed, the size of the budget, the preferred level of accuracy) and the output required. So, we faced the question: which method/methods to use to obtain the desired result? As different approaches had their unique strengths and weaknesses, we decided to combine a number of methods and make them work together. In this way, we could build a time series model capable of providing trustworthy predictions to ensure data reliability and time/cost saving. And this is how we did it.

The modeling process; let’s dive into the details

The demand data depends on various factors that can influence the result of the forecast, such as the price and types of goods, geographical location, the country’s economics, manufacturing technology, etc. As we wanted our time series model to provide the customer with high-accuracy predictions, we used the interpolation method for missing values to ensure that the input is reliable.

When conducting the time series analysis in Python 2.7., we analyzed the past data starting from 2010 to 2015 to calculate precisely the demand and predict its behavior in the future.

Fig1 The demand data over 2010-2015 timeframe
Fig. 1. The demand data over the 2010-2015 timeframe

At first sight, it may seem that there is no constant demand value, as the variance goes up and down, making the prediction hardly possible. But, there is a method that can help here.

We used the decomposition method to separately extract trend (the increase or decrease in the series over a period of time), seasonality (the fluctuation that occurs within the series over each week, each month, etc.) and residuals (the data point that falls outside of the expected data range). With these three components we built the additive model:

How We Combined Different Methods to Create Advanced Time Series Prediction Formula 1
where yt is the data, Tt is the trend-cycle component, St is the seasonal component and Rt is the residual component, all defined over the time period t.

An important first step in describing various components of the series is smoothing, although it does not really provide you with a ready-to-use model. In the beginning, we estimated the trend (behavior) component. Such methods as Moving Average, Exponential Smoothing, Chow’s Adaptive Control, Winter’s Linear and Seasonal Exponential Smoothing methods did not provide us with the trend estimation accuracy we expected. The most reliable result was obtained using the Hodrick-Prescott Filter technique.

Fig. 2. The estimated trend

Then, we defined the seasonality from the available data. This component could change over time, so we applied a powerful tool for decomposing the time series – the Loess method. This approach can handle any type of seasonality, and the rate of change can be controlled by a user.

Fig. 3. Multi-seasonality

We obtained a multi-seasonal component with some high and low variances, causing large fluctuations.

After applying Elastic Net Regression and Fourier transformation, we built a forecast for the trend based on the results obtained. The approximation of the trend can be found from the formula below,

where Pn(t) is a degree polynomial and Ak is a set of indexes, including the first k indexes with highest amplitudes.

Then, we calculated the Fourier coefficients using The Discrete Fourier Transform (DFT).
Fig. 4. The example of code of the DFT in Python

The effect of the Fourier terms, used as external regressors in the model, is visualised below.

Fig. 5. The visualised effect of Fourier terms

We built the trend prediction using the additive model.

Fig. 6. Trend prediction

When the trend and seasonal components are removed from the model, we can obtain the residuals (the difference between an observed value and its forecast based on other observations) from the remaining part to validate and fit our mathematical model.

Fig. 7. Obtained residuals

You may notice that there are some negative values present, showing that something unusual was happening during that period of time. We aimed to find out the circumstances causing such behavior, so we came up with an idea to compile the outliers with a simple calendar and discovered that the negative values tightly correlate with such public holidays as Ramadan, Eid Al Fitr and other. Having collected and summarized all the data, we applied Machine Learning methods based on previous data points as entry features and Machine Learning Strategies for Time Series Prediction.

After a few training sessions conducted with ML models, we built a prediction for residuals that can be observed below.

Fig. 8. Prediction for residuals

As a result, we got a final forecasting model that minimizes the mean absolute percentage error (MAPE) to 6% for one particular city and 10% for the entire country in general.

Fig. 9. The forecast at the original scale

A 24-times faster prediction? Yes, it’s possible

When building our model, we attempted not only to use the available information, but also discover the factors which could affect the results. This approach helped us develop the model generating more accurate forecasting results faster than the existing models. For example, to train the developed model to make a prediction for 300 different cities, we need about 15 minutes, while other methods require about 6 hours.

Also, the fact that the deviation between the actual demand and the predicted demand was only 6% resulted in possibilities to resolve mismatches between supply and demand. Now, the customer can more quickly and more easily plan the capacity, minimize future risks and optimize inventory.

What’s next?

Well, the results are quite promising. And there is a long way we can go further in improvement of this model, so it could provide accurate long-term forecasts as well. As for now, the degree of error for long-term predictions is still quite high. Sounds like a challenge? So stay tuned! Some new experiments are coming!

Posted by ELEKS on November 1, 2016 at 6:00am




Deep Learning Libraries by Language


  1. Theano is a python library for defining and evaluating mathematical expressions with numerical arrays. It makes it easy to write deep learning algorithms in python. On the top of the Theano many more libraries are built.

    1. Keras is a minimalist, highly modular neural network library in the spirit of Torch, written in Python, that uses Theano under the hood for optimized tensor manipulation on GPU and CPU.

    2. Pylearn2 is a library that wraps a lot of models and training algorithms such as Stochastic Gradient Descent that are commonly used in Deep Learning. Its functional libraries are built on top of Theano.

    3. Lasagne is a lightweight library to build and train neural networks in Theano. It is governed by simplicity, transparency, modularity, pragmatism , focus and restraint principles.

    4. Blocks a framework that helps you build neural network models on top of Theano.

  2. Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Google’s DeepDream is based on Caffe Framework. This framework is a BSD-licensed C++ library with Python Interface.

  3. nolearn contains a number of wrappers and abstractions around existing neural network libraries, most notably Lasagne, along with a few machine learning utility modules.

  4. Gensim is deep learning toolkit implemented in python programming language intended for handling large text collections, using efficient algorithms.

  5. Chainer bridge the gap between algorithms and implementations of deep learning. Its powerful, flexible and intuitive and is considered as the flexible framework for Deep Learning.

  6. deepnet is a GPU-based python implementation of deep learning algorithms like Feed-forward Neural Nets, Restricted Boltzmann Machines, Deep Belief Nets, Autoencoders, Deep Boltzmann Machines and Convolutional Neural Nets.

  7. Hebel is a library for deep learning with neural networks in Python using GPU acceleration with CUDA through PyCUDA. It implements the most important types of neural network models and offers a variety of different activation functions and training methods such as momentum, Nesterov momentum, dropout, and early stopping.

  8. CXXNET is fast, concise, distributed deep learning framework based on MShadow. It is a lightweight and easy extensible C++/CUDA neural network toolkit with friendly Python/Matlab interface for training and prediction.

  9. DeepPy is a Pythonic deep learning framework built on top of NumPy.

  10. DeepLearning is deep learning library, developed with C++ and python.

  11. Neon is Nervana’s Python based Deep Learning framework.


  1. ConvNet Convolutional neural net is a type of deep learning classification algorithms, that can learn useful features from raw data by themselves and is performed by tuning its weighs.

  2. DeepLearnToolBox is a matlab/octave toolbox for deep learning and includes Deep Belief Nets, Stacked Autoencoders, convolutional neural nets.

  3. cuda-convnet is a fast C++/CUDA implementation of convolutional (or more generally, feed-forward) neural networks. It can model arbitrary layer connectivity and network depth. Any directed acyclic graph of layers will do. Training is done using the backpropagation algorithm.

  4. MatConvNet  is a MATLAB toolbox implementing Convolutional Neural Networks (CNNs) for computer vision applications. It is simple, efficient, and can run and learn state-of-the-art CNNs


  1. eblearn is an open-source C++ library of machine learning by New York University’s machine learning lab, led by Yann LeCun. In particular, implementations of convolutional neural networks with energy-based models along with a GUI, demos and tutorials.

  2. SINGA is designed to be general to implement the distributed training algorithms of existing systems. It is supported by Apache Software Foundation.

  3. NVIDIA DIGITS is a new system for developing, training and visualizing deep neural networks. It puts the power of deep learning into an intuitive browser-based interface, so that data scientists and researchers can quickly design the best DNN for their data using real-time network behavior visualization.

  4. Intel® Deep Learning Framework provides a unified framework for Intel® platforms accelerating Deep Convolutional Neural Networks.


  1. N-Dimensional Arrays for Java (ND4J)is scientific computing libraries for the JVM. They are meant to be used in production environments, which means routines are designed to run fast with minimum RAM requirements.

  2. Deeplearning4j is the first commercial-grade, open-source, distributed deep-learning library written for Java and Scala. It is designed to be used in business environments, rather than as a research tool.

  3. Encog is an advanced machine learning framework which supports Support Vector Machines,Artificial Neural Networks, Genetic Programming, Bayesian Networks, Hidden Markov Models, Genetic Programming and Genetic Algorithms are supported.


  1. Convnet.js is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in a browser. No software requirements, no compilers, no installations, no GPUs, no sweat.


  1. Torch is a scientific computing framework with wide support for machine learning algorithms. It is easy to use and efficient, fast scripting language, LuaJIT, and an underlying C/CUDA implementation. Torch is based on Lua programming language.


  1. Mocha is a Deep Learning framework for Julia, inspired by the C++ framework Caffe. Efficient implementations of general stochastic gradient solvers and common layers in Mocha could be used to train deep / shallow (convolutional) neural networks, with (optional) unsupervised pre-training via (stacked) auto-encoders. Its best feature include Modular architecture, High-level Interface, portability with speed, compatibility and many more.


  1. Lush(Lisp Universal Shell) is an object-oriented programming language designed for researchers, experimenters, and engineers interested in large-scale numerical and graphic applications. It comes with rich set of deep learning libraries as a part of machine learning libraries.


  1. DNNGraph is a deep neural network model generation DSL in Haskell.


  1. Accord.NET is a .NET machine learning framework combined with audio and image processing libraries completely written in C#. It is a complete framework for building production-grade computer vision, computer audition, signal processing and statistics applications


  1. darch package can be used for generating neural networks with many layers (deep architectures). Training methods includes a pre training with the contrastive divergence method and a fine tuning with common known training algorithms like backpropagation or conjugate gradient.
  2. deepnet implements some deep learning architectures and neural network algorithms, including BP,RBM,DBN,Deep autoencoder and so on.

source from teglor




Computer Vision is a dynamic and rapidly growing field with countless high-profile applications that have been developed in recent years. The potential uses are diverse, and its integration with cutting edge research has already been validated with self-driving cars, facial recognition, 3D reconstructions, photo search and augmented reality. Artificial Intelligence has become a fundamental component of everyday technology, and visual recognition is a key aspect of that.  It is a valuable tool for interpreting the wealth of visual data that surrounds us and on a scale impossible with natural vision.

This course covers the tasks and systems at the core of visual recognition with a detailed exploration of deep learning architectures. While there will be a brief introduction to computer vision and frameworks, such as Caffe, Torch, Theano and TensorFlow, the focus will be learning end-to-end models, particularly for image classification. Students will learn to implement, train and debug their own neural networks as well as gain a detailed understanding of cutting-edge research in computer vision.

The final assignment will include training a multi-million parameter convolutional neural network and applying it on the largest image classification dataset (ImageNet).


  • Justin Johnson Instructor, Computer Science


  • End-to-end models
  • Image classification, localization and detection
  • Implementation, training and debugging
  • Learning algorithms, such as backpropagation
  • Long Short Term Memory (LSTM)
  • Recurrent Neural Networks (RNN)
  • Supervised and unsupervised learning


3.0 – 4.0

Students enrolling under the non degree option are required to take the course for 4.0 units.


Proficiency in Python; familiarity with C/C++; CS131 and CS229 or equivalents; Math21 or equivalent, linear algebra.

More information


For equities Trump is not decisive – What do you think?

The new US president is also influencing the price of European equities with his rhetoric. However, investors should not focus on overseas comments, other factors are more crucial.

Für Schweizer Aktien ist Trump nicht ausschlaggebend

Donald Trump has already made several trips to the car manufacturers in recent months. Shortly before his presidential nomination as US president, he also targeted the pharma sector: in mid-January, Trump announced plans to reduce drug prices in the US. “We’ll save billions of dollars,” he said.

Any big data or predictive analysis report there ?


Gartner 2017 Magic Quadrant for Data Science Platforms: gainers and losers

We compare Gartner 2017 Magic Quadrant for Data Science Platforms vs its 2016 version and identify notable changes for leaders and challengers, including IBM, SAS, RapidMiner, KNIME, MathWorks, Microsoft, and Quest.

Gartner new 2017 Magic Quadrant for Data Science Platforms (called in 2016 “Advanced Analytics Platforms”) was published last week. The 2017 report evaluated a new set of 16 analytics and data science firms over 15 criteria and placed them in 4 quadrants, based on completeness of vision and ability to execute.

While open source platforms like Python and R play an important role in the Data Science market, Gartner research methodology does not include them, so this report evaluates only commercial vendors.

Gartner 2017 Magic Quadrant for Data Science Platforms
Fig. 1: Gartner 2017 Magic Quadrant for Data Science Platforms

Firms covered:

  • Leaders (4): IBM, SAS, RapidMiner, KNIME
  • Challengers (4): MathWorks (new), Quest (formerly Dell), Alteryx, Angoss
  • Visionaries (5): Microsoft, (new), Dataiku (new), Domino Data Lab (new), Alpine Data
  • Niche Players (3): FICO, SAP, Teradata (new)

Gartner notes that even the lowest-scoring vendors in MQ are still among the top 16 firms among over 100 vendors in the heated Data Science market.

5 new firms were added in 2017 (MathWorks,, Dataiku, Domino Data Lab, and Teradata) and 5 others present in 2016 MQ were dropped: Lavastorm, Megaputer, Prognoz, Accenture, Predixion Software.

As we did in our popular post last year: Gartner 2016 Magic Quadrant for Advanced Analytics Platforms: gainers and losers, we compared 2017 Magic Quadrant with its 2016 version. Below we examine the changes, gainers, and losers.

Gartner 2017 Vs 2016 MQ for Data Science Platforms
Fig 2: Gartner Magic Quadrants for Data Science Platforms compared, 2017 vs 2016

Fig 2 shows a comparison of 2016 MQ (greyed background image) and 2017 MQ (foreground image), with arrows connecting circles for the same firm. Arrows are colored green if the firm position improved significantly (further away from origin), red if the position became weaker. Green circles indicate new firms, while red Xs mark vendors dropped in 2017.


The same 4 firms: IBM, SAS, RapidMiner, and KNIME, which were leaders in 2014-2016, are again in the leaders quadrant. Dell Software, sold in Nov 2016 and renamed Quest, and its Statistica Analytics suite moved from Leaders to Challengers quadrant.

The 2017 MQ shows that IBM gained a little in vision, while SAS lost a little in ability to execute. RapidMiner gained in ability to execute, but both KNIME and RapidMiner lost in vision.

IBM: is again a leader, based on SPSS Modeler and SPSS Statistics. Gartner did not evaluate IBM new Data Science Experience (DSx) platform but DSx contributed to IBM’s improved position along Vision axis. IBM strengths include its vast customer base and continued innovation of its data science and machine learning capabilities. However, IBM breadth of offerings (SPSS, IBM Watson, DSx, Cognos) creates some confusion in the market and SPSS line has interoperability problems.

SAS provides a many software products for analytics and data science. Gartner evaluated covers SAS Enterprise Miner (EM) and the SAS Visual Analytics Suite (VAS). SAS is more focused now on interactive modeling with VAS, but continues to support its traditional base SAS. SAS retain a strong position in the Leaders quad, but confusion about its multiple products and concerns about high cost led to decline in Ability to Execute.

KNIME offers open-source KNIME Analytics Platform with strong functionality for advanced data scientists. It is strong in several industries, especially in manufacturing and life sciences. It lost somewhat along the Vision dimension due to weaker marketing and innovation compared to other leaders.

RapidMiner offers GUI-based data science platform, suitable for beginner and expert data scientists. It also offers access to open-source code. RapidMiner is available both as a free version and a commercial edition (with extra functionality for large data and connections to more data sources). RapidMiner is in leaders quad due to its market presence and well-rounded product.


  • MathWorks is a new entry for its MATLAB product, popular with engineers, and providing a rich set of toolboxes.
  • Quest, the result of the sale of Dell Software in 2016 to a private equity firm, now sells the Statistica Platform. Quest is in Challenger quad (while Dell was in Leaders quad) as a result of the second change of ownership of Statistica in 3 years and lack of cloud-related product improvements (which however are on the roadmap).
  • Alteryx, offers an easy to use data science platform, with self-service data preparation and advanced analytics. It also added simulation and optimization capabilities. Compared to 2016, it moved from Visionaries to Challengers quad due to its solid customer growth.
  • Angoss provides visual data mining and predictive analytics tools, as well as prescriptive analysis and optimization. Angoss remained in almost the same position in Cha quad as in 2016.


  • Microsoft evaluation was based the Azure Machine Learning platform, part of the Microsoft Cortana Intelligence Suite, which offers a strong cloud-based data science platform. Gartner kept Microsoft in the visionaries quad, due to the lack of a comparable onsite solution.
  •, a new entry, offer open-source data science platform with fast execution of Deep Learning and other advanced Machine Learning methods.
  • Dataiku is a new entry, with its Data Science Studio (DSS). It was placed in Visionaries quad due to innovative nature of DSS, openness, collaboration features, and suitability for different skill levels.
  • Domino Data Lab a new entry with its Domino Data Science Platform, which focuses collaboration and supports a wide range of open-source technologies.
  • Alpine Data offer a “citizen data science” platform, Chorus, enabling collaboration between business analysts and front-line users in building and running analytic workflows. Compared to 2016, Alpine remained in Visionaries quadrant, but was dropped in its ability to execute due to its struggle to gain market share.

Niche Players:

  • SAP renamed its data science platform to SAP BusinessObjects Predictive Analytics. SAP decline in ability to execute caused it to drop from Challengers quad to Niche quad, and it is lagging in Spark integration, open-source and Python support, and cloud deployment.
  • FICO Decision Management Suite (DMS) offers multiple analytic tools. It stayed in the niche quadrant in 2016, and lost a little along both dimensions.
  • Teradata offers Aster Analytics platform, with 3 layers: analytic engines, prebuilt analytic functions, and the Aster AppCenter for analysis and connectivity to external BI tools. It is in the Niche quad to to low-level of adoption.

You can download the Gartner 2017 Magic Quadrant report for Advanced Analytics Platforms from RapidMiner, Dataiku, and probably other vendors favorably mentioned in this report.

You can also download a related 2017 Gartner Magic Quadrant for BI & Analytics from Tableau.



6 Top Big Data and Data Science Trends 2017

6 Top Big Data and Data Science Trends 2017

Recently we stepped in the 2017 year, and it’s time to draw the conclusion about 2016. Our world is depending more and more on the data and information, businesses using data science and machine learning to handle an enormous amount of data. Such things as deep learning and artificial intelligence applications really took off, and we expect those and other important developments to continue. So, let’s have a look at top big data and data science trends for the year 2017.


1. More and more businesses move their prediction analytics needs to the cloud

Many companies are moving their data and applications to the cloud. This process is driven by increased collaboration and flexibility, as well as reducing the complexity of administration and configuration of computing resources

And majority of the top cloud providers developed their own offering of Machine Learning services in a cloud. This step allows organizations to leverage machine learning technology, without massive investments and needs to employ large data science teams. So, now we are observing consumerization of predictive analytics technologies, which is supported by those cloud prediction services.

Here are main examples of such machine learning and AI as a service (MLaaS and AIaaS) providers:

IBM Watson

Microsoft Azure Machine Learning API

Google Prediction API

Amazon Machine Learning API


Those working with the data know very well that data is useless if it is not efficiently analyzed and turned into insights, which is, in fact, support decision-making process. In 2017, the increased usage of cloud ML services will be helping to improve and accelerate the transition from data to action for many companies for various industries.

2. Greater number of organizations will be adopting Hadoop and Spark big data platforms.

In 2017 we will notice a steady growth of Big Data technologies like Hadoop and Spark. The main advantage of Hadoop in the ability to collect vast amounts of data, and then distribute these data to low-cost servers running in parallel.

To actually make Hadoop to work outside of the test environment, many enterprises are going to shift it to the cloud in 2017, and abstract the storage from the calculation

And let’s take a look at Apache Spark – it is one of the fastest growing big data platform which uses in-memory computation.

According to recent studies, the percent of users using Spark on the public cloud (61%) was higher than the percent using Hadoop YARN (36%) and this trend will continue in 2017. In addition, the 2017 Spark proclaims the development of the operating model – through the “in-memory analytics.” This approach allows much faster computations, in some cases up to 100 times faster than original Hadoop Map Reduce framework.


3. Need for stronger data security is rising

The increasing number of cyber attacks is the reason for rising attention to the question of data security, which became more relevant in the recent years. It is expected that security analytics costs will raise up. Thus, it is critical to keep up with new trends in the field of data security.

One of the recent trends in security is increased usage of machine learning algorithm, including deep learning for detection of anomalies and other fields of data science security in various business domains.

Another noticeable trend of security data science is the utilization of AI conversational interfaces (a.k.a. bots) for automation of assistance and security responses to emerging threats. Such bots have been already using in several platforms like Slack.

Nowadays one of the quickly developing areas is behavioral biometrics. It can be used standalone, but in combination with ML we should expect faster results that should reduce the costs and improve efficiency. It can be used to identify the persons on the other end and what they’re trying to do as well as distinguish a user from the robot.

In the future, there may occur a lot of new types of attacks, and thus the requirements for cyber security are getting more complicated, and security specialists will need to adapt to the new threats.

4. Deep Learning technology is becoming mainstream.

Deep learning gets a lot of attention in 2016, as many noticeable results were achieved by using it for many important applications, such as machine translation and other forms of language processing, Automatic Image Caption Generation, Object Classification and Detection in Images,  Facial Recognition and Automatic Game Playing.

In 2017 we expect AI will not stop in improvement and move more and more to the form of artificial general intelligence (AGI).  It is no wonder that for data scientists Deep Learning will be one of the most important approaches for solving machine learning problems.

Furthermore, deep learning is becoming widely available now, because of the increased computing power and accessibility of this technology through various open-source frameworks, such as TensorFlow and Deeplearning4j.

Top companies, such as Google, Facebook, Microsoft and Baidu are driving research further with their strong artificial intelligence R&D teams.

So we are going to see a lot of new deep learning applications that could significantly impact our lives like auto-correction of the photo and professional photo processing using ML methods. Among other, we can mention the rise of the new portable devices, which helps disabled people to see, hear, and otherwise, sense their environment by utilizing machine learning.


5. The movement towards conversational interfaces and chatbots will accelerate

For a long time after the appearance of computers in our lives, we were forced to speak their language to communicate with them. And now, thanks to the latest discoveries and technology development, we can teach computers how to communicate in our language. Conversational systems have moved from a model in which the user is adapted to computers to such model where the computer “listens” and adjusts to the user’s desired result.

It appears that new possibilities in the conversational interfaces (CI) and chatbots will require a lot of new skills from developers. First of all, it will be a huge challenge for UX designers, as it is very different from current user interfaces. In this year many programmers and designers will aim to improve their skills in design and development of chatbots.

Chatbots themselves are going to be a big trend for 2017, and we are expecting to see more new creative solutions bringing chatbots to the next level. For now, chatbot technology has already been integrated into many consumer apps. The next step is to see it is moving into other industries like retail banking, e-commerce, enterprise solutions, or even marketing campaigns.

So why are the bots getting popular in recent time? Despite the fact that the bots have been existing for a long time, only now the AI development has reached a level where it became possible to create some advanced products, many of which utilize machine learning. Some of the prominent examples of conversational AI we can see in such products as Google Assistant and Siri for iOS, which have become an almost indispensable product for users of smartphones and has already gone far beyond just a fun application or quirks users.


6. 2017 is the Year of Self-Driving Cars

Finally, in 2017, one of the trends and incredibly popular areas are self-driving cars and autonomous technology. This trend is already under development for several years, but now is the time when technology advance is almost ready to hit the market. More and more car manufacturers are moving towards automating up the driving and run their hands-free models. And this year may be a big milestone for that area.

A lot of giants in the auto market have already presented their projects. Among them are BMW, Volvo, General Motors, Nissan and of course Tesla and Google. Companies and the greatest minds of our time believe that production of such cars will reduce the number of road accidents. On the other hand, we should not forget how many jobs would be lost during the development of this technology. For example in the USA, the driver’s profession is the most popular in the half of the states. This topic has been the subject of hot debate, but mostly among sociologists and economists and Silicon Valley gurus.

Electro car is not something new or surprising nowadays, and they gradually come into our lives more and more with the development of technologies and the reduction of their prices. Time is coming for a rise of self-driving technology. However, there are still many open questions regarding safety and proper legislation in this area.



The 2017 year is going to be very exciting and full of new discoveries and improvements in artificial intelligence field. This year is going to be “The Year of Intelligence” as we see that AI and machine learning applications are going mainstream and contributes to every part of organization and business areas and becoming one of the key competitive advantages for companies which integrate machine learning into its operations.

We hope you’ve enjoyed reading this post about new developments and trends in Data Science field for year 2017. We are not pretending this to be an ultimate list, as so many things are evolving quickly in technology realm, so we encourage you to share your vision about main trends for data and data science field in the comments section below.

Originally published at: