This tutorial mini series is focused on training a neural network to play the Open AI environment called CartPole.

The idea of CartPole is that there is a pole standing up on top of a cart. The goal is to balance this pole by wiggling/moving the cart from side to side to keep the pole balanced upright.

Let’s build our own language translator using Tensorflow! We’ll go over several translation methods and talk about how Google Translate is able to achieve state of the art performance.

This video has been edited together from several other video documentaries to describe the Mandelbrot set! An incredible mathematical formula explaining fractals and geometry! Several mathematicians and scientists explain this phenomenon in clear detail. Please enjoy!

We are not the owners of these video clips nor do we claim to be. This video is for educational & entertainment purposes only.

Data is almost everywhere. The amount of digital data that currently exists is now growing at a rapid pace. The number is doubling every two years and it is completely transforming our basic mode of existence. According to a paper from IBM, about 2.5 billion gigabytes of data had been generated on a daily basis in the year 2012. Another article from Forbes informs us that data is growing at a pace which is faster than ever. The same article suggests that by the year 2020, about 1.7 billion of new information will be developed per second for all the human inhabitants on this planet. As data is growing at a faster pace, new terms associated with processing and handling data are coming up. These include data science, data mining and machine learning. In the following section- we will give you a detailed insight on these terms.

What is data science?

Data Science deals with both structured and unstructured data. It is a field that includes everything that is associated with the cleansing, preparation and final analysis of data. Data science combines the programming, logical reasoning, mathematics and statistics. It captures data in the most ingenious ways and encourages the ability of looking at things with a different perspective. Likewise, it also cleanses, prepares and aligns the data. To put it more simply, data science is an umbrella of several techniques that are used for extracting the information and the insights of data. Data scientists are responsible for creating the data products and several other data based applications that deal with data in such a way that conventional systems are unable to do.

What is data mining?

Data mining is simply the process of garnering information from huge databases that was previously incomprehensible and unknown and then using that information to make relevant business decisions. To put it more simply, data mining is a set of various methods that are used in the process of knowledge discovery for distinguishing the relationships and patterns that were previously unknown. We can therefore term data mining as a confluence of various other fields like artificial intelligence, data room virtual base management, pattern recognition, visualization of data, machine learning, statistical studies and so on. The primary goal of the process of data mining is to extract information from various sets of data in an attempt to transform it in proper and understandable structures for eventual use. Data mining is thus a process which is used by data scientists and machine learning enthusiasts to convert large sets of data into something more usable.

What is machine learning?

Machine learning is kind of artificial intelligence that is responsible for providing computers the ability to learn about newer data sets without being programmed via an explicit source. It focuses primarily on the development of several computer programs that can transform if and when exposed to newer sets of data. Machine learning and data mining follow the relatively same process. But of them might not be the same. Machine learning follows the method of data analysis which is responsible for automating the model building in an analytical way. It uses algorithms that iteratively gain knowledge from data and in this process; it lets computers find the apparently hidden insights without any help from an external program. In order to gain the best results from data mining, complex algorithms are paired with the right processes and tools.

What is the difference between these three terms?

As we mentioned earlier, data scientists are responsible for coming up with data centric products and applications that handle data in a way which conventional systems cannot. The process of data science is much more focused on the technical abilities of handling any type of data. Unlike data mining and data machine learning it is responsible for assessing the impact of data in a specific product or organization.

While data science focuses on the science of data, data mining is concerned with the process. It deals with the process of discovering newer patterns in big data sets. It might be apparently similar to machine learning, because it categorizes algorithms. However, unlike machine learning, algorithms are only a part of data mining. In machine learning algorithms are used for gaining knowledge from data sets. However, in data mining algorithms are only combined that too as the part of a process. Unlike machine learning it does not completely focus on algorithms.

Edited by Abdelhamid Mellouk and Abdennacer Chebira

Machine Learning can be defined in various ways related to a scientific domain concerned with the design and development of theoretical and implementation tools that allow building systems with some Human Like intelligent behaviour.

Machine Learning addresses more specifically the ability to improve automatically through experience.

UNDERSTANDING MACHINE LEARNING

by Shai Ben-David and Shai Shalev-Shwartz

Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way.

The book provides an extensive theoretical account of the fundamental ideas underlying machine learning and the mathematical derivations that transform these principles into practical algorithms. Following a presentation of the basics of the field, the book covers a wide array of central topics that have not been addressed by previous textbooks. These include a discussion of the computational complexity of learning and the concepts of convexity and stability; important algorithmic paradigms including stochastic gradient descent, neural networks, and structured output learning; and emerging theoretical concepts such as the PAC-Bayes approach and compression-based bounds.

Designed for an advanced undergraduate or beginning graduate course, the text makes the fundamentals and algorithms of machine learning accessible to students and non-expert readers in statistics, computer science, mathematics, and engineering.

NEURAL NETWORKS

by D. Kriesel

The purpose of this book is to help you master the core concepts of neural networks, including modern techniques for deep learning.

After working through the book you will have written code that uses neural networks and deep learning to solve complex pattern recognition problems.

And you will have a foundation to use neural networks and deep learning to attack problems of your own devising.

To check those books and receive announcements when new free eBooks are published, click here.

Summary: Some observations about new major trends and directions in data science drawn from the Strata+Hadoop conference in San Jose last week.

I’m fresh off my annual field trip to the Strata+Hadoop conference in San Jose last week. This is always exciting, enervating, and exhausting but it remains the single best place to pick up on what’s changing in our profession.

This conference is on a world tour with four more stops before repeating next year. The New York show is supposed to be a little bigger (hard to imagine) but the San Jose show is closest to our intellectual birthplace. After all this is the place where to call yourself a nerd would be regarded as a humble brag.

I’ll try to briefly share the major themes and changes I found this year and will write later in more depth about some of these.

End of the Era of Hadoop

From the time it went open source in 2007 Hadoop and its related technologies have been profound drivers of the growth of data science. Doug Cutting remains one of the three Strata conference chairs. However, we all know that Hadoop/MapReduce has made its mark but that it’s no longer cutting edge. In fact we know that Apache Spark has eclipsed Hadoop and it would be fair to say that Spark was last year’s big news.

To put a stake in it, O’Reilly announced at this year’s Strata+Hadoop that the conference would henceforth be known as the Strata Data Conference. So farewell age of Hadoop.

Artificial Intelligence

I am as jaded as the next guy and maybe a little more so at the over hyped furor around AI. As I walked the conference floor I felt compelled to challenge any vendor with the temerity to put AI in their descriptors.

Actually there was very little of this at the show. AI tends to be most over hyped when we’re talking about apps but Strata+Hadoop is more about tools than apps. There were two or three vendors that I thought had applied the AI frosting a little thick but there were a few others where the label was appropriate and the capabilities pretty interesting. More about these good guys later.

In the learning program there were again two or three sessions aimed at AI use cases in business and these were uniformly well reasoned. Specifically that means acknowledging that this is in its infancy and while you should keep an eye on it, investing now would be speculative at best.

The Cambrian Explosion in Deep Learning

One of our general session speakers used this phrase to describe the hockey-stick like growth we’ve been experiencing in Deep Learning and AI in general. The original use of the phrase is credited to Gill Pratt, the DARPA Program Manager who oversaw the DARPA Robotics Challenge.

If you remember a little about your earth history, we trundled along with one-celled creatures for billions of years until about a half-a-billion years ago when, at the beginning of the Cambrian period, life diversified in a way that can truly be characterized as an explosion. Academic theory is that very small changes like the evolution of sight organs so changed the playing field that the exploitation of this new capability drove the development of additional capabilities that – you know – resulted in us.

So while data scientists are a little cautious to talk about the wonders of artificial intelligence, they are very enthusiastic in talking about the new capabilities presented by Deep Learning. This may seem a little paradoxical but I invite you to think about it this way.

Robust AI is the accumulated capabilities of speech, text, NLP, image processing, robotics, knowledge recovery, and several other human-like capabilities that at this point are very early in development and not at all well or easily integrated.

Deep Learning however is a group of tools that we are applying to develop these capabilities, including Convolutional Neural Nets, Recurrent Neural Nets, Generative Adversarial Neural Nets, and Reinforcement Learning to name the most popular. All of these are subsets of Deep Learning and all are accessed through the newly emerging Deep Learning platforms like TensorFlow, MXNet, Theano, Torch, and several others.

Like all platform battles, the winner who gains the most users will be the next IoS, Android, or Windows. Right now it appears Google’s TensorFlow is in the lead and there were at least four or five program sessions, some of them full-day, that were oversubscribed providing both general guidance as well as hands-on training in TensorFlow. So while the buzz around AI was appropriately subdued, the enthusiasm for learning about TensorFlow was in full flower. The emergence of Deep Learning platforms may be the slight evolutionary change that triggers the explosion of AI.

Platform Convergence

In the beginning you could pick a portion of the data science workflow and build a successful business there. Many of today’s largest companies got their start this way. Not anymore. Now everybody wants to be an end-to-end platform from data source to the deployment of models and other forms of exploitation. He with the most users will win and once adopted the pain of switching will be high. The same dynamic that continues to make enterprise ERP systems so sticky – it’s too painful to switch.

We’ve seen in the last years analytic platforms like SAS and SPSS add full data access and blending capability. We’ve seen blending platforms like Alteryx extend into analytics and visualization. So here are two new and rather unexpected additions to the full spectrum platform game:

Cloudera announces its own Data Science Workbench with capabilities in R, Python, and Scala.

Intel (yes Intel?) who just paid $15 Billion for Mobileye to seize its place in the self-driving car space is rolling out two data science platforms, Saffron and Nirvana, one aimed at IoT and the other at deep learning.

DataOps and Data Engineers

As recently as a year or so ago the term ‘data scientist’ applied to someone doing predictive analytics as well as the person you would turn to to implement Spark or a data lake. Thankfully over not too long a period we have come to differentiate Data Scientists from Data Engineers and acknowledge their special skill set that blends traditional CS skills with the new disciplines needed to store, extract, and utilize data for data scientists.

Now that this differentiation is a little clearer, we see a parallel rise in a new category of tools and platforms best described as DataOps. Philosophically similar to DevOps, DataOps tools and platforms are aimed at regularizing and simplifying the tasks of Data Engineers, particularly as it applies to repetitive tasks that may need to be repeated dozens or even hundreds of times for different data sources and different data destinations. Two new companies, both startups, Nexla, and Data Kitchen take a fairly narrow but deep view. Others like Qubole are laying claim to this area by better defining capabilities within their existing platforms.

Emerging Productivity Enhancements for Data Scientists

We may think the business world is populated by companies with just a few (if any) data scientists working together and for the most part we’d be right. However, this is not the market most vendors at Strata are interested in. They are pursuing wallet share among the Global 8000. That’s 8000 companies with more than $1 Billion in revenue and assuredly 100% commitment to predictive analytics.

I haven’t seen any specific data but an informal poll of vendors says these companies employee from 20 to several hundred data scientists each. When you have that many data scientists in one place you have to start thinking about efficiency and productivity. And there’s a major theme for this year – productivity enhancements for data scientists.

The list of vendors with this focus is too long for this article and DataOps just above is part of this. Here are just a few mentions of notable companies and their approach.

DataRobot: We reviewed DataRobot a year ago when it looked like predictive analytics was about to be fully automated and data scientists unemployed by 2025. That was a little premature. However, DataRobot has found a foothold by dramatically speeding up model development. This is one-click-to-model. Their platform cleans data, does feature engineering and identification, runs thousands of potential models/hyper parameter combinations in parallel, and deploys champion models in a fraction of the time it would take a team of data scientists.

SqlStream: Deploy a blazing fast stream processing system in a fraction of the time and with a fraction of the compute resources distros like Spark require. Make it so easy to manage that very little is needed of data engineers, and make it easy to change the logic and models within the stream without a team of data scientists.

Bansai: TensorFlow is complex and tough to learn. Bansai is introducing a higher level language that looks a lot like Python but manipulates deep learning algorithms in all the major deep learning platforms. Their initial target is reinforcement learning for robotics and the payoff is to solve the shortage of deep learning-qualified data scientists that are a bottleneck for development.

Qubole: Makes it easy to almost instantly establish a big data repository and begin analytics. You can’t completely replace data engineers but you can dramatically increase the number of data scientists each engineer can support with this SaaS implementation.

Emergence of the Data Science Appliance

Similar to productivity enhancements but aimed at business users who want solutions without necessarily needing to know the underlying data science are a group of offerings that intentionally hide the data science and focus on the business problem.

Anodot: Delivers a sophisticated anomaly detection system that looks at all your streaming data and decides both what’s anomalous and what’s important. This is catching on among ecommerce vendors and digital enterprises, some of whom have reportedly thrown out their internally developed anomaly detectors in favor of Anodot’s offering.

GeoStrategies: This company uses GIS data for site location and market identification and penetration studies. Lots of sophisticated platforms can do that too but GeoStrategies goes out of its way to hide the data science in favor of a UI that’s very intuitive for their business users.

Women in Data Science

Finally, my unscientific tally was that about 20% of attendees and a slightly higher percentage of presenters were women. This may not be representative of our profession as a whole as folks who attend these conferences may have different profiles than the whole industry. Still, while we might wish this was more like 50/50 I thought participation by our female members was a reasonably strong showing.

About the author: Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at:

Today, I’m going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper.

Once you know what they are, how they work, what they do and where you can find them, my hope is you’ll have this blog post as a springboard to learn even more about data mining.

What are we waiting for? Let’s get started!

Here are the algorithms:

1. C4.5

2. k-means

3. Support vector machines

4. Apriori

5. EM

6. PageRank

7. AdaBoost

8. kNN

9. Naive Bayes

10. CART

We also provide interesting resources at the end.

1. C4.5

What does it do? C4.5 constructs a classifier in the form of a decision tree. In order to do this, C4.5 is given a set of data representing things that are already classified.

Wait, what’s a classifier? A classifier is a tool in data mining that takes a bunch of data representing things we want to classify and attempts to predict which class the new data belongs to.

What’s an example of this? Sure, suppose a dataset contains a bunch of patients. We know various things about each patient like age, pulse, blood pressure, VO_{2}max, family history, etc. These are called attributes.

Now:

Given these attributes, we want to predict whether the patient will get cancer. The patient can fall into 1 of 2 classes: will get cancer or won’t get cancer. C4.5 is told the class for each patient.

And here’s the deal:

Using a set of patient attributes and the patient’s corresponding class, C4.5 constructs a decision tree that can predict the class for new patients based on their attributes.

Cool, so what’s a decision tree? Decision tree learning creates something similar to a flowchart to classify new data. Using the same patient example, one particular path in the flowchart could be:

Patient has a history of cancer

Patient is expressing a gene highly correlated with cancer patients

Patient has tumors

Patient’s tumor size is greater than 5cm

The bottom line is:

At each point in the flowchart is a question about the value of some attribute, and depending on those values, he or she gets classified. You can find lots of examples of decision trees.

Is this supervised or unsupervised? This is supervised learning, since the training dataset is labeled with classes. Using the patient example, C4.5 doesn’t learn on its own that a patient will get cancer or won’t get cancer. We told it first, it generated a decision tree, and now it uses the decision tree to classify.

You might be wondering how C4.5 is different than other decision tree systems?

First, C4.5 uses information gain when generating the decision tree.

Third, C4.5 can work with both continuous and discrete data. My understanding is it does this by specifying ranges or thresholds for continuous data thus turning continuous data into discrete data.

Finally, incomplete data is dealt with in its own ways.

Why use C4.5? Arguably, the best selling point of decision trees is their ease of interpretation and explanation. They are also quite fast, quite popular and the output is human readable.

Where is it used? A popular open-source Java implementation can be found over at OpenTox. Orange, an open-source data visualization and analysis tool for data mining, implements C4.5 in their decision tree classifier.

Classifiers are great, but make sure to checkout the next algorithm about clustering…

2. k-means

What does it do? k-means creates k groups from a set of objects so that the members of a group are more similar. It’s a popular cluster analysis technique for exploring a dataset.

Hang on, what’s cluster analysis? Cluster analysis is a family of algorithms designed to form groups such that the group members are more similar versus non-group members. Clusters and groups are synonymous in the world of cluster analysis.

Is there an example of this? Definitely, suppose we have a dataset of patients. In cluster analysis, these would be called observations. We know various things about each patient like age, pulse, blood pressure, VO_{2}max, cholesterol, etc. This is a vector representing the patient.

Look:

You can basically think of a vector as a list of numbers we know about the patient. This list can also be interpreted as coordinates in multi-dimensional space. Pulse can be one dimension, blood pressure another dimension and so forth.

You might be wondering:

Given this set of vectors, how do we cluster together patients that have similar age, pulse, blood pressure, etc?

Want to know the best part?

You tell k-means how many clusters you want. K-means takes care of the rest.

How does k-means take care of the rest? k-means has lots of variations to optimize for certain types of data.

At a high level, they all do something like this:

k-means picks points in multi-dimensional space to represent each of the k clusters. These are called centroids.

Every patient will be closest to 1 of these k centroids. They hopefully won’t all be closest to the same one, so they’ll form a cluster around their nearest centroid.

What we have are k clusters, and each patient is now a member of a cluster.

k-means then finds the center for each of the k clusters based on its cluster members (yep, using the patient vectors!).

This center becomes the new centroid for the cluster.

Since the centroid is in a different place now, patients might now be closer to other centroids. In other words, they may change cluster membership.

Steps 2-6 are repeated until the centroids no longer change, and the cluster memberships stabilize. This is called convergence.

Is this supervised or unsupervised? It depends, but most would classify k-means as unsupervised. Other than specifying the number of clusters, k-means “learns” the clusters on its own without any information about which cluster an observation belongs to. k-means can be semi-supervised.

Why use k-means? I don’t think many will have an issue with this:

The key selling point of k-means is its simplicity. Its simplicity means it’s generally faster and more efficient than other algorithms, especially over large datasets.

It gets better:

k-means can be used to pre-cluster a massive dataset followed by a more expensive cluster analysis on the sub-clusters. k-means can also be used to rapidly “play” with k and explore whether there are overlooked patterns or relationships in the dataset.

It’s not all smooth sailing:

Two key weaknesses of k-means are its sensitivity to outliers, and its sensitivity to the initial choice of centroids. One final thing to keep in mind is k-means is designed to operate on continuous data — you’ll need to do some tricks to get it to work on discrete data.

Where is it used? A ton of implementations for k-means clustering are available online:

If decision trees and clustering didn’t impress you, you’re going to love the next algorithm.

3. Support vector machines

What does it do? Support vector machine (SVM) learns a hyperplane to classify data into 2 classes. At a high-level, SVM performs a similar task like C4.5 except SVM doesn’t use decision trees at all.

Whoa, a hyper-what? A hyperplane is a function like the equation for a line, y = mx + b. In fact, for a simple classification task with just 2 features, the hyperplane can be a line.

As it turns out…

SVM can perform a trick to project your data into higher dimensions. Once projected into higher dimensions…

…SVM figures out the best hyperplane which separates your data into the 2 classes.

Do you have an example? Absolutely, the simplest example I found starts with a bunch of red and blue balls on a table. If the balls aren’t too mixed together, you could take a stick and without moving the balls, separate them with the stick.

You see:

When a new ball is added on the table, by knowing which side of the stick the ball is on, you can predict its color.

What do the balls, table and stick represent? The balls represent data points, and the red and blue color represent 2 classes. The stick represents the simplest hyperplane which is a line.

And the coolest part?

SVM figures out the function for the hyperplane.

What if things get more complicated? Right, they frequently do. If the balls are mixed together, a straight stick won’t work.

Here’s the work-around:

Quickly lift up the table throwing the balls in the air. While the balls are in the air and thrown up in just the right way, you use a large sheet of paper to divide the balls in the air.

You might be wondering if this is cheating:

Nope, lifting up the table is the equivalent of mapping your data into higher dimensions. In this case, we go from the 2 dimensional table surface to the 3 dimensional balls in the air.

How does SVM do this? By using a kernel we have a nice way to operate in higher dimensions. The large sheet of paper is still called a hyperplane, but it is now a function for a plane rather than a line. Note from Yuval that once we’re in 3 dimensions, the hyperplane must be a plane rather than a line.

I found this visualization super helpful:

Reddit also has 2 great threads on this in the ELI5 and ML subreddits.

How do balls on a table or in the air map to real-life data? A ball on a table has a location that we can specify using coordinates. For example, a ball could be 20cm from the left edge and 50cm from the bottom edge. Another way to describe the ball is as (x, y) coordinates or (20, 50). x and y are 2 dimensions of the ball.

Here’s the deal:

If we had a patient dataset, each patient could be described by various measurements like pulse, cholesterol level, blood pressure, etc. Each of these measurements is a dimension.

The bottom line is:

SVM does its thing, maps them into a higher dimension and then finds the hyperplane to separate the classes.

Margins are often associated with SVM? What are they? The margin is the distance between the hyperplane and the 2 closest data points from each respective class. In the ball and table example, the distance between the stick and the closest red and blue ball is the margin.

The key is:

SVM attempts to maximize the margin, so that the hyperplane is just as far away from red ball as the blue ball. In this way, it decreases the chance of misclassification.

Where does SVM get its name from? Using the ball and table example, the hyperplane is equidistant from a red ball and a blue ball. These balls or data points are called support vectors, because they support the hyperplane.

Is this supervised or unsupervised? This is a supervised learning, since a dataset is used to first teach the SVM about the classes. Only then is the SVM capable of classifying new data.

Why use SVM? SVM along with C4.5 are generally the 2 classifiers to try first. No classifier will be the best in all cases due to the No Free Lunch Theorem. In addition, kernel selection and interpretability are some weaknesses.

Where is it used? There are many implementations of SVM. A few of the popular ones are scikit-learn, MATLAB and of course libsvm.

The next algorithm is one of my favorites…

4. Apriori

What does it do? The Apriori algorithm learns association rules and is applied to a database containing a large number of transactions.

What are association rules? Association rule learning is a data mining technique for learning correlations and relations among variables in a database.

What’s an example of Apriori? Let’s say we have a database full of supermarket transactions. You can think of a database as a giant spreadsheet where each row is a customer transaction and every column represents a different grocery item.

Here’s the best part:

By applying the Apriori algorithm, we can learn the grocery items that are purchased together a.k.a association rules.

The power of this is:

You can find those items that tend to be purchased together more frequently than other items — the ultimate goal being to get shoppers to buy more. Together, these items are called itemsets.

For example:

You can probably quickly see that chips + dip and chips + soda seem to frequently occur together. These are called 2-itemsets. With a large enough dataset, it will be much harder to “see” the relationships especially when you’re dealing with 3-itemsets or more. That’s precisely what Apriori helps with!

You might be wondering how Apriori works? Before getting into the nitty-gritty of algorithm, you’ll need to define 3 things:

The first is the size of your itemset. Do you want to see patterns for a 2-itemset, 3-itemset, etc.?

The second is your support or the number of transactions containing the itemset divided by the total number of transactions. An itemset that meets the support is called a frequent itemset.

The third is your confidence or the conditional probability of some item given you have certain other items in your itemset. A good example is given chips in your itemset, there is a 67% confidence of having soda also in the itemset.

The basic Apriori algorithm is a 3 step approach:

Join. Scan the whole database for how frequent 1-itemsets are.

Prune. Those itemsets that satisfy the support and confidence move onto the next round for 2-itemsets.

Repeat. This is repeated for each itemset level until we reach our previously defined size.

Is this supervised or unsupervised? Apriori is generally considered an unsupervised learning approach, since it’s often used to discover or mine for interesting patterns and relationships.

But wait, there’s more…

Apriori can also be modified to do classification based on labelled data.

Why use Apriori? Apriori is well understood, easy to implement and has many derivatives.

On the other hand…

The algorithm can be quite memory, space and time intensive when generating itemsets.

Where is it used? Plenty of implementations of Apriori are available. Some popular ones are the ARtool, Weka, and Orange.

The next algorithm was the most difficult for me to understand, look at the next algorithm…

5. EM

What does it do? In data mining, expectation-maximization (EM) is generally used as a clustering algorithm (like k-means) for knowledge discovery.

In statistics, the EM algorithm iterates and optimizes the likelihood of seeing observed data while estimating the parameters of a statistical model with unobserved variables.

OK, hang on while I explain…

I’m not a statistician, so hopefully my simplification is both correct and helps with understanding.

Here are a few concepts that will make this way easier…

What’s a statistical model? I see a model as something that describes how observed data is generated. For example, the grades for an exam could fit a bell curve, so the assumption that the grades are generated via a bell curve (a.k.a. normal distribution) is the model.

Wait, what’s a distribution? A distribution represents the probabilities for all measurable outcomes. For example, the grades for an exam could fit a normal distribution. This normal distribution represents all the probabilities of a grade.

In other words, given a grade, you can use the distribution to determine how many exam takers are expected to get that grade.

Cool, what are the parameters of a model? A parameter describes a distribution which is part of a model. For example, a bell curve can be described by its mean and variance.

Using the exam scenario, the distribution of grades on an exam (the measurable outcomes) followed a bell curve (this is the distribution). The mean was 85 and the variance was 100.

So, all you need to describe a normal distribution are 2 parameters:

The mean

The variance

And likelihood? Going back to our previous bell curve example… suppose we have a bunch of grades and are told the grades follow a bell curve. However, we’re not given all the grades… only a sample.

Here’s the deal:

We don’t know the mean or variance of all the grades, but we can estimate them using the sample. The likelihood is the probability that the bell curve with estimated mean and variance results in those bunch of grades.

In other words, given a set of measurable outcomes, let’s estimate the parameters. Using these estimated parameters, the hypothetical probability of the outcomes is called likelihood.

Remember, it’s the hypothetical probability of the existing grades, not the probability of a future grade.

You’re probably wondering, what’s probability then?

Using the bell curve example, suppose we know the mean and variance. Then we’re told the grades follow a bell curve. The chance that we observe certain grades and how often they are observed is the probability.

In more general terms, given the parameters, let’s estimate what outcomes should be observed. That’s what probability does for us.

Great! Now, what’s the difference between observed and unobserved data? Observed data is the data that you saw or recorded. Unobserved data is data that is missing. There a number of reasons that the data could be missing (not recorded, ignored, etc.).

Here’s the kicker:

For data mining and clustering, what’s important to us is looking at the class of a data point as missing data. We don’t know the class, so interpreting missing data this way is crucial for applying EM to the task of clustering.

Once again: The EM algorithm iterates and optimizes the likelihood of seeing observed data while estimating the parameters of a statistical model with unobserved variables. Hopefully, this is way more understandable now.

The best part is…

By optimizing the likelihood, EM generates an awesome model that assigns class labels to data points — sounds like clustering to me!

How does EM help with clustering? EM begins by making a guess at the model parameters.

Then it follows an iterative 3-step process:

E-step: Based on the model parameters, it calculates the probabilities for assignments of each data point to a cluster.

M-step: Update the model parameters based on the cluster assignments from the E-step.

Repeat until the model parameters and cluster assignments stabilize (a.k.a. convergence).

Is this supervised or unsupervised? Since we do not provide labeled class information, this is unsupervised learning.

Why use EM? A key selling point of EM is it’s simple and straight-forward to implement. In addition, not only can it optimize for model parameters, it can also iteratively make guesses about missing data.

This makes it great for clustering and generating a model with parameters. Knowing the clusters and model parameters, it’s possible to reason about what the clusters have in common and which cluster new data belongs to.

EM is not without weaknesses though…

First, EM is fast in the early iterations, but slow in the later iterations.

Second, EM doesn’t always find the optimal parameters and gets stuck in local optima rather than global optima.

Where is it used? The EM algorithm is available in Weka. R has an implementation in the mclust package. scikit-learn also has an implementation in its gmm module.

What data mining does Google do? Take a look…

6. PageRank

What does it do? PageRank is a link analysis algorithm designed to determine the relative importance of some object linked within a network of objects.

Yikes.. what’s link analysis? It’s a type of network analysis looking to explore the associations (a.k.a. links) among objects.

Here’s an example: The most prevalent example of PageRank is Google’s search engine. Although their search engine doesn’t solely rely on PageRank, it’s one of the measures Google uses to determine a web page’s importance.

Let me explain:

Web pages on the World Wide Web link to each other. If rayli.net links to a web page on CNN, a vote is added for the CNN page indicating rayli.net finds the CNN web page relevant.

And it doesn’t stop there…

rayli.net’s votes are in turn weighted by rayli.net’s importance and relevance. In other words, any web page that’s voted for rayli.net increases rayli.net’s relevance.

The bottom line?

This concept of voting and relevance is PageRank. rayli.net’s vote for CNN increases CNN’s PageRank, and the strength of rayli.net’s PageRank influences how much its vote affects CNN’s PageRank.

What does a PageRank of 0, 1, 2, 3, etc. mean? Although the precise meaning of a PageRank number isn’t disclosed by Google, we can get a sense of its relative meaning.

And here’s how:

You see?

It’s a bit like a popularity contest. We all have a sense of which websites are relevant and popular in our minds. PageRank is just an uber elegant way to define it.

What other applications are there of PageRank? PageRank was specifically designed for the World Wide Web.

Think about it:

At its core, PageRank is really just a super effective way to do link analysis.The objects being linked don’t have to be web pages.

Here are 3 innovative applications of PageRank:

Dr Stefano Allesina, from the University of Chicago, applied PageRank to ecology to determine which species are critical for sustaining ecosystems.

Twitter developed WTF (Who-to-Follow) which is a personalized PageRank recommendation engine about who to follow.

Is this supervised or unsupervised? PageRank is generally considered an unsupervised learning approach, since it’s often used to discover the importance or relevance of a web page.

Why use PageRank? Arguably, the main selling point of PageRank is its robustness due to the difficulty of getting a relevant incoming link.

Simply stated:

If you have a graph or network and want to understand relative importance, priority, ranking or relevance, give PageRank a try.

Where is it used? The PageRank trademark is owned by Google. However, the PageRank algorithm is actually patented by Stanford University.

You might be wondering if you can use PageRank:

I’m not a lawyer, so best to check with an actual lawyer, but you can probably use the algorithm as long as it doesn’t commercially compete against Google/Stanford.

What does it do? AdaBoost is a boosting algorithm which constructs a classifier.

As you probably remember, a classifier takes a bunch of data and attempts to predict or classify which class a new data element belongs to.

But what’s boosting? Boosting is an ensemble learning algorithm which takes multiple learning algorithms (e.g. decision trees) and combines them. The goal is to take an ensemble or group of weak learners and combine them to create a single strong learner.

What’s the difference between a strong and weak learner? A weak learner classifies with accuracy barely above chance. A popular example of a weak learner is the decision stump which is a one-level decision tree.

Alternatively…

A strong learner has much higher accuracy, and an often used example of a strong learner is SVM.

What’s an example of AdaBoost? Let’s start with 3 weak learners. We’re going to train them in 10 rounds on a training dataset containing patient data. The dataset contains details about the patient’s medical records.

The question is…

How can we predict whether the patient will get cancer?

Here’s how AdaBoost answers the question…

In round 1: AdaBoost takes a sample of the training dataset and tests to see how accurate each learner is. The end result is we find the best learner.

In addition, samples that are misclassified are given a heavier weight, so that they have a higher chance of being picked in the next round.

One more thing, the best learner is also given a weight depending on its accuracy and incorporated into the ensemble of learners (right now there’s just 1 learner).

In round 2: AdaBoost again attempts to look for the best learner.

And here’s the kicker:

The sample of patient training data is now influenced by the more heavily misclassified weights. In other words, previously misclassified patients have a higher chance of showing up in the sample.

Why?

It’s like getting to the second level of a video game and not having to start all over again when your character is killed. Instead, you start at level 2 and focus all your efforts on getting to level 3.

Likewise, the first learner likely classified some patients correctly. Instead of trying to classify them again, let’s focus all the efforts on getting the misclassified patients.

The best learner is again weighted and incorporated into the ensemble, misclassified patients are weighted so they have a higher chance of being picked and we rinse and repeat.

At the end of the 10 rounds: We’re left with an ensemble of weighted learners trained and then repeatedly retrained on misclassified data from the previous rounds.

Is this supervised or unsupervised? This is supervised learning, since each iteration trains the weaker learners with the labelled dataset.

Why use AdaBoost? AdaBoost is simple. The algorithm is relatively straight-forward to program.

In addition, it’s fast! Weak learners are generally simpler than strong learners. Being simpler means they’ll likely execute faster.

Another thing…

It’s a super elegant way to auto-tune a classifier, since each successive AdaBoost round refines the weights for each of the best learners. All you need to specify is the number of rounds.

Finally, it’s flexible and versatile. AdaBoost can incorporate any learning algorithm, and it can work with a large variety of data.

Where is it used? AdaBoost has a ton of implementations and variants. Here are a few:

If you like Mr. Rogers, you’ll like the next algorithm…

8. kNN

What does it do? kNN, or k-Nearest Neighbors, is a classification algorithm. However, it differs from the classifiers previously described because it’s a lazy learner.

What’s a lazy learner? A lazy learner doesn’t do much during the training process other than store the training data. Only when new unlabeled data is input does this type of learner look to classify.

On the other hand, an eager learner builds a classification model during training. When new unlabeled data is input, this type of learner feeds the data into the classification model.

How does C4.5, SVM and AdaBoost fit into this? Unlike kNN, they are all eager learners.

Here’s why:

C4.5 builds a decision tree classification model during training.

SVM builds a hyperplane classification model during training.

AdaBoost builds an ensemble classification model during training.

So what does kNN do? kNN builds no such classification model. Instead, it just stores the labeled training data.

When new unlabeled data comes in, kNN operates in 2 basic steps:

First, it looks at the closest labeled training data points — in other words, the k-nearest neighbors.

Second, using the neighbors’ classes, kNN gets a better idea of how the new data should be classified.

You might be wondering…

How does kNN figure out what’s closer? For continuous data, kNN uses a distance metric like Euclidean distance. The choice of distance metric largely depends on the data. Some even suggest learning a distance metric based on the training data. There’s tons more details and papers on kNN distance metrics.

For discrete data, the idea is transform discrete data into continuous data. 2 examples of this are:

Using Hamming distance as a metric for the “closeness” of 2 text strings.

Transforming discrete data into binary features.

These 2 Stack Overflow threads have some more suggestions on dealing with discrete data:

How does kNN classify new data when neighbors disagree? kNN has an easy time when all neighbors are the same class. The intuition is if all the neighbors agree, then the new data point likely falls in the same class.

I’ll bet you can guess where things get hairy…

How does kNN decide the class when neighbors don’t have the same class?

2 common techniques for dealing with this are:

Take a simple majority vote from the neighbors. Whichever class has the greatest number of votes becomes the class for the new data point.

Take a similar vote except give a heavier weight to those neighbors that are closer. A simple way to do this is to use reciprocal distance e.g. if the neighbor is 5 units away, then weight its vote 1/5. As the neighbor gets further away, the reciprocal distance gets smaller and smaller… exactly what we want!

Is this supervised or unsupervised? This is supervised learning, since kNN is provided a labeled training dataset.

Why use kNN? Ease of understanding and implementing are 2 of the key reasons to use kNN. Depending on the distance metric, kNN can be quite accurate.

But that’s just part of the story…

Here are 5 things to watch out for:

kNN can get very computationally expensive when trying to determine the nearest neighbors on a large dataset.

Noisy data can throw off kNN classifications.

Features with a larger range of values can dominate the distance metric relative to features that have a smaller range, so feature scaling is important.

Since data processing is deferred, kNN generally requires greater storage requirements than eager classifiers.

Selecting a good distance metric is crucial to kNN’s accuracy.

Where is it used? A number of kNN implementations exist:

Spam? Fuhgeddaboudit! Read ahead to learn about the next algorithm…

9. Naive Bayes

What does it do? Naive Bayes is not a single algorithm, but a family of classification algorithms that share one common assumption:

Every feature of the data being classified is independent of all other features given the class.

What does independent mean? 2 features are independent when the value of one feature has no effect on the value of another feature.

For example:

Let’s say you have a patient dataset containing features like pulse, cholesterol level, weight, height and zip code. All features would be independent if the value of all features have no effect on each other. For this dataset, it’s reasonable to assume that the patient’s height and zip code are independent, since a patient’s height has little to do with their zip code.

But let’s not stop there, are the other features independent?

Sadly, the answer is no. Here are 3 feature relationships which are not independent:

If height increases, weight likely increases.

If cholesterol level increases, weight likely increases.

If cholesterol level increases, pulse likely increases as well.

In my experience, the features of a dataset are generally not all independent.

And that ties in with the next question…

Why is it called naive? The assumption that all features of a dataset are independent is precisely why it’s called naive — it’s generally not the case that all features are independent.

What’s Bayes? Thomas Bayes was an English statistician for which Bayes’ Theorem is named after. You can click on the link to find about more about Bayes’ Theorem.

In a nutshell, the theorem allows us to predict the class given a set of features using probability.

The simplified equation for classification looks something like this:

Let’s dig deeper into this…

What does the equation mean? The equation finds the probability of Class A given Features 1 and 2. In other words, if you see Features 1 and 2, this is the probability the data is Class A.

The equation reads: The probability of Class A given Features 1 and 2 is a fraction.

The fraction’s numerator is the probability of Feature 1 given Class A multiplied by the probability of Feature 2 given Class A multiplied by the probability of Class A.

The fraction’s denominator is the probability of Feature 1 multiplied by the probability of Feature 2.

What is an example of Naive Bayes? Below is a great example taken from a Stack Overflow thread.

Here’s the deal:

We have a training dataset of 1,000 fruits.

The fruit can be a Banana, Orange or Other (these are the classes).

The fruit can be Long, Sweet or Yellow (these are the features).

What do you see in this training dataset?

Out of 500 bananas, 400 are long, 350 are sweet and 450 are yellow.

Out of 300 oranges, none are long, 150 are sweet and 300 are yellow.

Out of the remaining 200 fruit, 100 are long, 150 are sweet and 50 are yellow.

If we are given the length, sweetness and color of a fruit (without knowing its class), we can now calculate the probability of it being a banana, orange or other fruit.

Suppose we are told the unknown fruit is long, sweet and yellow.

Here’s how we calculate all the probabilities in 4 steps:

Step 1: To calculate the probability the fruit is a banana, let’s first recognize that this looks familiar. It’s the probability of the class Banana given the features Long, Sweet and Yellow or more succinctly:

This is exactly like the equation discussed earlier.

Step 2: Starting with the numerator, let’s plug everything in.

Multiplying everything together (as in the equation), we get:

Step 3: Ignore the denominator, since it’ll be the same for all the other calculations.

Step 4: Do a similar calculation for the other classes:

Since the is greater than , Naive Bayes would classify this long, sweet and yellow fruit as a banana.

Is this supervised or unsupervised? This is supervised learning, since Naive Bayes is provided a labeled training dataset in order to construct the tables.

Why use Naive Bayes? As you could see in the example above, Naive Bayes involves simple arithmetic. It’s just tallying up counts, multiplying and dividing.

Once the frequency tables are calculated, classifying an unknown fruit just involves calculating the probabilities for all the classes, and then choosing the highest probability.

Despite its simplicity, Naive Bayes can be surprisingly accurate. For example, it’s been found to be effective for spam filtering.

Where is it used? Implementations of Naive Bayes can be found in Orange, scikit-learn, Weka and R.

Finally, check out the 10th algorithm…

10. CART

What does it do? CART stands for classification and regression trees. It is a decision tree learning technique that outputs either classification or regression trees. Like C4.5, CART is a classifier.

Is a classification tree like a decision tree? A classification tree is a type of decision tree. The output of a classification tree is a class.

For example, given a patient dataset, you might attempt to predict whether the patient will get cancer. The class would either be “will get cancer” or “won’t get cancer.”

What’s a regression tree? Unlike a classification tree which predicts a class, regression trees predict a numeric or continuous value e.g. a patient’s length of stay or the price of a smartphone.

Here’s an easy way to remember…

Classification trees output classes, regression trees output numbers.

Since we’ve already covered how decision trees are used to classify data, let’s jump right into things…

How does this compare with C4.5?

C4.5

CART

Uses information gain to segment data during decision tree generation.

Uses the cost-complexity method of pruning. Starting at the bottom of the tree, CART evaluates the misclassification cost with the node vs. without the node. If the cost doesn’t meet a threshold, it is pruned away.

Uses surrogates to distribute the missing values to children.

Is this supervised or unsupervised? CART is a supervised learning technique, since it is provided a labeled training dataset in order to construct the classification or regression tree model.

Why use CART? Many of the reasons you’d use C4.5 also apply to CART, since they are both decision tree learning techniques. Things like ease of interpretation and explanation also apply to CART as well.

Like C4.5, they are also quite fast, quite popular and the output is human readable.

Where is it used?scikit-learn implements CART in their decision tree classifier. R’s tree package has an implementation of CART. Weka and MATLAB also have implementations.

Bio: Raymond Li is a software engineer and data enthusiast who has been blogging at rayli.net for over a decade. He loves to learn, teach and grow. You’ll usually find him wrangling data, programming and lifehacking. Like what you’ve read? If so, subscribe via email to get started now!

I’ll show you how you can turn an article into a one-sentence summary in Python with the Keras machine learning library. We’ll go over word embeddings, encoder-decoder architecture, and the role of attention in learning theory.

Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, Matei Zaharia will cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, he will talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.

This talk was originally presented at Spark Summit East 2017.

Data science today is a lot like the Wild West: there’s endless opportunity and excitement, but also a lot of chaos and confusion. If you’re new to data science and applied machine learning, evaluating a machine-learning model can seem pretty overwhelming. Now you have help. With this O’Reilly report, machine-learning expert Alice Zheng takes you through the model evaluation basics.

In this overview, Zheng first introduces the machine-learning workflow, and then dives into evaluation metrics and model selection. The latter half of the report focuses on hyperparameter tuning and A/B testing, which may benefit more seasoned machine-learning practitioners.

With this report, you will:

Learn the stages involved when developing a machine-learning model for use in a software application

Understand the metrics used for supervised learning models, including classification, regression, and ranking

Walk through evaluation mechanisms, such as hold?out validation, cross-validation, and bootstrapping

Explore hyperparameter tuning in detail, and discover why it’s so difficult

Learn the pitfalls of A/B testing, and examine a promising alternative: multi-armed bandits

Get suggestions for further reading, as well as useful software packages