Spark Reading and Writing to Parquet Storage Format

 

Spark: Reading and Writing to Parquet Format
————————————————————————–

– Using Spark Data Frame save capability
– Code/Approach works on both local HDD and in HDFS environments

Related video:
Introduction to Apache Spark and Parquet, https://www.youtube.com/watch?v=itm0T…

Code for demo

case class Person(name: String, age: Int, sex:String)
val data = Seq(Person(“Jack”, 25,”M”), Person(“Jill”, 25,”F”), Person(“Jess”, 24,”F”))
val df = data.toDF()

import org.apache.spark.sql.SaveMode
df.select(“name”, “age”, “sex”).write.mode(SaveMode.Append).format(“parquet”).save(“/tmp/person”)

df.select(“name”, “age”, “sex”).write.partitionBy(“sex”).mode(SaveMode.Append).format(“parquet”).save(“/tmp/person_partitioned/”)

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val dfPerson = sqlContext.read.parquet(“/tmp/person”)

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Using Spark and Hive – PART 1: Spark as ETL tool

Part 1: Scenario – Spark as ETL tool
Write to Parquet file using Spark

Part 2: SparkSQL to query data from Hive
Read Hive table data from Spark

Create an External Table
Query the data from Hive
Add new data
Query the data from Hive

case class Person(name: String, age: Int, sex:String)
val data = Seq(Person(“Jack”, 25,”M”), Person(“Jill”, 25,”F”), Person(“Jess”, 24,”F”))

val df = data.toDF()

import org.apache.spark.sql.SaveMode
df.select(“name”, “age”, “sex”).write.mode(SaveMode.Append).format(“parquet”).save(“/tmp/person”)

//Add new data
val data = Seq(Person(“John”, 25,”M”))
val df = data.toDF()
df.select(“name”, “age”, “sex”).write.mode(SaveMode.Append).format(“parquet”).save(“/tmp/person”)

CREATE EXTERNAL TABLE person ( name String, age Int, sex String)
STORED as PARQUET
LOCATION ‘/tmp/person’

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Big Data for the Masses – Talend Open Studio for Big Data: Part #2 – Removing duplicate records

On February 29, 2012, Talend announced the availability of Talend Open Studio for Big Data, to be released under the Apache Software License. You can download a preview from http://www.talend.com/products-big-da…

Talend Open Studio for Big Data is a powerful open source solution for big data integration that addresses the needs of the data analyst by providing them with a graphical tool that abstracts the underling complexities of big data technologies. It provides a palette of easy to configure components that automatically generates code for Hadoop Distributed File System (HDFS), Pig, Hbase, Sqoop and Hive.
All of this is available under an Apache License, which is fully open source and free to use.

This video demonstration shows how to read an Excel file of records and pipe these to Hadoop. For data quality purposes, you will also learn how to add a “unique row” component to remove any duplicates from the input.

Download Talend’s solutions: http://www.talend.com/download.php?sr…

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

For equities Trump is not decisive – What do you think?

The new US president is also influencing the price of European equities with his rhetoric. However, investors should not focus on overseas comments, other factors are more crucial.

Für Schweizer Aktien ist Trump nicht ausschlaggebend

Donald Trump has already made several trips to the car manufacturers in recent months. Shortly before his presidential nomination as US president, he also targeted the pharma sector: in mid-January, Trump announced plans to reduce drug prices in the US. “We’ll save billions of dollars,” he said.

Any big data or predictive analysis report there ?

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Take a tour of F#

A straightforward way to break into functional programming in a Microsoft shop is via F#. Phillip Carter, Maira Wenzel, and team put together a pretty cool Tour of F# for anybody interested in getting their feet wet. They offer plenty of annotated code examples to show off some of the key features of the language.

Resultado de imagen de f#

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Cloud based Analytics Solution – TapestryKPI

TapestryKPI: Cloud-based Analytics reporting and digital data management solution built to unify the ever-growing and fragmented digital data space. The system is organized by four digital verticals.

Visit the link to Know more about the Cloud-based Analytics Solution – TapestryKPI: http://www.softcrylic.com/tapestrykpi/

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Gartner 2017 Magic Quadrant for Data Science Platforms: gainers and losers

We compare Gartner 2017 Magic Quadrant for Data Science Platforms vs its 2016 version and identify notable changes for leaders and challengers, including IBM, SAS, RapidMiner, KNIME, MathWorks, Microsoft, and Quest.

Gartner new 2017 Magic Quadrant for Data Science Platforms (called in 2016 “Advanced Analytics Platforms”) was published last week. The 2017 report evaluated a new set of 16 analytics and data science firms over 15 criteria and placed them in 4 quadrants, based on completeness of vision and ability to execute.

While open source platforms like Python and R play an important role in the Data Science market, Gartner research methodology does not include them, so this report evaluates only commercial vendors.

Gartner 2017 Magic Quadrant for Data Science Platforms
Fig. 1: Gartner 2017 Magic Quadrant for Data Science Platforms

Firms covered:

  • Leaders (4): IBM, SAS, RapidMiner, KNIME
  • Challengers (4): MathWorks (new), Quest (formerly Dell), Alteryx, Angoss
  • Visionaries (5): Microsoft, H2O.ai (new), Dataiku (new), Domino Data Lab (new), Alpine Data
  • Niche Players (3): FICO, SAP, Teradata (new)

Gartner notes that even the lowest-scoring vendors in MQ are still among the top 16 firms among over 100 vendors in the heated Data Science market.

5 new firms were added in 2017 (MathWorks, H2O.ai, Dataiku, Domino Data Lab, and Teradata) and 5 others present in 2016 MQ were dropped: Lavastorm, Megaputer, Prognoz, Accenture, Predixion Software.

As we did in our popular post last year: Gartner 2016 Magic Quadrant for Advanced Analytics Platforms: gainers and losers, we compared 2017 Magic Quadrant with its 2016 version. Below we examine the changes, gainers, and losers.

Gartner 2017 Vs 2016 MQ for Data Science Platforms
Fig 2: Gartner Magic Quadrants for Data Science Platforms compared, 2017 vs 2016

Fig 2 shows a comparison of 2016 MQ (greyed background image) and 2017 MQ (foreground image), with arrows connecting circles for the same firm. Arrows are colored green if the firm position improved significantly (further away from origin), red if the position became weaker. Green circles indicate new firms, while red Xs mark vendors dropped in 2017.

Leaders:

The same 4 firms: IBM, SAS, RapidMiner, and KNIME, which were leaders in 2014-2016, are again in the leaders quadrant. Dell Software, sold in Nov 2016 and renamed Quest, and its Statistica Analytics suite moved from Leaders to Challengers quadrant.

The 2017 MQ shows that IBM gained a little in vision, while SAS lost a little in ability to execute. RapidMiner gained in ability to execute, but both KNIME and RapidMiner lost in vision.

IBM: is again a leader, based on SPSS Modeler and SPSS Statistics. Gartner did not evaluate IBM new Data Science Experience (DSx) platform but DSx contributed to IBM’s improved position along Vision axis. IBM strengths include its vast customer base and continued innovation of its data science and machine learning capabilities. However, IBM breadth of offerings (SPSS, IBM Watson, DSx, Cognos) creates some confusion in the market and SPSS line has interoperability problems.

SAS provides a many software products for analytics and data science. Gartner evaluated covers SAS Enterprise Miner (EM) and the SAS Visual Analytics Suite (VAS). SAS is more focused now on interactive modeling with VAS, but continues to support its traditional base SAS. SAS retain a strong position in the Leaders quad, but confusion about its multiple products and concerns about high cost led to decline in Ability to Execute.

KNIME offers open-source KNIME Analytics Platform with strong functionality for advanced data scientists. It is strong in several industries, especially in manufacturing and life sciences. It lost somewhat along the Vision dimension due to weaker marketing and innovation compared to other leaders.

RapidMiner offers GUI-based data science platform, suitable for beginner and expert data scientists. It also offers access to open-source code. RapidMiner is available both as a free version and a commercial edition (with extra functionality for large data and connections to more data sources). RapidMiner is in leaders quad due to its market presence and well-rounded product.

Challengers:

  • MathWorks is a new entry for its MATLAB product, popular with engineers, and providing a rich set of toolboxes.
  • Quest, the result of the sale of Dell Software in 2016 to a private equity firm, now sells the Statistica Platform. Quest is in Challenger quad (while Dell was in Leaders quad) as a result of the second change of ownership of Statistica in 3 years and lack of cloud-related product improvements (which however are on the roadmap).
  • Alteryx, offers an easy to use data science platform, with self-service data preparation and advanced analytics. It also added simulation and optimization capabilities. Compared to 2016, it moved from Visionaries to Challengers quad due to its solid customer growth.
  • Angoss provides visual data mining and predictive analytics tools, as well as prescriptive analysis and optimization. Angoss remained in almost the same position in Cha quad as in 2016.

Visionaries:

  • Microsoft evaluation was based the Azure Machine Learning platform, part of the Microsoft Cortana Intelligence Suite, which offers a strong cloud-based data science platform. Gartner kept Microsoft in the visionaries quad, due to the lack of a comparable onsite solution.
  • H2O.ai, a new entry, offer open-source data science platform with fast execution of Deep Learning and other advanced Machine Learning methods.
  • Dataiku is a new entry, with its Data Science Studio (DSS). It was placed in Visionaries quad due to innovative nature of DSS, openness, collaboration features, and suitability for different skill levels.
  • Domino Data Lab a new entry with its Domino Data Science Platform, which focuses collaboration and supports a wide range of open-source technologies.
  • Alpine Data offer a “citizen data science” platform, Chorus, enabling collaboration between business analysts and front-line users in building and running analytic workflows. Compared to 2016, Alpine remained in Visionaries quadrant, but was dropped in its ability to execute due to its struggle to gain market share.

Niche Players:

  • SAP renamed its data science platform to SAP BusinessObjects Predictive Analytics. SAP decline in ability to execute caused it to drop from Challengers quad to Niche quad, and it is lagging in Spark integration, open-source and Python support, and cloud deployment.
  • FICO Decision Management Suite (DMS) offers multiple analytic tools. It stayed in the niche quadrant in 2016, and lost a little along both dimensions.
  • Teradata offers Aster Analytics platform, with 3 layers: analytic engines, prebuilt analytic functions, and the Aster AppCenter for analysis and connectivity to external BI tools. It is in the Niche quad to to low-level of adoption.

You can download the Gartner 2017 Magic Quadrant report for Advanced Analytics Platforms from RapidMiner, Dataiku, and probably other vendors favorably mentioned in this report.

You can also download a related 2017 Gartner Magic Quadrant for BI & Analytics from Tableau.

Related:

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

6 Top Big Data and Data Science Trends 2017

6 Top Big Data and Data Science Trends 2017

Recently we stepped in the 2017 year, and it’s time to draw the conclusion about 2016. Our world is depending more and more on the data and information, businesses using data science and machine learning to handle an enormous amount of data. Such things as deep learning and artificial intelligence applications really took off, and we expect those and other important developments to continue. So, let’s have a look at top big data and data science trends for the year 2017.

 

1. More and more businesses move their prediction analytics needs to the cloud

Many companies are moving their data and applications to the cloud. This process is driven by increased collaboration and flexibility, as well as reducing the complexity of administration and configuration of computing resources

And majority of the top cloud providers developed their own offering of Machine Learning services in a cloud. This step allows organizations to leverage machine learning technology, without massive investments and needs to employ large data science teams. So, now we are observing consumerization of predictive analytics technologies, which is supported by those cloud prediction services.

Here are main examples of such machine learning and AI as a service (MLaaS and AIaaS) providers:

IBM Watson

Microsoft Azure Machine Learning API

Google Prediction API

Amazon Machine Learning API

BigML

Those working with the data know very well that data is useless if it is not efficiently analyzed and turned into insights, which is, in fact, support decision-making process. In 2017, the increased usage of cloud ML services will be helping to improve and accelerate the transition from data to action for many companies for various industries.

2. Greater number of organizations will be adopting Hadoop and Spark big data platforms.

In 2017 we will notice a steady growth of Big Data technologies like Hadoop and Spark. The main advantage of Hadoop in the ability to collect vast amounts of data, and then distribute these data to low-cost servers running in parallel.

To actually make Hadoop to work outside of the test environment, many enterprises are going to shift it to the cloud in 2017, and abstract the storage from the calculation

And let’s take a look at Apache Spark – it is one of the fastest growing big data platform which uses in-memory computation.

According to recent studies, the percent of users using Spark on the public cloud (61%) was higher than the percent using Hadoop YARN (36%) and this trend will continue in 2017. In addition, the 2017 Spark proclaims the development of the operating model – through the “in-memory analytics.” This approach allows much faster computations, in some cases up to 100 times faster than original Hadoop Map Reduce framework.

 

3. Need for stronger data security is rising

The increasing number of cyber attacks is the reason for rising attention to the question of data security, which became more relevant in the recent years. It is expected that security analytics costs will raise up. Thus, it is critical to keep up with new trends in the field of data security.

One of the recent trends in security is increased usage of machine learning algorithm, including deep learning for detection of anomalies and other fields of data science security in various business domains.

Another noticeable trend of security data science is the utilization of AI conversational interfaces (a.k.a. bots) for automation of assistance and security responses to emerging threats. Such bots have been already using in several platforms like Slack.

Nowadays one of the quickly developing areas is behavioral biometrics. It can be used standalone, but in combination with ML we should expect faster results that should reduce the costs and improve efficiency. It can be used to identify the persons on the other end and what they’re trying to do as well as distinguish a user from the robot.

In the future, there may occur a lot of new types of attacks, and thus the requirements for cyber security are getting more complicated, and security specialists will need to adapt to the new threats.

4. Deep Learning technology is becoming mainstream.

Deep learning gets a lot of attention in 2016, as many noticeable results were achieved by using it for many important applications, such as machine translation and other forms of language processing, Automatic Image Caption Generation, Object Classification and Detection in Images,  Facial Recognition and Automatic Game Playing.

In 2017 we expect AI will not stop in improvement and move more and more to the form of artificial general intelligence (AGI).  It is no wonder that for data scientists Deep Learning will be one of the most important approaches for solving machine learning problems.

Furthermore, deep learning is becoming widely available now, because of the increased computing power and accessibility of this technology through various open-source frameworks, such as TensorFlow and Deeplearning4j.

Top companies, such as Google, Facebook, Microsoft and Baidu are driving research further with their strong artificial intelligence R&D teams.

So we are going to see a lot of new deep learning applications that could significantly impact our lives like auto-correction of the photo and professional photo processing using ML methods. Among other, we can mention the rise of the new portable devices, which helps disabled people to see, hear, and otherwise, sense their environment by utilizing machine learning.

 

5. The movement towards conversational interfaces and chatbots will accelerate

For a long time after the appearance of computers in our lives, we were forced to speak their language to communicate with them. And now, thanks to the latest discoveries and technology development, we can teach computers how to communicate in our language. Conversational systems have moved from a model in which the user is adapted to computers to such model where the computer “listens” and adjusts to the user’s desired result.

It appears that new possibilities in the conversational interfaces (CI) and chatbots will require a lot of new skills from developers. First of all, it will be a huge challenge for UX designers, as it is very different from current user interfaces. In this year many programmers and designers will aim to improve their skills in design and development of chatbots.

Chatbots themselves are going to be a big trend for 2017, and we are expecting to see more new creative solutions bringing chatbots to the next level. For now, chatbot technology has already been integrated into many consumer apps. The next step is to see it is moving into other industries like retail banking, e-commerce, enterprise solutions, or even marketing campaigns.

So why are the bots getting popular in recent time? Despite the fact that the bots have been existing for a long time, only now the AI development has reached a level where it became possible to create some advanced products, many of which utilize machine learning. Some of the prominent examples of conversational AI we can see in such products as Google Assistant and Siri for iOS, which have become an almost indispensable product for users of smartphones and has already gone far beyond just a fun application or quirks users.

 

6. 2017 is the Year of Self-Driving Cars

Finally, in 2017, one of the trends and incredibly popular areas are self-driving cars and autonomous technology. This trend is already under development for several years, but now is the time when technology advance is almost ready to hit the market. More and more car manufacturers are moving towards automating up the driving and run their hands-free models. And this year may be a big milestone for that area.

A lot of giants in the auto market have already presented their projects. Among them are BMW, Volvo, General Motors, Nissan and of course Tesla and Google. Companies and the greatest minds of our time believe that production of such cars will reduce the number of road accidents. On the other hand, we should not forget how many jobs would be lost during the development of this technology. For example in the USA, the driver’s profession is the most popular in the half of the states. This topic has been the subject of hot debate, but mostly among sociologists and economists and Silicon Valley gurus.

Electro car is not something new or surprising nowadays, and they gradually come into our lives more and more with the development of technologies and the reduction of their prices. Time is coming for a rise of self-driving technology. However, there are still many open questions regarding safety and proper legislation in this area.

 

Conclusions

The 2017 year is going to be very exciting and full of new discoveries and improvements in artificial intelligence field. This year is going to be “The Year of Intelligence” as we see that AI and machine learning applications are going mainstream and contributes to every part of organization and business areas and becoming one of the key competitive advantages for companies which integrate machine learning into its operations.

We hope you’ve enjoyed reading this post about new developments and trends in Data Science field for year 2017. We are not pretending this to be an ultimate list, as so many things are evolving quickly in technology realm, so we encourage you to share your vision about main trends for data and data science field in the comments section below.

Originally published at: http://activewizards.com/blog/6-top-big-data-and-data-science-trends-2017/

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail