What to Expect for Big Data and Apache Spark in 2017

Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, Matei Zaharia will cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, he will talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.

This talk was originally presented at Spark Summit East 2017.

You can view the slides on Slideshare:
http://www.slideshare.net/databricks/…

 

 

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Get Rid of Traditional ETL, Move to Spark! (Bas Geerdink)

ETL has been around since the 90s, supporting a whole ecosystem of BI tools and practises. While traditional ETL has proven its value, it’s time to move on to modern ways of getting your data from A to B. Since BI moved to big data, data warehousing became data lakes, and applications became microservices, ETL is next our our list of obsolete terms. Spark provides an ideal middleware framework for writing code that gets the job done fast, reliable, readable. In this session I will support this statement with some nice ‘old vs new’ diagrams, code examples and use cases. Please join if you want to know more about the NoETL paradigm, or just want to be convinced of the possibilities of Spark in this area!

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Spark Reading and Writing to Parquet Storage Format

 

Spark: Reading and Writing to Parquet Format
————————————————————————–

– Using Spark Data Frame save capability
– Code/Approach works on both local HDD and in HDFS environments

Related video:
Introduction to Apache Spark and Parquet, https://www.youtube.com/watch?v=itm0T…

Code for demo

case class Person(name: String, age: Int, sex:String)
val data = Seq(Person(“Jack”, 25,”M”), Person(“Jill”, 25,”F”), Person(“Jess”, 24,”F”))
val df = data.toDF()

import org.apache.spark.sql.SaveMode
df.select(“name”, “age”, “sex”).write.mode(SaveMode.Append).format(“parquet”).save(“/tmp/person”)

df.select(“name”, “age”, “sex”).write.partitionBy(“sex”).mode(SaveMode.Append).format(“parquet”).save(“/tmp/person_partitioned/”)

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val dfPerson = sqlContext.read.parquet(“/tmp/person”)

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Using Spark and Hive – PART 1: Spark as ETL tool

Part 1: Scenario – Spark as ETL tool
Write to Parquet file using Spark

Part 2: SparkSQL to query data from Hive
Read Hive table data from Spark

Create an External Table
Query the data from Hive
Add new data
Query the data from Hive

case class Person(name: String, age: Int, sex:String)
val data = Seq(Person(“Jack”, 25,”M”), Person(“Jill”, 25,”F”), Person(“Jess”, 24,”F”))

val df = data.toDF()

import org.apache.spark.sql.SaveMode
df.select(“name”, “age”, “sex”).write.mode(SaveMode.Append).format(“parquet”).save(“/tmp/person”)

//Add new data
val data = Seq(Person(“John”, 25,”M”))
val df = data.toDF()
df.select(“name”, “age”, “sex”).write.mode(SaveMode.Append).format(“parquet”).save(“/tmp/person”)

CREATE EXTERNAL TABLE person ( name String, age Int, sex String)
STORED as PARQUET
LOCATION ‘/tmp/person’

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Big Data for the Masses – Talend Open Studio for Big Data: Part #2 – Removing duplicate records

On February 29, 2012, Talend announced the availability of Talend Open Studio for Big Data, to be released under the Apache Software License. You can download a preview from http://www.talend.com/products-big-da…

Talend Open Studio for Big Data is a powerful open source solution for big data integration that addresses the needs of the data analyst by providing them with a graphical tool that abstracts the underling complexities of big data technologies. It provides a palette of easy to configure components that automatically generates code for Hadoop Distributed File System (HDFS), Pig, Hbase, Sqoop and Hive.
All of this is available under an Apache License, which is fully open source and free to use.

This video demonstration shows how to read an Excel file of records and pipe these to Hadoop. For data quality purposes, you will also learn how to add a “unique row” component to remove any duplicates from the input.

Download Talend’s solutions: http://www.talend.com/download.php?sr…

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail