Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, Matei Zaharia will cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, he will talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.
This talk was originally presented at Spark Summit East 2017.
ETL has been around since the 90s, supporting a whole ecosystem of BI tools and practises. While traditional ETL has proven its value, it’s time to move on to modern ways of getting your data from A to B. Since BI moved to big data, data warehousing became data lakes, and applications became microservices, ETL is next our our list of obsolete terms. Spark provides an ideal middleware framework for writing code that gets the job done fast, reliable, readable. In this session I will support this statement with some nice ‘old vs new’ diagrams, code examples and use cases. Please join if you want to know more about the NoETL paradigm, or just want to be convinced of the possibilities of Spark in this area!
On February 29, 2012, Talend announced the availability of Talend Open Studio for Big Data, to be released under the Apache Software License. You can download a preview from http://www.talend.com/products-big-da…
Talend Open Studio for Big Data is a powerful open source solution for big data integration that addresses the needs of the data analyst by providing them with a graphical tool that abstracts the underling complexities of big data technologies. It provides a palette of easy to configure components that automatically generates code for Hadoop Distributed File System (HDFS), Pig, Hbase, Sqoop and Hive.
All of this is available under an Apache License, which is fully open source and free to use.
This video demonstration shows how to read an Excel file of records and pipe these to Hadoop. For data quality purposes, you will also learn how to add a “unique row” component to remove any duplicates from the input.