Spark 2017欧洲技术峰会摘要(开发人员分类)

下载全部视频和PPT,请关注公众号(bigdata_summit),并点击“视频下载”菜单

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets

by Jules Damji, Databricks
video, slide
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)Session hashtag: #EUdev12

下面的内容来自机器翻译:
在所有开发人员的喜悦中,没有一个比一套使开发人员具有生产力,易于使用且直观和富有表现力的API更有吸引力。 Apache Spark通过Spark SQL,Streaming,Machine Learning和Graph Processing等组件提供这些API,以便以Scala,Java,Python和R等语言对大型数据集进行大规模分布式大数据处理。在这次演讲中,我将探讨Apache Spark 2.x中提供的三套API(RDD,DataFrame和Datasets)的演变。我特别强调三个要点:1)为什么和什么时候应该使用每一套作为最佳实践2)概述其性能和优化的好处;以及3)强调何时使用DataFrame和Datasets而不是RDD来处理大数据分布式的情况。通过使用API代码示例进行简单的笔记本演示,您将了解如何使用RDD,DataFrame和Datasets处理大数据,并在其中进行互操作。 (这将是博客的发声,以及Apache Spark 2.x数据框/数据集和Spark SQL API的最新发展:https://databricks.com/blog/2016/07/14/a-tale-of- 3-apache-spark-apis-rdds-dataframes-and-datasets.html)会话主题标签:#EUdev12

An Adaptive Execution Engine For Apache Spark SQL

by Carson Wang, Intel
video, slide
Catalyst is an excellent optimizer in SparkSQL, provides open interface for rule-based optimization in planning stage. However, the static (rule-based) optimization will not consider any data distribution at runtime. A technology called Adaptive Execution has been introduced since Spark 2.0 and aims to cover this part, but still pending in early stage. We enhanced the existing Adaptive Execution feature, and focus on the execution plan adjustment at runtime according to different staged intermediate outputs, like set partition numbers for joins and aggregations, avoid unnecessary data shuffling and disk IO, handle data skew cases, and even optimize the join order like CBO etc.. In our benchmark comparison experiments, this feature save huge manual efforts in tuning the parameters like the shuffled partition number, which is error-prone and misleading. In this talk, we will expose the new adaptive execution framework, task scheduling, failover retry mechanism, runtime plan switching etc. At last, we will also share our experience of benchmark 100 -300 TB scale of TPCx-BB in a hundreds of bare metal Spark cluster.Session hashtag: EUdev4

下面的内容来自机器翻译:
Catalyst是SparkSQL中极好的优化器,在规划阶段为规则优化提供开放接口。但是,静态(基于规则的)优化不会考虑运行时的任何数据分布。自Spark 2.0以来,引入了一种名为“自适应执行”的技术,旨在涵盖此部分,但仍处于早期阶段。我们增强了现有的自适应执行功能,并根据不同阶段的中间输出(如为连接和聚合设置分区数量),在运行时重点调整执行计划,避免不必要的数据混洗和磁盘IO,处理数据歪斜情况,甚至优化加入CBO等命令。在我们的基准比较实验中,该功能节省了大量的手动调整参数,如混洗分区数量,这是容易出错和误导。在这次演讲中,我们将展示新的自适应执行框架,任务调度,故障转移重试机制,运行时计划切换等。最后,我们还将分享我们在几百个裸机上分享我们100-300 TB规模的TPCx-BB规模的经验金属Spark集群。Session标签:EUdev4

Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Methodologies

by Luca Canali, CERN
video, slide
This talk is about methods and tools for troubleshooting Spark workloads at scale and is aimed at developers, administrators and performance practitioners. You will find examples illustrating the importance of using the right tools and right methodologies for measuring and understanding performance, in particular highlighting the importance of using data and root cause analysis to understand and improve the performance of Spark applications. The talk has a strong focus on practical examples and on tools for collecting data relevant for performance analysis. This includes tools for collecting Spark metrics and tools for collecting OS metrics. Among others, the talk will cover sparkMeasure, a tool developed by the author to collect Spark task metric and SQL metrics data, tools for analysing I/O and network workloads, tools for analysing CPU usage and memory bandwidth, tools for profiling CPU usage and for Flame Graph visualization.Session hashtag: #EUdev2

下面的内容来自机器翻译:
这次演讲的内容是关于大规模调试Spark工作负载的方法和工具,针对开发人员,管理人员和性能从业人员。您将找到示例说明使用正确的工具和正确的方法衡量和理解性能的重要性,特别强调了使用数据和根本原因分析来了解和改进Spark应用程序性能的重要性。这次演讲非常关注实际案例和收集与绩效分析有关的数据的工具。这包括收集Spark指标的工具和收集OS指标的工具。其中包括SparkMeasure,这是一个由作者开发的用于收集Spark任务度量和SQL度量数据的工具,用于分析I / O和网络工作负载的工具,分析CPU使用率和内存带宽的工具,分析CPU使用情况的工具以及为Flame Graph visualization.Session标签:#EUdev2

Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytics

by Akshay Rai, LinkedIn
video, slide
Is your job running slower than usual? Do you want to make sense from the thousands of Hadoop & Spark metrics? Do you want to monitor the performance of your flow, get alerts and auto tune them? These are the common questions every Hadoop user asks but there is not a single solution that addresses it. We at Linkedin faced lots of such issues and have built a simple self serve tool for the hadoop users called Dr. Elephant. Dr. Elephant, which is already open sourced, is a performance monitoring and tuning tool for Hadoop and Spark. It tries to improve the developer productivity and cluster efficiency by making it easier to tune jobs. Since its open source, it has been adopted by multiple organizations and followed with a lot of interest in the Hadoop and Spark community. In this talk, we will discuss about Dr. Elephant and outline our efforts to expand the scope of Dr. Elephant to be a comprehensive monitoring, debugging and tuning tool for Hadoop and Spark applications. We will talk about how Dr. Elephant performs exception analysis, give clear and specific suggestions on tuning, tracking metrics and monitoring their historical trends. Open Source: https://github.com/linkedin/dr-elephantSession hashtag: #EUdev9

下面的内容来自机器翻译:
你的工作比平时慢吗?您想从数千个Hadoop和Spark度量中理解吗?你想监视你的流量的性能,获得警报和自动调整?这些都是每个Hadoop用户所要求的常见问题,但没有解决这个问题的单一解决方案。我们在Linkedin上遇到了很多这样的问题,并且为称为Dr. Elephant的hadoop用户构建了一个简单的自助服务工具。已经开源的Elephant博士是Hadoop和Spark的性能监控和调优工具。它试图通过调整作业更容易提高开发人员的生产力和集群效率。自从开源以来,它已经被多个组织所采用,并对Hadoop和Spark社区产生了很大的兴趣。在这次演讲中,我们将讨论Elephant博士,并概述我们将Elephant博士的范围扩展为Hadoop和Spark应用程序的全面监控,调试和调优工具的努力。我们将讨论大象博士如何进行异常分析,就调整,跟踪指标和监测其历史趋势给出明确的具体建议。开源:https://http://github.com/linkedin/dr-elephantSession标签:#EUdev9

Extending Apache Spark SQL Data Source APIs with Join Push Down

by Ioana Delaney, IBM
video, slide
When Spark applications operate on distributed data coming from disparate data sources, they often have to directly query data sources external to Spark such as backing relational databases, or data warehouses. For that, Spark provides Data Source APIs, which are a pluggable mechanism for accessing structured data through Spark SQL. Data Source APIs are tightly integrated with the Spark Optimizer. They provide optimizations such as filter push down to the external data source and column pruning. While these optimizations significantly speed up Spark query execution, depending on the data source, they only provide a subset of the functionality that can be pushed down and executed at the data source. As part of our ongoing project to provide a generic data source push down API, this presentation will show our work related to join push down. An example is star-schema join, which can be simply viewed as filters applied to the fact table. Today, Spark Optimizer recognizes star-schema joins based on heuristics and executes star-joins using efficient left-deep trees. An alternative execution proposed by this work is to push down the star-join to the external data source in order to take advantage of multi-column indexes defined on the fact tables, and other star-join optimization techniques implemented by the relational data source.Session hashtag: #EUdev7

下面的内容来自机器翻译:
Spark应用程序对来自不同数据源的分布式数据进行操作时,通常需要直接查询Spark外部的数据源,如支持关系数据库或数据仓库。为此,Spark提供了数据源API,这是通过Spark SQL访问结构化数据的可插入机制。数据源API与Spark Optimizer紧密集成。它们提供了优化,比如过滤器下推到外部数据源和列修剪。虽然这些优化显着加快了Spark查询的执行速度,但依赖于数据源,它们只提供可在数据源处下推执行的一部分功能。作为我们正在进行的提供通用数据源下推API的项目的一部分,此演示文稿将显示我们与加入下推相关的工作。星型模式连接就是一个例子,可以简单地将其视为应用于事实表的过滤器。如今,Spark Optimizer基于启发式技术识别星型模式连接,并使用高效的左深度树执行星型连接。这项工作提出的另一种执行方式是将星形连接推送到外部数据源,以利用事实表上定义的多列索引以及由关系数据源实现的其他星形连接优化技术。会话主题标签:#EUdev7

Extending Apache Spark’s Ingestion: Building Your Own Java Data Source

by Jean Georges Perrin, Oplo
video, slide
Apache Spark is a wonderful platform for running your analytics jobs. It has great ingestion features from CSV, Hive, JDBC, etc. however, you may have your own data sources or formats you want to use. Your solution could be to convert your data in a CSV or JSON file and then ask Spark to do ingest it through its built-in tools. However, for enhanced performance, we will explore the way to build a data source, in Java, to extend Spark’s ingestion capabilities. We will first understand how Spark works for ingestion, then walk through the development of this data source plug-in. Targeted audience Software and data engineers who need to expand Spark’s ingestion capability. Key takeaways Requirements, needs & architecture – 15%. Build the required tool set in Java – 85%.Session hashtag: #EUdev6

下面的内容来自机器翻译:
Apache Spark是运行分析作业的绝佳平台。它具有从CSV,Hive,JDBC等伟大的摄取功能。但是,您可能有自己的数据源或您要使用的格式。您的解决方案可能是将您的数据转换为CSV或JSON文件,然后要求Spark通过其内置工具进行摄取。但是,为了提高性能,我们将探索以Java为基础构建数据源的方式,以扩展Spark的摄取功能。我们将首先了解Spark如何工作,然后通过这个数据源插件的开发。目标受众需要扩展Spark摄取功能的软件和数据工程师。关键要求要求,需求和架构 – 15%。在Java中构建所需的工具集 – 85%.Session标签:#EUdev6

Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud

by Michael McCune, Red Hat
video, slide
Writing intelligent cloud native applications is hard enough when things go well, but what happens when there are performance and debugging issues that arise during production? Inspecting the logs is a good start, but what if the logs don’t show the whole picture? Now you have to go deeper, examining the live performance metrics that are generated by Spark, or even deploying specialized microservices to monitor and act upon that data. Spark provides several built-in sinks for exposing metrics data about the internal state of its executors and drivers, but getting at that information when your cluster is in the cloud can be a time consuming and arduous process. In this presentation, Michael McCune will walk through the options available for gaining access to the metrics data even when a Spark cluster lives in a cloud native containerized environment. Attendees will see demonstrations of techniques that will help them to integrate a full-fledged metrics story into their deployments. Michael will also discuss the pain points and challenges around publishing this data outside of the cloud and explain how to overcome them. In this talk you will learn about: Deploying metrics sinks as microservices, Common configuration options, and Accessing metrics data through a variety of mechanisms.Session hashtag: #EUdev11

下面的内容来自机器翻译:
当事情进展顺利的时候编写智能云原生应用程序已经够难了,但是当生产过程中出现性能和调试问题时会发生什么?检查日志是一个好的开始,但如果日志不显示整个图像呢?现在,您必须更深入地研究Spark生成的实时性能指标,甚至部署专门的微服务来监视和处理这些数据。 Spark提供了几个内置接收器来公开有关执行程序和驱动程序内部状态的度量标准数据,但是当您的群集在云中时获取这些信息可能是一个耗时且艰巨的过程。在本演示中,即使Spark集群位于云本地集装箱环境中,迈克尔·麦库纳也将浏览可用于获取指标数据的选项。与会者将看到技术演示,这将有助于他们将全面的指标故事整合到他们的部署中。迈克尔还将讨论围绕云发布这些数据的难点和挑战,并解释如何克服这些问题。在本次演讲中,您将了解:将指标汇聚为微服务,通用配置选项以及通过各种机制访问指标数据。Session hashtag:#EUdev11

From Pipelines to Refineries: Building Complex Data Applications with Apache Spark

by Tim Hunter, Databricks
video, slide
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.Session hashtag: #EUdev1

下面的内容来自机器翻译:
大数据工具很难结合成一个更大的应用程序:具有讽刺意味的是,大数据应用程序本身不能很好地扩展。整合和数据管理的这些问题只会被越来越多的数据所放大。 Apache Spark为批处理,流和临时交互式分析提供了强大的构建块。然而,当将一个连贯的流水线集中在一起,可能涉及数百个转换步骤时,用户面临着挑战,特别是在面临快速迭代的需求时。这个演讲通过函数式编程的镜头来探讨这些问题。它提供了一个实验框架,通过向Apache Spark引入更多的懒惰来提供全面的管道保证。通过整个程序检查,自动缓存以及积极的计算并行和重用,这个框架允许转换无缝组合并缓解常见问题。Session hashtag:#EUdev1

Lessons From the Field: Applying Best Practices to Your Apache Spark Applications

by Silvio Fiorito, Databricks
video, slide
Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization. This talk will cover best practices I’ve applied over years in the field helping customers write Spark applications as well as identifying what patterns make sense for your use case.Session hashtag: #EUdev5

下面的内容来自机器翻译:
无论您是在进行ETL,机器学习还是数据仓库,Apache Spark都是加速分析的绝佳工具。但是,为了充分利用Spark,需要了解数据存储,文件格式和查询优化的最佳实践。本次演讲将涵盖我在现场应用多年的最佳实践,帮助客户编写Spark应用程序,并确定哪些模式对您的用例有意义。Session hashtag:#EUdev5

Optimal Strategies for Large-Scale Batch ETL Jobs

by Emma Tang, Neustar
video, slide
The ad tech industry processes large volumes of pixel and server-to-server data for each online user’s click, impression, and conversion data. At Neustar, we process 10+ billion events per day, and all of our events are fed through a number of Spark ETL batch jobs. Many of our Spark jobs process over 100 terabytes of data per run, each job runs to completion in around 3.5 hours. This means we needed to optimize our jobs in specific ways to achieve massive parallelization while keeping the memory usage (and cost) as low as possible. Our talk is focused on strategies dealing with extremely large data. We will talk about the things we learned and the mistakes we made. This includes: – Optimizing memory usage using Ganglia – Optimizing partition counts for different types of stages and effective joins – Counterintuitive strategies for materializing data to maximize efficiency – Spark default settings specific to large scale jobs, and how they matter – Running Spark using Amazon EMR with more than 3200 cores – Review different types of errors and stack traces that occur with large-scale jobs and how to read and handle them – How to deal with large number of map output status when there are 100k partitions joining with 100k partitions – How to prevent serialization buffer overflow as well as map out status buffer overflow. This can easily happen when data is extremely large – How to effectively use partitioners to combine stages and minimize shuffle.Session hashtag: #EUdev3

下面的内容来自机器翻译:
广告技术行业会为每个在线用户的点击量,展示次数和转化数据处理大量像素和服务器到服务器的数据。在Neustar,我们每天处理超过10亿个事件,并且我们所有的事件都通过一系列Spark ETL批处理作业进行处理。我们的许多Spark作业每次处理超过100太字节的数据,每个作业在大约3.5小时内完成。这意味着我们需要通过特定的方式优化我们的工作,以实现大规模并行化,同时尽可能降低内存使用(和成本)。我们的谈话集中在处理极大数据的策略上。我们会谈谈我们学到的东西和我们犯的错误。这包括: – 使用Ganglia优化内存使用 – 针对不同类型的阶段和有效的连接优化分区计数 – 实现数据最大化效率的违反直觉的策略 – 特定于大型作业的Spark默认设置,以及它们的重要性 – 使用Amazon EMR运行Spark有超过3200个内核 – 查看大规模作业中发生的不同类型的错误和堆栈跟踪,以及如何读取和处理它们 – 当100k分区加入100k分区时如何处理大量的地图输出状态 – 防止序列化缓冲区溢出以及映射出状态缓冲区溢出。当数据非常大时,这很容易发生 – 如何有效地使用分区来合并阶段并最大限度地减少shuffle。Session hashtag:#EUdev3

Storage Engine Considerations for Your Apache Spark Applications

by Mladen Kovacevic, Cloudera
video, slide
You have the perfect use case for your Spark applications – whether it be batch processing or super fast near-real time streaming — Now, where to store your valuable data!? In this talk we take a look at four storage options; HDFS, HBase, Solr and Kudu. With so many to choose from, which will fit your use case? What considerations should be taken into account? What are the pros and cons, what are the similarities and differences and how do they fit in with your Spark application? Learn the answers to these questions and more with a look at design patterns and techniques, and sample code to integrate into your application immediately. Walk away with the confidence to propose the right architecture for your use cases and the development know-how to implement and deliver with success.Session hashtag: #EUdev10

下面的内容来自机器翻译:
您的Spark应用程序具有完美的用例 – 无论是批处理还是超快近实时流 – 现在,在哪里存储您的宝贵数据!在这个演讲中,我们看看四个存储选项; HDFS,HBase,Solr和Kudu。有这么多的选择,哪个适合你的用例?应该考虑哪些因素?什么是优点和缺点,有什么相似之处和不同之处,它们如何适合你的Spark应用程序?通过了解设计模式和技术以及示例代码,立即了解这些问题的答案和更多内容。有信心为您提供适合您的用例的体系结构,并提供成功实施和交付的开发技巧。Session hashtag:#EUdev10

Supporting Highly Multitenant Spark Notebook Workloads: Best Practices and Useful Patches

by Brad Kaiser, IBM
video, slide
Notebooks: they enable our users, but they can cripple our clusters. Let’s fix that. Notebooks have soared in popularity at companies world-wide because they provide an easy, user-friendly way of accessing the cluster-computing power of Spark. But the more users you have hitting a cluster, the harder it is to manage the cluster resources as big, long-running jobs start to starve out small, short-running jobs. While you could have users spin up EMR-style clusters, this reduces the ability to take advantage of the collaborative nature of notebooks. It also quickly becomes expensive as clusters sit idle for long periods of time waiting on single users. What we want is fair, efficient resource utilization on a large single cluster for a large number of users. In this talk we’ll discuss dynamic allocation and the best practices for configuring the current version of Spark as-is to help solve this problem. We’ll also present new improvements we’ve made to address this use case. These include: decommissioning executors without losing cached data, proactively shutting down executors to prevent starvation, and improving the start times of new executors.Session hashtag: #EUdev8

下面的内容来自机器翻译:
笔记本电脑:他们使我们的用户,但他们可以削弱我们的集群。我们来解决这个问题。笔记本电脑在世界各地的公司中受到普遍欢迎,因为它们提供了访问Spark集群计算能力的简单易用的方法。但是,越是拥有群集的用户,管理群集资源就越困难,因为大型,长时间运行的作业开始消耗小而短的作业。虽然您可以让用户启用EMR风格的集群,但这会降低利用笔记本电脑协作特性的能力。随着群集闲置很长一段时间等待单个用户,它也很快变得昂贵。我们所需要的是在大量用户的大型单一群集上进行公平,高效的资源利用。在这个讨论中,我们将讨论动态分配,并且将Spark当前版本配置为最佳实践来帮助解决这个问题。我们还将介绍我们为解决这个用例所做的新的改进。其中包括:在不丢失缓存数据的情况下停止执行程序,主动关闭执行程序以防止饥饿,并改善新执行程序的启动时间。Session#hashdeg:#EUdev8

    原文作者:大数据技术峰会解读
    原文地址: https://zhuanlan.zhihu.com/p/30844422
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞