Spark 2017欧洲技术峰会摘要（Engineering 分类）

2024年2月25日 82次阅读来源: 大数据技术峰会解读

下载全部视频和PPT，请关注公众号(bigdata_summit)，并点击“视频下载”菜单

Apache Spark Pipelines in the Cloud with Alluxio

by Gene Pang, Alluxio, Inc.
video, slide
Organizations commonly use Apache Spark to gain actionable insight from their large amounts of data. Often, these analytics are in the form of data processing pipelines, where there are a series of processing stages, and each stage performs a particular function, and the output of one stage is the input of the next stage. There are several examples of pipelines, such as log processing, IoT pipelines, and machine learning. The common attribute among different pipelines is the sharing of data between stages. It is also common for Spark pipelines to process data stored in the public cloud, such as Amazon S3, Microsoft Azure Blob Storage, or Google Cloud Storage. The global availability and cost effectiveness of these public cloud storage services make them the preferred storage for data. However, running pipeline jobs while sharing data via cloud storage can be expensive in terms of increased network traffic, and slower data sharing and job completion times. Using Alluxio, a memory speed virtual distributed storage system, enables sharing data between different stages or jobs at memory speed. By reading and writing data in Alluxio, the data can stay in memory for the next stage of the pipeline, and this result in great performance gains. In this talk, we discuss how Alluxio can be deployed and used with a Spark data processing pipeline in the cloud. We show how pipeline stages can share data with Alluxio memory for improved performance benefits, and how Alluxio can improves completion times and reduces performance variability for Spark pipelines in the cloud.Session hashtag: #EUde5

下面的内容来自机器翻译:
组织通常使用Apache Spark从大量数据中获取可操作的洞察力。这些分析通常是以数据处理流水线的形式出现的，其中有一系列的处理阶段，每个阶段执行一个特定的功能，一个阶段的输出是下一阶段的输入。有几个管道的例子，如日志处理，物联网管道和机器学习。不同管道之间的共同属性是阶段之间的数据共享。对于存储在公有云中的数据（如Amazon S3，Microsoft Azure Blob存储或Google云存储），Spark管道也很常见。这些公共云存储服务的全球可用性和成本效益使其成为数据的首选存储。然而，在通过云存储共享数据的同时运行管道作业在增加网络流量方面可能是昂贵的，并且数据共享和作业完成时间较慢。使用内存速度虚拟分布式存储系统Alluxio，可以以不同的速度在不同的阶段或作业之间共享数据。通过在Alluxio中读写数据，数据可以留在内存中用于流水线的下一阶段，这样可以获得很大的性能提升。在这次演讲中，我们将讨论如何在云中部署和使用Alluxio和Spark数据处理管道。我们展示了流水线阶段如何与Alluxio内存共享数据，以提高性能优势，以及Alluxio如何提高完成时间并降低云中Spark 管道的性能可变性。Session＃hash＃： EUde5

Beyond Unit Tests: End-to-End Testing for Spark Workflows

by Anant Nag, LinkedIn
video, slide
As a Spark developer, do you want to quickly develop your Spark workflows? Do you want to test your workflows in a sandboxed environment similar to production? Do you want to write end-to-end tests for your workflows and add assertions on top of it? In just a few years, the number of users writing Spark jobs at LinkedIn have grown from tens to hundreds, and the number of jobs running every day has grown from hundreds to thousands. With the ever increasing number of users and jobs, it becomes crucial to reduce the development time for these jobs. It is also important to test these jobs thoroughly before they go to production. Currently, there is no way users can test their spark jobs end-to-end. The only way is to divide the spark jobs into functions and unit test the functions. We’ve tried to address these issues by creating a testing framework for Spark workflows. The testing framework enables the users to run their jobs in an environment similar to the production environment and on the data which is sampled from the original data. The testing framework consists of a test deployment system, a data generation pipeline to generate the sampled data, a data management system to help users manage and search the sampled data and an assertion engine to validate the test output. In this talk, we will discuss the motivation behind the testing framework before deep diving into its design. We will further discuss how the testing framework is helping the Spark users at LinkedIn to be more productive.Session hashtag: #EUde12

下面的内容来自机器翻译:
作为Spark开发人员，您是否想快速开发Spark工作流程？您是否想在类似于生产的沙盒环境中测试您的工作流程？你想为你的工作流编写端到端测试，并在其上添加断言吗？在短短的几年时间里，在LinkedIn领域撰写 Spark工作的用户数量从几十个增加到几百个，每天运行的作业数量从几百个增加到了几千个。随着用户和工作的不断增加，减少这些工作的开发时间变得至关重要。在生产前彻底测试这些工作也很重要。目前，用户无法通过端到端测试他们的spark作业。唯一的方法是将spark作业分成函数和单元测试函数。我们试图通过为Spark工作流程创建测试框架来解决这些问题。测试框架使用户能够在类似于生产环境的环境中以及从原始数据采样的数据中运行他们的工作。测试框架由测试部署系统，生成采样数据的数据生成管道，帮助用户管理和搜索采样数据的数据管理系统以及用于验证测试输出的断言引擎组成。在这次演讲中，我们将在深入探究其设计之前讨论测试框架背后的动机。我们将进一步讨论测试框架如何帮助LinkedIn的Spark用户提高工作效率。Session hashtag：＃EUde12

High Performance Enterprise Data Processing with Apache Spark

by Sandeep Varma, ZS Associates
video, slide
Data engineering to support reporting and analytics for commercial Lifesciences groups consists of very complex interdependent processing with highly complex business rules (thousands of transformations on hundreds of data sources). We will talk about our experiences in building a very high performance data processing platform powered by Spark that balances the considerations of extreme performance, speed of development, and cost of maintenance. We will touch upon optimizing enterprise grade Spark architecture for data warehousing and data mart type applications, optimizing end to end pipelines for extreme performance, running hundreds of jobs in parallel in Spark, orchestrating across multiple Spark clusters, and some guidelines for high speed platform and application development within enterprises. Key takeaways: – example architecture for complex data warehousing and data mart applications on Spark – architecture to build high performance Spark platforms for enterprises that balance functionality with total cost of ownership – orchestrating multiple elastic Spark clusters while running hundreds of jobs in parallel – business benefits of high performance data engineering, especially for Lifesciences.Session hashtag: #EUde3

下面的内容来自机器翻译:
用于支持商业生命科学组报告和分析的数据工程包括非常复杂的相互依存的处理以及高度复杂的业务规则（数百个数据源的数千次转换）。我们将讨论我们在构建由Spark驱动的高性能数据处理平台方面的经验，以平衡极端性能，开发速度和维护成本的考虑因素。我们将着重于优化企业级数据仓库和数据集市应用程序的Spark架构，优化端到端流水线以获得极高的性能，并行运行数百个作业。，协调跨多个 Spark 集群，以及企业内部高速平台和应用程序开发的一些指导。关键要点： – 针对复杂数据仓库和数据集市应用的示例架构 Spark – 为构建高性能的架构Spark这种平衡功能与总体拥有成本 – 在并行运行数百个作业的同时协调多个弹性 Spark 集群 – 高性能数据工程的业务优势，尤其是Lifesciences.Session标签：＃ EUde3

How to Share State Across Multiple Apache Spark Jobs using Apache Ignite

by Akmal Chaudhri, GridGain
video, slide
Attend this session to learn how to easily share state in-memory across multiple Spark jobs, either within the same application or between different Spark applications using an implementation of the Spark RDD abstraction provided in Apache Ignite. During the talk, attendees will learn in detail how IgniteRDD – an implementation of native Spark RDD and DataFrame APIs – shares the state of the RDD across other Spark jobs, applications and workers. Examples will show how IgniteRDD, with its advanced in-memory indexing capabilities, allows execution of SQL queries many times faster than native Spark RDDs or Data Frames.Session hashtag: #EUde9

下面的内容来自机器翻译:
参加本次会议，了解如何在同一个应用程序中或不同的Spark之间跨多个Spark >应用程序使用Apache Ignite中提供的Spark RDD抽象的实现。在演讲中，与会者将详细了解IgniteRDD（实现本地 Spark RDD和DataFrame API）如何与其他共享RDD的状态Spark作业，应用程序和工作人员。示例将展示IgniteRDD如何利用其先进的内存中索引功能，使得SQL查询的执行速度比本地 Spark RDD或Data Frames.Session标签的速度快很多倍：＃EUde9

Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production

by Brandon Carl, Facebook
video, slide
With more than 700 million monthly active users, Instagram continues to make it easier for people across the globe to join the community, share their experiences, and strengthen connections to their friends and passions. Powering Instagram’s various products requires the use of machine learning, high performance ranking services, and most importantly large amounts of data. At Instagram, we use Apache Spark for several critical production pipelines, including generating labeled training data for our machine learning models. In this session, you’ll learn about how one of Instagram’s largest Spark pipelines has evolved over time in order to process ~300 TB of input and ~90 TB of shuffle data. We’ll discuss the experience of building and managing such a large production pipeline and some tips and tricks we’ve learned along the way to manage Spark at scale. Topics include migrating from RDD to Dataset for better memory efficiency, splitting up long-running pipelines in order to better tune intermediate shuffle data, and dealing with changing data skew over time. Finally, we will also go over some optimizations we have made in order to maintain reliability of this critical data pipeline.Session hashtag: #EUde0

下面的内容来自机器翻译:
每月有超过7亿的活跃用户，Instagram继续让全球的人们更容易加入社区，分享他们的经验，并加强与朋友和激情的联系。支持Instagram的各种产品需要使用机器学习，高性能的排名服务，最重要的是大量的数据。在Instagram上，我们使用Apache Spark为几个关键生产流水线，包括为我们的机器学习模型生成标记的训练数据。在本次会议中，您将了解到为了处理大约300 TB的输入和大约90 TB的随机数据，Instagram最大的 Spark管道是如何演变的。我们将讨论构建和管理如此庞大的生产流程的经验，以及我们在管理Spark时学到的一些技巧和窍门。主题包括从RDD迁移到数据集以提高内存效率，分离长期运行的流水线以更好地调整中间随机数据，并处理随时间变化的数据偏移。最后，为了保持这个关键的数据管道的可靠性，我们还会进行一些优化。Session＃hashdeg：＃EUde0

Real-Time Detection of Anomalies in the Database Infrastructure using Apache Spark

by Daniel Lanza, CERN
video, slide
At CERN, the biggest physics laboratory in the world, large volumes of data are generated every hour, it implies serious challenges to store and process all this data. An important part of this responsibility comes to the database group which not only provides services for RDBMS but also scalable systems as Hadoop, Spark and HBase. Since databases are critical, they need to be monitored, for that we have built a highly scalable, secure and central repository that stores consolidated audit data and listener, alert and OS log events generated by the databases. This central platform is used for reporting, alerting and security policy management. The database group want to further exploit the information available in this central repository to build intrusion detection system to enhance the security of the database infrastructure. In addition, build pattern detection models to flush out anomalies using the monitoring and performance metrics available in the central repository. Finally, this platform also helps us for capacity planning of the database deployment. The audience would get first-hand experience of how to build real time Apache Spark application that is deployed in production. They would hear the challenges faced and decisions taken while developing the application and troubleshooting Apache Spark and Spark streaming application in production.Session hashtag: #EUde13

下面的内容来自机器翻译:
在世界上最大的物理实验室CERN，每小时都会产生大量的数据，这意味着存储和处理所有这些数据都面临着严峻的挑战。这个责任的一个重要组成部分来自数据库组，它不仅为RDBMS提供服务，而且还为可扩展系统提供Hadoop，Spark和HBase“。由于数据库至关重要，因此需要对其进行监控，因为我们已经构建了一个高度可扩展的安全中央存储库，用于存储由数据库生成的整合的审计数据以及监听器，警报和操作系统日志事件。这个中央平台用于报告，警报和安全策略管理。数据库组希望进一步利用此中央存储库中可用的信息来构建入侵检测系统，以提高数据库基础架构的安全性。此外，还可以使用中央存储库中提供的监视和性能指标来构建模式检测模型，以消除异常情况。最后，这个平台还帮助我们进行数据库部署的容量规划。观众将亲身体验如何构建实时部署在生产环境中的Apache Spark应用程序。他们会听到在开发Apache Spark和Spark应用程序的应用程序和故障诊断过程中面临的挑战和决策.Session hashtag ：＃EUde13

The State of Apache Spark in the Cloud

by Nicolas Poggi, Barcelona Super Computing
video, slide
Cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Spark and Hive come ready to use, with a general-purpose configuration and upgrade management. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2, making it challenging to keep up-to-date production services both on-premises and in the cloud for compatibility and stability. Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc with an on-premises commodity cluster as baseline. Nicolas uses BigBench, the brand new standard (TPCx-BB) for big data systems, with both Spark and Hive implementations for benchmarking the systems. BigBench combines SQL queries, MapReduce, user code (UDF), and machine learning, which makes it ideal to stress Spark libraries (SparkSQL, DataFrames, MLlib, etc.). The work is framed within the ALOJA research project, which features an open source benchmarking and analysis platform that has been recently extended to support SQL-on-Hadoop engines and BigBench. The ALOJA project aims to lower the total cost of ownership (TCO) of big data deployments and study their performance characteristics for optimization. Nicolas highlights how to easily repeat the benchmarks through ALOJA and benefit from BigBench to optimize your Spark cluster for advanced users. The work is a continuation of a paper to be published at the IEEE Big Data 16 conference.Session hashtag: #EUde6

下面的内容来自机器翻译:
云提供商目前提供便捷的按需管理的大数据集群（PaaS）和即付即用模式。在PaaS中，诸如Spark和Hive等分析引擎已经可以使用，并具有通用配置和升级管理。在过去的一年里，Spark框架和API已经发展得非常迅速，性能和v2的发布都有了重大的改进，使得保持最新的产品为了兼容性和稳定性而在本地和云中提供服务。 Nicolas Poggi评估了Spark的开箱即用支持，并比较了主要PaaS提供商的产品，可靠性，可扩展性和性价比，包括Azure HDinsight，Amazon Web服务EMR，带有本地商品群集的Google Dataproc作为基准。 Nicolas使用BigBench作为大数据系统的全新标准（TPCx-BB），用于基准测试的Spark和Hive系统。 BigBench将SQL查询，MapReduce，用户代码（UDF）和机器学习结合在一起，使其成为理解Spark 库（Spark > SQL，DataFrames，MLlib等）。这项工作是在ALOJA研究项目中构建的，该项目采用了开源的基准测试和分析平台，该平台最近已经扩展到支持SQL-on-span class =’no’> Hadoop 引擎和BigBench。 ALOJA项目旨在降低大数据部署的总体拥有成本（TCO），并研究其性能特征以优化。 Nicolas强调如何通过ALOJA轻松重复基准测试，并从BigBench中受益，为高级用户优化您的Spark群集。这项工作是在IEEE大数据16会议上发表的论文的继续。会议主题标签：＃EUde6

Using Apache Spark in the Cloud—A Devops Perspective

by Telmo Oliveira, Toon
video, slide
Toon is a leading brand in the European smart energy market, currently expanding internationally, providing energy usage insights, eco-friendly energy management and smart thermostat use for the connected home. As value added services become ever more relevant in this market, we have the need to ensure that we can easily and safely on-board new tenants into our data platform. In this talk we’re going to guide you across a less discussed side of using Spark in production – devops. We will speak about our journey from an on-premise cluster to a managed solution in the cloud. A lot of moving parts were involved: ETL flows, data sharing with 3rd parties and data migration to the new environment. Add to this the need to have a multi-tenant environment, revamp our toolset and deploy a live public facing service. It’s possible to find a lot of great examples of how Spark is used for data-science purposes. On the data engineering side, we need to deploy production services, ensure data is cleaned, secured and available, and keep the data-science teams happy. We’d like to share some of the options we took and some of the lessons learned from this (ongoing) transition.Session hashtag: #EUde10

下面的内容来自机器翻译:
Toon是欧洲智能能源市场的领导品牌，目前正在向国际扩张，为联网家庭提供能源使用的见解，环保的能源管理和智能温控器的使用。随着增值服务在这个市场变得越来越重要，我们有必要确保我们能够轻松安全地将新租户加入我们的数据平台。在本次演讲中，我们将引导您讨论在生产中使用Spark – devops“。我们将讲述我们从云端内部集群到托管解决方案的旅程。包括许多移动部件：ETL流程，与第三方共享数据以及将数据迁移到新环境。除此之外，还需要拥有多租户环境，修改我们的工具集并部署面向公众的服务。可以找到很多关于Spark用于数据科学目的的很好的例子。在数据工程方面，我们需要部署生产服务，确保数据清理，安全和可用，并保持数据科学团队的满意。我们想分享一些我们所选择的选项以及从这个（正在进行的）转换中学到的一些经验教训。Session＃hashdeg：＃EUde10

Working with Skewed Data: The Iterative Broadcast

by Fokko Driesprong, GoDataDriven
video, slide
Skewed data is the enemy when joining tables using Spark. It shuffles a large proportion of the data onto a few overloaded nodes, bottlenecking Spark’s parallelism and resulting in out of memory errors. The go-to answer is to use broadcast joins; leaving the large, skewed dataset in place and transmitting a smaller table to every machine in the cluster for joining. But what happens when your second table is too large to broadcast, and does not fit into memory? Or even worse, when a single key is bigger than the total size of your executor? Firstly, we will give an introduction into the problem. Secondly, the current ways of fighting the problem will be explained, including why these solutions are limited. Finally, we will demonstrate a new technique – the iterative broadcast join – developed while processing ING Bank’s global transaction data. This technique, implemented on top of the Spark SQL API, allows multiple large and highly skewed datasets to be joined successfully, while retaining a high level of parallelism. This is something that is not possible with existing Spark join types.Session hashtag: #EUde11

下面的内容来自机器翻译:
当使用Spark连接表时，歪斜的数据是敌人。它将大部分数据混洗在一些超载的节点上，瓶颈和Spark的并行性，并导致内存不足的错误。前往的答案是使用广播连接;留下大的，倾斜的数据集，并将更小的表传送到群集中的每台机器以加入。但是当你的第二张桌子太大而无法播放时会发生什么情况，而且不适合记忆？或者更糟糕的是，当一个关键字大于执行者的总大小？首先我们介绍一下这个问题。其次，解释目前解决问题的方法，包括为什么这些解决方案是有限的。最后，我们将演示在处理ING银行全球交易数据的同时开发的新技术 – 迭代广播加入。这种在Spark SQL API之上实现的技术允许多个大的高度倾斜的数据集成功连接，同时保持高水平的并行性。这是现有的Spark连接类型不可能实现的。Session＃hashdeg：＃EUde11

    原文作者：大数据技术峰会解读
    原文地址: https://zhuanlan.zhihu.com/p/30772353
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。