Spark 2017欧洲技术峰会摘要(Spark 生态体系分类)

下载全部视频和PPT,请关注公众号(bigdata_summit),并点击“视频下载”菜单

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP

by Artem Aliev, DataStax
video,
Graph is on the rise and it’s time to start learning about scalable graph analytics! In this session we will go over two Spark-based Graph Analytics frameworks: Tinkerpop and GraphFrames. While both frameworks can express very similar traversals, they have different performance characteristics and APIs. In this Deep-Dive by example presentation, we will demonstrate some common traversals and explain how, at a Spark level, each traversal is actually computed under the hood! Learn both the fluent Gremlin API as well as the powerful GraphFrame Motif api as we show examples of both simultaneously. No need to be familiar with Graphs or Spark for this presentation as we’ll be explaining everything from the ground up!Session hashtag: #EUeco3

下面的内容来自机器翻译:
图表正在崛起,现在是时候开始学习可伸缩图形分析了!在本次会议中,我们将介绍两个基于Spark的Graph Analytics框架:Tinkerpop和GraphFrames。虽然这两个框架可以表达非常相似的遍历,但它们具有不同的性能特征和API。在这个Deep-Dive的示例演示中,我们将演示一些常见的遍历,并解释如何在Spark级别上实际计算每个遍历!学习流利的Gremlin API以及功能强大的GraphFrame Motif api,因为我们同时展示了两个示例。没有必要熟悉Graphs或Spark,因为我们将从头开始解释一切!Session#标签:#EUeco3

Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark

by Emily Curtin, The Weather Company / IBM
video, slide
spark-bench is an open-source benchmarking tool, and it’s also so much more. spark-bench is a flexible system for simulating, comparing, testing, and benchmarking Spark applications and Spark itself. spark-bench originally began as a benchmarking suite to get timing numbers on very specific algorithms mostly in the machine learning domain. Since then it has morphed into a highly configurable and flexible framework suitable for many use cases. This talk will discuss the high level design and capabilities of spark-bench before walking through some major, practical use cases. Use cases include, but are certainly not limited to: regression testing changes to Spark; comparing performance of different hardware and Spark tuning options; simulating multiple notebook users hitting a cluster at the same time; comparing parameters of a machine learning algorithm on the same set of data; providing insight into bottlenecks through use of compute-intensive and i/o-intensive workloads; and, yes, even benchmarking. In particular this talk will address the use of spark-bench in developing new features features for Spark core.Session hashtag: #EUeco8

下面的内容来自机器翻译:
火花台是一个开源的基准测试工具,也是如此之多。 spark-bench是一个灵活的系统,用于对Spark应用程序和Spark本身进行模拟,比较,测试和基准测试。火花台最初是作为一个基准测试套件,以获取主要在机器学习领域的非常具体算法的计时数。从那以后,它变成了一个适用于许多用例的高度可配置和灵活的框架。这次演讲将在讨论一些主要的实际使用案例之前,讨论火花台的高层设计和功能。用例包括但不限于:对Spark的回归测试更改;比较不同硬件和Spark调试选项的性能;模拟多个笔记本用户同时碰到集群;将机器学习算法的参数在同一组数据上进行比较;通过使用计算密集型和I / O密集型工作负载来洞察瓶颈;是的,甚至是基准。特别是这个演讲将解决火花台在开发Spark核心新功能特性方面的用途。Session#hasheg:#EUeco8

Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBase through Spark SQL

by Weiqing Yang, Hortonworks
video, slide
Both Spark and HBase are widely used, but how to use them together with high performance and simplicity is a very challenging topic. Spark HBase Connector(SHC) provides feature rich and efficient access to HBase through Spark SQL. It bridges the gap between the simple HBase key value store and complex relational SQL queries and enables users to perform complex data analytics on top of HBase using Spark. SHC implements the standard Spark data source APIs, and leverages the Spark catalyst engine for query optimization. To achieve high performance, SHC constructs the RDD from scratch instead of using the standard HadoopRDD. With the customized RDD, all critical techniques can be applied and fully implemented, such as partition pruning, column pruning, predicate pushdown and data locality. The design makes the maintenance easy, while achieving a good tradeoff between performance and simplicity. In addition to fully supporting all the Avro schemas natively, SHC has also integrated natively with Phoenix data types. With SHC, Spark can execute batch jobs to read/write data from/into Phoenix tables. Phoenix can also read/write data from/into HBase tables created by SHC. For example, users can run a complex SQL query on top of an HBase table created by Phoenix inside Spark, perform a table join against an Dataframe which reads the data from a Hive table, or integrate with Spark Streaming to implement a more complicated system. In this talk, apart from explaining why SHC is of great use, we will also demo how SHC works, how to use SHC in secure/non-secure clusters, how SHC works with multiple secure HBase clusters, etc. This talk will also benefit people who use Spark and other data sources (besides HBase) as it inspires them with ideas of how to support high performance data source access at the Spark DataFrame level.Session hashtag: #EUeco7

下面的内容来自机器翻译:
Spark和HBase都被广泛使用,但是如何将它们与高性能和简单性结合使用是一个非常具有挑战性的话题。 Spark HBase连接器(SHC)通过Spark SQL提供对HBase的功能丰富和高效的访问。它弥合了简单的HBase键值存储和复杂的关系SQL查询之间的差距,并使用户能够使用Spark在HBase之上执行复杂的数据分析。 SHC实现标准的Spark数据源API,并利用Spark催化剂引擎进行查询优化。为了实现高性能,SHC从头开始构建RDD,而不是使用标准的HadoopRDD。通过定制的RDD,所有的关键技术都可以被应用和完全实现,比如分区修剪,列修剪,谓词下推和数据局部性。该设计使维护变得简单,同时在性能和简单性之间取得了良好的折衷。除了完全支持所有的Avro模式,SHC还与Phoenix数据类型本地集成。使用SHC,Spark可以执行批处理作业,以便从/向Phoenix表中读取/写入数据。 Phoenix还可以读取/写入由SHC创建的HBase表中的数据。例如,用户可以在Spark内部由Phoenix创建的HBase表上运行复杂的SQL查询,对从Hive表中读取数据的Dataframe执行表连接,或者与Spark Streaming集成以实现更复杂的系统。在这个演讲中,除了解释为什么SHC具有很好的使用性外,我们还将演示SHC如何工作,如何在安全/非安全集群中使用SHC,SHC如何与多个安全HBase集群协同工作等。使用Spark和其他数据源(除了HBase)的人,因为它激发了他们如何在Spark DataFrame级别支持高性能数据源访问的想法。Session标签:#EUeco7

Best Practices for Using Alluxio with Apache Spark

by Gene Pang, Alluxio, Inc.
video, slide
Alluxio, formerly Tachyon, is a memory speed virtual distributed storage system and leverages memory for storing data and accelerating access to data in different storage systems. Many organizations and deployments use Alluxio with Apache Spark, and some of them scale out to over PB’s of data. Alluxio can enable Spark to be even more effective, in both on-premise deployments and public cloud deployments. Alluxio bridges Spark applications with various storage systems and further accelerates data intensive applications. In this talk, we briefly introduce Alluxio, and present different ways how Alluxio can help Spark jobs. We discuss best practices of using Alluxio with Spark, including RDDs and DataFrames, as well as on-premise deployments and public cloud deployments.Session hashtag: #EUeco2

下面的内容来自机器翻译:
Alluxio(以前称为Tachyon)是一种内存速度高的虚拟分布式存储系统,利用内存来存储数据,并加速对不同存储系统中数据的访问。许多组织和部署使用Apache Spark的Alluxio,其中一些扩展到PB的数据。 Alluxio可以使Spark在内部部署和公共云部署中更加高效。 Alluxio将Spark应用程序与各种存储系统进行桥接,并进一步加速数据密集型应用程序在这个演讲中,我们简单地介绍一下Alluxio,并介绍Alluxio如何帮助Spark工作。我们将讨论使用Spark的Alluxio的最佳实践,包括RDD和DataFrame,以及内部部署和公共云部署。Session#标签:#EUeco2

Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spark Job-Server

by Arvind Heda, InfoEdge Ltd.
video, slide
Kapil Malik and Arvind Heda will discuss a solution for interactive querying of large scale structured data, stored in a distributed file system (HDFS / S3), in a scalable and reliable manner using a unique combination of Spark SQL, Apache Zeppelin and Spark Job-server (SJS) on Yarn. The solution is production tested and can cater to thousands of queries processing terabytes of data every day. It contains following components – 1. Zeppelin server : A custom interpreter is deployed, which de-couples spark context from the user notebooks. It connects to the remote spark context on Spark Job-server. A rich set of APIs are exposed for the users. The user input is parsed, validated and executed remotely on SJS. 2. Spark job-server : A custom application is deployed, which implements the set of APIs exposed on Zeppelin custom interpreter, as one or more spark jobs. 3. Context router : It routes different user queries from custom interpreter to one of many Spark Job-servers / contexts. The solution has following characteristics – * Multi-tenancy There are hundreds of users, each having one or more Zeppelin notebooks. All these notebooks connect to same set of Spark contexts for running a job. * Fault tolerance The notebooks do not use Spark interpreter, but a custom interpreter, connecting to a remote context. If one spark context fails, the context router sends user queries to another context. * Load balancing Context router identifies which contexts are under heavy load / responding slowly, and selects the most optimal context for serving a user query. * Efficiency We use Alluxio for caching common datasets. * Elastic resource usage We use spark dynamic allocation for the contexts. This ensures that cluster resources are blocked by this application only when it’s doing some actual work.Session hashtag: #EUeco9

下面的内容来自机器翻译:
Kapil Malik和Arvind Heda将使用Spark SQL,Apache Zeppelin和Spark Job-only的独特组合,以可扩展和可靠的方式讨论存储在分布式文件系统(HDFS / S3)中的大规模结构化数据的交互式查询解决方案。纱线上的服务器(SJS)。该解决方案经过生产测试,可满足每天处理数千TB数据的数千个查询。它包含以下组件 – 1. Zeppelin服务器:部署了一个自定义的解释器,从用户笔记本中去除了火花上下文。它连接到Spark Job-server上的远程Spark上下文。为用户提供了丰富的API集合。用户输入在SJS上被远程解析,验证和执行。 2. Spark作业服务器:部署一个自定义应用程序,它实现了Zeppelin自定义解释器上公开的一组API,作为一个或多个Spark作业。 3.上下文路由器:将自定义解释器的不同用户查询路由到多个Spark Job-servers /上下文中的一个。该解决方案具有以下特点 – *多租户有数百个用户,每个用户有一个或多个Zeppelin笔记本电脑。所有这些笔记本连接到相同的Spark上下文集来运行作业。容错笔记本不使用Spark解释器,而是使用自定义解释器连接到远程上下文。如果一个spark上下文失败,上下文路由器将用户查询发送到另一个上下文。 *负载均衡上下文路由器识别哪些上下文处于高负载/缓慢响应状态,并选择用于服务用户查询的最佳上下文。 *效率我们使用Alluxio来缓存常用数据集。 *弹性资源使用情况我们使用spark动态分配上下文。这可以确保只有在执行某些实际工作时,群集资源才被此应用程序阻止。Session#标签:#EUeco9

Running Spark Inside Docker Containers: From Workload to Cluster

by Haohai Ma, IBM
video, slide
This presentation describes the journey we went through in containerizing Spark workload into multiple elastic Spark clusters in a multi-tenant kubernetes environment. Initially we deployed Spark binaries onto a host-level filesystem, and then the Spark drivers, executors and master can transparently migrate to run inside a Docker container by automatically mounting host-level volumes. In this environment, we do not need to prepare a specific Spark image in order to run Spark workload in containers. We then utilized Kubernetes helm charts to deploy a Spark cluster. The administrator could further create a Spark instance group for each tenant. A Spark instance group, which is akin to the Spark notion of a tenant, is logically an independent kingdom for a tenant’s Spark applications in which they own dedicated Spark masters, history server, shuffle service and notebooks. Once a Spark instance group is created, it automatically generates its image and commits to a specified repository. Meanwhile, from Kubernetes’ perspective, each Spark instance group is a first-class deployment and thus the administrator can scale up/down its size according to the tenant’s SLA and demand. In a cloud-based data center, each Spark cluster can provide a Spark as a service while sharing the Kubernetes cluster. Each tenant that is registered into the service gets a fully isolated Spark instance group. In an on-prem Kubernetes cluster, each Spark cluster can map to a Business Unit, and thus each user in the BU can get a dedicated Spark instance group. The next step on this journey will address the resource sharing across Spark instance groups by leveraging new Kubernetes’ features (Kubernetes31068/9), as well as the Elastic workload containers depending on job demands (Spark18278). Demo: https://www.youtube.com/watch?v=eFYu6o3-Ea4&t=5sSession hashtag: #EUeco5

下面的内容来自机器翻译:
本演示文稿描述了我们将Spark工作负载容纳到多租户kubernetes环境中的多个弹性Spark集群中所经历的过程。最初我们将Spark二进制文件部署到主机级文件系统上,然后Spark驱动程序,执行程序和主服务器可以自动挂载主机级卷,从而透明地迁移到Docker容器中运行。在这种环境下,我们不需要准备一个特定的Spark图像来运行容器中的Spark工作负载。然后,我们利用Kubernetes掌舵图来部署一个Spark集群。管理员可以为每个租户进一步创建一个Spark实例组。类似于租户的Spark概念的Spark实例组在逻辑上对于租户的Spark应用程序是一个独立的王国,在这个应用程序中他们拥有专门的Spark主人,历史记录服务器,随机播放服务和笔记本电脑。一旦创建了一个Spark实例组,它就会自动生成它的镜像并提交给指定的仓库。同时,从Kubernetes的角度来看,每个Spark实例组都是一流的部署,因此管理员可以根据租户的SLA和需求来扩大/缩小其规模。在基于云的数据中心中,每个Spark群集可以在共享Kubernetes群集时提供Spark作为服务。注册到服务中的每个租户都获得一个完全隔离的Spark实例组。在一个本地Kubernetes集群中,每个Spark集群都可以映射到一个业务单元,因此BU中的每个用户都可以获得一个专用的Spark实例组。这次旅程的下一步将通过利用新的Kubernetes功能(Kubernetes31068 / 9)以及取决于工作需求的弹性工作负载容器(Spark18278)来解决Spark实例组间的资源共享问题。演示:https://www.youtube.com/watch?v=eFYu6o3-Ea4&t=5sSession标签:#EUeco5

Smack Stack and Beyond—Building Fast Data Pipelines

by Jorg Schad, Mesosphere
video, slide
There are an ever increasing number of use cases, like online fraud detection, for which the response times of traditional batch processing are too slow. In order to be able to react to such events in close to real-time, you need to go beyond classical batch processing and utilize stream processing systems such as Apache Spark Streaming, Apache Flink, or Apache Storm. These systems, however, are not sufficient on their own. For an efficient and fault-tolerant setup, you also need a message queue and storage system. One common example for setting up a fast data pipeline is the SMACK stack. SMACK stands for Spark (Streaming) – the stream processing system Mesos – the cluster orchestrator Akka – the system for providing custom actors for reacting upon the analyses Cassandra – the storage system Kafka – the message queue Setting up this kind of pipeline in a scalable, efficient and fault-tolerant manner is not trivial. First, this workshop will discuss the different components in the SMACK stack. Then, participants will get hands-on experience in setting up and maintaining data pipelines.Session hashtag: #EUeco1

下面的内容来自机器翻译:
在线欺诈检测等用例数量不断增加,传统批量处理的响应时间太慢。为了能够对这些事件做出接近实时的反应,您需要超越经典的批处理,并利用流处理系统,如Apache Spark Streaming,Apache Flink或Apache Storm。但是,这些系统本身是不够的。对于高效的容错设置,还需要消息队列和存储系统。建立快速数据流水线的一个常见例子是SMACK堆栈。 SMACK代表Spark(Streaming) – 流处理系统Mesos – 集群协调者Akka – 提供用于对分析作出反应的自定义参与者的系统Cassandra – 存储系统Kafka – 消息队列以可扩展的方式设置这种流水线,高效和容错的方式并不是微不足道的。首先,本次研讨会将讨论SMACK堆栈中的不同组件。然后,参与者将获得建立和维护数据管道的实践经验。会议主题标签:#EUeco1

Testing Apache Spark—Avoiding the Fail Boat Beyond RDDs

by Holden Karau, IBM
video,
As Spark continues to evolve, we need to revisit our testing techniques to support Datasets, streaming, and more. This talk expands on “Beyond Parallelize and Collect” (not required to have been seen) to discuss how to create large scale test jobs while supporting Spark’s latest features. We will explore the difficulties with testing Streaming Programs, options for setting up integration testing, beyond just local mode, with Spark, and also examine best practices for acceptance tests.Session hashtag: #EUeco4

下面的内容来自机器翻译:
随着Spark的不断发展,我们需要重新审视我们的测试技术,以支持数据集,流媒体等等。这次演讲扩展了“超越并行和收集”(不需要被看到),讨论如何创建大规模的测试工作,同时支持Spark的最新功能。我们将探索测试流式处理程序的困难,设置集成测试的选项,除了本地模式,还有Spark,还要考察验收测试的最佳实践。Session#标签:#EUeco4

Variant-Apache Spark for Bioinformatics

by Piotr Szul, CSIRO Data61
video, slide
This talk will showcase work done by the bioinformatics team at CSIRO in Sydney, Australia to make Spark more useful and usable for the bioinformatics community. They have created a custom library, variant-spark, which provides a DSL and also a custom implementation of Spark ML via random forests for genomic pipeline processing. We’ve created a demo, using their ‘Hipster-genome’ and a Databricks notebook to better explain their library to the world-wide bioinformatics community. This notebooks compares results with another popular genomics library (http://HAIL.io) as well.Session hashtag: #EUeco6

下面的内容来自机器翻译:
本次演讲将展示澳大利亚悉尼CSIRO生物信息学团队所做的工作,以使Spark更加有用,并可用于生物信息学界。他们已经创建了一个自定义库,variant-spark,它提供了一个DSL,并通过随机森林提供了Spark ML的自定义实现,用于基因组流水线处理。我们已经创建了一个演示,使用他们的“时髦基因组”和Databricks笔记本更好地向世界范围的生物信息学界解释他们的图书馆。这个笔记本还将结果与另一个流行的基因组库(http://HAIL.io)进行比较。会话标签:#EUeco6

    原文作者:大数据技术峰会解读
    原文地址: https://zhuanlan.zhihu.com/p/30914194
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞