【Data Flow】The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost i...

2023年8月31日 191次阅读来源: HustWolf

正文之前

三部曲最后一个系列

正文

IMPLEMENTATION & DESIGN
3.实现 & 设计

3.1 Implementation
3.1 实现

We have implemented this model internally in FlumeJava, with MillWheel used as the underlying execution engine for streaming mode; additionally, an external reimplementation for Cloud Dataflow is largely complete at the time of writing. Due to prior characterization of those internal systems in the literature, as well as Cloud Dataflow being publicly available, details of the implementations themselves are elided here for the sake of brevity. One interesting note is that the core windowing and triggering code is quite general, and a significant portion of it is shared across batch and streaming implementations; that system itself is worthy of a more detailed analysis in future work.

我们已经在FlumeJava内部，基于MillWheel作为流式模式下的底层执行引擎，实现了这个模型。另外，在写作本文的时候，一个外部的，针对Cloud Dataflow的重新实现也已经大体完工。由于这些在文中前面出现的内部系统的实现在Cloud Dataflow都是公开可获取的，因此为了简洁起见，我们省略了这些实现。一个值得注意的点是，核心的窗口化以及触发机制的代码是通用的，这些重要的地方都适用于批处理和流式模式下的实现。系统本身在未来的工作中需要更加细化的分析。

3.2 Design Principles
3.2 设计原则

Though much of our design was motivated by the real-world experiences detailed in Section 3.3 below, it was also guided by a core set of principles that we believed our model should embody:
• Never rely on any notion of completeness.
• Be flexible, to accommodate the diversity of known use cases, and those to come in the future.
• Not only make sense, but also add value, in the context of each of the envisioned execution engines.
• Encourage clarity of implementation.
• Support robust analysis of data in the context in which they occurred.
While the experiences below informed specific features of the model, these principles informed the overall shape and character of it, and we believe ultimately led to a more comprehensive and general result.

虽然我们的设计动机是下面的3.3节中的源自于现实世界的经验，但是我们同时也受到一些我们坚信在模型中包含的核心原则影响：

永远不要奢望完美
灵活点，要能够满足当前的多样化使用案例的需求并且考虑到将来的。
对于每个期望的执行引擎，不仅仅要有意义，而且要附加另外的价值。
鼓励清晰地实现。（透明性）
支持对数据在它们产生的上下文中进行健壮性分析。

可以说下面的使用案例决定了模型的特征，而上面的设计原则则决定了整体的框架和特征，我们相信这会导致最终我们的模型会得到更加的广泛、通用。

3.3 Motivating Experiences
3.3 业务场景

As we designed the Dataflow Model, we took into consideration our real-world experiences with FlumeJava and MillWheel over the years. Things which worked well, we made sure to capture in the model; things which worked less well motivated changes in approach. Here are brief summaries of some of these experiences that influenced our design.

当我们设计Dataflow模型的时候，我们将过去几年的关于FlumeJava与MillWheel的经验纳入了考虑范围。这些经验中的优秀部分我们都会在模型中用到，不够好的部分则激励着我们重新设计改进。下面是一些影响了我们的设计的简单总结。

3.3.1 Large Scale Backfills & The Lambda Architecture: Unified Model
3.3.1 大规模数据回写 & Lambda架构：统一模型
A number of teams run log joining pipelines on MillWheel. One particularly large log join pipeline runs in streaming mode on MillWheel by default, but has a separate FlumeJava batch implementation used for large scale backfills. A much nicer setup would be to have a single implementation written in a unified model that could run in both streaming and batch mode without modification. This became the initial motivating use case for unification across batch, micro-batch, and streaming engines, and was highlighted in Figures 10−12.
有些团队在MillWheel上运行日志链接作业。一个实用的大型日志链接管道作业在MillWheel上默认以流式模式运行，而另外一个FlumeJava批处理实现被用来处理大规模的数据回写作业。一个更好地机制是实现统一的模型的设计，这个实现可以不经修改的同时运行流模式和批处理模式。这是第一个激发我们去思考统一批处理、微型批处理、流处理引擎的使用案例，如图12中展示的那样。

Another motivation for the unified model came from an experience with the Lambda Architecture. Though most data processing use cases at Google are handled exclusively by a batch or streaming system, one MillWheel customer ran their streaming pipeline in weak consistency mode, with a nightly MapReduce to generate truth. They found that customers stopped trusting the weakly consistent results over time, and as a result reimplemented their system around strong consistency so they could provide reliable, low latency results. This experience further motivated the desire to support fluid choice amongst execution engines.

另一个激发统一模型构想的来源是一个Lambda架构使用的案例。尽管在Google大部分的数据处理案例都是单独的使用批处理或者是流式系统，不过有一个MillWheel的用户在夜间用MapReduce来运行他们的弱一致性流式处理作业。他们发现消费者超时之后不再信任弱一致性的结果，并且基于强一致性重新实现了他们的系统，因此他们可以提供可靠地、低延迟的结果。这个案例深刻的激励着我们去灵活的选择不同的执行引擎。

3.3.2 Unaligned Windows: Sessions
3.3.2 非对齐窗口：会话

From the outset, we knew we needed to support sessions; this in fact is the main contribution of our windowing model over existing models. Sessions are an extremely important use case within Google (and were in fact one of the reasons MillWheel was created), and are used across a number of product areas, including search, ads, analytics, social, and YouTube. Pretty much anyone that cares about correlating bursts of otherwise disjoint user activity over a period of time does so by calculating sessions. Thus, support for sessions became paramount in our design. As shown in Figure 14, generating sessions in the Dataflow Model is trivial.

从一开始我们就知道我们需要支持会话；老实说这也是我们这个窗口模型相对比现有的一些模型的主要进步之一。会话在Google的用例中是十分至关重要的（其实这也是MillWheel被创造的一个原因之一），并且在许多的产品领域，包括搜索、广告、分析、社交和YouTube等都有用到。几乎所有的关心把一段时间内突然出现的用户的分散的活动记录相关联起来的情况都要用到会话计算。因此，支持会话也就成了用户设计中的最大因素。如图14所示，在Dataflow里面支持会话是很简单的。

3.3.3 Billing: Triggers, Accumulation, & Retraction
3.3.3 Billing：触发器，累积，& 撤回

Two teams with billing pipelines built on MillWheel experienced issues that motivated parts of the model. Recommended practice at the time was to use the watermark as a completion metric, with ad hoc logic to deal with late data or changes in source data. Lacking a principled system for updates and retractions, a team that processed resource utilization statistics ended up leaving our platform to build a custom solution (the model for which ended being quite similar to the one we developed concurrently). Another billing team had significant issues with watermark lags caused by stragglers in their input. These shortcomings became major motivators in our design, and influenced the shift of focus from one of targeting completeness to one of adaptability over time. The results were twofold: triggers, which allow the concise and flexible specification of when results are materialized, as evidenced by the variety of output patterns possible over the same data set in Figures 7−14; and incremental processing support via accumulation (Figures 7 and 8) and retractions (Figure 14).

两支团队在MillWheel上的管道计算开发案例推动了模型的部分构建。当时我们中意的做法是使用水位标记作为一个数据完全到达的标记，使用额外的逻辑代码来处理迟到的数据或者应对源头数据的改变。由于缺乏支持更新和撤回的系统，一支负责资源利用率处理的团队最终放弃了我们的平台转而采用了另外一套定制化的解决方案（他们最后的结模型跟我们同时开发出来的模型实际上十分相似）。另外一个我们所服务的团队因为输入源的迟滞导致了水位标记延迟这一重大的问题。这些缺点成为了我们设计中的主要推动点，并且使得我们的开发聚焦点从数据的完整性转移到了对时间的适应性（对迟到数据的可适应性）。最后的结果是双倍的：触发器，允许简单、灵活的决定何时输出具体的计算结果，这一点由图7-14中的同一个数据集有多个可能格式的输出证明了；还有就是通过累积（图7、8）、撤回（图14）支持的增量计算。

3.3.4 Statistics Calculation: Watermark Triggers
3.3.4 统计计算：水位标记触发器

Many MillWheel pipelines calculate aggregate statistics (e.g. latency averages). For them, 100% accuracy is not required, but having a largely complete view of their data in a reasonable amount of time is. Given the high level of accuracy we achieve with watermarks for structured input sources like log files, such customers find watermarks very effective in triggering a single, highly-accurate aggregate per window. Watermark triggers are highlighted in Figure 12.
许多MillWheel管道计算都被用于汇总统计（比如平均延迟）。对于这些业务来说，它们不需要100%的计算精度，但是在合理的时间范围内得到一个近乎于完整的统计却是必须的。考虑到我们对于结构化输入源，比如日志文件在精确度方面有着很高的水平需求，消费者发现对每一个窗口，使用单次、高精度的触发器就可以获得很好的效率。水位标记触发器在图12中可见。
A number of abuse detection pipelines run on MillWheel. Abuse detection is another example of a use case where processing a majority of the data quickly is much more useful than processing 100% of the data more slowly. As such, they are heavy users of MillWheel’s percentile watermarks, and were a strong motivating case for being able to support percentile watermark triggers in the model.

有一些滥用的检测管道作业在MillWheel上运行。滥用检测是另一个处理迅速数据主体要远比慢慢的处理100%的数据更高效的使用案例。因此他们会大量的使用MillWheel的百分位水位标记，并且这是一个极大的推动我们的模型去支持百分位水位标记的原因之一。

Relatedly, a pain point with batch processing jobs is stragglers that create a long tail in execution time. While dynamic rebalancing can help with this issue, FlumeJava has a custom feature that allows for early termination of a job based on overall progress. One of the benefits of the unified model for batch mode is that this sort of early termination criteria is now naturally expressible using the standard triggers mechanism, rather than requiring a custom feature.

与此相关的，一个批处理作业的痛点是部分执行缓慢的节点会直接拖累整个执行时间（落后者会给执行时间创造一个长长的尾巴，直译）。当动态的动态（重新）平衡可以解决这个问题的时候，FlumeJava也定制支持了基于总体进展提前终止作业的进的经典特性。统一模型的一个好处就是批处理模式遇到这种提前终止的问题的时候可以很自然的使用标准的触发器链条来表达而不是需要一个新的定制功能。

3.3.5 Recommendations: Processing Time Triggers
3.3.5 建议：处理时间触发器

Another pipeline that we considered built trees of user activity (essentially session trees) across a large Google property. These trees were then used to build recommendations tailored to users’ interests. The pipeline was noteworthy in that it used processing-time timers to drive its output. This was due to the fact that, for their system, having regularly updated, partial views on the data was much more valuable than waiting until mostly complete views were ready once the watermark passed the end of the session. It also meant that lags in watermark progress due to a small amount of slow data would not affect timeliness of output for the rest of the data. This pipeline thus motivated inclusion of processing-time triggers shown in Figures 7 and 8.

另一个我们考虑到的管道方案是通过Google的大量数据资源建立用户活动树（特别是会话树）。这些树可以被用于建立基于用户兴趣的推荐定制。这个管道作业值得关注的地方在于它是使用处理时间计时器来驱动输出。这就导致了，对于他们的系统来说，数据的常规更新，局部视图比之等到大部分数据齐全之后水位标记越过窗口尽头然后再处理数据更加有价值。这同时也意味着，水位标记进程中由于小部分的迟到的数据产生的延迟不会影响剩余的数据的输出时间。考虑到这种场景，我们开发了基于处理时间的触发器，如图7、8所示。

3.3.6 Anomaly Detection： Data-Driven & Composite Triggers
3.3.6 异常探测：数据驱动 & 组合触发器

In the MillWheel paper, we described an anomaly detection pipeline used to track trends in Google web search queries. When developing triggers, their diff detection system motivated data-driven triggers. These differs observe the stream of queries and calculate statistical estimates of whether a spike exists or not. When they believe a spike is happening, they emit a start record, and when they believe it has ceased, they emit a stop. Though you could drive the differ output with something periodic like Trill’s punctuations, for anomaly detection you ideally want output as soon as you are confident you have discovered an anomaly; the use of punctuations essentially transforms the streaming system into micro-batch, introducing additional latency. While practical for a number of use cases, it ultimately is not an ideal fit for this one, thus motivating support for custom data-driven triggers. It was also a motivating case for trigger composition, because in reality, the system runs multiple differs at once, multiplexing the output of them according to a well-defined set of logic. The AtCount trigger used in Figure 9 exemplified data-driven triggers; figures 10−14 utilized composite triggers.

在MillWheel的论文中，我们描述了使用异常探测来追踪Google网站搜索的趋势。在开发触发器的过程中，这种微分异常探测系统就启发了数据驱动的触发器的诞生。这种微分探测系统关注查询流并且计算统计数据来检测是否存在毛刺（突出点）。当他们相信一个突刺点出现了，他们就会发出一个开始记录的信号，并且当检测到中止的时候，他们就发出一个停止的信号。（译者注：可能会对接系统自动对系统扩容或缩容）即使我们可以使用别的方式触发计算，比如定期的Trill的标点符，但是对于异常检测，你往往希望能够在确认有了异常之后就立刻进行输出。标点符的应用实际上是把流式处理的系统转变成了微型批处理系统，附加了额外的延迟。在一些案例的实际使用中，最后往往不是一个理想的方案，因此这启发了我们开发支持定制化数据驱动的触发器。这也是一个对于开发组合触发器有深刻的激励作用的案，因为在现实案例中，一个系统可能运行多种微分计算，需要定义一组设计优秀的逻辑集合来面对不同的输出。图9中的AtCount触发器是数据驱动触发器的典型例证，图10-14则是使用了组合的触发器。

CONCLUSIONS
结论
The future of data processing is unbounded data. Though bounded data will always have an important and useful place, it is semantically subsumed by its unbounded counterpart. Furthermore, the proliferation of unbounded data sets across modern business is staggering. At the same time, consumers of processed data grow savvier by the day, demanding powerful constructs like event-time ordering and unaligned windows. The models and systems that exist today serve as an excellent foundation on which to build the data processing tools of tomorrow, but we firmly believe that a shift in overall mindset is necessary to enable those tools to comprehensively address the needs of consumers of unbounded data.
数据处理的未来是无界数据。尽管有界数据仍然会占据重要且实用的位置，但是它在语义上可以被归入无界数据的内容中去。而且无界数据集的增殖速度在现代的商业活动中令人震惊。同时，数据处理的消费者在与日俱增，比如类似基于事件时间和非对齐窗口的强力的结构的需求等。现存的系统和模型为未来的数据处理工具构建了优秀的基础，但是我们仍然坚信整体心态的转变，是保证那些工具能够完美的满足消费者关于无界数据的需求所必需的。
Based on our many years of experience with real-world, massive-scale, unbounded data processing within Google, we believe the model presented here is a good step in that direction. It supports the unaligned, event-time-ordered windows modern data consumers require. It provides flexible triggering and integrated accumulation and retraction, refocusing the approach from one of finding completeness in data to one of adapting to the ever present changes manifest in realworld datasets. It abstracts away the distinction of batch vs. micro-batch vs. streaming, allowing pipeline builders a more fluid choice between them, while shielding them from the system-specific constructs that inevitably creep into models targeted at a single underlying system. Its overall flexibility allows pipeline builders to appropriately balance the dimensions of correctness, latency, and cost to fit their use case, which is critical given the diversity of needs in existence. And lastly, it clarifies pipeline implementations by separating the notions of what results are being computed, where in event time they are being computed, when in processing time they are materialized, and how earlier results relate to later refinements. We hope others will find this model useful as we all continue to push forward the state of the art in this fascinating, remarkably complex field.
基于我们在现实世界与Google的大规模、无界数据处理的经验，我们相信我们的模型会在这个方向上踏出坚实的一步。它支持现代数据消费者所需要的非对称、基于事件时间驱动的窗口。它可以提供灵活的触发机制，整体累积和撤回操作，而且从追求数据的完整性的聚焦点转变到了自动适应现实世界中不断变动的数据源。它将批处理、微型批处理、流式的区别抽象出来，允许管道作业构建者有了一个更加流畅的选择。同时屏蔽了在单一的底层系统中容易把系统的构建工作不可避免的蔓延到模型的数据处理工作中去的问题。它优秀的灵活性让作业者可以自适应的基于具体的使用场景去平衡对于正确性、延迟和成本的需求，对于实际需求的多样化来说，这一点至关重要。最后，模型阐明了管道计算实现是将计算什么、在事件时间的哪一点进行计算、在处理时间中何时被具体处理、早期的结果如何与后面的改进相关分割开来解释的。我们希望别人人能够认同我们的模型是一个实用的模型，并且与我们一起推动这个令人着迷的复杂领域一起继续发展。
ACKNOWLEDGMENTS
5.致谢

We thank all of our faithful reviewers for their dedication, time, and thoughtful comments: Atul Adya, Ben Birt, Ben Chambers, Cosmin Arad, Matt Austern, Lukasz Cwik, Grzegorz Czajkowski, Walt Drummond, Jeff Gardner, Anthony Mancuso, Colin Meek, Daniel Myers, Sunil Pedapudi, Amy Unruh, and William Vambenepe. We also wish to recognize the impressive and tireless efforts of everyone on the Google Cloud Dataflow, FlumeJava, MillWheel, and related teams that have helped bring this work to life.

我们感谢这篇文章的所有评审者：他们专注付出，提供了很有思考的意见。他们是：Atul Adya, Ben Birt,Ben Chambers, Cosmin Arad, Matt Austern, Lukasz Cwik,Grzegorz Czajkowski, Walt Drummond, Je_ Gardner, An-thony Mancuso, Colin Meek, Daniel Myers, Sunil Pedapudi,Amy Unruh, and William Vambenepe。我们也想在此赞扬谷歌Cloud Dataflow团队，FlumeJava团队，MillWheel团队和其他相关团队的成员，他们为这项工作付出了令人影响深刻的不倦的努力。

正文之后

本文大部分是自己翻译，小部分比较长、难的句子，则是参考了欧陆字典、Google 翻译、以及最重要的下面的来自阿里云的一篇已经翻译好的部分！

流计算精品翻译: The Dataflow Model

说点题外话。。。最近跟机械的朋友因为不住在一块，而且大家都进了实验室，没多少交流了。甚至我最要好的室友也只是今晚约了吃吃烧烤看看电影。然后实验室这边，认识的人都还没来，同一间房子里面的人大家都不怎么说话，有的说话的我也插不进去。。。。好吧，他们说的比较学术，我听不大懂。。。。。然后女票也不怎么搭理我。。哎。世界太冷漠了。我还是看书吧。。。

    原文作者：HustWolf
    原文地址: https://www.jianshu.com/p/de7b1656dfa0
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。