时间序列数据的SWAB分割算法

2023年10月23日 486次阅读

我试图了解如何对一组时间序列数据(每日股票价格,温度等)进行分段,并且遇到了一本书,解释了如何进行SWAB(滑动窗口和自下而上)分割算法,但我不太明白.该分割是超声处理算法的一部分.以下文字来自“多媒体数据挖掘与分析：破坏性创新”.

The SWAB segmentation
algorithm gets four parameters—the input file (time series data), the output
file (segmented data), the maximal error, and the indication of nominal attributes.
After running a number of experiments on time series of different sizes with different
values for the number of segments, we chose the appropriate default number of
segments as follows. 25–50 % of time series size for time series with less than 100
observations, 20–35 % for time series with 100–200 observations, and 15–25 % for
time series with more than 200 observations. If the user is not interested to use the
default value for any reason, he can enter his own number of segments as a parameter
to the algorithm.
Starting with the default values for the minimum and the maximum error, we
run the segmentation algorithm for the first time and get the minimum number of
segments for a given time series (the higher the maximum error, the fewer segments
will be found). Then we decrease the maximum error (and so increase the number
of found segments) trying to narrow the upper and the lower bounds of error by
dividing the base by powers of 2 (like in binary search). Every time after running
the segmentation algorithm with the current maximal error, we test whether this
value gives a better approximation for the optimal number of segments, and so is a
better upper or lower bound for the optimal maximum error. If so, we advance the
appropriate bound to this value. In the beginning, only the upper bound is affected.
However, once we found the lower bound that provides more segments than the
optimum, we continue to look for the optimal number of segments by smaller steps:
the next maximum error is the mean between the current upper and lower bounds.
As follows from our experience with many different time series databases, the
optimal maximal error is usually found within 3–4 iterations. The convergence rate
depends on the input time series database itself. If the algorithm has not converged
within 20 iterations, we stop searching and proceed with the next sonification steps
using the segments found at the 20th iteration.

因此,例如,如果我有150个观测值的时间序列数据(相当于20-35％的默认段数),我需要采取哪些确切的步骤来分割数据？

任何帮助都表示赞赏,谢谢.

最佳答案确切的步骤

以下是该方法的简要描述：

The Sliding Window algorithm works by anchoring the left point of a
potential segment at the first data point of a time series,
then attempting to approximate the data to the right with
increasing longer segments. At some point i, the error for the
potential segment is greater than the user-specified threshold, so
the subsequence from the anchor to i -1 is transformed into
a segment. The anchor is moved to location i, and the process repeats
until the entire time series has been transformed into a piecewise
linear approximation.

基于此,该算法的伪代码如下.请参阅我在代码中的注释,以了解具体情况.

//function takes a set of points T and a max error
function Sliding_Window(T, max_error)
  anchor = 1;
  while (not finished segmenting time series) {
    i=2;

    //keep making subsets of progressively larger size
    //until the error of the subset is greater than the max error
    //t[anchor: anchor + i] represents the elements of the set
    //from index (anchor) to index (anchor + i)
    //this could be an implemented as an array
    while (calculate_error(T[anchor: anchor+i]) < max_error) { 
      i=i+1;
    }

    //add the newly found segment to the set of all segments
    Seg_TS = concat(Seg_TS, create_segment(T[anchor: anchor + (i-1)]);

    //and increment the anchor in preparation for creating a new segment
    anchor = anchor + i;
  }
}

“错误”的定义

你似乎不清楚的一件事是在这种情况下“错误”的含义.以下段落很好地解释了它：

All segmentation algorithms also need some method to evaluate the
quality of fit for a potential segment. A measure commonly used in
conjunction with linear regression is the sum of squares, or the
residual error. This is calculated by taking all the vertical
differences between the best-fit line and the actual data points,
squaring them and then summing them together. Another commonly
used measure of goodness of fit is the distance between the best fit
line and the data point furthest away in the vertical direction.

换句话说,这里可以使用多种方法来表示“错误”.统计中使用的两种常见方法是平方和和最大垂直距离.从理论上讲,你甚至可以为此编写自己的函数,只要它返回的数字在某种程度上表明该段代表给定点集的程度.

关于平方和方法的更多信息在这里：https://en.wikipedia.org/wiki/Residual_sum_of_squares

如果你想自己实现它,一些伪代码可能如下所示：

function calculateSegmentErrorUsingSumOfSquares() {
  int sum = 0;
  for each (point in set_approximated_by_segment) {
    int difference = point.y_coordinate - approximation_segment.y_at_x(point.x_coordinate)
    sum = sum + (difference * difference)
  }
  return sum
}

请注意,您使用的任何方法都可能具有某些优点和缺点.有关更多信息和参考资料,请参阅下面的Jason的评论,但关键是：确保您选择的任何错误函数都能很好地响应您期望的数据类型.

来源

https://www.cs.rutgers.edu/~pazzani/Publications/survey.pdf