时序数据取样方法
语境 (Context)
In most studies, it is pretty hard (or sometimes impossible) to analyse a whole population, so researchers use samples instead. In statistics, survey sampling is the process by which we get a sample from our population, in order to conduct a survey. As data scientists, we usually use data that was previously collected, so we don’t spend too much time thinking about how to actually do this. As we will see in this article, however, our data can have different biases, depending on how it was sampled, so you better understand the implications of each of this sampling designs. There are many ways of drawing those samples and, depending on the context, some can be better than others.
在大多数研究中,很难分析整个人口 (有时甚至是不可能),因此研究人员使用样本代替。 在统计中,调查抽样是我们从人口中获取样本以进行调查的过程。 作为数据科学家,我们通常使用以前收集的数据,因此我们不会花太多时间思考如何实际执行此操作。 但是,正如我们将在本文中看到的那样,我们的数据可能会有不同的偏差,具体取决于如何采样,因此您可以更好地理解每种采样设计的含义。 绘制这些样本的方法有很多,根据上下文的不同,有些方法可能更好。
概率x非概率 (Probability x non-probability)
There are two broad categories of sampling designs: probability and non-probability. In probability sampling, each element of the population has a known and non-zero probability of being in the sample. This method is usually preferable, since its properties, such as bias and sampling error, are usually known. In non-probability sampling, some elements of the population may not be selected and there is a great risk of the sample being non-representative of the population as a whole. However, probability sampling can sometimes not be possible under some circumstances, or it can just be cheaper to do it non-randomly.
抽样设计分为两大类:概率和非概率。 在概率抽样中 ,总体中的每个元素都有一个已知且非零的 概率出现在样本中。 通常首选此方法,因为它的属性(例如偏差和采样误差 )通常是已知的。 在非概率抽样中 ,可能不会选择总体的某些元素,并且存在很大的风险,即抽样不能代表整个总体。 但是,在某些情况下有时不可能进行概率采样,或者非随机地进行概率采样会更便宜。
Let’s now take a look at some of the different sampling designs in each category and their properties.
现在,让我们看一下每个类别中的一些不同采样设计及其属性。
概率抽样 (Probability sampling)
简单随机抽样,无需替换(SRSWR) (Simple random sampling without replacement (SRSWR))
This is probably the most obvious sampling method there is: if you have a population of 1000 individuals and you can only analyse 100, then you will randomly select one individual at a time, until you have your sample of 100. This will give each individual the same probability of being in the sample.
这可能是最明显的抽样方法:如果您有1000个人,并且只能分析100,那么您将一次随机选择一个人,直到获得100个样本。这将给每个人出现在样本中的可能性相同。
SRSWR is an unbiased sampling design, meaning that we expect the parameters calculated from the sample to be unbiased. It is often the preferable sampling design, but with a small caveat: you risk getting a really bad sample, completely out of bad luck, and having results that are not at all representative of your population. In this case, stratifying your sample might help (we’ll get to that later).
SRSWR是一种无偏抽样设计,这意味着我们期望从样本计算得出的参数是无偏的。 它通常是更可取的抽样设计,但有一个小警告:您可能会冒出真正糟糕的样本,完全是因为运气不好而导致的结果,这完全不能代表您的总体。 在这种情况下, 对样本进行分层可能会有所帮助(我们稍后会介绍)。
In practice, however, it is not that simple to get an actual simple random sample. For election polls, for instance, how do you do it? You can’t actually have a list of every person in the country to randomly select from. You can, for instance, have a list of all the personal phone numbers available, and select from there. My point is that you probably need a list of your whole population to do this — if you are randomly interviewing people in the streets, it is actually not completely random: depending on which location you choose to go to, your sample might yield different results.
然而,实际上,获得实际的简单随机样本并不是那么简单。 例如,对于选举民意测验,您如何做? 您实际上无法获得要随机选择的国家/地区中每个人的列表。 例如,您可以列出所有可用的个人电话号码,然后从中进行选择。 我的观点是,您可能需要列出整个人口的清单,如果您在街上随机采访人们,实际上并不是完全随机的:根据您选择去的地点,样本可能会产生不同的结果。
泊松采样 (Poisson sampling)
In Poisson sampling design, every element on your population will go through a Bernoulli trial, to define if they will be in the sample or not. If the probability is the same for every element in the population, this is a special case called Bernoulli sampling. It will also depend on having a list of every element in your population. Let’s say you have a list of all the companies in your country, and you want to survey them. You could assign a probability p for each one of them to be in your sample, or even a different probability for each, depending on their size, for instance (you might want to give a greater weight to bigger companies). Note that, in this case, you can’t know the exact size of your sample beforehand — it is what we call a random size sampling design.
在Poisson抽样设计中,总体中的每个元素都会经过一次Bernoulli试验,以定义它们是否会出现在样本中。 如果总体中每个元素的概率都相同,则这是称为伯努利抽样的特例。 这也将取决于您人口中每个元素的清单。 假设您拥有您所在国家/地区的所有公司的列表,并且想要对其进行调查。 例如,您可以为每个样本中的每个样本分配一个概率p ,甚至为每个样本分配一个不同的概率,具体取决于样本的大小(您可能希望更大的公司获得更大的权重)。 请注意,在这种情况下,您无法事先知道样本的确切大小,这就是我们所说的随机大小抽样设计 。
分层抽样 (Stratified sampling)
Under certain conditions, it might actually be useful to stratify your population, according to some features. Let’s say you want to do a survey with your company’s 1000 employees to see how happy they are at their jobs, but you only have the time to interview 100 of them, so you take a sample. With a SRSWR, you could risk getting 50 guys from accounting and no data scientists. This would make you think your company’s employees are much unhappier than they actually are, since data scientists are the happiest people at their jobs, and accountants… well, they are accountants. In this case, what you can do, is split your population into departments, and then sample randomly from each department, taking samples that are proportional to the department size.
根据某些功能,在某些情况下,对您的人群进行分层实际上可能很有用。 假设您想对公司的1000名员工进行调查,以了解他们对工作的满意程度,但是您只有时间采访其中的100名员工,因此您进行了抽样调查。 使用SRSWR,您可能会冒着从会计界招募50名专家而没有数据科学家的风险。 这会让您认为公司的员工比他们实际的要快乐得多,因为数据科学家是他们工作中最快乐的人,而会计师……好吧,他们是会计师。 在这种情况下,您可以做的是将您的总体分为多个部门,然后从每个部门中随机抽样,并与部门规模成正比。
This method can be really useful under some conditions:
在某些情况下,此方法可能非常有用:
Variability within strata is small (you know, from previous studies, that people within the same department tend to feel more or less the same in terms of happiness at work)
阶层中的差异很小 (您从以前的研究中可以知道,同一部门的人们在工作上的幸福感或多或少地相同)
Variability between strata is big (your level of happiness at work depends a lot on your department)
阶层之间的差异很大 (您的工作幸福感在很大程度上取决于您所在的部门)
However, in practice, it can be expensive and complicated to implement. Since it needs previous information on your population, it might be useful when you conduct smaller studies in between broader, more expensive ones (ex.: if you have a census on your country every 10 years and you need intermediate information every 5 years, you can use your census data to help the intermediate, smaller studies).
但是,在实践中,实施起来可能既昂贵又复杂。 由于它需要有关您的人口的先前信息,因此当您在范围更广,成本更高的研究之间进行较小的研究时(例如:如果您每10年进行一次国家人口普查,而每5年需要一次中间信息,则您可能会有用)可以使用您的人口普查数据来帮助中等规模较小的研究)。
非概率抽样 (Non-probability sampling)
义工抽样 (Volunteer sampling)
It is a widely used method: it’s what you get when you post a survey form on a Facebook group and ask people to fill it for you. It’s easy and cheap, but it can lead to a lot of bias, since you are actually sampling people who are on Facebook, saw your post, and most importantly: that are willing to fill that form for you. This might oversample people who like you, or people who have enough free time to fill in the form.
这是一种广泛使用的方法:这是在Facebook组上发布调查表并要求人们为您填写时所得到的。 它既简单又便宜,但是会导致很多偏差,因为您实际上是在抽样Facebook上的人,看过您的帖子以及最重要的是:愿意为您填写表格的人。 这可能会使喜欢您的人或有足够空闲时间来填写表格的人过多。
It can be used as a first validation step to see if there might be an interest in pursuing more expensive methods later on.
可以将其用作第一个验证步骤,以查看以后是否有兴趣追求更昂贵的方法。
判断抽样 (Judgement sampling)
In this sampling design, you will choose your sample based on your existing domain knowledge. If you want to survey potential customers for a new coding online course, you might already have an idea of the type of people who would like it, and start looking for them on LinkedIn.
在此抽样设计中,您将根据您现有的领域知识选择样本。 如果您想调查潜在客户以学习新的在线编码课程,则可能已经对想要的人类型有所了解,然后开始在LinkedIn上寻找他们。
It goes without saying that this method is prone to your own biases, and you should not take definitive conclusions based on its results. It can be used under the same circumstances as volunteer sampling.
不用说,这种方法容易引起您的偏见 ,并且您不应该根据其结果得出明确的结论。 它可以在与志愿者抽样相同的情况下使用。
结论 (Conclusion)
Now you know some of the most common sampling designs, when to use them and their caveats. Survey sampling is a whole field of expertise on itself, specially useful for those who work as statisticians for govern agencies, but it is good for data scientists to know the basics in order to understand what are the implications of their collection methods, or to conduct surveys themselves.
现在,您了解了一些最常见的采样设计,何时使用它们及其注意事项。 调查抽样本身就是一个完整的专业领域,对作为政府机构的统计学家的那些人特别有用,但对于数据科学家来说,了解基础知识以了解其收集方法的含义或进行操作是一件好事。调查自己。
Once you have sampled your data, then what? Well, you will need to apply some feature engineering to make sense out of it. Additionally, you might like this article on project management workflows for data scientists.
一旦采样了数据,那又如何呢? 好吧,您将需要应用一些功能工程以使其有意义。 此外,您可能喜欢这篇关于数据科学家的项目管理工作流的文章。
Feel free to reach out to me on LinkedIn if you would like to discuss further, it would be a pleasure (honestly).
如果您想进一步讨论,请 随时在 LinkedIn 上与我联系 ,这是一种荣幸(诚实)。
翻译自: https://towardsdatascience.com/sampling-methods-for-data-science-ddfeb5b3c8ed
时序数据取样方法