全部版块 我的主页
论坛 计量经济学与统计论坛 五区 计量经济学与统计软件 LATEX论坛
2012 5
2016-08-13

By Jean-Nicholas Hould, JeanNicholasHould.com

Yesterday, I was reading a thread on Quora. The people in this thread where answering the following question: What are 20 questions to detect fake data scientists?. The most upvoted answer contained a list of questions that could leave a good number of data scientists off guard.

In that thread, my attention was drawn to one particular question. Not because it was specifically hard but because I doubt many data scientists can answer that question. Yet, most of them, whether they know it or not, are using this concept on a daily basis.

The question was: What is the Central Limit Theorem? Why is it important?



Explain the Theorem Like I’m Five


Let’s say you are studying the population of beer drinkers in the US. You’d like to understand the mean age of those people but you don’t have time to survey the entire US population.

Instead of surveying the whole population, you collect one sample of 100 beer drinkers in the US. With this data, you are able to calculate an arithmetic mean. Maybe for this sample, the mean age is 35 years old. Say you collect another sample of 100 beer drinkers. For that new sample, the mean age is 39 years old. As you collect more and more means of those samples of 100 beer drinkers, you get what is called a sampling distribution. The sampling distribution is the distribution of the samples mean. In this example, 35 and 39 would be two observations in that sampling distribution.

The statement of the theorem says that the sampling distribution, the distribution of the samples mean you collected, will approximately take the shape of a bell curve around the population mean. This shape is also known as a normal distribution. Don’t get the statement wrong. The CLT is not saying that any population will have a normal distribution. It says the sampling distribution will.

As your samples get bigger, the sampling distribution will tend to look more and more like a normal distribution. The Theorem holds true for any populations, regardless of their distribution*. There are some important conditions for the Theorem to hold true but I won’t cover them in this post.

Why is it important?


The Central Limit Theorem is at the core of what every data scientist does daily: make statistical inferences about data.

The theorem gives us the ability to quantify the likelihood that our sample will deviate from the population without having to take any new sample to compare it with. We don’t need the characteristics about the whole population to understand the likelihood of our sample being representative of it.

The concepts of confidence interval and hypothesis testing are based on the CLT. By knowing that our sample mean will fit somewhere in a normal distribution, we know that 68 percent of the observations lie within one standard deviation from the population mean, 95 percent will lie within two standard deviations and so on.

The CLT is not limited to making inferences from a sample about a population. There are four kinds of inferences we can make based on the CLT

  • We have the information of a valid sample. We can make accurate assumptions about it’s population.
  • We have the information of the population. We can make accurate assumptions about a valid sample from that population.
  • We have the information of a population and a valid sample. We can accurately infer if the sample was drawn from that population.
  • We have the information about two different valid samples. We can accurately infer if the two samples where drawn from the same population.

As a data scientist, you should be able to deeply understand this theorem. You should be able to explain it and understand why it’s so important. This post skips many important aspects of the theorems such as it’s mathematical demonstration, the criteria for it to be valid and the details about the statistical inferences that can be made from it. These elements are material for another post.


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

全部回复
2016-8-13 22:48:27
谢谢分享
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2016-8-14 13:00:04
谢谢分享
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2016-8-14 15:06:43
上面的是 Part 1 The Theorem Every Data Scientist Should Know

这个是 Part 2 The Theorem Every Data Scientist Should Know (Part 2)


The Theorem Every Data Scientist Should Know (Part 2)

13 Jul 2016

Last week, I wrote a post about the Central Limit Theorem. In that post, I explained through examples what the theorem is and why it’s so important when working with data. If you haven’t read it yet, go do it now. To keep the post short and focused, I didn’t go into many details. The goal of that post was to communicate the general concept of the theorem. In the days following it’s publication, I received many messages. People wanted me to go into more details.

Today, I’ll dive into more specifics. I’ll be focusing on answering the following question: How do we calculate confidence intervals and margins of error with the CLT?

By the end of this post, you should be able to explain how we calculate confidence intervals to your colleagues.

More Details On The CLT

The theorem states that if we collect a large enough sample from a population, the sample mean should be equal to, more or less, the population mean. If we collect a large number of different samples mean, the distribution of those samples mean should take the shape of a normal distribution no matter what the population distribution is. We call this distribution of means the sampling distribution.

Knowing that the sampling distribution will take the shape of a normal distribution is what makes the theorem so powerful. With a few information about a sample, we are able to calculate the probability that the sample mean will differ from the population mean and by how much it will differ. Sounds familiar? Well the Central Limit Theorem is foundational to the concept of confidence intervals and margins of error in frequentist statistics.

When explaining the theorem, we keep referring to two distribution: the population distribution and the sampling distribution of the mean. The reason we keep referring to those two distribution is because they are connected:

Population Distribution

The mean of the sampling distribution will cluster around the population mean.

The standard deviation of the population distribution is tied with the standard deviation of the sampling distribution. With the standard deviation of the sampling distribution and the sample size, we are able to calculate the standard deviation of the population distribution. The standard deviation of the sampling distribution is called the standard error.

Ok, so technically, how do calculating confidence intervals work?

Beer, beer, beer…

Let’s go back to the beer example from my previous post. Say we are studying the American beer drinkers and we want to know the average age of the US beer drinker population. We hire a firm to conduct a survey on 100 random American beer drinkers. From that sample, we get the following (totally made up) results:

n (sample size): 100
Standard Deviation of Age: 15
Arithmetic Mean of Age: 40
What can we infer from the population with this information? Quite a lot, actually.

With this data at hand and based on what we learned about the CLT, our best guess is that the population mean is more or less equal to 40, the mean of our sample. However, how can we be confident about this number? What are the chances that we are wrong?

What is the probability that the mean age of the US beer drinker population is between 38 and 42? (I selected those values to keep the example simple. By the of the post, you should be able to calculate this for any range.)

Standard Error & Standard Deviation

Here’s an important bit information I haven’t provided you with yet. This formula describes the relation ship between the Standard Error of the Mean and the Standard Deviation of the Population. It is necessary to use this formula in order to calculate confidence intervals and margins of error.

Standard Error of the Mean = Standard Deviation of Population / √n

The challenge is that with the data provided above, neither do we have the Standard Error nor the Standard Deviation of the Population. To solve this, alternatively to the Standard Deviation of the Population, we can use our best estimator for that value. In this case, our best estimator is the sample standard deviation.

Standard Error of the Mean = 15 / √100 = 1.5

We now know that our best estimate for the Standard Error of the mean is 1.5. This is equivalent to saying the standard deviation of the sampling distribution of the mean is 1.5. This value is essential in calculating the probability of us being wrong.

Probability of an observation

Armed with the standard error, we can now calculate the probability of our population mean being between 38 and 42. When working with a distribution such as the normal distribution, we generally want to normalize absolute values in terms of standard deviations. What does the range of 2 year above and below our arithmetic mean represents in terms of standard deviation? We can normalize this range by diving the 2 years by the standard deviation. It represents 2 / 1.5 or 1.33 standard deviation above or below the sample mean.

Since the normal distribution is a distribution of probabilities and it has been studied extensively, there is a table called the Z-Table that documents the probability that a statistic is observed. With the Z-Table, we can easily know the probability that an observation will occur above or below a certain standard deviation. We can lookup the information in the Z-Table to understand the probability of an observation being within 1.33 standard deviations from our mean.

In this case, the table tells us that the probability that the mean age of the US beer drinker population is between 38 and 42 is 81.64%. This is similar to saying that we are confident at approximately 81.64% that the population is more or less 2 years of our sample mean. There you have it, a confidence interval and a margin of error.

This example if fairly simple, I agree. It’s important to remember that a good portion of the data scientist work is just arithmetic. Understanding the fundamentals is essential if you want to interpret data. It will also help you do a better job at teaching your colleagues about it. As a data scientist, a major part of your job is to communicate clearly statistical concepts to people with various level of statistical knowledge.
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2016-9-2 00:37:22
oliyiyi 发表于 2016-8-13 20:25
By Jean-Nicholas Hould, JeanNicholasHould.comYesterday, I was reading a thread on Quora. The people  ...
谢谢楼主的资料谢啦
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-2-5 07:06:55
棒棒棒!
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群