Python Tutorial : The important of EDA: Anscombe's quartet
Автор: DataCamp
Загружено: 2020-04-11
Просмотров: 1206
Описание:
Want to learn more? Take the full course at https://learn.datacamp.com/courses/st... at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.
---
In 1973, statistician Francis Anscombe published a paper that contained four fictitious x-y data sets, plotted here. He uses these data sets to make an important point. That point becomes clear if we blindly go about doing parameter estimation on these data sets.
First, let's look at the average x-values of the four data sets. They are all the same. How about the average y-values? Again, all the same. And what if we do a linear regression on each of the data sets? They all have the same line!
Surely some of the fits are less optimal than others. Let's look at the sum of the squares of the residuals. Oh my, they are all basically the same as well.
Of course, Anscombe constructed the data sets so that this would happen. The point he was making is very important. You already have some powerful tools for statistical inference. You can compute summary statistics and optimal parameters, including linear regression parameters, and by the end of the course, you will able to construct confidence intervals with quantify uncertainty about the parameter estimates. These are crucial skills for any data analysis, no doubt. But look before you leap!
This is a powerful reminder to do some graphic exploratory data analysis before you start computing and making judgments about your data. For example, this data set might be well modeled with a line, and the regression parameters will be meaningful. The same is true of this data set, but the outlier throws off the slope and intercept. After doing EDA, you should look into what is causing that outlier.
This data set might also have a linear relationship between x and y, but from the plot, you can conclude that you should try to acquire more data for intermediate x values to make sure that it does. And this data set is definitely not linear, and you need to choose another model.
Explore your data first.
I'll let you prove to yourself that these data sets give the same regression parameters. It will be good practice, and seeing is believing!
#DataCamp #PythonTutorial ##StatisticalThinkinginPython #StatisticalThinkinginPythonPart 2
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: