资讯专栏INFORMATION COLUMN

回归分析 Regression

cnTomato / 2355人阅读

摘要:回归分析按照涉及的自变量的多少,分为一元回归和多元回归分析按照自变量和因变量之间的关系类型,可分为线性回归分析和非线性回归分析。

原文:2016-09-28 IBM Intern郝建勇 IBM数据科学家

概述

回归分析(regressionanalysis)是确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法,运用十分广泛。简单来说,就是将一系列影响因素和结果拟合出一个方程,然后将这个方程应用到其他同类事件中,可以进行预测。回归分析按照涉及的自变量的多少,分为一元回归和多元回归分析;按照自变量和因变量之间的关系类型,可分为线性回归分析和非线性回归分析。本文从回归分析的基本概念开始,介绍回归分析的基本原理和求解方法,并用python给出回归分析的实例,带给读者更直观的认识。

一、回归分析研究的问题

回归分析就是应用统计方法,对大量的观测数据进行整理、分析和研究,从而得出反应事物内部规律性的一些结论。然后运用这个结论去预测同类事件的结果。回归分析可以应用到心理学、医学、和经济学等领域,应用十分广泛。

二、回归分析的基本概念

三、一元线性回归



四、多元线性回归





五、一元多项式回归

六、多元多项式回归

七、用python做多项式回归实例

用python来做多项式的回归是非常方便的,如果我们要自己写模型的话,就可以按照前面介绍的方法和公式来写模型,然后训练预测,值得一提的是对于前面公式里的很多矩阵运算,其实我们可以用python里面的NumPy库来实现,因此,实现前面的多项式回归模型相对来说还是很简单的。NumPy是一个基于python的科学计算的库,是Python的一种开源的数字扩展。这种工具可用来存储和处理大型矩阵,效率还是很高的。一种看法是NumPy将Python变成一种免费的更强大的MatLab系统。

    言归正传,那这里演示的实例没有自己写模型,而是用了scikit-learn里面的线性模型。
    实验的数据是这样的:如果以文本的形式输入训练数据,那么文件就为多行数据。每一行最后一列为y值,也就是因变量的值,前几列是自变量的值,训练数据如果只有两列(一列自变量,一列因变量),那么线性模型得到的就是一元多项式回归方程,否则就是多元多项式回归方程;当以List输入训练数据的时候,训练数据输入的List形式为[[1, 2],[3, 4],[5, 6],[7, 8]],训练数据结果的List形式为[3, 7, 8, 9],由于数据来源不同,所以训练模型的方法也有一定的差别。

先把文本的数据加载成为线性模型所需要的数据格式:

接下来就是训练模型:

然后打印回归方程的代码如下:

还可以用如下方法已有模型来预测输入数据的值:

调用测试过程:

测试结果示例:

为什么选择SSE作为loss function?

$$minimizes sum_{All Training Points}{}(actual-predicated)qquad$$ 正负会抵消

$$minimizes sum_{All_Training_Points}{}|actual-predicated|qquad$$ 不是连续函数

$$minimizes sum_{All_Training_Points}{}(actual-predicated)^{2}qquad$$

SSE的缺点

SSE的值和数据量成正比,不能很好反应回归的效果

如果我们要比较两个数据集上的回归效果,我们需要用R Score

What Is R-squared?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for this guideline that I’ll talk about both in this post and my next post.

The Coefficient of Determination, r-squared

Here"s a plot illustrating a very weak relationship between y and x. There are two lines on the plot, a horizontal line placed at the average response, $bar{y}$, and a shallow-sloped estimated regression line, $hat{y}$. Note that the slope of the estimated regression line is not very steep, suggesting that as the predictor x increases, there is not much of a change in the average response y. Also, note that the data points do not "hug" the estimated regression line:


$$SSR=sum_{i=1}^{n}(hat{y}_i-ar{y})^2=119.1$$

$$SSE=sum_{i=1}^{n}(y_i-hat{y}_i)^2=1708.5$$

$$SSTO=sum_{i=1}^{n}(y_i-ar{y})^2=1827.6$$

The calculations on the right of the plot show contrasting "sums of squares" values:

SSR is the "regression sum of squares" and quantifies how far the estimated sloped regression line, $hat{y}_i$, is from the horizontal "no relationship line," the sample mean or $bar{y}$.

SSE is the "error sum of squares" and quantifies how much the data points, $y_i$, vary around the estimated regression line, $hat{y}_i$.

SSTO is the "total sum of squares" and quantifies how much the data points, $y_i$, vary around their mean, $bar{y}$

Note that SSTO = SSR + SSE. The sums of squares appear to tell the story pretty well. They tell us that most of the variation in the response y (SSTO = 1827.6) is just due to random variation (SSE = 1708.5), not due to the regression of y on x (SSR = 119.1). You might notice that SSR divided by SSTO is 119.1/1827.6 or 0.065.Do you see where this quantity appears on Minitab"s fitted line plot?

Contrast the above example with the following one in which the plot illustrates a fairly convincing relationship between y and x. The slope of the estimated regression line is much steeper, suggesting that as the predictor x increases, there is a fairly substantial change (decrease) in the response y. And, here, the data points do "hug" the estimated regression line:

$$SSR=sum_{i=1}^{n}(hat{y}_i-ar{y})^2=6679.3$$

$$SSE=sum_{i=1}^{n}(y_i-hat{y}_i)^2=1708.5$$

$$SSTO=sum_{i=1}^{n}(y_i-ar{y})^2=8487.8$$

The sums of squares for this data set tell a very different story, namely that most of the variation in the response y (SSTO = 8487.8) is due to the regression of y on x (SSR = 6679.3) not just due to random error (SSE = 1708.5). And, SSR divided by SSTO is 6679.3/8487.8 or 0.799, which again appears on Minitab"s fitted line plot.

The previous two examples have suggested how we should define the measure formally. In short, the "coefficient of determination" or "r-squared value," denoted $r^2$, is the regression sum of squares divided by the total sum of squares. Alternatively, as demonstrated in this , since SSTO = SSR + SSE, the quantity $r^2$ also equals one minus the ratio of the error sum of squares to the total sum of squares:

$$r^2=frac{SSR}{SSTO}=1-frac{SSE}{SSTO}$$

Here are some basic characteristics of the measure:

Since $r^2$ is a proportion, it is always a number between 0 and 1.

If $r^2$ = 1, all of the data points fall perfectly on the regression line. The predictor x accounts for all of the variation in y!

If $r^2$ = 0, the estimated regression line is perfectly horizontal. The predictor x accounts for none of the variation in y!

We"ve learned the interpretation for the two easy cases — when r2 = 0 or r2 = 1 — but, how do we interpret r2 when it is some number between 0 and 1, like 0.23 or 0.57, say? Here are two similar, yet slightly different, ways in which the coefficient of determination r2 can be interpreted. We say either:

$r^2$ ×100 percent of the variation in y is reduced by taking into account predictor x

or:

$r^2$ ×100 percent of the variation in y is "explained by" the variation in predictor x.

Many statisticians prefer the first interpretation. I tend to favor the second. The risk with using the second interpretation — and hence why "explained by" appears in quotes — is that it can be misunderstood as suggesting that the predictor x causes the change in the response y. Association is not causation. That is, just because a data set is characterized by having a large r-squared value, it does not imply that x causes the changes in y. As long as you keep the correct meaning in mind, it is fine to use the second interpretation. A variation on the second interpretation is to say, "$r^2$ ×100 percent of the variation in y is accounted for by the variation in predictor x."

Students often ask: "what"s considered a large r-squared value?" It depends on the research area. Social scientists who are often trying to learn something about the huge variation in human behavior will tend to find it very hard to get r-squared values much above, say 25% or 30%. Engineers, on the other hand, who tend to study more exact systems would likely find an r-squared value of just 30% merely unacceptable. The moral of the story is to read the literature to learn what typical r-squared values are for your research area!

Key Limitations of R-squared

R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.

R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!

The R-squared in your output is a biased estimate of the population R-squared.

文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。

转载请注明本文地址:https://www.ucloud.cn/yun/18222.html

相关文章

  • 一起来复习Data Science:那些让人抓狂的回归分析

    摘要:此外,惩罚函数也可以从拉格朗日函数的方式理解,其实和岭回归都是控制在数学分析微积分教材中可知,约束条件下的拉格朗日函数形式就是上面表述的两个惩罚函数。 回归分析(regression analysis)是确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法。在此,我讲会从机器学习和统计学两个方面分别描述回归在两个领域中的区别。 线性回归 常见的线性回归基于以下模型: $$Y =...

    cgspine 评论0 收藏0
  • 在一头扎进机器学习前应该知道的那些事儿

    摘要:当掌握机器学习基本知识以及清楚自己所要处理的任务后,应用机器学习就不会那么难了。因此,在学习和应用机器学习之前,我们首先应该明确自己的任务是什么,以及适合使用哪种机器学习方法来完成。 摘要: 本文简单总结了机器学习的几大任务及其对应的方法,方便初学者根据自己的任务选择合适的方法。当掌握机器学习基本知识以及清楚自己所要处理的任务后,应用机器学习就不会那么难了。 机器学习一直是一个火热的研...

    moven_j 评论0 收藏0
  • 在一头扎进机器学习前应该知道的那些事儿

    摘要:当掌握机器学习基本知识以及清楚自己所要处理的任务后,应用机器学习就不会那么难了。因此,在学习和应用机器学习之前,我们首先应该明确自己的任务是什么,以及适合使用哪种机器学习方法来完成。 摘要: 本文简单总结了机器学习的几大任务及其对应的方法,方便初学者根据自己的任务选择合适的方法。当掌握机器学习基本知识以及清楚自己所要处理的任务后,应用机器学习就不会那么难了。 机器学习一直是一个火热的研...

    jk_v1 评论0 收藏0
  • 机器学习(二):线性回归(simple and multiple)

    摘要:多重线性回归多重线性回归将会不只有一个自变量,并且每个自变量拥有自己的系数且符合线性回归。因此回归方程对的检验等同于检验两组均数的比较。但是多元线性回归很难用图像表示,因为包含多个自变量。至此多元线性回归就结束了。陈常中线性回归分析 写之前先声明一下,https://steveli90.github.io 是我的个人github page,所以同样的文章我会在这上边先发。本来我想机器学...

    bergwhite 评论0 收藏0
  • 机器学习(三):线性回归(polinomial regression多项式回归

    摘要:所以这里我们也就可以描述一下多项式线性回归。由此公式我们可以看出,自变量只有一个,就是,只不过的级数不同而已。 首先我们需要明确一个概念,我们讨论的线性或者非线性针对的是自变量的系数,而非自变量本身,所以这样的话不管自变量如何变化,自变量的系数如果符合线性我们就说这是线性的。所以这里我们也就可以描述一下多项式线性回归。 showImg(https://segmentfault.com/...

    davidac 评论0 收藏0

发表评论

0条评论

最新活动
阅读需要支付1元查看
<