## Wednesday, July 23, 2014

### The Linear Regression Model y = a + bx

This is part of Mike's Regression and Model Fitting Tutorial

Linear Regression of the form $$y = a + b x$$ is the typical "go to" regression method that people generally use. It is often taught in many basic statistics and non-statistical mathematics courses. There are a large number of problems where linear regression of the form $$y = a + bx$$ provides a correct answer and a large number of problems where it provides an acceptable answer. In future posts, we'll look at other models that may fit other data better.

By definition, the line $$y = a + b x$$ is a straight line with the following characteristics:

• The y axis intercept (the model evaluated at $$x = 0$$, in mathematical notation $$y = a +bx |_{x=0}$$ is equal to  $$a$$.
• The x axis intercept (the model evaluated at $$y = 0$$ and solved for x, in mathematical notation $$y = a + bx|_{y=0}$$ is equal to $$\frac{-a}{b}$$.
• The slope of the line is equal to $$b$$. This can be shown with either the typical "rise over run" or using the first derivative (there's really not a lot of distinction here, but I recognize that some readers may not have a calculus background).
• Let's look at the "rise over run" part first. To make the math easy, let's assume that we want to compare the change in y (call the change in y $$\sigma$$ when we make some arbitrary change in x ( let's say we add $$\delta$$ )$$y = a + b x$$ $$y + \sigma = a + b(x + \delta)$$ Now let's look at the change: $$y + \sigma - y = a + b (x + \delta) - (a + bx )$$ Simplifying, we get to $$\sigma = b \delta$$ . The "rise over run" is equal to $$\frac{\sigma}{\delta} = b$$. To take this a step further, assume $$\delta = 1$$, the change is exactly $$b$$
• Going back to basic first semester calculus, this can be shown using the first derivative (for non-calculus readers, the derivative is a measure of how quickly the slope of a particular curve changes) $$\frac{d}{dx} a + bx = b$$

How do I calculate $$a$$ and $$b$$ for the line $$y = a + b x$$?

Let's get into the calculation of the $$a$$ and $$b$$ values for the $$y= a + b x$$ model. Firstly, we need to set up a matrix $$A$$  with the relevant transformations of our input data. We'll get to that in a minute below. First, let's answer the question "How do I find a line between 2 points in the $$(x,y)$$ plane?"

For a second, let's consider two of our points in our data: $$(x_1, y_1 )$$ and $$(x_2, y_2 )$$.  If we used just the two points, we can calculate $$a$$ and $$b$$ directly. First, let's define a couple of equations:

$$y_1 = a + b x_1$$
$$y_2 = a + b x_2$$

Doing a little bit of reorganization, let's solve for b first:

$$y_2 - y_1 = a + b x_2 - (a + b x_1)$$

$$a$$ cancels out and the right side simplifies to  $$b x_2 - b x_1 = b ( x_2 - x_1 )$$. Solving for $$b$$:

$$b = \frac{y_2 - y_1}{x_2 - x_1}$$

Either equation can be used to solve for $$a$$. Using the first, $$a = y_1 - b x_1$$, using the second equation, $$a = y_2 - b x_2$$. Now, let's consider a matrix solution to the same problem. We'll set up matrices to solve the equation
$$A z = B$$

Here, lets define $$z$$ and $$B$$. $$z$$ contains our unknowns... namely $$a$$ and $$b$$. $$B$$ contains our Y values, namely $$y_1$$  and $$y_2$$

$$z = \begin{pmatrix} a \\ b \end {pmatrix} \quad \quad B = \begin{pmatrix} y_1 \\ y_2 \end{pmatrix}$$

Let's take a little bit of extra time to talk about $$A$$. Each column in $$A$$ has to be a function of the data $$x_i$$. Let's go back to our original equations and rewrite them slightly...

$$y_1 = a + b x_1 \iff a \mathbf{x_1^0} + b x_1^{\mathbf{1}}$$
$$y_2 = a + b x_2 \iff a \mathbf{x_2^0} + b x_2^{\mathbf{1}}$$

We know generally that almost anything raised to the "zero" power is equal to 1. Anything raised to the first power is equal to itself. Let's put our rewritten equations into their equivalent matrix format:

$$A = \begin{pmatrix} x_1^0 & x_1^1 \\ x_2^0 & x_2^1 \end{pmatrix} \iff \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \end{pmatrix}$$

Our resulting matrix $$A$$ contains all of the data. The first column contains the data raised to the 0 power and the second column contains the data raised to the first power. Let's write our system of equations out in matrix form:

$$A z = B$$
$$\begin{pmatrix} 1 & x_1 \\ 1 & x_2 \end{pmatrix} \begin{pmatrix} a \\ b \end{pmatrix} = \begin{pmatrix} y_1 \\ y_2 \end{pmatrix}$$

This is a 2x2 system, so we can look up the inverse of a 2x2 matrix on my post here: http://mikemstech.blogspot.com/2014/07/inverse-of-2x2-matrix.html. We'll use the fact that

$$(A^{-1}) A z = (A^{-1}) B$$
$$z = (A^{-1}) B$$

Calculating $$A^{-1} B$$ yields

$$\begin{pmatrix} \frac{y_1 x_2 - y_2 x_1}{ x_2 - x_1 } \\ \frac{y_2 - y_1}{x_2-x_1} \end{pmatrix}$$

Namely, $$a = \frac{y_1 x_2 - y_2 x_1}{ x_2 - x_1 }$$ and $$b = \frac{y_2 - y_1}{x_2-x_1}$$ for our two point example. A fair amount of algebra can be used to show the equivalence of the answers above and the answers to the matrix equations for $$a$$ ( $$b$$ is the same with either approach).

How to find $$a$$ and $$b$$ with more than two points.

We used the two point example as a conceptual introduction to how we set up the matrices, and now we want to consider the case with more than 2 points. We set up our system of equations using the least squares approach (minimizing the total sum of squared error for the model generated).

$$A^T A z = A^T B$$

In this case,
$$A = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix} \quad z = \begin{pmatrix} a \\ b \end{pmatrix} \quad B = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}$$

Now for the calculation of $$A^T A$$ and $$A^T B$$

$$A^T A = \begin{pmatrix} \sum \limits _{i = 1}^n 1 & \sum \limits _{i=1}^n x_i \\ \sum \limits _{i=1}^n x_i & \sum \limits _{i=1}^n x_i^2 \end{pmatrix} = \begin{pmatrix} n & \sum \limits _{i=1}^n x_i \\ \sum \limits _{i=1}^n x_i & \sum \limits _{i=1}^n x_i^2 \end{pmatrix} \quad \quad A^T B = \begin{pmatrix} \sum \limits _{i=1}^n y_i \\ \sum \limits _{i=1}^n x_i \cdot y_i \end{pmatrix}$$

This is a 2x2 system, so we can look up the inverse of a 2x2 matrix on my post here: http://mikemstech.blogspot.com/2014/07/inverse-of-2x2-matrix.html. Again, we'll use the following:

$$(A^TA)^{-1} A^TA z = (A^TA)^{-1} B$$
$$z = (A^TA)^{-1} B$$

Finding the inverse of $$A^T A$$ yields

$$(A^T A)^{-1} = \frac { 1 } { n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 } \begin{bmatrix} \sum \limits_{i=1}^n x_i^2 & -1 \cdot \sum \limits _{i=1}^n x_i \\ -1 \cdot \sum \limits _{i=1}^n x_i & n \end{bmatrix}$$

Calculating $$(A^T A)^{-1} B$$ yields

$$\begin{pmatrix} (A^T A)^{-1} B = \frac{\sum \limits_{i=1}^n x_i^2 \sum \limits_{i=1}^n y_i - \sum \limits_{i=1}^n x_i \sum \limits_{i=1}^n x_i y_i}{n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 } \\ \frac{n \sum \limits_{i=1}^n x_i y_i - \sum \limits_{i=1}^n y_i \sum \limits_{i=1}^n x_i }{n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 } \end{pmatrix}$$

So, for the regression model $$y = a + bx$$
$$a = \frac{\sum \limits_{i=1}^n x_i^2 \sum \limits_{i=1}^n y_i - \sum \limits_{i=1}^n x_i \sum \limits_{i=1}^n x_i y_i}{n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 } \quad \quad b = \frac{n \sum \limits_{i=1}^n x_i y_i - \sum \limits_{i=1}^n y_i \sum \limits_{i=1}^n x_i }{n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 }$$

Example: Calculate A Regression Line for 3 Collinear Points

Problem statement: Calculate a line in the form $$y = a + b x$$ that goes through the points $$(1,5),(2,7),(3,9)$$.

We derived the formula above, so now we need to focus on calculation.

$$a = \frac{\sum \limits_{i=1}^n x_i^2 \sum \limits_{i=1}^n y_i - \sum \limits_{i=1}^n x_i \sum \limits_{i=1}^n x_i y_i}{n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 } \quad \quad b = \frac{n \sum \limits_{i=1}^n x_i y_i - \sum \limits_{i=1}^n y_i \sum \limits_{i=1}^n x_i }{n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 }$$

If calculating by hand, the easiest way is to organize the calculationsin a table.

 Point $$x_i$$ $$y_i$$ $$x_i^2$$ $$x_i y_i$$ $$(1,5)$$ 1 5 1 5 $$(2,7)$$ 2 7 4 14 $$(3,9)$$ 3 9 9 27 $$\sum \limits_{i=1}^3 x_i = 1 + 2 + 3 = 6$$ $$\sum \limits_{i=1}^3 y_i = 5 + 7 + 9 = 21$$ $$\sum \limits_{i=1}^3 x_i^2 = 1 + 4 + 9 = 14$$ $$\sum \limits_{i=1}^3 x_i y_i = 5 + 14 + 27 = 46$$
Now, for the calculation of $$a$$
$$a = \frac{\sum \limits_{i=1}^n x_i^2 \sum \limits_{i=1}^n y_i - \sum \limits_{i=1}^n x_i \sum \limits_{i=1}^n x_i y_i}{n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 } = \frac{ 14 \cdot 21 - 6 \cdot 46 }{ 3 \cdot 14 - 6^2 } = \frac{18}{6} = 3$$

Now, for the calculation of $$b$$

$$b = \frac{n \sum \limits_{i=1}^n x_i y_i - \sum \limits_{i=1}^n y_i \sum \limits_{i=1}^n x_i }{n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 } = \frac{3 \cdot 46 - 21 \cdot 6}{3 \cdot 14 - 6^2 } = \frac{12}{6} = 2$$

The ending solution for $$y= a + b x$$ that fits these three points is $$y = 3 + 2 x$$

## Sunday, July 20, 2014

### Mike's Regression and Model Fitting Tutorial

So, I noticed on a brief look around the Internet that there are not a lot of good, multi-platform tutorials for regression and model fitting... particularly examples in the tools that most people have access to in the workplace. In this tutorial series I will present explanations and derivations of the models and procedures for fitting models of various levels of complexity in various applications and programming languages. The applications will include the following: Microsoft Excel, OpenOffice.org, Office 365, and Google Spreadsheets. The programming languages will include C#.Net, VB.Net, Java, SQL, and R. Over time I may expand to other applications and programming languages.

As a special note, the focus of this tutorial series is on presenting methods to implement linear regression in an application. In some cases, there may be a built in way to calculate a model in the application, but the focus of the posts is using an application to implement the model directly (since in some cases you may need to tweak the model for your specific data set, and in my experience the built-in features of an application don't usually support model customization).

Below are a list of models that I've developed tutorials for:
• Linear Models
• $$y = a$$ (sometimes called the "null" model)
• Understanding the model for $$y = a$$
• How to calculate $$y=a$$ models in Microsoft Excel
• How to calculate $$y=a$$ models in OpenOffice.org
• How to calculate $$y=a$$ models in Google Spreadsheets
• How to calculate $$y=a$$ models in Office 365
• How to calculate $$y=a$$ models in C#.Net
• How to calculate $$y=a$$ models in VB.Net
• How to calculate $$y=a$$ models in Java
• How to calculate $$y=a$$ models in SQL
• How to calculate $$y=a$$ models in R
• $$y = a + bx$$
• Understanding the model for $$y = a + bx$$
• How to perform linear regression in Microsoft Excel
• How to perform linear regression in OpenOffice.org
• How to perform linear regression in Office 365
• How to perform linear regression in C#.Net
• How to perform linear regression in VB.Net
• How to perform linear regression in Java
• How to perform linear regression in SQL
• How to perform linear regression in R
• $$y = a + b x + c x^2$$
• Understanding the model for $$y = a + b x + c x^2$$
• How to perform quadratic regression in Microsoft Excel
• How to perform quadratic regression in OpenOffice.org
• How to perform quadratic regression in Office 365
• How to perform quadratic regression in C#.Net
• How to perform quadratic regression in VB.Net
• How to perform quadratic regression in Java
• How to perform quadratic regression in SQL
• How to perform quadratic regression in R

### The Linear Model y = a

This is part of Mike's Regression and Model Fitting Tutorial.

The simplest linear model that I am going to discuss in this series is the model $$y = a$$. By the end of this post, I hope you'll walk away with the knowledge of what the model represents and how it is often used. By the end of this post, I hope you'll be able to answer the following questions:
• What is the "null" model?
• How is the "null" model used?
• Why is the solution to the linear model $$y = a$$ equal to the mean?
For a minute, let's consider a scenario where we have a set of interval or ratio data that we want to learn more about. In this scenario we are attempting to describe the relationship in the data that we have collected. We may or may not be interested in predicting other values. We are essentially asking the question "Why are things the way that they are?" and may be asking the question "How might things be if we collect more data?" As we work through our model, we will be attempting to describe the relationship between dependent and independent variables.

If your new to statistics, the words in the previous paragraph might not have a lot of meaning. Let's look at them in detail...

Interval/Ratio Data - The words "Interval" and "Ratio" refer to the measurement scale of the data in question. Most of the models in the tutorial will require an interval/ratio dependent variable. Some clues that help us end up at a description of this measurement level include:

• "Zero on the scale" - the measurement scale typically has a 0 at some point. For ratio data, there is an absolute "zero" to the scale (i.e. it would be impossible for a collected data point to fall below "zero").
• Sub-interval equality - The distance between subsequent points on the scale are equal. For example, we would generally consider the difference between 1 degree F and 2 degrees F to be equivalent to the difference between 3 degrees and 4 degrees F.
Console.WriteLine("\\begin{equation*}");
Console.WriteLine(B[i]);
Console.WriteLine("\\end{equation*}");

}

}

//End of file
Console.WriteLine("\\end{document}");

}

public string GetCombinedArrays()
{
StringBuilder output = new StringBuilder();

output.Append("\\left ( \\begin{array}{" + ArrayFormat + "}");

for (int i = 0; i < rank; i++)
{
if (i > 0) { output.AppendLine("\\\\"); }

for (int j = 0; j < rank; j++)
{
if (j > 0) output.Append("&");
output.AppendLine(A[i, j]);
}

for (int j = 0; j < rank; j++)
{
output.Append("&");
output.AppendLine(A_inverse[i, j]);
}

output.Append("&");
output.Append(B[i]);

}

output.Append("\\end{array} \\right )");
return output.ToString();
}

public string NtoC(int input)
{
return NtoC(input.ToString());
}

public string NtoC(string input)
{
return input.Replace("0", "a")
.Replace("1", "b")
.Replace("2", "c")
.Replace("3", "d")
.Replace("4", "e")
.Replace("5", "f")
.Replace("6", "g")
.Replace("7", "h")
.Replace("8", "i")
.Replace("9", "j");

}

public static void PrintUsage()
{
Console.WriteLine(@"
LatexMatrixInverse C#.Net Edition 1.0
Developed by Mike Burr
This application is provided without any warranties, express or implied.

Usage: LatexMatrixInverse <rank>

<rank>: The rank of the matrix. Also the number of the diagonal elements in a square matrix.
ex. A 2x2 matrix has rank 2, a 3x3 matrix has rank 3, etc...

The latex code for the matrix inverse and corresponding solution vector
of a matrix with given rank will be written to standard out along with supporting work
code (ex. row transformations and additions). The end user should then adjust the matrix
values in the resulting latex code to get the desired result for their specific problem.

The matrix inversion is performed using Gauss-Jordan elimination.

Note that singular matrices still won't be invertible, but matrices with inverses
should be invertible using the latex code generated with the occasional initial
row substitution. Simplification of the resulting formulas are the end user's
responsibility.

A recommended use of the application would be to redirect the output to a file.
ex. The inverse for a 3x3 matrix written to m3x3.tex:

LatexMatrixInverse 3 > m3x3.tex
");
}
}
}

-->