## Sunday, July 20, 2014

### The Linear Model y = a

This is part of Mike's Big Data, Data Mining, and Analytics Tutorial

The simplest linear model that I am going to discuss in this series is the model $y = a$. By the end of this post, I hope you'll walk away with the knowledge of what the model represents and how it is often used. By the end of this post, I hope you'll be able to answer the following questions:
• What is the "null" model?
• How is the "null" model used?
• Why is the solution to the linear model $y = a$ equal to the mean?
For a minute, let's consider a scenario where we have a set of interval or ratio data that we want to learn more about. In this scenario we are attempting to describe the relationship in the data that we have collected. We may or may not be interested in predicting other values. We are essentially asking the question "Why are things the way that they are?" and may be asking the question "How might things be if we collect more data?" As we work through our model, we will be attempting to describe the relationship between dependent and independent variables.

If your new to statistics, the words in the previous paragraph might not have a lot of meaning. Let's look at them in detail...

Interval/Ratio Data - The words "Interval" and "Ratio" refer to the measurement scale of the data in question. Most of the models in the tutorial will require an interval/ratio dependent variable. Some clues that help us end up at a description of this measurement level include:

• "Zero on the scale" - the measurement scale typically has a 0 at some point. For ratio data, there is an absolute "zero" to the scale (i.e. it would be impossible for a collected data point to fall below "zero").
• Sub-interval equality - The distance between subsequent points on the scale are equal. For example, we would generally consider the difference between 1 degree F and 2 degrees F to be equivalent to the difference between 3 degrees and 4 degrees F.
• "Lots" of possible values on the scale. For example, in a bank account, we could have virtually any amount above or below zero, probably down to a resolution of \$0.01 (for accounts denominated in the U.S. Dollar)

Some examples of interval/ratio data include:

• Temperature: degrees of temperature are typically considered interval or ratio. Scales such as Celsius and Fahrenheit are typically considered interval (because they have a "zero" on the scale). Scales such as Kelvin and Rankin have an "absolute" zero on the scale, meaning there can't reasonably be values less than 0 on these scales.
• Measurements of mass/weight/volume: Measurements of mass, weight, and volume typically have interval/ratio properties.
• Measurements of economic value: Typical measurements of economic value are quoted in a currency or an amount of a good.
Some examples of things that aren't interval/ratio:

• Demographic variables such as religion, sexual preference, and gender: These can't be given non-arbitrary numeric values. Additionally, there is no meaning between the numeric "differences" applied to these categories. For example, say Republican is coded as a 1, Democrat is coded as a 2, and Independent is coded as a 3. There is no meaningful difference by describing the difference between 1 and 2, much less comparing that to the difference between 3 and 4. these are sometimes referred to as "nominal" or "categorical" data.
• Ratings and rankings: These typically arise from surveys/questionnaires. These might take the form of the common "agree/disagree" and "satisfied/dissatisfied" scales. In this case, we can say that certain values are different (and potentially less than or greater than) other values, but we don't have homogeneity in the scale. For example, someone responding "agree" to a question might not be the same as someone else responding "agree," but after repeated measurements of the same person we could probably conclude that "strongly agree" is larger in magnitude than "somewhat agree" or just "agree." These are sometimes referred to as "ordinal" data.
Relationships in Data/Prediction:

We don't generally set off on a course of research without some sort of purpose. This purpose is typically to understand or optimize something. We might be asking questions of the form "Is there a relationship between X an Y?" or "Are A and B correlated?" or "Can I use C or D (or some combination) to predict E?" The purpose of each of these questions is to help us understand relationships in the real world based on data collected from our observations.

We should be careful to remember that correlation does not imply causation. When we develop predictive models, we will be careful not to say things like "A is causing B" because we probably don't have sufficient basis to conclude such a thing... To be able to describe causation, we would likely need to step away from regression/classification and enter the realms of experimental design.

Dependent/Independent Variables

For our purposes, the dependent variable is the variable we are interested in describing. The independent variables are the variables that we could potentially use to describe the dependent variable.
Examples:

• If we are predicting asset prices (ex. stocks, bonds, commodities), our dependent variable might be "price in the future", and we might be using independent variables such as "price in the past," economic variables, financial data, etc.
• If we are predicting a students "GPA in a course" (dependent variable), we might use the following as independent variables: "hours spent per week," "percentage of lectures attended," scores on standardized tests such as the SAT, ACT, GED, GRE, MCAT, etc.
So now, let's dig into the "null model."

At a basic level, the null model is a model that minimizes the error when describing the dependent variable with a single number. Effectively, the overall error is minimized as measured from the horizontal (or vertical) line to each of the points. Below is a graphical depiction of the null model (generated in R):

The null model represents the best guess that we could use to describe the data if we didn't have or didn't use any of the possible dependent variables to describe the independent variable (y in the chart above).

Intuitively, we might conclude that this is the mean (or average) of the dependent variable. We can prove this below.

Without sinking too much into the proofs involved in linear algebra, let's first state the (provable) assumption that we can develop a "minimum error" solution to our system of equations by solving the linear system:

$$A^T A x = A^T B$$

Here, $A$ is a matrix that contains a transformation of the independent variables and $B$ is a matrix containing all of the dependent variable values.

$$A = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} \quad B = \begin{pmatrix}y_1 \\ \vdots \\ y_n \end {pmatrix}$$

In this case, we don't have (or aren't using) any of the independent variable data to describe the dependent variable. Now let's calculate two of the items needed in the first equation:

$$A^T A = \sum_{i=1}^n 1 = n \quad \quad A^T B = \sum_{i =1}^n 1 \cdot y_i$$

So... $A^T A$ is simply the sample size and $A^T B$ is the sum of the dependent variable observations. If we take the inverse of $A^T A$ and multiply it both sides, we end up with our answer for $a$. In this case, $A^T A$ is a 1x1 matrix. I show how to find the inverse here: http://mikemstech.blogspot.com/2014/07/inverse-of-1x1-matrix.html

$$(A^T A) ^{-1} (A^T A) x = (A^T A)^{-1} (A^T B)$$
$$x = (A^T A)^{-1} (A^T B)$$

By the matrix inversion formula for the 1x1 matrix (linked above), we see that

$$(A^T A)^{-1} (A^T B) = \frac{ \sum_{i=1}^n y_i}{n} = \bar{x}$$
Here we see that the answer to the null model is just the average of the dependent variable values.

Let's do a small example. Say that we have the following data:
$$B = \begin{pmatrix} y_1 = 0 \\ y_2 = 5 \\ y_3 = 10 \end{pmatrix}$$

Let's determine the null model for this data.

$$A = \begin{pmatrix} 1 \\ 1 \\ 1 \end{pmatrix}$$

$$A^T A = \sum_{i=1}^{3} 1 = 3 \quad \quad A^T B = (1 \cdot 0 + 1 \cdot 5 + 1 \cdot 10 = 15)$$

$$(A^T A)^{-1} = \frac { 1} {3}$$

$$(A^T A)^{-1} B = \frac { 15} {3} = 5$$

Here the null model (average) for the data is y = 5.

The null model, besides being an interesting conceptual introduction, is also important for other practical reasons. When we evaluate other models, we will typically determine their goodness of fit based on a comparison with the null model for the same data. We are conceptually asking the question "Does our 'complicated' model describe the dependent variable better than the average?" There are tools that we can use to answer this question that will be discussed with the other models. Here it is also useful to initially define the total sum of squares.

The total sum of squares (TSS) for the model is the sum of the squared deviations from the average.

$$TSS = \sum_{i=1}^n (y_i - \bar{y})^2$$

This is also equal to

$$TSS = \sum_{i=1}^n (\bar{y} - y_i)^2$$
As an example, let's consider our data from above and calculate the TSS.

$$TSS = \sum_{i=1}^n (y_i - \bar{y})^2 = (0 - 5)^2 + (5-5)^2 + (10-5)^2 = 50$$

The ending value of 50 doesn't tell us a lot until we compare it with another model (normally to see if the other model "explains" more of the variability in the dependent variable).

Back to Mike's Big Data, Data Mining, and Analytics Tutorial