Heteroskedasticity

Definition

The definition of heteroskedasticity is the circumstance in which the variability of a variable is unequal across the range of values of a second variable that predicts it when the theory suggests there should be no such variation.

Introduction

Heteroskedasticity is best demonstrated with an example:

These graphs show us, on the left, the scatterplot of the lot or plot or site area of a property and its selling price in AUD millions. The right hand graph shows the independent, X, variable against the scatter plot of the residuals of the model derived from the data set. This is a basic ordinary least squaes (OLS) demonstration and there is nothing difficult about it. However, consider the following, revised version, of the residuals plot:

Source: Excellent review of Heteroskedasticity

This residual plot shows what is called the heteroskedastic fan and in turn that shows us that as the lot size increases the residual or prediction error of our model tends to increase too. According to the definition of heteeroskedasticity, there should be no such pattern, rather, residuals should look more like this:

Source: Excellent review of Heteroskedasticity

Where we see here is a more or less random distribution of residuals and that is what we want! Please note, to get he results shown in this third diagram, we need to transform our data by logarithms and I will demonstrate that shortly.

In the next section, we will discuss the ways in which we might eliminate the problem of heteroskedasticity.

Removing Heteroskedasticity

In this post I am demonstrating these methods of dealing with heteroskedasticity:

    Breusch-Pagan Technique
    Weighted Least Squares
    Logarithmic Transformation

I do not have the data that is use on the YouTube channel that I cited above so I created my own data set from an Australian property site and I use that data set for all demonstrations. You can download my spreadsheet from the link at the bottom of the page to follow along with me.

    Breusch-Pagan Technique

The technique is the most involved of the three techniques discussed here: not difficult, just more involved. We do the following.

Open your Excel file with the data already present or, of course, add it!

Carry out the OLS analysis of the data using the Data Analysis Toolpak to obtain the following:

We can see from the residuals plot, even though my data set is only 33 rows deep, that there is a heteroskedastic fan there. We need to prove that what appears to be true, is true: the model is showing heteroskedasticity. Run the Breush-Pagan test like this:

Add a column Residuals Squared to the table of Predictions and Residuals you can see in the range E44:F76 and onwards: in G43 the formula is, simply, =F43^2 which we copy down to the end of the table, G76:

Create a new regression model using the Toolpak by using G43:G76 as the Y variable and A4:B37 as the X variables. The other settings are as usual, as you can see in my file.

That will give you the following output:

Finally, we add this test statistic formula to cell N64=COUNT(J66:J98)*J45 and this p Value formula to cell N65=CHISQ.DIST.RT(N64,2) which are the test statistics and p Values for this revised model and in this case, our conclusion confirms our earlier finding :

Because this p-value is less than 0.05, we reject the null hypothesis and conclude that there is heteroscedasticity present in the original regression model.

    Weighted Least Squares

By comparison, the Weighted least squares method, WLS, is much more straightforward than the Breusch-Pagan method.

We leave the input data alone and add two columns, as shown below:

The formulas to use are C5=A5/SQRT($B5) and D5=B5/SQRT($B5). That is the Selling Price and the Land Areas divided by the square root of the Land Area, the X Variable. We now run the regression model with Selling Prices as the Y variable and columns B:D as the X variables. You should see this:


as well as these graphs:

We already have the original residuals plot from diagram 4, above and we can see from these three new graphs that the heteroskasticity is not so evident. We can probably claim to have eliminated it by using WLS.

I should conclude this section by saying that there are other, more complex WLS methods and if you need something more advanced that what you see here, please search for them.

    Logarithmic Transformation

The previous two techniques are probably new to you but logarithmic transformations might not be: the methodology is simple, take a look:

Create two new columns, D and E by entering the following formulas D5=LN(A5) and E5=LN(B5). Fill down those formulas to D37 and E37 respectively. You can see the graphs: firstly the logarithmic graph with an almost straight set of points as opposed to the scatterplot of the original data, which we have already seen, showing wide dispersion of data. This is exactly what we wanted to see: almost a straight line!

We now run the regression method on these new columns to see:

And here we see the original residuals plot again, starting in cell W18 and our LN residuals plot, starting in cell W5: this LN plot conforms almost exactly to the dream output we saw in diagram 3, with randomly distributed points.

The log-log model seems to be perfectly ueful!

Conclusions

Heteroskedasticity is a serious problem for statisticians and anyone creating regression models. If we don't find a solution to the problem, any model that we accept and use will provide inadequate, wrong, results and we do not want that.

Use any of these three methods we have discussed here and feel more confident of your analysis.


Download my Excel file ... sydney_property_incl_heterosked.xlsx


Duncan Williamson

16th November 2020 Revised 4th May 2021

No comments: