Linear Regression Pt. 2

In my last blog, I began to discuss Linear Regression using weight/height data. Linear Regressions are among the simplest of models to learn and consequently are usually the first to be taught in a curriculum of machine learning. Because of it’s simplicity, regression analysis has a wide variety of applications in scientific disciplines and in business. What is Linear Regression? At it’s simplest, it is used to predict an outcome or value of a variable called the target based upon the value of another variable or variables, called predictors, upon which the target is based. If we utilize one variable to explain or predict the target, the model is called a simple linear regression. If you are using more than one variable to explain For example, using the data I introduced last time, the kind of question linear regression can address is sample data of weight and height, can we predict the weight given the height? Regression models work when there is a strong relationship between the predictors (also called features, or independent variables). To perform a linear regression, there are four critical assumptions your data will need to fulfill:

  • Linearity: There is a Linear Relationship
  • Normality: Error is Normally Distributed
  • Homoscedasticity: The Variance is Homogenous
  • Independence- the features are independent of each other

Let's look at each of these.

Linearity means that there is a linear relationship between the target variable and the feature/s. When the value in the feature increases, the target variable will also increase; when you plot the data, the resultant figure graphed appears line-like. You can make a pretty good guess by plotting a scatterplot. The following scatterplots will demonstrate data sets with different degrees of linearity:

The Figure is taken from Gene Sprechini: http://lycofs01.lycoming.edu/~sprgene/M123/Text/UNIT_09.pdf

So what does our data look like? I’ll construct a scatterplot:

import pandas as pd
import plotly.express as px
df = pd.read_csv('weight-height.csv')fig = px.scatter(df, x='Height', y="Weight", color = "Gender", marginal_y= "violin", marginal_x= "violin", trendline="ols")fig.show()

I used a few other parameters that I thought would be helpful: color = “Gender” plots each point colored according to gender; marginal_x and marginal_y = “violin” uses a violin figure to plot distributions for x and y values respectively; I also included trendline = “ols” to draw the trendline using ordinary least squares. Because I effectively categorized the data by gender, Plotly draws two trendlines for males and females. Because of the density of points being plotted, both trend lines are obscured except for the beginning and the end of the distribution. Nonetheless, it's clear from the distribution of points alone that the distribution has a strong positive correlation. We can quantify this correlation with:

df.corr()['Weight']

So weight correlates perfectly with itself, and mostly with height. Clearly, the data meets the requirement for linearity.

What about Normality?