Regression belongs like classification to the field of supervised learning.
# Common imports
import numpy as np
import os
import pandas as pd
# To plot pretty figures
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib as mpl
import matplotlib.pyplot as plt
#%matplotlib notebook
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
import seaborn as sns
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings = lambda *a, **kw: None
# to make this notebook's output identical at every run
np.random.seed(42)
preprocessing.StandardScaler
transform()
method.fit_transform()
both fits an estimator and returns the transformed input data set.LinearRegression
, after trainingpredict()
method forms the predictions.score()
method that measures the quality of the predictions given a test set.lin_reg.coef_
)# Scikit-Learn ≥0.20 is required
import sklearn
We use as an example the Boston housing data (from sklearn
), which contains 13 attributes of housing markets around Boston. The data was collected in 1978 and each of the 506 entries represent aggregated data about 14 features for homes from various suburbs in Boston, Massachusetts.
The objective is to predict the value of prices of the house using the given features
from sklearn.datasets import load_boston
data = load_boston() # object is a dictionary
data.keys()
Data Set Characteristics:
print(data['DESCR'])
X_full, y_full = data.data, data.target
n_samples = X_full.shape[0]
n_features = X_full.shape[1]
X_df=pd.DataFrame(X_full, columns=data['feature_names']) # to dataframe format
X_df.head()
X_df.isnull().sum()
There is none
target
or $y$)¶Before the regression, let us inspect the features and their distributions.
y_full.shape
sns.set(rc={'figure.figsize':(11.7,8.27)})
plt.hist(y_full, bins=30)
plt.xlabel("House prices in $1000", size=15)
plt.ylabel('count', size=15)
plt.title('Distribution of median price in each neighborhood', size=20)
plt.show()
X_full.shape
Histogram plots to look at the distribution
X_df.hist(bins=50, figsize=(20,15))
plt.show()
Boston Correlation Heatmap Example with Seaborn
The seaborn package offers a heatmap that will allow a two-dimensional graphical representation of the Boston data. The heatmap will represent the individual values that are contained in a matrix are represented as colors.
import pandas as pd
import matplotlib.pyplot as plt
sns.set(rc={'figure.figsize':(8.5,5)})
correlation_matrix = X_df.corr().round(2)
sns.heatmap(correlation_matrix) #annot=True
plt.show()
An important point in selecting features for a linear regression model is to check for multicolinearity.
The features RAD, TAX have a correlation of 0.91. These feature pairs are strongly correlated to each other. This can affect the model.
Same goes for the features DIS and AGE which have a correlation of -0.75.
from pandas.plotting import scatter_matrix
scatter_matrix(X_df[['DIS', 'AGE','RAD', 'TAX']], figsize=(12, 8))
for feature_name in X_df.columns:
plt.figure(figsize=(4, 3))
plt.scatter(X_df[feature_name], y_full, alpha=0.1)
plt.ylabel('Price', size=15)
plt.xlabel(feature_name, size=15)
plt.tight_layout()
The prices increase as the value of RM increases linearly. There are few outliers and the data seems to be capped at 50.
The prices tend to decrease with an increase in LSTAT. Though it doesn’t look to be following exactly a linear line.
Drop the observations with price >=50 (because of the right censure)
mask=y_full<50
y_full=y_full[mask==True]
X_full=X_full[mask==True]
X_df=X_df[mask==True]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_full, y_full,test_size=0.2, random_state=1)
print("train data", X_train.shape, y_train.shape)
print("test data", X_test.shape, y_test.shape)
The missing features should be:
Most common scaling methods:
0
and 1
)from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# our first machine learning model
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
scikit-learn
API
In scikit-learn
all regression algorithms have:
fit()
method to learn from data, andpredict()
method for predicting numbers from input features.lin_reg.fit(X_train, y_train)
print("R-squared for training dataset:{}".
format(np.round(lin_reg.score(X_train, y_train), 2)))
lin_reg.fit(X_train_scaled, y_train)
print("R-squared for training dataset & scaled features:{}".
format(np.round(lin_reg.score(X_train_scaled, y_train), 2)))
features = list(X_df.columns)
print('The coefficients of the features from the linear model:')
print(dict(zip(features, [round(x, 2) for x in lin_reg.coef_])))
scikit-learn
offers the following metrics for measuring regression quality:
Taking absolute values before adding up the deviatons assures that deviations with different signs can not cancel out.
neg_mean_absolute_error
in scikit-learn
.
Here we replace the absolute difference by its squared difference. Squaring also insures positive differeces.
This measure is more sensitive to outliers: A few larger differences contribute more significantly to a larger mean squared error.
neg_mean_squared_error
in scikit-learn
.
Here we replace mean calculation by median.
This measure is less sensitive to outliers: A few larger differences will not contribute significantly to a larger error value.
neg_median_absolute_error
in scikit-learn
.
The formula for this metric can be found here.
This metric is recommended when your target values are distributed over a huge range of values, like popoluation numbers.
The previous error metrics would put a larger weight on large target values.
The name is neg_mean_squared_log_error
from sklearn.metrics import mean_squared_error
y_train_pred = lin_reg.predict(X_train_scaled)
train_mse = mean_squared_error(y_train, y_train_pred)
train_rmse = np.sqrt(train_mse)
print("RMSE: %s" % train_rmse) # = np.sqrt(np.mean((predicted - expected) ** 2))
Two other scores to mention are explained variance and $R^2$-score. For both larger values indicate better regression results.
The $R^2$-score corresponds to the proportion of variance (of $y$) that has been explained by the independent variables in the model.
It takes values in the range $0 .. 1$. The name within scikit-learn
is R2
.
The formula for explained variance, the score takes values up to $1$. The name within scikit-learn
is explained_variance
.
from sklearn.metrics import r2_score
r2=round(r2_score(y_test, y_test_pred), 2)
print("R2: %s" % r2)
# Regplot
g=sns.regplot(x= y_test_pred, y=y_test, x_bins=100)
g=g.set_title("Test sample")
plt.xlabel("Predicted prices: $\hat{Y}_i$")
plt.ylabel("Prices: $Y_i$")
plt.annotate('R2={}'.format(r2),
xy=(1, 0), xycoords='axes fraction',
horizontalalignment='right',
verticalalignment='bottom')
plt.annotate('Notes: 100 binned',
xy=(0, 0), xycoords='figure fraction',
horizontalalignment='left',
verticalalignment='bottom')
plt.plot([0, 50], [0, 50], '--k')
plt.axis('tight')
plt.tight_layout()
plt.show(g)
#Let us plot how good given and predicted values match on the training data set (sic !).
def plot_fit_quality(values_test, predicted):
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
x = np.arange(len(predicted))
plt.scatter(x, predicted - values_test, color='steelblue', marker='o')
plt.plot([0, len(predicted)], [0, 0], "k:")
max_diff = np.max(np.abs(predicted - values_test))
plt.ylim([-max_diff, max_diff])
plt.ylabel("error")
plt.xlabel("sample id")
plt.subplot(1, 2, 2)
plt.scatter(x, (predicted - values_test) / values_test, color='steelblue', marker='o')
plt.plot([0, len(predicted)], [0, 0], "k:")
plt.ylim([-.5, .5])
plt.ylabel("relative error")
plt.xlabel("sample id")
plot_fit_quality(y_test, y_test_pred)
from sklearn.linear_model import Ridge
print('The coefficients of the features from the Ridge model:')
print(dict(zip(features, [round(x, 2) for x in ridge_reg.coef_])))
from sklearn.preprocessing import PolynomialFeatures
poly_features=PolynomialFeatures(degree=2)
X_train_poly=poly_features.fit_transform(X_train)
X_test_poly=poly_features.fit_transform(X_test)
lin_reg = LinearRegression()
lin_reg.fit(X_train_poly, y_train)
y_test_pred = lin_reg.predict(X_test_poly)
test_rmse = mean_squared_error(y_test,y_test_pred)
test_rmse = np.sqrt(test_rmse)
print("test RMS: %s" % test_rmse)
print("train R2: %s" % round(r2_score(y_train, y_train_pred), 2))
print("test R2: %s" % round(r2_score(y_test, y_test_pred), 2))
from sklearn.linear_model import Lasso
lasso_reg=Lasso(alpha=1)
lasso_reg.fit(X_train, y_train)
y_test_pred = lasso_reg.predict(X_test)
test_mse = mean_squared_error(y_test, y_test_pred)
test_rmse = np.sqrt(test_mse)
print("test RMS: %s" % test_rmse)
print("train R2: %s" % round(r2_score(y_train, y_train_pred), 2))
print("test R2: %s" % round(r2_score(y_test, y_test_pred), 2))
from sklearn.linear_model import Lasso
lasso_reg=Lasso(alpha=1)
lasso_reg.fit(X_train_scaled, y_train)
y_test_pred = lasso_reg.predict(X_test_scaled)
test_mse = mean_squared_error(y_test, y_test_pred)
test_rmse = np.sqrt(test_mse)
print("test RMS: %s" % test_rmse)
print("train R2: %s" % round(r2_score(y_train, y_train_pred), 2))
print("test R2: %s" % round(r2_score(y_test, y_test_pred), 2))
print('The coefficients of the features from the Lasso model:')
print(dict(zip(features, [round(x, 2) for x in lasso_reg.coef_])))
from sklearn.linear_model import ElasticNet
elanet_reg=ElasticNet(random_state=0)
elanet_reg.fit(X_train_scaled, y_train)
y_test_pred = elanet_reg.predict(X_test_scaled)
test_mse = mean_squared_error(y_test, y_test_pred)
test_rmse = np.sqrt(test_mse)
print("test RMS: %s" % test_rmse)
print("train R2: %s" % round(r2_score(y_train, y_train_pred), 2))
print("test R2: %s" % round(r2_score(y_test, y_test_pred), 2))
print('The coefficients of the features from the Lasso model:')
print(dict(zip(features, [round(x, 2) for x in elanet_reg.coef_])))
alphas=np.logspace(-6, 6, 13)
from sklearn import linear_model
lassocv_reg = linear_model.LassoCV(alphas=alphas)
lassocv_reg.fit(X_train, y_train)
alpha=lassocv_reg.alpha_
print("Best alpha", alpha)
lasso_reg=Lasso(alpha=alpha)
lasso_reg.fit(X_train_scaled, y_train)
y_train_pred=lasso_reg.predict(X_train_scaled)
y_test_pred = lasso_reg.predict(X_test_scaled)
test_mse = mean_squared_error(y_test, y_test_pred)
test_rmse = np.sqrt(test_mse)
print("test RMS: %s" % test_rmse)
print("train R2: %s" % round(r2_score(y_train, y_train_pred), 2))
print("test R2: %s" % round(r2_score(y_test, y_test_pred), 2))
from sklearn.model_selection import cross_val_score, cross_val_predict
# Perform 6-fold cross validation
scores = cross_val_score(elanet_reg, X_train_scaled, y_train,
scoring="neg_mean_squared_error", cv=5)
scores
y_train_pred_cv = cross_val_predict(elanet_reg, X_train_scaled, y_train, cv=5)
## plt.figure(figsize=(4, 3))
plt.scatter(y_train, y_train_pred_cv)
plt.plot([0, 50], [0, 50], '--k')
plt.axis('tight')
plt.xlabel('True price ($1000s)')
plt.ylabel('Predicted price ($1000s)')
plt.tight_layout()
accuracy =r2_score(y_train, y_train_pred_cv)
print('Cross-Predicted Accuracy:', accuracy)
from sklearn.model_selection import GridSearchCV
param_grid = [
{'alpha': [0.0001, 0.001, 0.01, 0.1 ,1, 10],
'l1_ratio':[.1,.5,.9,1]}
]
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(elanet_reg, param_grid, cv=5,
scoring='neg_mean_squared_error',
return_train_score=True)
grid_search.fit(X_train, y_train)
grid_search.best_params_
grid_search.best_estimator_
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
df_cvres=pd.DataFrame(cvres)
df_cvres['mean_test_score_pos_sqrt']=df_cvres['mean_test_score'].apply(lambda x: np.sqrt(-x))
df_cvres['log_param_alpha']=df_cvres['param_alpha'].apply(lambda x: np.log(x))
df_cvres.head()
_, ax = plt.subplots(1,1)
plt.plot(df_cvres["log_param_alpha"], df_cvres["mean_test_score_pos_sqrt"])
ax.set_title("Grid Search", fontsize=20, fontweight='bold')
ax.set_xlabel("$\log (alpha)$", fontsize=16)
ax.set_ylabel('Avg. mean test score', fontsize=16)
from sklearn.model_selection import RandomizedSearchCV