Saturday, March 26, 2016

Deriving the Cosine Distance for Machine Learning Distance Metrics

Deriving the Cosine Distance for Machine Learning



Cosine distance is used heavily in machine learning and is a very common distance metric alongside the euclidean distance. The biggest advantage of the cosine distance is its robustness against data scale. I.e. if you have various features that have different min and max scales to it, you can directly apply cosine distance and its notion of similarity.


This blog post is to help understand how the cosine distance is derived.


Figure 1: Illustration of the Angular Distance of Theta


In Figure 1 we see vectors A and B in 2 dimensions. In practical situations, you may have many more dimensions but that just means that your vector is longer in length. Now the equation used for cosine distance is typically:




The challenge is how do we get this equation. Note that this equation is very nice since it only depends on dot products and a division. It is extremely efficient to calculate and cos of 90 degrees is 0 and cos of 0 is 1 and cos of 180 is -1, which works perfectly inline with how we see A and B as similar or dissimilar.


It comes from applying vector math on the usual Cosine Rule but with a little trick.
Cosine Rule - c2 = a2 + b2 - 2ab cos theta.
Now lets look at that formula but in vector math notation. Note that vectors are basically each data point with each feature as an independent dimension. E.g. A = (f1,f2,f3...fN)

I couldnt figure out how to add equations in blogpost so forgive the screen shot I had to take just to paste the equations...
Note that ||a-b||^2 is the length of the line between the two points. This is the same as c^2 in the cosine law equation. 

Which is what was the definition of the cosine distance we had in the beginning…

Thursday, August 28, 2014

A Methodology to Perform Linear Regression

CapitalOneInterviewAssignment_Part 1

Building a Linear Regression Predictor

The challenge is to build a predictor that takes in a set of tab delimited floating values and predicts a floating value.

This is fundamentally a linear regression problem and requires some special treatment to ensure that the data that goes into the regression model is clean and does not have issues such as multi-colinearity and hetereoskedacitic. Also, it would be important to detect if there are external factors that affect the prediction but are not captured as features.

The data science approach and results is as follows:

  1. Data exploration.
    • Summary statistics on the data.
    • Detect colinearity, heteroskedacity issues. The goal here is to check if the feature set is not polluted by either redundant features or indications of external influences.
    • Identify NAN's and replace with mean values
  2. Data Clean Up
    • The only clean up is to replace with mean values.
  3. Pre-processing algorithms
    • Cross-train split from cross-validation (CV). There are two levels of splitting. One CV split is to evaluate between different predictors. The other CV split is to perform hyper-parameter selection using gridsearch.
  4. Model evaluation
    • Compare with using R^2.
  5. Future work
    • Apply broader hyper-parameter optimization using gridsearch
    • Using Hadoop, Spark or GraphLab in order to deal with large datasets. This would allow larger versions of the dataset without having to load the entire dataset into one continuous memory block.
    • Look at spearman vs. pearson correlation to identify whether the independent variable is linear or non linear. If non linear, perhaps introduce a polynomial term as well.
    • Try PCA for dimensionality reduction
In [19]:
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.cross_validation import train_test_split
import numpy as np
import statsmodels.stats.diagnostic as std
import matplotlib.pyplot as plt
import seaborn as sns

Data Exploration

In [20]:
df = pd.read_csv("codetest_train.txt",sep='\t')
In [21]:
df.head()
Out[21]:
target f_0 f_1 f_2 f_3 f_4 f_5 f_6 f_7 f_8 ... f_244 f_245 f_246 f_247 f_248 f_249 f_250 f_251 f_252 f_253
0 3.066056 -0.652557 0.255256 -0.615412 -1.833283 -0.735654 NaN 1.115108 -0.170707 -0.350879 ... -1.607174 -1.400451 -0.920319 -0.198487 -0.944919 -0.572555 0.170449 -0.417764 -1.244350 -0.502889
1 -1.910473 1.178667 -0.093398 -0.555568 0.811153 -0.467931 -0.005488 -0.116258 -1.242854 1.984955 ... 1.281957 0.032356 -0.060830 NaN -0.061294 -0.301641 1.280799 -0.849742 0.821387 -0.259742
2 7.830711 0.180984 -0.778108 -0.919190 0.112607 0.887144 -0.761681 1.872443 -1.708934 0.134782 ... -0.236893 -0.659965 1.072506 -0.192872 0.570371 -0.267201 1.434515 1.331823 -1.146947 2.579905
3 -2.180862 0.745228 -0.245268 -1.343348 1.163072 -0.168847 -0.150516 -1.099532 0.225346 1.223015 ... 0.709195 -0.203496 -0.135737 -0.570835 1.682078 0.243042 -0.380941 0.612880 1.032715 0.400305
4 5.462784 1.217383 -1.323946 -0.958151 0.448290 -2.873187 -0.855653 0.602568 0.763372 0.019770 ... 0.892303 -0.432659 -0.877101 0.289258 0.653759 1.230144 0.456956 -0.754051 -0.025100 -0.931225

5 rows × 255 columns

In [22]:
y = df.pop("target")
df.describe()
Out[22]:
f_0 f_1 f_2 f_3 f_4 f_5 f_6 f_7 f_8 f_9 ... f_244 f_245 f_246 f_247 f_248 f_249 f_250 f_251 f_252 f_253
count 4903.000000 4928.000000 4908.000000 4910.000000 4907.000000 4912.000000 4897.000000 4904.000000 4893.000000 4911.000000 ... 4910.000000 4883.000000 4914.000000 4894.000000 4902.000000 4886.000000 4900.000000 4921.000000 4904.000000 4904.000000
mean -0.000432 0.002564 0.028875 -0.005436 -0.006764 0.005568 0.001536 -0.001016 0.009748 -0.002622 ... 0.013529 0.004937 0.023262 -0.018450 -0.009832 0.016954 -0.004944 0.016878 -0.001356 0.010331
std 0.999737 0.997931 1.019343 0.990349 1.006289 0.995800 1.004624 0.997360 0.988310 1.005418 ... 1.001454 0.997329 0.996465 1.005001 0.989225 1.011337 0.991580 1.001362 1.003412 1.006887
min -3.940976 -3.847382 -3.818473 -3.434423 -3.400202 -4.051005 -3.179496 -3.889719 -3.856546 -3.910096 ... -3.585086 -3.493958 -3.484578 -4.012033 -3.252239 -3.821459 -3.376073 -3.373395 -3.949648 -3.727771
25% -0.673040 -0.684705 -0.650854 -0.654998 -0.685200 -0.659947 -0.671828 -0.679356 -0.662404 -0.675319 ... -0.666017 -0.676178 -0.661783 -0.692160 -0.662535 -0.647772 -0.679635 -0.647463 -0.695102 -0.676679
50% -0.010543 -0.002652 0.047116 0.003579 -0.007036 -0.007822 -0.003055 -0.021591 0.016544 -0.002122 ... 0.025606 -0.027898 0.026912 -0.034966 -0.010708 0.002441 0.009770 0.019531 0.002471 0.015306
75% 0.677278 0.674600 0.718753 0.667814 0.653662 0.648748 0.678594 0.670028 0.698020 0.685997 ... 0.671749 0.670035 0.702553 0.655212 0.650979 0.709693 0.660469 0.692202 0.672384 0.705512
max 3.830801 3.995597 3.198775 4.962191 3.106121 4.295765 4.165960 3.798185 4.195049 4.842176 ... 3.365021 3.455789 3.881038 3.689781 3.629413 4.144275 3.873226 3.186501 3.724332 3.955565

8 rows × 250 columns

In [64]:
# Highest correlated features with target column. Should see if they are linear correlations
corr_matrix["target"].iloc[np.argsort(corr_matrix["target"])][::-1][0:10]
Out[64]:
target    1.000000
f_175     0.559211
f_161     0.395634
f_205     0.363730
f_195     0.263667
f_35      0.184608
f_218     0.164864
f_47      0.139391
Mexico    0.115002
f_75      0.097581
Name: target, dtype: float64
In [66]:
pd.scatter_matrix(df[["f_175","f_161","f_205","target"]])
Out[66]:
array([[<matplotlib.axes.AxesSubplot object at 0xa610c14c>,
        <matplotlib.axes.AxesSubplot object at 0xaa0315cc>,
        <matplotlib.axes.AxesSubplot object at 0xa5e2272c>,
        <matplotlib.axes.AxesSubplot object at 0xa5dbbd4c>],
       [<matplotlib.axes.AxesSubplot object at 0xa5d8fa4c>,
        <matplotlib.axes.AxesSubplot object at 0xa5d4514c>,
        <matplotlib.axes.AxesSubplot object at 0xa5cfdecc>,
        <matplotlib.axes.AxesSubplot object at 0xa5d1aaec>],
       [<matplotlib.axes.AxesSubplot object at 0xa5c70dec>,
        <matplotlib.axes.AxesSubplot object at 0xa5c2faec>,
        <matplotlib.axes.AxesSubplot object at 0xa5c63f2c>,
        <matplotlib.axes.AxesSubplot object at 0xa5c290ec>],
       [<matplotlib.axes.AxesSubplot object at 0xa5bd866c>,
        <matplotlib.axes.AxesSubplot object at 0xa5b9682c>,
        <matplotlib.axes.AxesSubplot object at 0xa5bac86c>,
        <matplotlib.axes.AxesSubplot object at 0xa5b0a54c>]], dtype=object)
In [65]:
df[["f_175","f_161","f_205","target"]].corr()
Out[65]:
f_175 f_161 f_205 target
f_175 1.000000 0.698535 0.008231 0.559211
f_161 0.698535 1.000000 -0.006862 0.395634
f_205 0.008231 -0.006862 1.000000 0.363730
target 0.559211 0.395634 0.363730 1.000000

From the scatter matrix and correlation matrix, notice that there is little correlation between the features and the target. Also note that the relationship might be non linear. It would be quite time consuming to do this with visual inspection. However, we can use the difference between pearson and spearman correlation to detect nonlinear relationships that we can then perform feature transformation upon in order to linearize prior to performing the linear regression.

In [79]:
import scipy.stats as st
print st.spearmanr(df[["f_175","target"]])
print st.pearsonr(df["f_175"],df["target"])
(0.56502924196593451, 0.0)
(0.55921068208270208, 0.0)

In [81]:
print st.spearmanr(df[["f_161","target"]])
print st.pearsonr(df["f_161"],df["target"])
(0.39713888083122312, 1.4784063942274236e-188)
(0.39563355008574125, 5.105752113862635e-187)

In [82]:
print st.spearmanr(df[["f_205","target"]])
print st.pearsonr(df["f_205"],df["target"])
(0.36077413223262689, 1.4329521566736594e-153)
(0.36373023641314262, 3.0004829719354247e-156)

Pearson vs. Spearman correlation

Above, we have done a few comparisons between Pearson and Spearman correlation. They are almost equal which implies that the data set is eliptical in distribution. There is unlikely a strong benefit of using a non-linear model for this. Let us move on to looking at the feature datatypes to see if there are any non floating value features...

In [23]:
df.dtypes[df.dtypes!="float64"]
Out[23]:
f_61     object
f_121    object
f_215    object
f_237    object
dtype: object

The features above are categorical columns. Categorical columns can be converted to binary data and merged back.

In [24]:
df = pd.read_csv("codetest_train.txt",sep='\t')
dummies_df = pd.get_dummies(df["f_237"])
dummies_df2 = pd.get_dummies(df["f_215"])
dummies_df.reset_index(inplace=True)
dummies_df2.reset_index(inplace=True)
x = dummies_df.merge(dummies_df2,left_on="index",right_on="index")
x.set_index("index")
df.reset_index(inplace=True)
df = df.merge(x,left_on="index",right_on="index")
df.head()
df.pop("index")
df.head()
Out[24]:
target f_0 f_1 f_2 f_3 f_4 f_5 f_6 f_7 f_8 ... f_251 f_252 f_253 Canada Mexico USA blue orange red yellow
0 3.066056 -0.652557 0.255256 -0.615412 -1.833283 -0.735654 NaN 1.115108 -0.170707 -0.350879 ... -0.417764 -1.244350 -0.502889 1 0 0 0 0 1 0
1 -1.910473 1.178667 -0.093398 -0.555568 0.811153 -0.467931 -0.005488 -0.116258 -1.242854 1.984955 ... -0.849742 0.821387 -0.259742 1 0 0 1 0 0 0
2 7.830711 0.180984 -0.778108 -0.919190 0.112607 0.887144 -0.761681 1.872443 -1.708934 0.134782 ... 1.331823 -1.146947 2.579905 1 0 0 0 1 0 0
3 -2.180862 0.745228 -0.245268 -1.343348 1.163072 -0.168847 -0.150516 -1.099532 0.225346 1.223015 ... 0.612880 1.032715 0.400305 0 0 1 1 0 0 0
4 5.462784 1.217383 -1.323946 -0.958151 0.448290 -2.873187 -0.855653 0.602568 0.763372 0.019770 ... -0.754051 -0.025100 -0.931225 1 0 0 0 1 0 0

5 rows × 262 columns

In [25]:
x = df.dtypes[df.dtypes!="float64"].index
[df.pop(col) for col in x.values]
Out[25]:
[0       b
 1       a
 2       b
 3       a
 4       b
 5       d
 6       d
 7       d
 8       a
 9       c
 10      c
 11      d
 12    NaN
 13      e
 14      c
 ...
 4985      d
 4986      a
 4987    NaN
 4988      c
 4989      c
 4990      a
 4991      b
 4992      e
 4993      d
 4994      a
 4995      e
 4996      c
 4997      d
 4998      d
 4999      a
 Name: f_61, Length: 5000, dtype: object, 0     D
 1     A
 2     B
 3     C
 4     E
 5     F
 6     A
 7     F
 8     B
 9     D
 10    B
 11    A
 12    A
 13    C
 14    D
 ...
 4985    C
 4986    E
 4987    A
 4988    A
 4989    F
 4990    F
 4991    E
 4992    B
 4993    F
 4994    E
 4995    B
 4996    F
 4997    F
 4998    C
 4999    B
 Name: f_121, Length: 5000, dtype: object, 0        red
 1       blue
 2     orange
 3       blue
 4     orange
 5       blue
 6     yellow
 7        red
 8     orange
 9       blue
 10    yellow
 11       NaN
 12       red
 13       red
 14    orange
 ...
 4985       red
 4986    orange
 4987       red
 4988    orange
 4989      blue
 4990    yellow
 4991    yellow
 4992       red
 4993    yellow
 4994    orange
 4995      blue
 4996       red
 4997    yellow
 4998      blue
 4999       red
 Name: f_215, Length: 5000, dtype: object, 0     Canada
 1     Canada
 2     Canada
 3        USA
 4     Canada
 5        USA
 6        USA
 7     Mexico
 8     Mexico
 9     Mexico
 10    Canada
 11    Mexico
 12    Canada
 13    Canada
 14    Mexico
 ...
 4985       USA
 4986       USA
 4987    Canada
 4988       USA
 4989       USA
 4990    Canada
 4991    Canada
 4992    Canada
 4993    Mexico
 4994       USA
 4995    Canada
 4996    Canada
 4997    Mexico
 4998       USA
 4999    Mexico
 Name: f_237, Length: 5000, dtype: object]
In [26]:
df = df.fillna(df.mean()) # Fill missing values with the mean of that column. Other approaches can be considered as well
corr_matrix = df.corr()
In [27]:
def Detect_Colinearity (corr_matrix,corr_threshold):
    col_set =set()
    for col in corr_matrix.columns:
        column = corr_matrix.loc[:,col][corr_matrix.loc[:,col]!=1]
        if max(column) > corr_threshold:
            tup = tuple(sort ((np.argmax(column), col)))
            col_set.add(tup)
            print np.argmax(column), " max correlation with " ,col ," is ", max(column)
    return col_set
col_set = Detect_Colinearity(corr_matrix,0.7)
f_218  max correlation with  f_75  is  0.701528006836
f_205  max correlation with  f_195  is  0.700222900414
f_195  max correlation with  f_205  is  0.700222900414
f_75  max correlation with  f_218  is  0.701528006836

In [28]:
col_set
Out[28]:
{('f_195', 'f_205'), ('f_218', 'f_75')}
In [29]:
def draw_scatter(colA,colB):
    plt.scatter(df[colA],df[colB])
    plt.xlabel(colA)
    plt.ylabel(colB)
    plt.show()

[draw_scatter(colA,colB) for colA,colB in col_set]
#draw_scatter("f_35","f_47")
Out[29]:
[None, None]

Notice that the above features are highly correlated to each other and not heteroskedastic as well. Hence, it would be prudent to remove one of those pairs of features since having just one of them provides sufficient explantory power.

Data Clean-Up

In [30]:
[df.pop(colA) for colA,colB in col_set]
# Remove one of the pair features that are highly correlated
Out[30]:
[0     0.539137
 1     0.561481
 2     0.750839
 3    -0.320563
 4     0.609882
 5     0.060647
 6    -0.362968
 7     0.847730
 8     0.305817
 9    -0.856485
 10    1.606802
 11   -0.953853
 12    0.876890
 13    0.914836
 14    0.267366
 ...
 4985   -0.015994
 4986    0.666057
 4987   -0.021838
 4988    0.550384
 4989   -0.368570
 4990    0.134997
 4991   -0.970546
 4992    0.010943
 4993    0.416705
 4994   -0.553957
 4995   -0.193834
 4996   -0.375043
 4997   -0.015489
 4998    0.381753
 4999    0.884792
 Name: f_195, Length: 5000, dtype: float64, 0    -0.081401
 1    -0.422620
 2    -0.559654
 3    -0.166179
 4     0.953225
 5    -0.835506
 6    -0.974140
 7    -0.879044
 8     0.101077
 9    -0.954207
 10    0.627886
 11   -0.956676
 12   -0.550488
 13   -1.348389
 14   -0.423770
 ...
 4985    0.597803
 4986   -1.274587
 4987    0.712450
 4988    0.040832
 4989   -0.408023
 4990    0.738290
 4991    0.483821
 4992    0.641688
 4993    0.486608
 4994   -0.986703
 4995   -1.677835
 4996   -0.005801
 4997   -0.156283
 4998    0.471449
 4999   -0.093550
 Name: f_218, Length: 5000, dtype: float64]

Now that we have done the data clean up, lets put it all into a single function that can be applied on the test and train set.

In [31]:
def data_clean_up (input,data_type="train",col_set=None):
    # This function cleans up the data in the following manner:
# 1. Convert all classification columns to binarized columns and merge it back
# 2. Remove the classification columns as well as the first column of the binarized (this is redundant since the value can be deduced from the other columns)
# 3. Detect colinearity and hetroskedacity and remove those columns

    df = pd.read_csv(input,sep='\t')
    x = df.dtypes[df.dtypes!="float64"].index
    total_dummies = pd.get_dummies(df[x[0]])
    total_dummies.reset_index(inplace=True)

    for col in x[1:]:
        dummies_df = pd.get_dummies(df[col])
        dummies_df.reset_index(inplace=True)
        asd = range(0,dummies_df.shape[1])
        asd.pop(1)
        dummies_df = dummies_df.iloc[:,asd]
        total_dummies = total_dummies.merge(dummies_df2,left_on="index",right_on="index")

        
    df.reset_index(inplace=True)
    df = df.merge(total_dummies,left_on="index",right_on="index")
    df.pop("index")

    if data_type == "test":
        # remove all the features identified by training clean-up
        [df.pop(colA) for colA,colB in col_set]
    if data_type=="train": 
        col_set = set()
        y = df.pop("target")
        [col_set.add(tuple([val,val])) for val in x.values]
    df = df.fillna(df.mean()) #df.mean()) # Fill missing values with the mean of that column. Other approaches can be considered as well
    corr_matrix = df.corr()
    if data_type=="train": 
        # find all the columns to drop based on colinearity/hetreskedacity detection and non-floats
        ss= Detect_Colinearity(corr_matrix,1)
        col_set = col_set|ss
        [df.pop(colA) for colA,colB in col_set]
        return df,y,col_set
    return df

df_train,y_train,col_set = data_clean_up("./codetest_train.txt")
print "Features to be removed due to colinearity and non floats are " , [a for a,b in col_set]
df_test = data_clean_up("./codetest_test.txt",data_type="test",col_set=col_set)
Features to be removed due to colinearity and non floats are  ['f_61', 'f_121', 'f_215', 'f_237']

Pre-Processing and Model Evaluation

Model evaluation is done using R^2 score. Note that CV version of regressors are used in order to further improve the generalization strength.

In [32]:
import sklearn.linear_model as lm

xtrain, xtest, ytrain, ytest = train_test_split(df_train.values, y_train.values, test_size=0.3, random_state=42)

clf = lm.LassoCV(cv=10)
clf.fit(xtrain,ytrain)
ypred = clf.predict(xtest)
Lasso_score = clf.score(xtest,ytest)
print "score for Lasso Linear Regression is " , Lasso_score


clf = lm.RidgeCV(cv=10)
clf.fit(xtrain,ytrain)
ypred = clf.predict(xtest)
Ridge_score = clf.score(xtest,ytest)
print "score for Ridge Linear Regression is " ,Ridge_score 

clf = lm.LassoLarsCV(cv=10)
clf.fit(xtrain,ytrain)
ypred = clf.predict(xtest)
LassoLars_score = clf.score(xtest,ytest)
print "score for LassoLarsCV Linear Regression is " , LassoLars_score
score for Lasso Linear Regression is  0.56259194772
score for Ridge Linear Regression is  0.544774032938
score for LassoLarsCV Linear Regression is  0.563229331306

Lasso Regression is the best. Now train with the complete dataset to gain maximum amount of info before performing the prediction on the test set.

In [15]:
clf = lm.LassoCV(cv=60)
clf.fit(df_train.values, y_train.values)
cols = np.where(clf.coef_>0)
print "Selected most important features based on Lasso Regression is " ,df.columns[cols]
Selected most important features based on Lasso Regression is  Index([u'f_10', u'f_16', u'f_19', u'f_28', u'f_34', u'f_35', u'f_53', u'f_66', u'f_69', u'f_73', u'f_85', u'f_91', u'f_93', u'f_100', u'f_115', u'f_136', u'f_141', u'f_146', u'f_168', u'f_174', u'f_192', u'f_194', u'f_196', u'f_200', u'f_205', u'f_216', u'f_219', u'f_220', u'f_233', u'f_236', u'f_239', u'f_243', u'Canada', u'Mexico'], dtype='object')

In [16]:
clf.score(xtest,ytest)
Out[16]:
0.57499834755624368

Output Results and Linear Regression predictor to file

In [17]:
results = clf.predict(df_test.values)
fid = open("results.csv",'w')
results.tofile(fid,sep=',',format='%f')
In [18]:
pd.to_pickle(clf,"./Part1LassoRegression.pickle")

Conclusion

The score / R^2 was improved significantly by maintaining the classification data. In terms of performance, the clean-up and training does take a polynomial amount of time especially as the number of cross validation runs increase. (40 CV's were done here)

Followers