<!DOCTYPE html>

ST 558 Project 2

library(rmarkdown)
library(usethis)
use_git_config(user.name="Mandy Liesch", user.email="amliesch@ncsu.edu")

Introduction

Online News Popularity Data Set summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity). Here we first showed some summary statistics and plots about the data grouped by weekdays. Then we create several models to predict the response, shares in different channels. The performance of these models will be evaluated by RMSE. The model having the lowest RMSE will be selected as a winner. The methods of modeling include:

Regression Tree
Log Transformed Full Linear Regression Model
Linear Regression Model Without Day of the Week
Subset Linear Regression Model #1
Subset Linear Regression Model #2
Backward Selection Linear Regression
Bagged Regression Tree
Random Forest Model
Boosted Tree Model

Data preparation

Subset Data by Channel

library(tidyverse)

data_whole<-read_csv("OnlineNewsPopularity/OnlineNewsPopularity.csv")

#create a new variable, channel, to help with the subsetting.
data_whole$channel <- names(data_whole[14:19])[apply(data_whole[14:19],1, match, x = 1)]
data_whole$channel <-sub("data_channel_is_", "", data_whole$channel)

#Subset the data to work on the data channel of interest
#channel_interest = params[[1]]$team

#Get the important data
data_interest<-data_whole%>%
  filter(channel==x[[2]]$team)%>%
  select(-c(1,14:19,62))

Establish Training Data

Split the data into a training (70% of the data) and test set (30% of the data)

library(caret)
library(rsample)
set.seed(14)
index <- initial_split(data_interest,
                       prop = 0.7)
train <- training(index)
test <- testing(index)

Data Summaries

Correlation Plots

This graphical function looks at the correlation of all of the different variables against each other.

library(corrplot)
#drop values that are not important (the days of the week)
newTrain<-train[ -c(25:31) ]
lmNewTest<-test[ -c(25:31) ]
#drop the predictor variables
predictTrain<-newTrain[ -c(47) ]
#Calculate the correlation Matrix and round it
res <- cor(predictTrain)

#Plot the correlation matrix values by cluster
corrplot(res, type = "upper", order = "hclust",
         tl.col = "black", tl.cex = 0.5)

From the results of this spot, it appears that we likely have some clusters of colinearity.

Table Summary

We summarize the train data of interest in tables grouped by weekdays, showing the pattern of shares in a week.

#create a new variable, weekday, to help with the creating plots.
train$weekday <- names(train[25:31])[apply(train[25:31],1, match, x = 1)]
train$weekday <-sub("weekday_is_", "", train$weekday)

#summarize the train data by weekday.knitr::kable(
summary<-train%>%group_by(weekday)%>%
  summarise(Avg=round(mean(shares),0),Sd=round(sd(shares),0),Median=median(shares),IQR=round(IQR(shares),0))
knitr::kable(summary)

weekday	Avg	Sd	Median	IQR
friday	2824	6717	1200	1046
monday	2578	5356	1100	1002
saturday	3627	7328	1600	1300
sunday	3887	6593	1700	2700
thursday	2731	8082	1100	1204
tuesday	2742	6617	1100	1123
wednesday	2800	8354	1100	1229

We summarize the train data of interest in the plots below. The histogram of shares shows that it is not a normal distribution. After log transformation, the distribution of log(share) is more close to a normal distribution.

#histogram of shares and log(shares).
hist(train$shares)

hist(log(train$shares))

Data Plots

Box Plots

We use box plots to show the difference in shares and num_images between weekdays and weekends.If the boxes of weekends are higher than the ones of weekdays, then articles be shared more often during weekends.

g1<-ggplot(train, aes(x=factor(is_weekend,labels=c("No", "Yes")),y=shares))
g1+geom_boxplot(fill="white", width=0.5,lwd=1.5,color='black',outlier.shape = NA)+
   scale_y_continuous(limits = quantile(train$shares, c(0.1, 0.9)))+
   labs(subtitle = "Shares on weekend",x="On weekend or not")

g2<-ggplot(train, aes(x=factor(is_weekend,labels=c("No", "Yes")),y=num_imgs))
g2+geom_boxplot(fill="white", width=0.5,lwd=1.5,color='black',outlier.shape = NA)+
   scale_y_continuous(limits = quantile(train$num_imgs, c(0, 0.95)))+
   labs(subtitle = "number of images on weekend",x="On weekend or not")

Linear Model

We can inspect the trend of shares as a function of num_images. If the points show an upward trend, then articles with more images tend to be shared more often. If we see a negative trend then articles with more images tend to be shared less often. We can also observe the difference after the log transformation.

g3<-ggplot(train,aes(x=num_imgs,y=shares))
g3+geom_point()+
  labs(subtitle = "num_imgs vs shares")+
  scale_y_continuous(limits = quantile(train$shares, c(0, 0.9)))+
  scale_x_continuous(limits = quantile(train$num_imgs, c(0, 0.9)))+
  geom_smooth(method="lm")

g4<-ggplot(train,aes(x=num_imgs,y=log(shares)))
g4+geom_point()+
  labs(subtitle = "num_imgs vs log(shares)")+
  scale_y_continuous(limits = quantile(log(train$shares), c(0, 0.9)))+
  scale_x_continuous(limits = quantile(train$num_imgs, c(0, 0.9)))+
  geom_smooth(method="lm")

#remove weekday from data set
train<-train%>%select(-weekday)

Models

Regression Tree

Classification trees are machine learning algorithms that have several benefits, including the ease of operation, and less pre-processing. Data does not require normalization, scaling, and removal of missing values. The results are usually easy to explain, and stakeholders usually can understand them. A regression tree is a tree that uses numerical values to predict the nodes and tree branches. Despite all of the benefits, the Decision Tree algorithm can’t be used for regression and predicting continuous values, it also does not transfer well to other datasets.

library(tree)
tree.news<-tree(shares~., data=train)
summary(tree.news)

## 
## Regression tree:
## tree(formula = shares ~ ., data = train)
## Variables actually used in tree construction:
## [1] "kw_avg_avg"               
## [2] "global_sentiment_polarity"
## Number of terminal nodes:  3 
## Residual mean deviance:  4.6e+07 = 2.271e+11 / 4936 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu. 
## -52380.0  -1829.0  -1520.0      0.0   -620.1 
##     Max. 
## 140300.0

plot(tree.news)
text(tree.news, pretty=0)

yhat.regTree<- predict(tree.news, newdata = test)
yhat.test<-test["shares"]
yhat.regTree<-as.data.frame(yhat.regTree)
meanRegTree<-mean((yhat.regTree$yhat.regTree-yhat.test$shares)^2)

RMSE_regTree<-sqrt(meanRegTree)

These results can vary widely depending on the datasets.

Linear Models

Linear models are very valuable and powerful tools, and are very versatile, and can be applied to many situations. Multiple regression examines the relationship between several independent variables and one dependent variable (in this case, total Shares). Regression models give users the ability to determine the relative influence of one or more predictor variables to the predictor, and it also allows users to identify outliers, or anomalies. The main disadvantages have to do with the input quality of data. Input that is incomplete may lead to wrong conclusions. It also assumes that data is independent, which is not always the case.

There are several different types of linear models. In this project, we use multiple different multiple regression values that were log transformed, representing the full dataset, and several partial subsets with multiple variables removed at different points for multicolinearity reasons.

There are also several different types of variable selection, including forward, backward, and stepwise, which user predefined criteria set the entry and/or exit criteria of the models. Backwards selection starts with a full model, and then removes variables that are least significant one at a time, until the model criteria defined by the user are hit. Forward regression does the opposite, and is not represented here.

Linear Regression After Log Transformation

Transform the response with log, then fit a linear regression model with all the variables. Then calculate the RMSE of the model.

lm<- lm(log(shares)~.,train)
summary(lm)

## 
## Call:
## lm(formula = log(shares) ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8917 -0.5435 -0.1980  0.3181  4.6430 
## 
## Coefficients: (3 not defined because of singularities)
##                                Estimate
## (Intercept)                   1.576e+02
## timedelta                     2.681e-04
## n_tokens_title                1.360e-03
## n_tokens_content             -1.410e-05
## n_unique_tokens               3.709e-01
## n_non_stop_words             -2.674e-01
## n_non_stop_unique_tokens     -2.010e-01
## num_hrefs                     4.520e-03
## num_self_hrefs               -5.239e-03
## num_imgs                      2.561e-03
## num_videos                    2.070e-03
## average_token_length         -4.234e-02
## num_keywords                  1.344e-02
## kw_min_min                    1.106e-03
## kw_max_min                    7.034e-05
## kw_avg_min                   -4.732e-04
## kw_min_max                   -1.471e-06
## kw_max_max                    4.055e-08
## kw_avg_max                   -2.319e-07
## kw_min_avg                   -5.177e-05
## kw_max_avg                   -3.733e-05
## kw_avg_avg                    3.492e-04
## self_reference_min_shares     3.321e-06
## self_reference_max_shares    -9.333e-07
## self_reference_avg_sharess    5.607e-06
## weekday_is_monday            -4.397e-01
## weekday_is_tuesday           -4.503e-01
## weekday_is_wednesday         -4.318e-01
## weekday_is_thursday          -4.201e-01
## weekday_is_friday            -3.557e-01
## weekday_is_saturday          -9.213e-02
## weekday_is_sunday                    NA
## is_weekend                           NA
## LDA_00                       -1.511e+02
## LDA_01                       -1.509e+02
## LDA_02                       -1.510e+02
## LDA_03                       -1.510e+02
## LDA_04                       -1.511e+02
## global_subjectivity           4.548e-01
## global_sentiment_polarity    -3.740e-01
## global_rate_positive_words   -1.004e+00
## global_rate_negative_words    1.113e+00
## rate_positive_words           2.456e-01
## rate_negative_words                  NA
## avg_positive_polarity         3.266e-01
## min_positive_polarity         1.889e-01
## max_positive_polarity        -1.017e-03
## avg_negative_polarity         3.118e-01
## min_negative_polarity        -6.797e-02
## max_negative_polarity        -7.999e-02
## title_subjectivity            2.211e-02
## title_sentiment_polarity      1.042e-01
## abs_title_subjectivity        2.194e-02
## abs_title_sentiment_polarity -5.832e-02
##                              Std. Error t value
## (Intercept)                   3.894e+02   0.405
## timedelta                     8.438e-05   3.177
## n_tokens_title                6.404e-03   0.212
## n_tokens_content              4.899e-05  -0.288
## n_unique_tokens               4.361e-01   0.851
## n_non_stop_words              3.802e-01  -0.703
## n_non_stop_unique_tokens      3.638e-01  -0.552
## num_hrefs                     1.195e-03   3.782
## num_self_hrefs                4.879e-03  -1.074
## num_imgs                      1.537e-03   1.666
## num_videos                    2.408e-03   0.859
## average_token_length          5.739e-02  -0.738
## num_keywords                  8.238e-03   1.632
## kw_min_min                    4.185e-04   2.642
## kw_max_min                    1.718e-05   4.095
## kw_avg_min                    8.510e-05  -5.561
## kw_min_max                    4.955e-07  -2.969
## kw_max_max                    1.511e-07   0.268
## kw_avg_max                    2.224e-07  -1.043
## kw_min_avg                    1.797e-05  -2.880
## kw_max_avg                    6.257e-06  -5.966
## kw_avg_avg                    3.346e-05  10.438
## self_reference_min_shares     2.761e-06   1.203
## self_reference_max_shares     9.936e-07  -0.939
## self_reference_avg_sharess    2.949e-06   1.902
## weekday_is_monday             5.411e-02  -8.127
## weekday_is_tuesday            5.463e-02  -8.242
## weekday_is_wednesday          5.471e-02  -7.893
## weekday_is_thursday           5.515e-02  -7.619
## weekday_is_friday             5.715e-02  -6.223
## weekday_is_saturday           7.028e-02  -1.311
## weekday_is_sunday                    NA      NA
## is_weekend                           NA      NA
## LDA_00                        3.894e+02  -0.388
## LDA_01                        3.894e+02  -0.388
## LDA_02                        3.894e+02  -0.388
## LDA_03                        3.894e+02  -0.388
## LDA_04                        3.894e+02  -0.388
## global_subjectivity           1.871e-01   2.432
## global_sentiment_polarity     3.714e-01  -1.007
## global_rate_positive_words    1.640e+00  -0.612
## global_rate_negative_words    2.916e+00   0.382
## rate_positive_words           2.631e-01   0.934
## rate_negative_words                  NA      NA
## avg_positive_polarity         2.946e-01   1.109
## min_positive_polarity         2.757e-01   0.685
## max_positive_polarity         9.634e-02  -0.011
## avg_negative_polarity         2.588e-01   1.205
## min_negative_polarity         9.453e-02  -0.719
## max_negative_polarity         2.155e-01  -0.371
## title_subjectivity            5.686e-02   0.389
## title_sentiment_polarity      5.017e-02   2.076
## abs_title_subjectivity        7.673e-02   0.286
## abs_title_sentiment_polarity  7.977e-02  -0.731
##                              Pr(>|t|)    
## (Intercept)                  0.685630    
## timedelta                    0.001497 ** 
## n_tokens_title               0.831839    
## n_tokens_content             0.773478    
## n_unique_tokens              0.395057    
## n_non_stop_words             0.481913    
## n_non_stop_unique_tokens     0.580652    
## num_hrefs                    0.000157 ***
## num_self_hrefs               0.283032    
## num_imgs                     0.095846 .  
## num_videos                   0.390134    
## average_token_length         0.460735    
## num_keywords                 0.102830    
## kw_min_min                   0.008278 ** 
## kw_max_min                   4.30e-05 ***
## kw_avg_min                   2.83e-08 ***
## kw_min_max                   0.002998 ** 
## kw_max_max                   0.788416    
## kw_avg_max                   0.297178    
## kw_min_avg                   0.003991 ** 
## kw_max_avg                   2.61e-09 ***
## kw_avg_avg                    < 2e-16 ***
## self_reference_min_shares    0.229150    
## self_reference_max_shares    0.347618    
## self_reference_avg_sharess   0.057293 .  
## weekday_is_monday            5.55e-16 ***
## weekday_is_tuesday            < 2e-16 ***
## weekday_is_wednesday         3.60e-15 ***
## weekday_is_thursday          3.06e-14 ***
## weekday_is_friday            5.29e-10 ***
## weekday_is_saturday          0.189934    
## weekday_is_sunday                  NA    
## is_weekend                         NA    
## LDA_00                       0.697951    
## LDA_01                       0.698303    
## LDA_02                       0.698200    
## LDA_03                       0.698108    
## LDA_04                       0.697939    
## global_subjectivity          0.015071 *  
## global_sentiment_polarity    0.313966    
## global_rate_positive_words   0.540543    
## global_rate_negative_words   0.702793    
## rate_positive_words          0.350596    
## rate_negative_words                NA    
## avg_positive_polarity        0.267666    
## min_positive_polarity        0.493110    
## max_positive_polarity        0.991580    
## avg_negative_polarity        0.228358    
## min_negative_polarity        0.472169    
## max_negative_polarity        0.710524    
## title_subjectivity           0.697391    
## title_sentiment_polarity     0.037918 *  
## abs_title_subjectivity       0.774906    
## abs_title_sentiment_polarity 0.464733    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8849 on 4888 degrees of freedom
## Multiple R-squared:  0.08898,    Adjusted R-squared:  0.07966 
## F-statistic: 9.548 on 50 and 4888 DF,  p-value: < 2.2e-16

yhat_lm<-predict(lm,test)
RMSE_lm<-sqrt(mean((test$shares - exp(yhat_lm))^2))

Plot the lm Residuals

par(mfrow=c(2,2))
plot(lm)

Looking at our residuals, there seems to be skewing in both direction, indicating that the data, even after transformation, has extreme outliers in both directions.

Model Removing the Day Variable

#look at the data for multicolinearity
lmNewTest<-test[ -c(25:31) ]
lm2<- lm(log(shares)~.,newTrain)
yhat_lm2<-predict(lm2,lmNewTest)
RMSE_lm2<-sqrt(mean((lmNewTest$shares - exp(yhat_lm2))^2))

library(mctest)
omcdiag(lm2)

## 
## Call:
## omcdiag(mod = lm2)
## 
## 
## Overall Multicollinearity Diagnostics
## 
##                           MC Results detection
## Determinant |X'X|:      0.000000e+00         1
## Farrar Chi-Square:      4.113519e+05         1
## Red Indicator:          1.657000e-01         0
## Sum of Lambda Inverse: -6.862627e+15         0
## Theil's Method:         2.838680e+01         1
## Condition Number:       3.332843e+07         1
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test

imcdiag(lm2)

## 
## Call:
## imcdiag(mod = lm2)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                                       VIF
## timedelta                    1.982400e+00
## n_tokens_title               1.129200e+00
## n_tokens_content             4.023800e+00
## n_unique_tokens              1.189446e+05
## n_non_stop_words                      Inf
## n_non_stop_unique_tokens     7.113844e+04
## num_hrefs                    1.497900e+00
## num_self_hrefs               1.398800e+00
## num_imgs                     1.951500e+00
## num_videos                   1.417300e+00
## average_token_length         1.388170e+01
## num_keywords                 1.522900e+00
## kw_min_min                   4.300400e+00
## kw_max_min                   2.031970e+01
## kw_avg_min                   2.161280e+01
## kw_min_max                   1.770400e+00
## kw_max_max                   5.439100e+00
## kw_avg_max                   3.627400e+00
## kw_min_avg                   2.512200e+00
## kw_max_avg                   6.047200e+00
## kw_avg_avg                   8.626300e+00
## self_reference_min_shares    2.506300e+00
## self_reference_max_shares    5.235800e+00
## self_reference_avg_sharess   7.755000e+00
## is_weekend                   1.027400e+00
## LDA_00                                Inf
## LDA_01                                Inf
## LDA_02                                Inf
## LDA_03                                Inf
## LDA_04                                Inf
## global_subjectivity          2.764000e+00
## global_sentiment_polarity    8.447000e+00
## global_rate_positive_words   5.004900e+00
## global_rate_negative_words   8.361700e+00
## rate_positive_words                   Inf
## rate_negative_words          9.007199e+15
## avg_positive_polarity        5.902800e+00
## min_positive_polarity        1.728100e+00
## max_positive_polarity        3.574400e+00
## avg_negative_polarity        7.002800e+00
## min_negative_polarity        4.953700e+00
## max_negative_polarity        2.888000e+00
## title_subjectivity           2.150800e+00
## title_sentiment_polarity     1.258200e+00
## abs_title_subjectivity       1.365500e+00
## abs_title_sentiment_polarity 2.150300e+00
##                                 TOL
## timedelta                    0.5044
## n_tokens_title               0.8856
## n_tokens_content             0.2485
## n_unique_tokens              0.0000
## n_non_stop_words             0.0000
## n_non_stop_unique_tokens     0.0000
## num_hrefs                    0.6676
## num_self_hrefs               0.7149
## num_imgs                     0.5124
## num_videos                   0.7055
## average_token_length         0.0720
## num_keywords                 0.6566
## kw_min_min                   0.2325
## kw_max_min                   0.0492
## kw_avg_min                   0.0463
## kw_min_max                   0.5648
## kw_max_max                   0.1839
## kw_avg_max                   0.2757
## kw_min_avg                   0.3981
## kw_max_avg                   0.1654
## kw_avg_avg                   0.1159
## self_reference_min_shares    0.3990
## self_reference_max_shares    0.1910
## self_reference_avg_sharess   0.1289
## is_weekend                   0.9733
## LDA_00                       0.0000
## LDA_01                       0.0000
## LDA_02                       0.0000
## LDA_03                       0.0000
## LDA_04                       0.0000
## global_subjectivity          0.3618
## global_sentiment_polarity    0.1184
## global_rate_positive_words   0.1998
## global_rate_negative_words   0.1196
## rate_positive_words          0.0000
## rate_negative_words          0.0000
## avg_positive_polarity        0.1694
## min_positive_polarity        0.5787
## max_positive_polarity        0.2798
## avg_negative_polarity        0.1428
## min_negative_polarity        0.2019
## max_negative_polarity        0.3463
## title_subjectivity           0.4649
## title_sentiment_polarity     0.7948
## abs_title_subjectivity       0.7323
## abs_title_sentiment_polarity 0.4651
##                                        Wi
## timedelta                    1.068217e+02
## n_tokens_title               1.404790e+01
## n_tokens_content             3.287878e+02
## n_unique_tokens              1.293314e+07
## n_non_stop_words                      Inf
## n_non_stop_unique_tokens     7.735011e+06
## num_hrefs                    5.414020e+01
## num_self_hrefs               4.336820e+01
## num_imgs                     1.034649e+02
## num_videos                   4.537980e+01
## average_token_length         1.400674e+03
## num_keywords                 5.685970e+01
## kw_min_min                   3.588582e+02
## kw_max_min                   2.100693e+03
## kw_avg_min                   2.241297e+03
## kw_min_max                   8.376790e+01
## kw_max_max                   4.826727e+02
## kw_avg_max                   2.856806e+02
## kw_min_avg                   1.644257e+02
## kw_max_avg                   5.487991e+02
## kw_avg_avg                   8.292296e+02
## self_reference_min_shares    1.637872e+02
## self_reference_max_shares    4.605679e+02
## self_reference_avg_sharess   7.344981e+02
## is_weekend                   2.978800e+00
## LDA_00                                Inf
## LDA_01                                Inf
## LDA_02                                Inf
## LDA_03                                Inf
## LDA_04                                Inf
## global_subjectivity          1.918054e+02
## global_sentiment_polarity    8.097402e+02
## global_rate_positive_words   4.354640e+02
## global_rate_negative_words   8.004645e+02
## rate_positive_words                   Inf
## rate_negative_words          9.793828e+17
## avg_positive_polarity        5.330965e+02
## min_positive_polarity        7.916550e+01
## max_positive_polarity        2.799224e+02
## avg_negative_polarity        6.527015e+02
## min_negative_polarity        4.299006e+02
## max_negative_polarity        2.052902e+02
## title_subjectivity           1.251335e+02
## title_sentiment_polarity     2.807830e+01
## abs_title_subjectivity       3.974660e+01
## abs_title_sentiment_polarity 1.250757e+02
##                                        Fi
## timedelta                    1.092718e+02
## n_tokens_title               1.437010e+01
## n_tokens_content             3.363289e+02
## n_unique_tokens              1.322977e+07
## n_non_stop_words                      Inf
## n_non_stop_unique_tokens     7.912424e+06
## num_hrefs                    5.538200e+01
## num_self_hrefs               4.436290e+01
## num_imgs                     1.058380e+02
## num_videos                   4.642070e+01
## average_token_length         1.432801e+03
## num_keywords                 5.816390e+01
## kw_min_min                   3.670890e+02
## kw_max_min                   2.148875e+03
## kw_avg_min                   2.292704e+03
## kw_min_max                   8.568920e+01
## kw_max_max                   4.937434e+02
## kw_avg_max                   2.922331e+02
## kw_min_avg                   1.681971e+02
## kw_max_avg                   5.613865e+02
## kw_avg_avg                   8.482491e+02
## self_reference_min_shares    1.675438e+02
## self_reference_max_shares    4.711316e+02
## self_reference_avg_sharess   7.513447e+02
## is_weekend                   3.047100e+00
## LDA_00                                Inf
## LDA_01                                Inf
## LDA_02                                Inf
## LDA_03                                Inf
## LDA_04                                Inf
## global_subjectivity          1.962047e+02
## global_sentiment_polarity    8.283127e+02
## global_rate_positive_words   4.454519e+02
## global_rate_negative_words   8.188242e+02
## rate_positive_words                   Inf
## rate_negative_words          1.001846e+18
## avg_positive_polarity        5.453237e+02
## min_positive_polarity        8.098130e+01
## max_positive_polarity        2.863428e+02
## avg_negative_polarity        6.676721e+02
## min_negative_polarity        4.397609e+02
## max_negative_polarity        2.099988e+02
## title_subjectivity           1.280036e+02
## title_sentiment_polarity     2.872230e+01
## abs_title_subjectivity       4.065820e+01
## abs_title_sentiment_polarity 1.279445e+02
##                              Leamer
## timedelta                    0.7102
## n_tokens_title               0.9411
## n_tokens_content             0.4985
## n_unique_tokens              0.0029
## n_non_stop_words             0.0000
## n_non_stop_unique_tokens     0.0037
## num_hrefs                    0.8171
## num_self_hrefs               0.8455
## num_imgs                     0.7158
## num_videos                   0.8400
## average_token_length         0.2684
## num_keywords                 0.8103
## kw_min_min                   0.4822
## kw_max_min                   0.2218
## kw_avg_min                   0.2151
## kw_min_max                   0.7516
## kw_max_max                   0.4288
## kw_avg_max                   0.5251
## kw_min_avg                   0.6309
## kw_max_avg                   0.4067
## kw_avg_avg                   0.3405
## self_reference_min_shares    0.6317
## self_reference_max_shares    0.4370
## self_reference_avg_sharess   0.3591
## is_weekend                   0.9866
## LDA_00                       0.0000
## LDA_01                       0.0000
## LDA_02                       0.0000
## LDA_03                       0.0000
## LDA_04                       0.0000
## global_subjectivity          0.6015
## global_sentiment_polarity    0.3441
## global_rate_positive_words   0.4470
## global_rate_negative_words   0.3458
## rate_positive_words          0.0000
## rate_negative_words          0.0000
## avg_positive_polarity        0.4116
## min_positive_polarity        0.7607
## max_positive_polarity        0.5289
## avg_negative_polarity        0.3779
## min_negative_polarity        0.4493
## max_negative_polarity        0.5884
## title_subjectivity           0.6819
## title_sentiment_polarity     0.8915
## abs_title_subjectivity       0.8558
## abs_title_sentiment_polarity 0.6819
##                                      CVIF Klein
## timedelta                    2.016100e+00     1
## n_tokens_title               1.148400e+00     1
## n_tokens_content             4.092100e+00     1
## n_unique_tokens              1.209636e+05     1
## n_non_stop_words                      Inf     1
## n_non_stop_unique_tokens     7.234593e+04     1
## num_hrefs                    1.523300e+00     1
## num_self_hrefs               1.422600e+00     1
## num_imgs                     1.984700e+00     1
## num_videos                   1.441400e+00     1
## average_token_length         1.411740e+01     1
## num_keywords                 1.548800e+00     1
## kw_min_min                   4.373300e+00     1
## kw_max_min                   2.066460e+01     1
## kw_avg_min                   2.197960e+01     1
## kw_min_max                   1.800400e+00     1
## kw_max_max                   5.531400e+00     1
## kw_avg_max                   3.688900e+00     1
## kw_min_avg                   2.554800e+00     1
## kw_max_avg                   6.149800e+00     1
## kw_avg_avg                   8.772700e+00     1
## self_reference_min_shares    2.548900e+00     1
## self_reference_max_shares    5.324600e+00     1
## self_reference_avg_sharess   7.886700e+00     1
## is_weekend                   1.044800e+00     0
## LDA_00                                Inf     1
## LDA_01                                Inf     1
## LDA_02                                Inf     1
## LDA_03                                Inf     1
## LDA_04                                Inf     1
## global_subjectivity          2.810900e+00     1
## global_sentiment_polarity    8.590400e+00     1
## global_rate_positive_words   5.089800e+00     1
## global_rate_negative_words   8.503700e+00     1
## rate_positive_words                   Inf     1
## rate_negative_words          9.160086e+15     1
## avg_positive_polarity        6.003000e+00     1
## min_positive_polarity        1.757400e+00     1
## max_positive_polarity        3.635100e+00     1
## avg_negative_polarity        7.121600e+00     1
## min_negative_polarity        5.037800e+00     1
## max_negative_polarity        2.937000e+00     1
## title_subjectivity           2.187300e+00     1
## title_sentiment_polarity     1.279600e+00     1
## abs_title_subjectivity       1.388700e+00     1
## abs_title_sentiment_polarity 2.186800e+00     1
##                                IND1   IND2
## timedelta                    0.0045 0.7050
## n_tokens_title               0.0080 0.1628
## n_tokens_content             0.0022 1.0691
## n_unique_tokens              0.0000 1.4227
## n_non_stop_words             0.0000 1.4227
## n_non_stop_unique_tokens     0.0000 1.4227
## num_hrefs                    0.0060 0.4729
## num_self_hrefs               0.0064 0.4056
## num_imgs                     0.0046 0.6937
## num_videos                   0.0063 0.4189
## average_token_length         0.0006 1.3202
## num_keywords                 0.0059 0.4885
## kw_min_min                   0.0021 1.0919
## kw_max_min                   0.0004 1.3527
## kw_avg_min                   0.0004 1.3569
## kw_min_max                   0.0051 0.6191
## kw_max_max                   0.0017 1.1611
## kw_avg_max                   0.0025 1.0305
## kw_min_avg                   0.0036 0.8564
## kw_max_avg                   0.0015 1.1874
## kw_avg_avg                   0.0010 1.2578
## self_reference_min_shares    0.0036 0.8551
## self_reference_max_shares    0.0017 1.1510
## self_reference_avg_sharess   0.0012 1.2392
## is_weekend                   0.0088 0.0379
## LDA_00                       0.0000 1.4227
## LDA_01                       0.0000 1.4227
## LDA_02                       0.0000 1.4227
## LDA_03                       0.0000 1.4227
## LDA_04                       0.0000 1.4227
## global_subjectivity          0.0033 0.9080
## global_sentiment_polarity    0.0011 1.2543
## global_rate_positive_words   0.0018 1.1384
## global_rate_negative_words   0.0011 1.2526
## rate_positive_words          0.0000 1.4227
## rate_negative_words          0.0000 1.4227
## avg_positive_polarity        0.0015 1.1817
## min_positive_polarity        0.0052 0.5994
## max_positive_polarity        0.0025 1.0247
## avg_negative_polarity        0.0013 1.2195
## min_negative_polarity        0.0018 1.1355
## max_negative_polarity        0.0031 0.9301
## title_subjectivity           0.0042 0.7612
## title_sentiment_polarity     0.0071 0.2920
## abs_title_subjectivity       0.0066 0.3808
## abs_title_sentiment_polarity 0.0042 0.7611
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## n_tokens_title , n_tokens_content , n_unique_tokens , n_non_stop_words , n_non_stop_unique_tokens , num_self_hrefs , num_imgs , num_videos , average_token_length , num_keywords , kw_max_max , kw_avg_max , self_reference_min_shares , self_reference_max_shares , self_reference_avg_sharess , LDA_00 , LDA_01 , LDA_02 , LDA_03 , LDA_04 , global_sentiment_polarity , global_rate_positive_words , global_rate_negative_words , rate_positive_words , rate_negative_words , avg_positive_polarity , min_positive_polarity , max_positive_polarity , avg_negative_polarity , min_negative_polarity , max_negative_polarity , title_sentiment_polarity , abs_title_subjectivity , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.0877 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

Looking at all of the VIF values, we are going to start by removing all of the LDA Values, and the positive word rate to remove all “infinite” VIF values.

First Multicolinearity Trim

The mctest package was used to calculate the VIF values of multicolinearity.

toRemove<-c( "LDA_01", "LDA_02", "LDA_03", "LDA_04", "rate_positive_words")
trimTrain1 <- newTrain[, ! names(newTrain) %in% toRemove, drop = F]
lmNewTest3<-lmNewTest[, ! names(newTrain) %in% toRemove, drop = F]

#Repeat linear Model process
lm3<- lm(log(shares)~., trimTrain1)
yhat_lm3<-predict(lm3,lmNewTest3)
RMSE_lm3<-sqrt(mean((lmNewTest3$shares - exp(yhat_lm3))^2))

imcdiag(lm3)

## 
## Call:
## imcdiag(mod = lm3)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                                     VIF    TOL
## timedelta                        1.9486 0.5132
## n_tokens_title                   1.1172 0.8951
## n_tokens_content                 4.0023 0.2499
## n_unique_tokens              94984.1446 0.0000
## n_non_stop_words             26332.0942 0.0000
## n_non_stop_unique_tokens     56807.5774 0.0000
## num_hrefs                        1.4609 0.6845
## num_self_hrefs                   1.3854 0.7218
## num_imgs                         1.9458 0.5139
## num_videos                       1.3751 0.7272
## average_token_length             3.2074 0.3118
## num_keywords                     1.4705 0.6800
## kw_min_min                       4.2851 0.2334
## kw_max_min                      20.2581 0.0494
## kw_avg_min                      21.4867 0.0465
## kw_min_max                       1.7560 0.5695
## kw_max_max                       5.3425 0.1872
## kw_avg_max                       3.3125 0.3019
## kw_min_avg                       2.4368 0.4104
## kw_max_avg                       5.9388 0.1684
## kw_avg_avg                       8.2863 0.1207
## self_reference_min_shares        2.5015 0.3998
## self_reference_max_shares        5.2341 0.1911
## self_reference_avg_sharess       7.7533 0.1290
## is_weekend                       1.0251 0.9755
## LDA_00                           1.0441 0.9578
## global_subjectivity              2.6060 0.3837
## global_sentiment_polarity        8.4084 0.1189
## global_rate_positive_words       4.7358 0.2112
## global_rate_negative_words       8.1052 0.1234
## rate_negative_words              9.5537 0.1047
## avg_positive_polarity            5.8754 0.1702
## min_positive_polarity            1.6914 0.5912
## max_positive_polarity            3.5599 0.2809
## avg_negative_polarity            6.9929 0.1430
## min_negative_polarity            4.9159 0.2034
## max_negative_polarity            2.8579 0.3499
## title_subjectivity               2.1479 0.4656
## title_sentiment_polarity         1.2561 0.7961
## abs_title_subjectivity           1.3614 0.7345
## abs_title_sentiment_polarity     2.1492 0.4653
##                                        Wi
## timedelta                    1.161524e+02
## n_tokens_title               1.435600e+01
## n_tokens_content             3.676343e+02
## n_unique_tokens              1.163069e+07
## n_non_stop_words             3.224242e+06
## n_non_stop_unique_tokens     6.955965e+06
## num_hrefs                    5.643190e+01
## num_self_hrefs               4.718840e+01
## num_imgs                     1.158152e+02
## num_videos                   4.592660e+01
## average_token_length         2.702928e+02
## num_keywords                 5.761510e+01
## kw_min_min                   4.022590e+02
## kw_max_min                   2.358149e+03
## kw_avg_min                   2.508599e+03
## kw_min_max                   9.257620e+01
## kw_max_max                   5.317418e+02
## kw_avg_max                   2.831604e+02
## kw_min_avg                   1.759329e+02
## kw_max_avg                   6.047609e+02
## kw_avg_avg                   8.922124e+02
## self_reference_min_shares    1.838646e+02
## self_reference_max_shares    5.184610e+02
## self_reference_avg_sharess   8.269389e+02
## is_weekend                   3.070900e+00
## LDA_00                       5.396700e+00
## global_subjectivity          1.966565e+02
## global_sentiment_polarity    9.071634e+02
## global_rate_positive_words   4.574439e+02
## global_rate_negative_words   8.700284e+02
## rate_negative_words          1.047403e+03
## avg_positive_polarity        5.969939e+02
## min_positive_polarity        8.466310e+01
## max_positive_polarity        3.134568e+02
## avg_negative_polarity        7.338344e+02
## min_negative_polarity        4.794968e+02
## max_negative_polarity        2.274952e+02
## title_subjectivity           1.405599e+02
## title_sentiment_polarity     3.136110e+01
## abs_title_subjectivity       4.425230e+01
## abs_title_sentiment_polarity 1.407227e+02
##                                        Fi
## timedelta                    1.191550e+02
## n_tokens_title               1.472710e+01
## n_tokens_content             3.771378e+02
## n_unique_tokens              1.193134e+07
## n_non_stop_words             3.307591e+06
## n_non_stop_unique_tokens     7.135780e+06
## num_hrefs                    5.789060e+01
## num_self_hrefs               4.840820e+01
## num_imgs                     1.188091e+02
## num_videos                   4.711380e+01
## average_token_length         2.772799e+02
## num_keywords                 5.910450e+01
## kw_min_min                   4.126575e+02
## kw_max_min                   2.419109e+03
## kw_avg_min                   2.573448e+03
## kw_min_max                   9.496930e+01
## kw_max_max                   5.454875e+02
## kw_avg_max                   2.904802e+02
## kw_min_avg                   1.804808e+02
## kw_max_avg                   6.203942e+02
## kw_avg_avg                   9.152764e+02
## self_reference_min_shares    1.886176e+02
## self_reference_max_shares    5.318635e+02
## self_reference_avg_sharess   8.483156e+02
## is_weekend                   3.150300e+00
## LDA_00                       5.536200e+00
## global_subjectivity          2.017402e+02
## global_sentiment_polarity    9.306139e+02
## global_rate_positive_words   4.692690e+02
## global_rate_negative_words   8.925190e+02
## rate_negative_words          1.074479e+03
## avg_positive_polarity        6.124265e+02
## min_positive_polarity        8.685170e+01
## max_positive_polarity        3.215598e+02
## avg_negative_polarity        7.528044e+02
## min_negative_polarity        4.918920e+02
## max_negative_polarity        2.333761e+02
## title_subjectivity           1.441934e+02
## title_sentiment_polarity     3.217180e+01
## abs_title_subjectivity       4.539630e+01
## abs_title_sentiment_polarity 1.443605e+02
##                              Leamer       CVIF
## timedelta                    0.7164     1.9782
## n_tokens_title               0.9461     1.1342
## n_tokens_content             0.4999     4.0632
## n_unique_tokens              0.0032 96429.3409
## n_non_stop_words             0.0062 26732.7405
## n_non_stop_unique_tokens     0.0042 57671.9122
## num_hrefs                    0.8274     1.4831
## num_self_hrefs               0.8496     1.4064
## num_imgs                     0.7169     1.9754
## num_videos                   0.8528     1.3960
## average_token_length         0.5584     3.2562
## num_keywords                 0.8246     1.4929
## kw_min_min                   0.4831     4.3503
## kw_max_min                   0.2222    20.5663
## kw_avg_min                   0.2157    21.8136
## kw_min_max                   0.7546     1.7828
## kw_max_max                   0.4326     5.4238
## kw_avg_max                   0.5494     3.3629
## kw_min_avg                   0.6406     2.4738
## kw_max_avg                   0.4103     6.0292
## kw_avg_avg                   0.3474     8.4124
## self_reference_min_shares    0.6323     2.5396
## self_reference_max_shares    0.4371     5.3137
## self_reference_avg_sharess   0.3591     7.8712
## is_weekend                   0.9877     1.0407
## LDA_00                       0.9787     1.0600
## global_subjectivity          0.6195     2.6457
## global_sentiment_polarity    0.3449     8.5364
## global_rate_positive_words   0.4595     4.8078
## global_rate_negative_words   0.3513     8.2285
## rate_negative_words          0.3235     9.6991
## avg_positive_polarity        0.4126     5.9648
## min_positive_polarity        0.7689     1.7171
## max_positive_polarity        0.5300     3.6140
## avg_negative_polarity        0.3782     7.0993
## min_negative_polarity        0.4510     4.9907
## max_negative_polarity        0.5915     2.9013
## title_subjectivity           0.6823     2.1806
## title_sentiment_polarity     0.8922     1.2752
## abs_title_subjectivity       0.8571     1.3821
## abs_title_sentiment_polarity 0.6821     2.1819
##                              Klein   IND1
## timedelta                        1 0.0042
## n_tokens_title                   1 0.0073
## n_tokens_content                 1 0.0020
## n_unique_tokens                  1 0.0000
## n_non_stop_words                 1 0.0000
## n_non_stop_unique_tokens         1 0.0000
## num_hrefs                        1 0.0056
## num_self_hrefs                   1 0.0059
## num_imgs                         1 0.0042
## num_videos                       1 0.0059
## average_token_length             1 0.0025
## num_keywords                     1 0.0056
## kw_min_min                       1 0.0019
## kw_max_min                       1 0.0004
## kw_avg_min                       1 0.0004
## kw_min_max                       1 0.0047
## kw_max_max                       1 0.0015
## kw_avg_max                       1 0.0025
## kw_min_avg                       1 0.0034
## kw_max_avg                       1 0.0014
## kw_avg_avg                       1 0.0010
## self_reference_min_shares        1 0.0033
## self_reference_max_shares        1 0.0016
## self_reference_avg_sharess       1 0.0011
## is_weekend                       0 0.0080
## LDA_00                           0 0.0078
## global_subjectivity              1 0.0031
## global_sentiment_polarity        1 0.0010
## global_rate_positive_words       1 0.0017
## global_rate_negative_words       1 0.0010
## rate_negative_words              1 0.0009
## avg_positive_polarity            1 0.0014
## min_positive_polarity            1 0.0048
## max_positive_polarity            1 0.0023
## avg_negative_polarity            1 0.0012
## min_negative_polarity            1 0.0017
## max_negative_polarity            1 0.0029
## title_subjectivity               1 0.0038
## title_sentiment_polarity         1 0.0065
## abs_title_subjectivity           1 0.0060
## abs_title_sentiment_polarity     1 0.0038
##                                IND2
## timedelta                    0.7730
## n_tokens_title               0.1666
## n_tokens_content             1.1912
## n_unique_tokens              1.5879
## n_non_stop_words             1.5879
## n_non_stop_unique_tokens     1.5879
## num_hrefs                    0.5009
## num_self_hrefs               0.4417
## num_imgs                     0.7718
## num_videos                   0.4331
## average_token_length         1.0928
## num_keywords                 0.5081
## kw_min_min                   1.2173
## kw_max_min                   1.5095
## kw_avg_min                   1.5140
## kw_min_max                   0.6837
## kw_max_max                   1.2907
## kw_avg_max                   1.1085
## kw_min_avg                   0.9363
## kw_max_avg                   1.3205
## kw_avg_avg                   1.3963
## self_reference_min_shares    0.9531
## self_reference_max_shares    1.2845
## self_reference_avg_sharess   1.3831
## is_weekend                   0.0388
## LDA_00                       0.0670
## global_subjectivity          0.9786
## global_sentiment_polarity    1.3991
## global_rate_positive_words   1.2526
## global_rate_negative_words   1.3920
## rate_negative_words          1.4217
## avg_positive_polarity        1.3176
## min_positive_polarity        0.6491
## max_positive_polarity        1.1419
## avg_negative_polarity        1.3608
## min_negative_polarity        1.2649
## max_negative_polarity        1.0323
## title_subjectivity           0.8486
## title_sentiment_polarity     0.3238
## abs_title_subjectivity       0.4215
## abs_title_sentiment_polarity 0.8491
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## n_tokens_title , n_tokens_content , n_unique_tokens , n_non_stop_words , n_non_stop_unique_tokens , num_self_hrefs , num_imgs , num_videos , average_token_length , num_keywords , kw_max_max , kw_avg_max , self_reference_min_shares , self_reference_max_shares , self_reference_avg_sharess , LDA_00 , global_sentiment_polarity , global_rate_positive_words , global_rate_negative_words , rate_negative_words , avg_positive_polarity , min_positive_polarity , max_positive_polarity , avg_negative_polarity , min_negative_polarity , max_negative_polarity , title_subjectivity , abs_title_subjectivity , abs_title_sentiment_polarity , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.0866 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

This improves the model multicolinearity, but we are still left with some. We then pare down and select those models with the next highest VIF removed one at a time, until all values are below 5.

Second Mulitcolinearity Trim

toRemove<-c("self_reference_avg_sharess", "kw_avg_min", "n_unique_tokens", "rate_negative_words", "kw_avg_avg", "n_non_stop_words", "global_sentiment_polarity", "avg_negative_polarity", "kw_max_max")
trimTrain2 <- trimTrain1[, ! names(trimTrain1) %in% toRemove, drop = F]

#Repeat linear Model process
lm4<- lm(log(shares)~., trimTrain2)

imcdiag(lm4)

## 
## Call:
## imcdiag(mod = lm4)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                                 VIF    TOL
## timedelta                    1.7835 0.5607
## n_tokens_title               1.1147 0.8971
## n_tokens_content             2.4126 0.4145
## n_non_stop_unique_tokens     1.0144 0.9858
## num_hrefs                    1.3945 0.7171
## num_self_hrefs               1.3690 0.7305
## num_imgs                     1.4872 0.6724
## num_videos                   1.2979 0.7705
## average_token_length         2.1902 0.4566
## num_keywords                 1.3484 0.7416
## kw_min_min                   1.7865 0.5598
## kw_max_min                   1.6248 0.6155
## kw_min_max                   1.7194 0.5816
## kw_avg_max                   2.4462 0.4088
## kw_min_avg                   1.5730 0.6357
## kw_max_avg                   1.7006 0.5880
## self_reference_min_shares    1.0870 0.9200
## self_reference_max_shares    1.1181 0.8944
## is_weekend                   1.0199 0.9805
## LDA_00                       1.0390 0.9625
## global_subjectivity          2.4380 0.4102
## global_rate_positive_words   1.6757 0.5968
## global_rate_negative_words   1.5732 0.6356
## avg_positive_polarity        3.5549 0.2813
## min_positive_polarity        1.6436 0.6084
## max_positive_polarity        3.3739 0.2964
## min_negative_polarity        1.9729 0.5069
## max_negative_polarity        1.2655 0.7902
## title_subjectivity           2.1444 0.4663
## title_sentiment_polarity     1.2280 0.8143
## abs_title_subjectivity       1.3584 0.7362
## abs_title_sentiment_polarity 2.1264 0.4703
##                                    Wi       Fi
## timedelta                    124.0235 128.1838
## n_tokens_title                18.1529  18.7618
## n_tokens_content             223.5984 231.0988
## n_non_stop_unique_tokens       2.2860   2.3627
## num_hrefs                     62.4495  64.5443
## num_self_hrefs                58.4095  60.3688
## num_imgs                      77.1237  79.7107
## num_videos                    47.1492  48.7308
## average_token_length         188.3955 194.7150
## num_keywords                  55.1408  56.9904
## kw_min_min                   124.4928 128.6688
## kw_max_min                    98.9016 102.2192
## kw_min_max                   113.8798 117.6998
## kw_avg_max                   228.9205 236.5993
## kw_min_avg                    90.7006  93.7430
## kw_max_avg                   110.8908 114.6105
## self_reference_min_shares     13.7707  14.2326
## self_reference_max_shares     18.6975  19.3247
## is_weekend                     3.1439   3.2493
## LDA_00                         6.1670   6.3739
## global_subjectivity          227.6208 235.2561
## global_rate_positive_words   106.9514 110.5390
## global_rate_negative_words    90.7323  93.7758
## avg_positive_polarity        404.4187 417.9844
## min_positive_polarity        101.8805 105.2979
## max_positive_polarity        375.7605 388.3650
## min_negative_polarity        154.0013 159.1671
## max_negative_polarity         42.0236  43.4332
## title_subjectivity           181.1424 187.2186
## title_sentiment_polarity      36.0862  37.2967
## abs_title_subjectivity        56.7251  58.6279
## abs_title_sentiment_polarity 178.2941 184.2748
##                              Leamer   CVIF
## timedelta                    0.7488 1.7889
## n_tokens_title               0.9472 1.1181
## n_tokens_content             0.6438 2.4199
## n_non_stop_unique_tokens     0.9929 1.0175
## num_hrefs                    0.8468 1.3988
## num_self_hrefs               0.8547 1.3732
## num_imgs                     0.8200 1.4917
## num_videos                   0.8778 1.3018
## average_token_length         0.6757 2.1968
## num_keywords                 0.8612 1.3524
## kw_min_min                   0.7482 1.7919
## kw_max_min                   0.7845 1.6297
## kw_min_max                   0.7626 1.7247
## kw_avg_max                   0.6394 2.4536
## kw_min_avg                   0.7973 1.5778
## kw_max_avg                   0.7668 1.7057
## self_reference_min_shares    0.9591 1.0903
## self_reference_max_shares    0.9457 1.1215
## is_weekend                   0.9902 1.0230
## LDA_00                       0.9811 1.0421
## global_subjectivity          0.6404 2.4454
## global_rate_positive_words   0.7725 1.6808
## global_rate_negative_words   0.7973 1.5780
## avg_positive_polarity        0.5304 3.5657
## min_positive_polarity        0.7800 1.6486
## max_positive_polarity        0.5444 3.3841
## min_negative_polarity        0.7119 1.9789
## max_negative_polarity        0.8889 1.2693
## title_subjectivity           0.6829 2.1509
## title_sentiment_polarity     0.9024 1.2317
## abs_title_subjectivity       0.8580 1.3625
## abs_title_sentiment_polarity 0.6858 2.1328
##                              Klein   IND1
## timedelta                        1 0.0035
## n_tokens_title                   1 0.0057
## n_tokens_content                 1 0.0026
## n_non_stop_unique_tokens         0 0.0062
## num_hrefs                        1 0.0045
## num_self_hrefs                   1 0.0046
## num_imgs                         1 0.0042
## num_videos                       1 0.0049
## average_token_length             1 0.0029
## num_keywords                     1 0.0047
## kw_min_min                       1 0.0035
## kw_max_min                       1 0.0039
## kw_min_max                       1 0.0037
## kw_avg_max                       1 0.0026
## kw_min_avg                       1 0.0040
## kw_max_avg                       1 0.0037
## self_reference_min_shares        1 0.0058
## self_reference_max_shares        1 0.0057
## is_weekend                       0 0.0062
## LDA_00                           0 0.0061
## global_subjectivity              1 0.0026
## global_rate_positive_words       1 0.0038
## global_rate_negative_words       1 0.0040
## avg_positive_polarity            1 0.0018
## min_positive_polarity            1 0.0038
## max_positive_polarity            1 0.0019
## min_negative_polarity            1 0.0032
## max_negative_polarity            1 0.0050
## title_subjectivity               1 0.0029
## title_sentiment_polarity         1 0.0051
## abs_title_subjectivity           1 0.0047
## abs_title_sentiment_polarity     1 0.0030
##                                IND2
## timedelta                    1.2448
## n_tokens_title               0.2915
## n_tokens_content             1.6590
## n_non_stop_unique_tokens     0.0403
## num_hrefs                    0.8016
## num_self_hrefs               0.7637
## num_imgs                     0.9283
## num_videos                   0.6503
## average_token_length         1.5398
## num_keywords                 0.7320
## kw_min_min                   1.2474
## kw_max_min                   1.0896
## kw_min_max                   1.1856
## kw_avg_max                   1.6752
## kw_min_avg                   1.0322
## kw_max_avg                   1.1673
## self_reference_min_shares    0.2268
## self_reference_max_shares    0.2993
## is_weekend                   0.0552
## LDA_00                       0.1063
## global_subjectivity          1.6712
## global_rate_positive_words   1.1425
## global_rate_negative_words   1.0324
## avg_positive_polarity        2.0364
## min_positive_polarity        1.1096
## max_positive_polarity        1.9936
## min_negative_polarity        1.3973
## max_negative_polarity        0.5944
## title_subjectivity           1.5121
## title_sentiment_polarity     0.5260
## abs_title_subjectivity       0.7475
## abs_title_sentiment_polarity 1.5009
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## n_tokens_title , n_tokens_content , n_non_stop_unique_tokens , num_self_hrefs , num_videos , self_reference_max_shares , LDA_00 , global_rate_positive_words , global_rate_negative_words , avg_positive_polarity , min_positive_polarity , max_positive_polarity , min_negative_polarity , max_negative_polarity , title_subjectivity , title_sentiment_polarity , abs_title_subjectivity , abs_title_sentiment_polarity , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.063 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

After removing 15 more variables for obvious multicolinearity via VIF (>5), we need to replot the correlation matrix, which shows a much lower clustering rate of high correlations.

Replot Correlation

#Remove the predictor
train_cor<-trimTrain2[1:31]
res <- cor(train_cor)
palette = colorRampPalette(c("green", "white", "red")) (20)
heatmap(x = res, col = palette, symm = TRUE, cexRow=0.5, cexCol = 0.5)

The new heatmap appears to have less prominent clustering values.

Final Model Fit Prediction

#trim the testing data
newTest1<-test[ -c(25:31) ]
toRemove<-c( "LDA_01", "LDA_02", "LDA_03", "LDA_04", "rate_positive_words", "self_reference_avg_sharess", "kw_avg_min", "n_unique_tokens", "rate_negative_words", "kw_avg_avg", "n_non_stop_words", "global_sentiment_polarity", "avg_negative_polarity", "kw_max_max")

trimTest4 <- newTest1[, ! names(newTest1) %in% toRemove, drop = F]

yhat_lm4<-predict(lm4,trimTest4)
RMSE_lm4<-sqrt(mean((trimTest4$shares - exp(yhat_lm4))^2))

Open Parallel Processing

library(parallel)
library(doParallel)
cores<-detectCores()
cl <- makeCluster(cores-1)  
registerDoParallel(cl)

Backward Regression Selection

Transform the response with log, then fit a linear regression model with the variables after backward selection.

#backward selection after log transformation
library(leaps)
backward<- regsubsets(log(shares)~., trimTrain1, nvmax = 31, method = "backward")
backward_summary<-summary(backward)

#backward_summary[["which"]][size, ]
par(mfrow=c(1,3))
plot(backward_summary$cp, xlab = "Size", ylab = "backward Cp", type = "l")
plot(backward_summary$bic, xlab = "Size", ylab = "backward bic", type = "l")
plot(backward_summary$adjr2, xlab = "Size", ylab = "backward adjR2", type = "l")

coef(backward, which.min(backward_summary$cp))

##                (Intercept) 
##               6.359931e+00 
##                  timedelta 
##               2.248744e-04 
##            n_unique_tokens 
##               2.460419e-01 
##           n_non_stop_words 
##              -1.639688e-01 
##                  num_hrefs 
##               3.949443e-03 
##                   num_imgs 
##               2.802931e-03 
##       average_token_length 
##              -3.251348e-02 
##                 kw_min_min 
##               9.892610e-04 
##                 kw_max_min 
##               7.506077e-05 
##                 kw_avg_min 
##              -4.979849e-04 
##                 kw_min_max 
##              -1.575758e-06 
##                 kw_avg_max 
##              -3.675655e-07 
##                 kw_min_avg 
##              -6.084209e-05 
##                 kw_max_avg 
##              -3.928801e-05 
##                 kw_avg_avg 
##               3.665636e-04 
##  self_reference_min_shares 
##               5.092764e-06 
## self_reference_avg_sharess 
##               3.115176e-06 
##                 is_weekend 
##               3.834877e-01 
##        global_subjectivity 
##               4.009758e-01 
##   title_sentiment_polarity 
##               8.662371e-02

coef(backward, which.max(backward_summary$adjr2))

##                (Intercept) 
##               6.275777e+00 
##                  timedelta 
##               2.372227e-04 
##            n_unique_tokens 
##               2.359931e-01 
##           n_non_stop_words 
##              -1.571340e-01 
##                  num_hrefs 
##               4.390640e-03 
##             num_self_hrefs 
##              -6.530367e-03 
##                   num_imgs 
##               2.730878e-03 
##       average_token_length 
##              -3.634229e-02 
##               num_keywords 
##               1.096753e-02 
##                 kw_min_min 
##               9.840874e-04 
##                 kw_max_min 
##               7.101020e-05 
##                 kw_avg_min 
##              -4.744792e-04 
##                 kw_min_max 
##              -1.497845e-06 
##                 kw_avg_max 
##              -3.054739e-07 
##                 kw_min_avg 
##              -5.262641e-05 
##                 kw_max_avg 
##              -3.814651e-05 
##                 kw_avg_avg 
##               3.540103e-04 
##  self_reference_min_shares 
##               4.831342e-06 
## self_reference_avg_sharess 
##               3.177461e-06 
##                 is_weekend 
##               3.797716e-01 
##        global_subjectivity 
##               4.278686e-01 
##  global_sentiment_polarity 
##              -2.743433e-01 
##      avg_positive_polarity 
##               3.187604e-01 
##      avg_negative_polarity 
##               1.800516e-01 
##   title_sentiment_polarity 
##               8.969644e-02

#get best subset of the specified size with min cp.
sub <- backward_summary$which[which.min(backward_summary$cp), ]

# Create test model matrix, predcition, test error
test_model <- model.matrix(log(shares)~ ., data = lmNewTest3)
model <- test_model[, sub]
yhat_back<-model %*% coef(backward, which.min(backward_summary$cp))
RMSE_back<-sqrt(mean((test$shares - exp(yhat_back))^2))

Random Forests

As previously mentioned in the regression trees section, the random forest builds an entire forest of these trees, and merges them together to get a more accurate and stable predictions than one off trees. It is usually trained using the bagging method. Unlike regression trees, which are prone to overfitting, only a random subset of the features is taken into consideration by the algorithm for splitting a node (used CV to find the perfect amount of variables to use). This builds in additional error and makes a more robust prediction.

The manual dimensional reduction was necessary to have the processing speeds to handle the random forests model.

library(randomForest)
#single bagged model
tree.train<-randomForest(shares~., data=trimTrain1, mtry=32, importance=TRUE)
tree.train

## 
## Call:
##  randomForest(formula = shares ~ ., data = trimTrain1, mtry = 32,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 32
## 
##           Mean of squared residuals: 52277702
##                     % Var explained: -4.43

#single bagged regression tree error prediction
tree.test<-lmNewTest3["shares"]
yhat.bag<-predict(tree.train, newdata=lmNewTest3)
yhat.bag<-as.data.frame(yhat.bag)
yhat_bag<-mean((yhat.bag$yhat.bag-tree.test$shares)^2)
RMSE_bag<-sqrt(yhat_bag)

#run parallel processing to determine the best mtry value
control <- trainControl(method="repeatedcv", number=15, repeats=3, search="random")
mtry <- sqrt(ncol(trimTrain1))
rf_random <- train(shares~., data=trimTrain1, method="rf", tuneLength=15, trControl=control)
print(rf_random)

## Random Forest 
## 
## 4939 samples
##   41 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (15 fold, repeated 3 times) 
## Summary of sample sizes: 4609, 4609, 4610, 4610, 4609, 4609, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared    MAE     
##    2    6629.011  0.03706536  2822.009
##    4    6669.485  0.03526208  2880.715
##    6    6718.484  0.03211413  2921.189
##   13    6785.964  0.03274811  2971.154
##   15    6812.224  0.03070878  2983.034
##   16    6816.179  0.03059629  2985.856
##   17    6817.829  0.03149012  2987.973
##   25    6864.298  0.03101920  3013.743
##   27    6879.914  0.02988849  3019.211
##   28    6873.506  0.03044771  3018.112
##   34    6908.863  0.02948317  3026.094
##   37    6921.538  0.02939710  3033.242
##   40    6942.481  0.02839888  3037.883
## 
## RMSE was used to select the optimal model
##  using the smallest value.
## The final value used for the model was mtry = 2.

plot(rf_random)

mtry<-which.min(rf_random$results$RMSE)

#USe a model to determine the best number of trees
control <- trainControl(method="repeatedcv", number=5, repeats=3, search="grid")
tunegrid <- expand.grid(.mtry=mtry)
modellist <- list()
for (ntree in c(500, 1000, 1500, 2000)) {
fit <- train(shares~., data=trimTrain1, method="rf", tuneGrid=tunegrid, trControl=control, ntree=ntree)
key <- toString(ntree)
modellist[[key]] <- fit
}
results <- resamples(modellist)
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: 500, 1000, 1500, 2000 
## Number of resamples: 15 
## 
## MAE 
##          Min.  1st Qu.   Median     Mean
## 500  2545.326 2709.938 2757.728 2763.906
## 1000 2511.312 2679.102 2713.901 2758.375
## 1500 2439.440 2662.122 2835.812 2763.393
## 2000 2517.474 2615.096 2676.085 2764.019
##       3rd Qu.     Max. NA's
## 500  2822.120 3012.633    0
## 1000 2848.338 3058.273    0
## 1500 2871.649 3016.605    0
## 2000 2896.648 3075.132    0
## 
## RMSE 
##          Min.  1st Qu.   Median     Mean
## 500  5327.972 6211.123 6634.984 6896.760
## 1000 4775.656 6095.685 6712.796 6879.728
## 1500 4323.084 5466.882 6645.817 6815.589
## 2000 5033.916 5611.289 5905.956 6810.231
##       3rd Qu.     Max. NA's
## 500  7503.016 9386.653    0
## 1000 7951.741 8655.002    0
## 1500 7811.525 9349.352    0
## 2000 7924.521 9898.586    0
## 
## Rsquared 
##             Min.    1st Qu.     Median
## 500  0.014468658 0.02183457 0.03149408
## 1000 0.014062916 0.02263286 0.03069537
## 1500 0.005736438 0.01784515 0.03438637
## 2000 0.013776481 0.02000829 0.02904484
##            Mean    3rd Qu.       Max. NA's
## 500  0.02944354 0.03554660 0.05104536    0
## 1000 0.03180291 0.03930603 0.06591972    0
## 1500 0.03529606 0.04050349 0.09869772    0
## 2000 0.03081709 0.03489194 0.06315269    0

#Apply best fit parameters to model. 
#random forests model
tree.trainRF<-randomForest(shares~., data=trimTrain1, mtry=mtry,  importance=TRUE)
tree.trainRF

## 
## Call:
##  randomForest(formula = shares ~ ., data = trimTrain1, mtry = mtry,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 48715196
##                     % Var explained: 2.68

#random forest error prediction
tree.trainRF<-randomForest(shares~., data=trimTrain1, ntree=500, mtry=2, importance=TRUE)
tree.trainRF

## 
## Call:
##  randomForest(formula = shares ~ ., data = trimTrain1, ntree = 500,      mtry = 2, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 48709814
##                     % Var explained: 2.69

plot(tree.trainRF)

#Calculate Prediction
yhat.rf<-predict(tree.trainRF, newdata = lmNewTest3)
yhat.rf<-as.data.frame(yhat.rf)
yhat_rf<-mean((yhat.rf$yhat.rf-tree.test$shares)^2)
RMSE_rfTrimmed<-sqrt(yhat_rf)
varImpPlot(tree.trainRF)

Boosted Tree

Boosting is a general approach that can be applied to many statistical learning methods for regression or classification. The trees in boosting are grown sequentially : each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.

Procedure (for regression trees):
1.Initialize predictions as 0,
2.Find the residuals (observed-predicted), call the set of them
3.Fit a tree with splits (d+1 terminal nodes) treating the residuals as the response (which they are for the first fit)
4.Update predictions
5.Update residuals for new predictions and repeat B times

Tune parameters must be chosen shrinkage, B and d in the boosting tree model.

cvcontrol <- trainControl(method="repeatedcv", number = 10,
                          allowParallel=TRUE)
grid <- expand.grid(n.trees = c(1000,1500), 
                    interaction.depth=c(1:3), 
                    shrinkage=c(0.01,0.05,0.1), 
                    n.minobsinnode=c(20))
capture<-capture.output(train.gbm <- train(log(shares) ~ ., 
                   data=train,
                   method="gbm",
                   trControl=cvcontrol,
                   tuneGrid = grid))
train.gbm

## Stochastic Gradient Boosting 
## 
## 4939 samples
##   53 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 4443, 4446, 4445, 4445, 4445, 4445, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage
##   0.01     
##   0.01     
##   0.01     
##   0.01     
##   0.01     
##   0.01     
##   0.05     
##   0.05     
##   0.05     
##   0.05     
##   0.05     
##   0.05     
##   0.10     
##   0.10     
##   0.10     
##   0.10     
##   0.10     
##   0.10     
##   interaction.depth  n.trees
##   1                  1000   
##   1                  1500   
##   2                  1000   
##   2                  1500   
##   3                  1000   
##   3                  1500   
##   1                  1000   
##   1                  1500   
##   2                  1000   
##   2                  1500   
##   3                  1000   
##   3                  1500   
##   1                  1000   
##   1                  1500   
##   2                  1000   
##   2                  1500   
##   3                  1000   
##   3                  1500   
##   RMSE       Rsquared  
##   0.8831526  0.08358297
##   0.8828966  0.08326902
##   0.8811912  0.08686116
##   0.8815343  0.08632587
##   0.8803682  0.08859167
##   0.8805245  0.08876009
##   0.8849067  0.08109300
##   0.8865315  0.08020106
##   0.8857669  0.08439775
##   0.8903216  0.08201295
##   0.8931726  0.07775127
##   0.8989670  0.07523577
##   0.8903901  0.07584644
##   0.8934907  0.07478507
##   0.9021662  0.07009803
##   0.9114919  0.06577996
##   0.9111418  0.06912994
##   0.9236946  0.06350299
##   MAE      
##   0.6432767
##   0.6431562
##   0.6416479
##   0.6420692
##   0.6411333
##   0.6413508
##   0.6448725
##   0.6459971
##   0.6464764
##   0.6518540
##   0.6533379
##   0.6593740
##   0.6505821
##   0.6548596
##   0.6639618
##   0.6722264
##   0.6686273
##   0.6800523
## 
## Tuning
##  held constant at a value
##  of 20
## RMSE was used to select
##  the optimal model using
##  the smallest value.
## The final values used for
##  = 3, shrinkage = 0.01
##  and n.minobsinnode = 20.

boostPred <- predict(train.gbm, newdata = test)
RMSE_boost <- sqrt(mean((test$shares - exp(boostPred))^2))

stopCluster(cl)

Comparison

Generally, the model with the lowest RMSE is the best on comparison.

comparison<-data.frame(RMSE_lm, RMSE_lm2, RMSE_lm3, RMSE_lm4, RMSE_back,  RMSE_bag, RMSE_rfTrimmed, RMSE_boost, RMSE_regTree)

comparison

##    RMSE_lm RMSE_lm2 RMSE_lm3
## 1 9473.409 9473.886 9472.282
##   RMSE_lm4 RMSE_back RMSE_bag
## 1 9532.991  9453.578 9576.123
##   RMSE_rfTrimmed RMSE_boost
## 1       9271.053   9492.716
##   RMSE_regTree
## 1     9542.537

which.min(comparison)

## RMSE_rfTrimmed 
##              7

The overall prediction error rate for this data set is very high. This is likely due to the high values of outlier articles with freakishly high shares, that are timely AND viral. These values were NOT removed from analysis, as these are the share metrics that a company would likely want to evaluate for emulation.

ST558Project2