<!DOCTYPE html>

ST 558 Project 2

library(rmarkdown)
library(usethis)
use_git_config(user.name="Mandy Liesch", user.email="amliesch@ncsu.edu")

Introduction

Online News Popularity Data Set summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity). Here we first showed some summary statistics and plots about the data grouped by weekdays. Then we create several models to predict the response, shares in different channels. The performance of these models will be evaluated by RMSE. The model having the lowest RMSE will be selected as a winner. The methods of modeling include:

Regression Tree
Log Transformed Full Linear Regression Model
Linear Regression Model Without Day of the Week
Subset Linear Regression Model #1
Subset Linear Regression Model #2
Backward Selection Linear Regression
Bagged Regression Tree
Random Forest Model
Boosted Tree Model

Data preparation

Subset Data by Channel

library(tidyverse)

data_whole<-read_csv("OnlineNewsPopularity/OnlineNewsPopularity.csv")

#create a new variable, channel, to help with the subsetting.
data_whole$channel <- names(data_whole[14:19])[apply(data_whole[14:19],1, match, x = 1)]
data_whole$channel <-sub("data_channel_is_", "", data_whole$channel)

#Subset the data to work on the data channel of interest
#channel_interest = params[[1]]$team

#Get the important data
data_interest<-data_whole%>%
  filter(channel==x[[2]]$team)%>%
  select(-c(1,14:19,62))

Establish Training Data

Split the data into a training (70% of the data) and test set (30% of the data)

library(caret)
library(rsample)
set.seed(14)
index <- initial_split(data_interest,
                       prop = 0.7)
train <- training(index)
test <- testing(index)

Data Summaries

Correlation Plots

This graphical function looks at the correlation of all of the different variables against each other.

library(corrplot)
#drop values that are not important (the days of the week)
newTrain<-train[ -c(25:31) ]
lmNewTest<-test[ -c(25:31) ]
#drop the predictor variables
predictTrain<-newTrain[ -c(47) ]
#Calculate the correlation Matrix and round it
res <- cor(predictTrain)

#Plot the correlation matrix values by cluster
corrplot(res, type = "upper", order = "hclust",
         tl.col = "black", tl.cex = 0.5)

From the results of this spot, it appears that we likely have some clusters of colinearity.

Table Summary

We summarize the train data of interest in tables grouped by weekdays, showing the pattern of shares in a week.

#create a new variable, weekday, to help with the creating plots.
train$weekday <- names(train[25:31])[apply(train[25:31],1, match, x = 1)]
train$weekday <-sub("weekday_is_", "", train$weekday)

#summarize the train data by weekday.knitr::kable(
summary<-train%>%group_by(weekday)%>%
  summarise(Avg=round(mean(shares),0),Sd=round(sd(shares),0),Median=median(shares),IQR=round(IQR(shares),0))
knitr::kable(summary)

weekday	Avg	Sd	Median	IQR
friday	3896	5122	2200	2400
monday	4008	5391	2450	3025
saturday	3520	3784	2450	2275
sunday	4676	6409	2650	3175
thursday	2965	2793	2000	2150
tuesday	3369	4412	1900	2300
wednesday	3414	4062	2100	2500

We summarize the train data of interest in the plots below. The histogram of shares shows that it is not a normal distribution. After log transformation, the distribution of log(share) is more close to a normal distribution.

#histogram of shares and log(shares).
hist(train$shares)

hist(log(train$shares))

Data Plots

Box Plots

We use box plots to show the difference in shares and num_images between weekdays and weekends.If the boxes of weekends are higher than the ones of weekdays, then articles be shared more often during weekends.

g1<-ggplot(train, aes(x=factor(is_weekend,labels=c("No", "Yes")),y=shares))
g1+geom_boxplot(fill="white", width=0.5,lwd=1.5,color='black',outlier.shape = NA)+
   scale_y_continuous(limits = quantile(train$shares, c(0.1, 0.9)))+
   labs(subtitle = "Shares on weekend",x="On weekend or not")

g2<-ggplot(train, aes(x=factor(is_weekend,labels=c("No", "Yes")),y=num_imgs))
g2+geom_boxplot(fill="white", width=0.5,lwd=1.5,color='black',outlier.shape = NA)+
   scale_y_continuous(limits = quantile(train$num_imgs, c(0, 0.95)))+
   labs(subtitle = "number of images on weekend",x="On weekend or not")

Linear Model

We can inspect the trend of shares as a function of num_images. If the points show an upward trend, then articles with more images tend to be shared more often. If we see a negative trend then articles with more images tend to be shared less often. We can also observe the difference after the log transformation.

g3<-ggplot(train,aes(x=num_imgs,y=shares))
g3+geom_point()+
  labs(subtitle = "num_imgs vs shares")+
  scale_y_continuous(limits = quantile(train$shares, c(0, 0.9)))+
  scale_x_continuous(limits = quantile(train$num_imgs, c(0, 0.9)))+
  geom_smooth(method="lm")

g4<-ggplot(train,aes(x=num_imgs,y=log(shares)))
g4+geom_point()+
  labs(subtitle = "num_imgs vs log(shares)")+
  scale_y_continuous(limits = quantile(log(train$shares), c(0, 0.9)))+
  scale_x_continuous(limits = quantile(train$num_imgs, c(0, 0.9)))+
  geom_smooth(method="lm")

#remove weekday from data set
train<-train%>%select(-weekday)

Models

Regression Tree

Classification trees are machine learning algorithms that have several benefits, including the ease of operation, and less pre-processing. Data does not require normalization, scaling, and removal of missing values. The results are usually easy to explain, and stakeholders usually can understand them. A regression tree is a tree that uses numerical values to predict the nodes and tree branches. Despite all of the benefits, the Decision Tree algorithm can’t be used for regression and predicting continuous values, it also does not transfer well to other datasets.

library(tree)
tree.news<-tree(shares~., data=train)
summary(tree.news)

## 
## Regression tree:
## tree(formula = shares ~ ., data = train)
## Variables actually used in tree construction:
##  [1] "self_reference_avg_sharess"  
##  [2] "min_positive_polarity"       
##  [3] "abs_title_sentiment_polarity"
##  [4] "kw_avg_max"                  
##  [5] "LDA_00"                      
##  [6] "n_non_stop_unique_tokens"    
##  [7] "global_sentiment_polarity"   
##  [8] "num_videos"                  
##  [9] "kw_avg_min"                  
## [10] "kw_min_max"                  
## [11] "LDA_02"                      
## [12] "LDA_03"                      
## Number of terminal nodes:  13 
## Residual mean deviance:  15310000 = 2.47e+10 / 1613 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu. 
## -17000.0  -1733.0   -968.7      0.0    444.0 
##     Max. 
##  37440.0

plot(tree.news)
text(tree.news, pretty=0)

yhat.regTree<- predict(tree.news, newdata = test)
yhat.test<-test["shares"]
yhat.regTree<-as.data.frame(yhat.regTree)
meanRegTree<-mean((yhat.regTree$yhat.regTree-yhat.test$shares)^2)

RMSE_regTree<-sqrt(meanRegTree)

These results can vary widely depending on the datasets.

Linear Models

Linear models are very valuable and powerful tools, and are very versatile, and can be applied to many situations. Multiple regression examines the relationship between several independent variables and one dependent variable (in this case, total Shares). Regression models give users the ability to determine the relative influence of one or more predictor variables to the predictor, and it also allows users to identify outliers, or anomalies. The main disadvantages have to do with the input quality of data. Input that is incomplete may lead to wrong conclusions. It also assumes that data is independent, which is not always the case.

There are several different types of linear models. In this project, we use multiple different multiple regression values that were log transformed, representing the full dataset, and several partial subsets with multiple variables removed at different points for multicolinearity reasons.

There are also several different types of variable selection, including forward, backward, and stepwise, which user predefined criteria set the entry and/or exit criteria of the models. Backwards selection starts with a full model, and then removes variables that are least significant one at a time, until the model criteria defined by the user are hit. Forward regression does the opposite, and is not represented here.

Linear Regression After Log Transformation

Transform the response with log, then fit a linear regression model with all the variables. Then calculate the RMSE of the model.

lm<- lm(log(shares)~.,train)
summary(lm)

## 
## Call:
## lm(formula = log(shares) ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8031 -0.4854 -0.0801  0.3906  3.4079 
## 
## Coefficients: (3 not defined because of singularities)
##                                Estimate
## (Intercept)                   6.545e+00
## timedelta                     4.318e-04
## n_tokens_title               -3.308e-03
## n_tokens_content              2.749e-06
## n_unique_tokens              -6.531e-01
## n_non_stop_words              3.421e-01
## n_non_stop_unique_tokens     -6.008e-01
## num_hrefs                    -5.360e-03
## num_self_hrefs               -1.197e-03
## num_imgs                     -5.212e-03
## num_videos                    1.058e-02
## average_token_length         -2.886e-02
## num_keywords                  4.856e-02
## kw_min_min                    7.956e-04
## kw_max_min                   -1.829e-05
## kw_avg_min                    7.622e-05
## kw_min_max                   -1.173e-06
## kw_max_max                    7.596e-08
## kw_avg_max                    5.840e-07
## kw_min_avg                   -1.363e-05
## kw_max_avg                   -3.077e-05
## kw_avg_avg                    2.774e-04
## self_reference_min_shares    -7.309e-08
## self_reference_max_shares    -9.991e-07
## self_reference_avg_sharess    3.819e-06
## weekday_is_monday            -1.085e-01
## weekday_is_tuesday           -2.453e-01
## weekday_is_wednesday         -2.309e-01
## weekday_is_thursday          -2.957e-01
## weekday_is_friday            -1.020e-01
## weekday_is_saturday          -1.759e-01
## weekday_is_sunday                    NA
## is_weekend                           NA
## LDA_00                        5.704e-01
## LDA_01                       -5.812e-01
## LDA_02                        4.556e-02
## LDA_03                       -1.695e-01
## LDA_04                               NA
## global_subjectivity          -1.402e-01
## global_sentiment_polarity     2.535e-01
## global_rate_positive_words    5.764e-01
## global_rate_negative_words   -6.943e+00
## rate_positive_words           6.558e-01
## rate_negative_words           1.248e+00
## avg_positive_polarity        -5.645e-01
## min_positive_polarity        -9.534e-01
## max_positive_polarity        -1.564e-01
## avg_negative_polarity         8.285e-02
## min_negative_polarity        -1.711e-01
## max_negative_polarity        -1.671e-01
## title_subjectivity            3.270e-02
## title_sentiment_polarity     -8.096e-02
## abs_title_subjectivity        2.817e-01
## abs_title_sentiment_polarity  3.364e-01
##                              Std. Error
## (Intercept)                   3.623e-01
## timedelta                     1.527e-04
## n_tokens_title                9.410e-03
## n_tokens_content              7.120e-05
## n_unique_tokens               6.221e-01
## n_non_stop_words              9.102e-01
## n_non_stop_unique_tokens      5.597e-01
## num_hrefs                     2.045e-03
## num_self_hrefs                4.203e-03
## num_imgs                      3.408e-03
## num_videos                    5.159e-03
## average_token_length          8.535e-02
## num_keywords                  1.145e-02
## kw_min_min                    4.569e-04
## kw_max_min                    2.356e-05
## kw_avg_min                    8.261e-05
## kw_min_max                    2.716e-07
## kw_max_max                    1.677e-07
## kw_avg_max                    3.096e-07
## kw_min_avg                    2.541e-05
## kw_max_avg                    8.900e-06
## kw_avg_avg                    5.378e-05
## self_reference_min_shares     2.332e-06
## self_reference_max_shares     8.901e-07
## self_reference_avg_sharess    2.785e-06
## weekday_is_monday             9.654e-02
## weekday_is_tuesday            9.313e-02
## weekday_is_wednesday          9.394e-02
## weekday_is_thursday           9.314e-02
## weekday_is_friday             9.670e-02
## weekday_is_saturday           1.091e-01
## weekday_is_sunday                    NA
## is_weekend                           NA
## LDA_00                        1.169e-01
## LDA_01                        1.847e-01
## LDA_02                        1.269e-01
## LDA_03                        1.336e-01
## LDA_04                               NA
## global_subjectivity           2.937e-01
## global_sentiment_polarity     5.284e-01
## global_rate_positive_words    2.276e+00
## global_rate_negative_words    5.208e+00
## rate_positive_words           7.906e-01
## rate_negative_words           8.367e-01
## avg_positive_polarity         4.631e-01
## min_positive_polarity         3.686e-01
## max_positive_polarity         1.421e-01
## avg_negative_polarity         4.271e-01
## min_negative_polarity         1.588e-01
## max_negative_polarity         3.437e-01
## title_subjectivity            9.924e-02
## title_sentiment_polarity      9.859e-02
## abs_title_subjectivity        1.253e-01
## abs_title_sentiment_polarity  1.463e-01
##                              t value Pr(>|t|)
## (Intercept)                   18.066  < 2e-16
## timedelta                      2.828  0.00475
## n_tokens_title                -0.352  0.72524
## n_tokens_content               0.039  0.96921
## n_unique_tokens               -1.050  0.29390
## n_non_stop_words               0.376  0.70705
## n_non_stop_unique_tokens      -1.073  0.28324
## num_hrefs                     -2.621  0.00885
## num_self_hrefs                -0.285  0.77574
## num_imgs                      -1.530  0.12633
## num_videos                     2.050  0.04050
## average_token_length          -0.338  0.73532
## num_keywords                   4.241 2.36e-05
## kw_min_min                     1.741  0.08183
## kw_max_min                    -0.777  0.43748
## kw_avg_min                     0.923  0.35635
## kw_min_max                    -4.319 1.67e-05
## kw_max_max                     0.453  0.65066
## kw_avg_max                     1.886  0.05943
## kw_min_avg                    -0.536  0.59169
## kw_max_avg                    -3.457  0.00056
## kw_avg_avg                     5.158 2.81e-07
## self_reference_min_shares     -0.031  0.97500
## self_reference_max_shares     -1.122  0.26183
## self_reference_avg_sharess     1.371  0.17049
## weekday_is_monday             -1.124  0.26127
## weekday_is_tuesday            -2.634  0.00852
## weekday_is_wednesday          -2.458  0.01409
## weekday_is_thursday           -3.175  0.00153
## weekday_is_friday             -1.055  0.29172
## weekday_is_saturday           -1.612  0.10710
## weekday_is_sunday                 NA       NA
## is_weekend                        NA       NA
## LDA_00                         4.880 1.17e-06
## LDA_01                        -3.146  0.00168
## LDA_02                         0.359  0.71959
## LDA_03                        -1.269  0.20467
## LDA_04                            NA       NA
## global_subjectivity           -0.477  0.63319
## global_sentiment_polarity      0.480  0.63143
## global_rate_positive_words     0.253  0.80008
## global_rate_negative_words    -1.333  0.18269
## rate_positive_words            0.830  0.40694
## rate_negative_words            1.492  0.13589
## avg_positive_polarity         -1.219  0.22311
## min_positive_polarity         -2.587  0.00978
## max_positive_polarity         -1.100  0.27140
## avg_negative_polarity          0.194  0.84623
## min_negative_polarity         -1.077  0.28144
## max_negative_polarity         -0.486  0.62694
## title_subjectivity             0.330  0.74177
## title_sentiment_polarity      -0.821  0.41168
## abs_title_subjectivity         2.248  0.02474
## abs_title_sentiment_polarity   2.299  0.02165
##                                 
## (Intercept)                  ***
## timedelta                    ** 
## n_tokens_title                  
## n_tokens_content                
## n_unique_tokens                 
## n_non_stop_words                
## n_non_stop_unique_tokens        
## num_hrefs                    ** 
## num_self_hrefs                  
## num_imgs                        
## num_videos                   *  
## average_token_length            
## num_keywords                 ***
## kw_min_min                   .  
## kw_max_min                      
## kw_avg_min                      
## kw_min_max                   ***
## kw_max_max                      
## kw_avg_max                   .  
## kw_min_avg                      
## kw_max_avg                   ***
## kw_avg_avg                   ***
## self_reference_min_shares       
## self_reference_max_shares       
## self_reference_avg_sharess      
## weekday_is_monday               
## weekday_is_tuesday           ** 
## weekday_is_wednesday         *  
## weekday_is_thursday          ** 
## weekday_is_friday               
## weekday_is_saturday             
## weekday_is_sunday               
## is_weekend                      
## LDA_00                       ***
## LDA_01                       ** 
## LDA_02                          
## LDA_03                          
## LDA_04                          
## global_subjectivity             
## global_sentiment_polarity       
## global_rate_positive_words      
## global_rate_negative_words      
## rate_positive_words             
## rate_negative_words             
## avg_positive_polarity           
## min_positive_polarity        ** 
## max_positive_polarity           
## avg_negative_polarity           
## min_negative_polarity           
## max_negative_polarity           
## title_subjectivity              
## title_sentiment_polarity        
## abs_title_subjectivity       *  
## abs_title_sentiment_polarity *  
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7616 on 1575 degrees of freedom
## Multiple R-squared:  0.2238, Adjusted R-squared:  0.1992 
## F-statistic: 9.084 on 50 and 1575 DF,  p-value: < 2.2e-16

yhat_lm<-predict(lm,test)
RMSE_lm<-sqrt(mean((test$shares - exp(yhat_lm))^2))

Plot the lm Residuals

par(mfrow=c(2,2))
plot(lm)

Looking at our residuals, there seems to be skewing in both direction, indicating that the data, even after transformation, has extreme outliers in both directions.

Model Removing the Day Variable

#look at the data for multicolinearity
lmNewTest<-test[ -c(25:31) ]
lm2<- lm(log(shares)~.,newTrain)
yhat_lm2<-predict(lm2,lmNewTest)
RMSE_lm2<-sqrt(mean((lmNewTest$shares - exp(yhat_lm2))^2))

library(mctest)
omcdiag(lm2)

## 
## Call:
## omcdiag(mod = lm2)
## 
## 
## Overall Multicollinearity Diagnostics
## 
##                           MC Results detection
## Determinant |X'X|:      0.000000e+00         1
## Farrar Chi-Square:      1.123337e+05         1
## Red Indicator:          1.699000e-01         0
## Sum of Lambda Inverse: -1.852839e+15         0
## Theil's Method:         2.401140e+01         1
## Condition Number:                NaN        NA
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test

imcdiag(lm2)

## 
## Call:
## imcdiag(mod = lm2)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                                  VIF    TOL
## timedelta                     2.6385 0.3790
## n_tokens_title                1.1231 0.8904
## n_tokens_content              4.3346 0.2307
## n_unique_tokens              15.8991 0.0629
## n_non_stop_words             14.1605 0.0706
## n_non_stop_unique_tokens     11.9212 0.0839
## num_hrefs                     2.8901 0.3460
## num_self_hrefs                1.9459 0.5139
## num_imgs                      1.9782 0.5055
## num_videos                    1.2102 0.8263
## average_token_length          3.9457 0.2534
## num_keywords                  1.7864 0.5598
## kw_min_min                    3.8111 0.2624
## kw_max_min                   10.6723 0.0937
## kw_avg_min                   10.4727 0.0955
## kw_min_max                    2.9877 0.3347
## kw_max_max                    4.7543 0.2103
## kw_avg_max                    5.6922 0.1757
## kw_min_avg                    2.9120 0.3434
## kw_max_avg                    9.4605 0.1057
## kw_avg_avg                   11.7101 0.0854
## self_reference_min_shares     7.1598 0.1397
## self_reference_max_shares     4.2655 0.2344
## self_reference_avg_sharess   12.6587 0.0790
## is_weekend                    1.1414 0.8761
## LDA_00                           Inf 0.0000
## LDA_01                           Inf 0.0000
## LDA_02                           Inf 0.0000
## LDA_03                           Inf 0.0000
## LDA_04                           Inf 0.0000
## global_subjectivity           2.1811 0.4585
## global_sentiment_polarity     7.2628 0.1377
## global_rate_positive_words    3.9485 0.2533
## global_rate_negative_words    7.0883 0.1411
## rate_positive_words          36.4128 0.0275
## rate_negative_words          34.3847 0.0291
## avg_positive_polarity         5.6245 0.1778
## min_positive_polarity         2.0946 0.4774
## max_positive_polarity         2.7954 0.3577
## avg_negative_polarity         8.4530 0.1183
## min_negative_polarity         6.1390 0.1629
## max_negative_polarity         3.5893 0.2786
## title_subjectivity            2.6708 0.3744
## title_sentiment_polarity      1.8543 0.5393
## abs_title_subjectivity        1.4851 0.6734
## abs_title_sentiment_polarity  3.2414 0.3085
##                                     Wi
## timedelta                      57.5288
## n_tokens_title                  4.3220
## n_tokens_content              117.0824
## n_unique_tokens               523.1255
## n_non_stop_words              462.0781
## n_non_stop_unique_tokens      383.4559
## num_hrefs                      66.3638
## num_self_hrefs                 33.2104
## num_imgs                       34.3470
## num_videos                      7.3799
## average_token_length          103.4258
## num_keywords                   27.6118
## kw_min_min                     98.6997
## kw_max_min                    339.6063
## kw_avg_min                    332.5959
## kw_min_max                     69.7909
## kw_max_max                    131.8162
## kw_avg_max                    164.7500
## kw_min_avg                     67.1318
## kw_max_avg                    297.0575
## kw_avg_avg                    376.0431
## self_reference_min_shares     216.2791
## self_reference_max_shares     114.6551
## self_reference_avg_sharess    409.3507
## is_weekend                      4.9652
## LDA_00                             Inf
## LDA_01                             Inf
## LDA_02                             Inf
## LDA_03                             Inf
## LDA_04                             Inf
## global_subjectivity            41.4698
## global_sentiment_polarity     219.8945
## global_rate_positive_words    103.5268
## global_rate_negative_words    213.7678
## rate_positive_words          1243.3815
## rate_negative_words          1172.1731
## avg_positive_polarity         162.3726
## min_positive_polarity          38.4311
## max_positive_polarity          63.0377
## avg_negative_polarity         261.6822
## min_negative_polarity         180.4358
## max_negative_polarity          90.9136
## title_subjectivity             58.6631
## title_sentiment_polarity       29.9966
## abs_title_subjectivity         17.0309
## abs_title_sentiment_polarity   78.6987
##                                     Fi Leamer
## timedelta                      58.8735 0.6156
## n_tokens_title                  4.4230 0.9436
## n_tokens_content              119.8192 0.4803
## n_unique_tokens               535.3533 0.2508
## n_non_stop_words              472.8790 0.2657
## n_non_stop_unique_tokens      392.4190 0.2896
## num_hrefs                      67.9151 0.5882
## num_self_hrefs                 33.9867 0.7169
## num_imgs                       35.1499 0.7110
## num_videos                      7.5524 0.9090
## average_token_length          105.8434 0.5034
## num_keywords                   28.2572 0.7482
## kw_min_min                    101.0068 0.5122
## kw_max_min                    347.5445 0.3061
## kw_avg_min                    340.3701 0.3090
## kw_min_max                     71.4222 0.5785
## kw_max_max                    134.8973 0.4586
## kw_avg_max                    168.6010 0.4191
## kw_min_avg                     68.7009 0.5860
## kw_max_avg                    304.0011 0.3251
## kw_avg_avg                    384.8329 0.2922
## self_reference_min_shares     221.3346 0.3737
## self_reference_max_shares     117.3351 0.4842
## self_reference_avg_sharess    418.9190 0.2811
## is_weekend                      5.0813 0.9360
## LDA_00                             Inf 0.0000
## LDA_01                             Inf 0.0000
## LDA_02                             Inf 0.0000
## LDA_03                             Inf 0.0000
## LDA_04                             Inf 0.0000
## global_subjectivity            42.4391 0.6771
## global_sentiment_polarity     225.0344 0.3711
## global_rate_positive_words    105.9467 0.5032
## global_rate_negative_words    218.7645 0.3756
## rate_positive_words          1272.4450 0.1657
## rate_negative_words          1199.5722 0.1705
## avg_positive_polarity         166.1679 0.4217
## min_positive_polarity          39.3294 0.6910
## max_positive_polarity          64.5111 0.5981
## avg_negative_polarity         267.7989 0.3439
## min_negative_polarity         184.6534 0.4036
## max_negative_polarity          93.0387 0.5278
## title_subjectivity             60.0343 0.6119
## title_sentiment_polarity       30.6977 0.7344
## abs_title_subjectivity         17.4290 0.8206
## abs_title_sentiment_polarity   80.5383 0.5554
##                                 CVIF Klein
## timedelta                     3.3363     1
## n_tokens_title                1.4201     0
## n_tokens_content              5.4811     1
## n_unique_tokens              20.1042     1
## n_non_stop_words             17.9057     1
## n_non_stop_unique_tokens     15.0742     1
## num_hrefs                     3.6545     1
## num_self_hrefs                2.4605     1
## num_imgs                      2.5015     1
## num_videos                    1.5303     0
## average_token_length          4.9892     1
## num_keywords                  2.2589     1
## kw_min_min                    4.8190     1
## kw_max_min                   13.4950     1
## kw_avg_min                   13.2425     1
## kw_min_max                    3.7779     1
## kw_max_max                    6.0117     1
## kw_avg_max                    7.1978     1
## kw_min_avg                    3.6822     1
## kw_max_avg                   11.9627     1
## kw_avg_avg                   14.8072     1
## self_reference_min_shares     9.0535     1
## self_reference_max_shares     5.3937     1
## self_reference_avg_sharess   16.0068     1
## is_weekend                    1.4433     0
## LDA_00                           Inf     1
## LDA_01                           Inf     1
## LDA_02                           Inf     1
## LDA_03                           Inf     1
## LDA_04                           Inf     1
## global_subjectivity           2.7580     1
## global_sentiment_polarity     9.1837     1
## global_rate_positive_words    4.9929     1
## global_rate_negative_words    8.9631     1
## rate_positive_words          46.0434     1
## rate_negative_words          43.4789     1
## avg_positive_polarity         7.1121     1
## min_positive_polarity         2.6485     1
## max_positive_polarity         3.5347     1
## avg_negative_polarity        10.6887     1
## min_negative_polarity         7.7627     1
## max_negative_polarity         4.5386     1
## title_subjectivity            3.3772     1
## title_sentiment_polarity      2.3448     1
## abs_title_subjectivity        1.8778     1
## abs_title_sentiment_polarity  4.0987     1
##                                IND1   IND2
## timedelta                    0.0105 0.8470
## n_tokens_title               0.0248 0.1495
## n_tokens_content             0.0064 1.0493
## n_unique_tokens              0.0018 1.2781
## n_non_stop_words             0.0020 1.2676
## n_non_stop_unique_tokens     0.0023 1.2495
## num_hrefs                    0.0096 0.8920
## num_self_hrefs               0.0143 0.6630
## num_imgs                     0.0141 0.6745
## num_videos                   0.0230 0.2369
## average_token_length         0.0071 1.0183
## num_keywords                 0.0156 0.6004
## kw_min_min                   0.0073 1.0060
## kw_max_min                   0.0026 1.2361
## kw_avg_min                   0.0027 1.2337
## kw_min_max                   0.0093 0.9074
## kw_max_max                   0.0059 1.0770
## kw_avg_max                   0.0049 1.1243
## kw_min_avg                   0.0096 0.8955
## kw_max_avg                   0.0029 1.2198
## kw_avg_avg                   0.0024 1.2475
## self_reference_min_shares    0.0039 1.1734
## self_reference_max_shares    0.0065 1.0442
## self_reference_avg_sharess   0.0022 1.2562
## is_weekend                   0.0244 0.1690
## LDA_00                       0.0000 1.3639
## LDA_01                       0.0000 1.3639
## LDA_02                       0.0000 1.3639
## LDA_03                       0.0000 1.3639
## LDA_04                       0.0000 1.3639
## global_subjectivity          0.0128 0.7386
## global_sentiment_polarity    0.0038 1.1761
## global_rate_positive_words   0.0070 1.0185
## global_rate_negative_words   0.0039 1.1715
## rate_positive_words          0.0008 1.3265
## rate_negative_words          0.0008 1.3243
## avg_positive_polarity        0.0049 1.1214
## min_positive_polarity        0.0133 0.7128
## max_positive_polarity        0.0100 0.8760
## avg_negative_polarity        0.0033 1.2026
## min_negative_polarity        0.0045 1.1418
## max_negative_polarity        0.0078 0.9839
## title_subjectivity           0.0104 0.8532
## title_sentiment_polarity     0.0150 0.6284
## abs_title_subjectivity       0.0187 0.4455
## abs_title_sentiment_polarity 0.0086 0.9431
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## n_tokens_title , n_tokens_content , n_unique_tokens , n_non_stop_words , n_non_stop_unique_tokens , num_self_hrefs , num_imgs , average_token_length , kw_min_min , kw_max_min , kw_avg_min , kw_max_max , kw_avg_max , kw_min_avg , self_reference_min_shares , self_reference_max_shares , self_reference_avg_sharess , is_weekend , LDA_02 , LDA_03 , LDA_04 , global_subjectivity , global_sentiment_polarity , global_rate_positive_words , global_rate_negative_words , rate_positive_words , rate_negative_words , min_positive_polarity , max_positive_polarity , avg_negative_polarity , min_negative_polarity , max_negative_polarity , title_subjectivity , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.2159 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

Looking at all of the VIF values, we are going to start by removing all of the LDA Values, and the positive word rate to remove all “infinite” VIF values.

First Multicolinearity Trim

The mctest package was used to calculate the VIF values of multicolinearity.

toRemove<-c( "LDA_01", "LDA_02", "LDA_03", "LDA_04", "rate_positive_words")
trimTrain1 <- newTrain[, ! names(newTrain) %in% toRemove, drop = F]
lmNewTest3<-lmNewTest[, ! names(newTrain) %in% toRemove, drop = F]

#Repeat linear Model process
lm3<- lm(log(shares)~., trimTrain1)
yhat_lm3<-predict(lm3,lmNewTest3)
RMSE_lm3<-sqrt(mean((lmNewTest3$shares - exp(yhat_lm3))^2))

imcdiag(lm3)

## 
## Call:
## imcdiag(mod = lm3)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                                  VIF    TOL
## timedelta                     2.6264 0.3807
## n_tokens_title                1.1158 0.8962
## n_tokens_content              4.3256 0.2312
## n_unique_tokens              15.4512 0.0647
## n_non_stop_words              4.8511 0.2061
## n_non_stop_unique_tokens     11.7375 0.0852
## num_hrefs                     2.7925 0.3581
## num_self_hrefs                1.9153 0.5221
## num_imgs                      1.9562 0.5112
## num_videos                    1.2023 0.8318
## average_token_length          3.8729 0.2582
## num_keywords                  1.7220 0.5807
## kw_min_min                    3.7473 0.2669
## kw_max_min                   10.6639 0.0938
## kw_avg_min                   10.4650 0.0956
## kw_min_max                    2.9635 0.3374
## kw_max_max                    4.5710 0.2188
## kw_avg_max                    5.6033 0.1785
## kw_min_avg                    2.7818 0.3595
## kw_max_avg                    8.9034 0.1123
## kw_avg_avg                   10.5816 0.0945
## self_reference_min_shares     7.1123 0.1406
## self_reference_max_shares     4.2611 0.2347
## self_reference_avg_sharess   12.6238 0.0792
## is_weekend                    1.1383 0.8785
## LDA_00                        1.1841 0.8445
## global_subjectivity           2.1502 0.4651
## global_sentiment_polarity     7.2308 0.1383
## global_rate_positive_words    3.8535 0.2595
## global_rate_negative_words    6.9925 0.1430
## rate_negative_words           8.0273 0.1246
## avg_positive_polarity         5.6172 0.1780
## min_positive_polarity         2.0855 0.4795
## max_positive_polarity         2.7770 0.3601
## avg_negative_polarity         8.4364 0.1185
## min_negative_polarity         6.1223 0.1633
## max_negative_polarity         3.5673 0.2803
## title_subjectivity            2.6609 0.3758
## title_sentiment_polarity      1.8521 0.5399
## abs_title_subjectivity        1.4725 0.6791
## abs_title_sentiment_polarity  3.2370 0.3089
##                                    Wi       Fi
## timedelta                     64.4466  66.1408
## n_tokens_title                 4.5874   4.7080
## n_tokens_content             131.7788 135.2430
## n_unique_tokens              572.6308 587.6842
## n_non_stop_words             152.6009 156.6125
## n_non_stop_unique_tokens     425.4743 436.6592
## num_hrefs                     71.0283  72.8955
## num_self_hrefs                36.2698  37.2233
## num_imgs                      37.8897  38.8857
## num_videos                     8.0143   8.2249
## average_token_length         113.8403 116.8329
## num_keywords                  28.6101  29.3622
## kw_min_min                   108.8620 111.7238
## kw_max_min                   382.9332 392.9998
## kw_avg_min                   375.0503 384.9097
## kw_min_max                    77.8026  79.8479
## kw_max_max                   141.5008 145.2206
## kw_avg_max                   182.4062 187.2013
## kw_min_avg                    70.6054  72.4615
## kw_max_avg                   313.1730 321.4057
## kw_avg_avg                   379.6695 389.6503
## self_reference_min_shares    242.1989 248.5658
## self_reference_max_shares    129.2224 132.6195
## self_reference_avg_sharess   460.5949 472.7030
## is_weekend                     5.4784   5.6224
## LDA_00                         7.2952   7.4870
## global_subjectivity           45.5768  46.7749
## global_sentiment_polarity    246.8963 253.3867
## global_rate_positive_words   113.0694 116.0418
## global_rate_negative_words   237.4533 243.6955
## rate_negative_words          278.4569 285.7770
## avg_positive_polarity        182.9554 187.7650
## min_positive_polarity         43.0124  44.1432
## max_positive_polarity         70.4131  72.2641
## avg_negative_polarity        294.6687 302.4150
## min_negative_polarity        202.9725 208.3082
## max_negative_polarity        101.7302 104.4045
## title_subjectivity            65.8119  67.5419
## title_sentiment_polarity      33.7641  34.6517
## abs_title_subjectivity        18.7233  19.2155
## abs_title_sentiment_polarity  88.6404  90.9706
##                              Leamer    CVIF
## timedelta                    0.6170  3.1221
## n_tokens_title               0.9467  1.3263
## n_tokens_content             0.4808  5.1420
## n_unique_tokens              0.2544 18.3673
## n_non_stop_words             0.4540  5.7667
## n_non_stop_unique_tokens     0.2919 13.9527
## num_hrefs                    0.5984  3.3195
## num_self_hrefs               0.7226  2.2768
## num_imgs                     0.7150  2.3254
## num_videos                   0.9120  1.4291
## average_token_length         0.5081  4.6039
## num_keywords                 0.7620  2.0470
## kw_min_min                   0.5166  4.4545
## kw_max_min                   0.3062 12.6765
## kw_avg_min                   0.3091 12.4400
## kw_min_max                   0.5809  3.5228
## kw_max_max                   0.4677  5.4337
## kw_avg_max                   0.4225  6.6608
## kw_min_avg                   0.5996  3.3068
## kw_max_avg                   0.3351 10.5837
## kw_avg_avg                   0.3074 12.5786
## self_reference_min_shares    0.3750  8.4545
## self_reference_max_shares    0.4844  5.0653
## self_reference_avg_sharess   0.2815 15.0063
## is_weekend                   0.9373  1.3531
## LDA_00                       0.9190  1.4076
## global_subjectivity          0.6820  2.5560
## global_sentiment_polarity    0.3719  8.5955
## global_rate_positive_words   0.5094  4.5807
## global_rate_negative_words   0.3782  8.3122
## rate_negative_words          0.3530  9.5423
## avg_positive_polarity        0.4219  6.6773
## min_positive_polarity        0.6925  2.4791
## max_positive_polarity        0.6001  3.3011
## avg_negative_polarity        0.3443 10.0286
## min_negative_polarity        0.4041  7.2778
## max_negative_polarity        0.5295  4.2406
## title_subjectivity           0.6130  3.1630
## title_sentiment_polarity     0.7348  2.2016
## abs_title_subjectivity       0.8241  1.7504
## abs_title_sentiment_polarity 0.5558  3.8479
##                              Klein   IND1
## timedelta                        1 0.0096
## n_tokens_title                   0 0.0226
## n_tokens_content                 1 0.0058
## n_unique_tokens                  1 0.0016
## n_non_stop_words                 1 0.0052
## n_non_stop_unique_tokens         1 0.0022
## num_hrefs                        1 0.0090
## num_self_hrefs                   1 0.0132
## num_imgs                         1 0.0129
## num_videos                       0 0.0210
## average_token_length             1 0.0065
## num_keywords                     1 0.0147
## kw_min_min                       1 0.0067
## kw_max_min                       1 0.0024
## kw_avg_min                       1 0.0024
## kw_min_max                       1 0.0085
## kw_max_max                       1 0.0055
## kw_avg_max                       1 0.0045
## kw_min_avg                       1 0.0091
## kw_max_avg                       1 0.0028
## kw_avg_avg                       1 0.0024
## self_reference_min_shares        1 0.0035
## self_reference_max_shares        1 0.0059
## self_reference_avg_sharess       1 0.0020
## is_weekend                       0 0.0222
## LDA_00                           0 0.0213
## global_subjectivity              1 0.0117
## global_sentiment_polarity        1 0.0035
## global_rate_positive_words       1 0.0065
## global_rate_negative_words       1 0.0036
## rate_negative_words              1 0.0031
## avg_positive_polarity            1 0.0045
## min_positive_polarity            1 0.0121
## max_positive_polarity            1 0.0091
## avg_negative_polarity            1 0.0030
## min_negative_polarity            1 0.0041
## max_negative_polarity            1 0.0071
## title_subjectivity               1 0.0095
## title_sentiment_polarity         1 0.0136
## abs_title_subjectivity           1 0.0171
## abs_title_sentiment_polarity     1 0.0078
##                                IND2
## timedelta                    0.9224
## n_tokens_title               0.1546
## n_tokens_content             1.1452
## n_unique_tokens              1.3932
## n_non_stop_words             1.1825
## n_non_stop_unique_tokens     1.3627
## num_hrefs                    0.9561
## num_self_hrefs               0.7119
## num_imgs                     0.7281
## num_videos                   0.2506
## average_token_length         1.1050
## num_keywords                 0.6246
## kw_min_min                   1.0921
## kw_max_min                   1.3499
## kw_avg_min                   1.3472
## kw_min_max                   0.9869
## kw_max_max                   1.1637
## kw_avg_max                   1.2237
## kw_min_avg                   0.9541
## kw_max_avg                   1.3223
## kw_avg_avg                   1.3488
## self_reference_min_shares    1.2801
## self_reference_max_shares    1.1400
## self_reference_avg_sharess   1.3716
## is_weekend                   0.1809
## LDA_00                       0.2316
## global_subjectivity          0.7968
## global_sentiment_polarity    1.2836
## global_rate_positive_words   1.1030
## global_rate_negative_words   1.2765
## rate_negative_words          1.3040
## avg_positive_polarity        1.2244
## min_positive_polarity        0.7753
## max_positive_polarity        0.9532
## avg_negative_polarity        1.3130
## min_negative_polarity        1.2463
## max_negative_polarity        1.0720
## title_subjectivity           0.9298
## title_sentiment_polarity     0.6853
## abs_title_subjectivity       0.4780
## abs_title_sentiment_polarity 1.0294
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## n_tokens_title , n_tokens_content , n_unique_tokens , n_non_stop_words , n_non_stop_unique_tokens , num_self_hrefs , num_imgs , average_token_length , kw_min_min , kw_max_min , kw_avg_min , kw_max_max , kw_avg_max , kw_min_avg , self_reference_min_shares , self_reference_max_shares , self_reference_avg_sharess , is_weekend , global_subjectivity , global_sentiment_polarity , global_rate_positive_words , global_rate_negative_words , rate_negative_words , avg_positive_polarity , max_positive_polarity , avg_negative_polarity , min_negative_polarity , max_negative_polarity , title_subjectivity , title_sentiment_polarity , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.2087 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

This improves the model multicolinearity, but we are still left with some. We then pare down and select those models with the next highest VIF removed one at a time, until all values are below 5.

Second Mulitcolinearity Trim

toRemove<-c("self_reference_avg_sharess", "kw_avg_min", "n_unique_tokens", "rate_negative_words", "kw_avg_avg", "n_non_stop_words", "global_sentiment_polarity", "avg_negative_polarity", "kw_max_max")
trimTrain2 <- trimTrain1[, ! names(trimTrain1) %in% toRemove, drop = F]

#Repeat linear Model process
lm4<- lm(log(shares)~., trimTrain2)

imcdiag(lm4)

## 
## Call:
## imcdiag(mod = lm4)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                                 VIF    TOL
## timedelta                    2.4987 0.4002
## n_tokens_title               1.1018 0.9076
## n_tokens_content             3.1074 0.3218
## n_non_stop_unique_tokens     2.3360 0.4281
## num_hrefs                    2.5410 0.3935
## num_self_hrefs               1.8691 0.5350
## num_imgs                     1.8628 0.5368
## num_videos                   1.1666 0.8572
## average_token_length         1.4686 0.6809
## num_keywords                 1.5997 0.6251
## kw_min_min                   1.8372 0.5443
## kw_max_min                   1.2628 0.7919
## kw_min_max                   2.6483 0.3776
## kw_avg_max                   4.5831 0.2182
## kw_min_avg                   1.4173 0.7056
## kw_max_avg                   1.4686 0.6809
## self_reference_min_shares    1.4277 0.7004
## self_reference_max_shares    1.7479 0.5721
## is_weekend                   1.1259 0.8882
## LDA_00                       1.1786 0.8485
## global_subjectivity          1.9334 0.5172
## global_rate_positive_words   1.3992 0.7147
## global_rate_negative_words   1.4836 0.6740
## avg_positive_polarity        3.3090 0.3022
## min_positive_polarity        1.9154 0.5221
## max_positive_polarity        2.6551 0.3766
## min_negative_polarity        2.0094 0.4977
## max_negative_polarity        1.2212 0.8188
## title_subjectivity           2.6509 0.3772
## title_sentiment_polarity     1.7995 0.5557
## abs_title_subjectivity       1.4675 0.6815
## abs_title_sentiment_polarity 3.1556 0.3169
##                                    Wi       Fi
## timedelta                     77.0635  79.6823
## n_tokens_title                 5.2366   5.4146
## n_tokens_content             108.3587 112.0409
## n_non_stop_unique_tokens      68.6988  71.0332
## num_hrefs                     79.2386  81.9312
## num_self_hrefs                44.6909  46.2096
## num_imgs                      44.3659  45.8735
## num_videos                     8.5686   8.8598
## average_token_length          24.0971  24.9159
## num_keywords                  30.8363  31.8841
## kw_min_min                    43.0505  44.5134
## kw_max_min                    13.5150  13.9743
## kw_min_max                    84.7539  87.6340
## kw_avg_max                   184.2405 190.5013
## kw_min_avg                    21.4584  22.1876
## kw_max_avg                    24.0957  24.9145
## self_reference_min_shares     21.9925  22.7399
## self_reference_max_shares     38.4578  39.7646
## is_weekend                     6.4725   6.6925
## LDA_00                         9.1844   9.4965
## global_subjectivity           47.9963  49.6273
## global_rate_positive_words    20.5278  21.2254
## global_rate_negative_words    24.8679  25.7130
## avg_positive_polarity        118.7255 122.7600
## min_positive_polarity         47.0678  48.6672
## max_positive_polarity         85.1031  87.9951
## min_negative_polarity         51.9030  53.6667
## max_negative_polarity         11.3763  11.7629
## title_subjectivity            84.8856  87.7702
## title_sentiment_polarity      41.1099  42.5069
## abs_title_subjectivity        24.0363  24.8531
## abs_title_sentiment_polarity 110.8374 114.6038
##                              Leamer   CVIF
## timedelta                    0.6326 2.7710
## n_tokens_title               0.9527 1.2219
## n_tokens_content             0.5673 3.4460
## n_non_stop_unique_tokens     0.6543 2.5906
## num_hrefs                    0.6273 2.8179
## num_self_hrefs               0.7314 2.0728
## num_imgs                     0.7327 2.0658
## num_videos                   0.9258 1.2938
## average_token_length         0.8252 1.6287
## num_keywords                 0.7906 1.7740
## kw_min_min                   0.7378 2.0375
## kw_max_min                   0.8899 1.4005
## kw_min_max                   0.6145 2.9369
## kw_avg_max                   0.4671 5.0826
## kw_min_avg                   0.8400 1.5718
## kw_max_avg                   0.8252 1.6287
## self_reference_min_shares    0.8369 1.5833
## self_reference_max_shares    0.7564 1.9384
## is_weekend                   0.9424 1.2486
## LDA_00                       0.9211 1.3071
## global_subjectivity          0.7192 2.1441
## global_rate_positive_words   0.8454 1.5517
## global_rate_negative_words   0.8210 1.6453
## avg_positive_polarity        0.5497 3.6696
## min_positive_polarity        0.7226 2.1241
## max_positive_polarity        0.6137 2.9444
## min_negative_polarity        0.7054 2.2284
## max_negative_polarity        0.9049 1.3543
## title_subjectivity           0.6142 2.9397
## title_sentiment_polarity     0.7455 1.9956
## abs_title_subjectivity       0.8255 1.6274
## abs_title_sentiment_polarity 0.5629 3.4994
##                              Klein   IND1
## timedelta                        1 0.0078
## n_tokens_title                   0 0.0177
## n_tokens_content                 1 0.0063
## n_non_stop_unique_tokens         1 0.0083
## num_hrefs                        1 0.0077
## num_self_hrefs                   1 0.0104
## num_imgs                         1 0.0104
## num_videos                       0 0.0167
## average_token_length             1 0.0132
## num_keywords                     1 0.0122
## kw_min_min                       1 0.0106
## kw_max_min                       1 0.0154
## kw_min_max                       1 0.0073
## kw_avg_max                       1 0.0042
## kw_min_avg                       1 0.0137
## kw_max_avg                       1 0.0132
## self_reference_min_shares        1 0.0136
## self_reference_max_shares        1 0.0111
## is_weekend                       0 0.0173
## LDA_00                           0 0.0165
## global_subjectivity              1 0.0101
## global_rate_positive_words       1 0.0139
## global_rate_negative_words       1 0.0131
## avg_positive_polarity            1 0.0059
## min_positive_polarity            1 0.0102
## max_positive_polarity            1 0.0073
## min_negative_polarity            1 0.0097
## max_negative_polarity            0 0.0159
## title_subjectivity               1 0.0073
## title_sentiment_polarity         1 0.0108
## abs_title_subjectivity           1 0.0133
## abs_title_sentiment_polarity     1 0.0062
##                                IND2
## timedelta                    1.4080
## n_tokens_title               0.2170
## n_tokens_content             1.5920
## n_non_stop_unique_tokens     1.3426
## num_hrefs                    1.4237
## num_self_hrefs               1.0916
## num_imgs                     1.0873
## num_videos                   0.3353
## average_token_length         0.7491
## num_keywords                 0.8800
## kw_min_min                   1.0698
## kw_max_min                   0.4886
## kw_min_max                   1.4611
## kw_avg_max                   1.8353
## kw_min_avg                   0.6912
## kw_max_avg                   0.7491
## self_reference_min_shares    0.7033
## self_reference_max_shares    1.0045
## is_weekend                   0.2625
## LDA_00                       0.3558
## global_subjectivity          1.1333
## global_rate_positive_words   0.6698
## global_rate_negative_words   0.7652
## avg_positive_polarity        1.6381
## min_positive_polarity        1.1219
## max_positive_polarity        1.4633
## min_negative_polarity        1.1792
## max_negative_polarity        0.4253
## title_subjectivity           1.4619
## title_sentiment_polarity     1.0430
## abs_title_subjectivity       0.7478
## abs_title_sentiment_polarity 1.6036
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## n_tokens_title , n_tokens_content , num_self_hrefs , num_imgs , num_videos , average_token_length , kw_min_min , kw_max_min , self_reference_max_shares , is_weekend , global_subjectivity , global_rate_positive_words , global_rate_negative_words , avg_positive_polarity , max_positive_polarity , min_negative_polarity , max_negative_polarity , title_subjectivity , title_sentiment_polarity , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.1875 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

After removing 15 more variables for obvious multicolinearity via VIF (>5), we need to replot the correlation matrix, which shows a much lower clustering rate of high correlations.

Replot Correlation

#Remove the predictor
train_cor<-trimTrain2[1:31]
res <- cor(train_cor)
palette = colorRampPalette(c("green", "white", "red")) (20)
heatmap(x = res, col = palette, symm = TRUE, cexRow=0.5, cexCol = 0.5)

The new heatmap appears to have less prominent clustering values.

Final Model Fit Prediction

#trim the testing data
newTest1<-test[ -c(25:31) ]
toRemove<-c( "LDA_01", "LDA_02", "LDA_03", "LDA_04", "rate_positive_words", "self_reference_avg_sharess", "kw_avg_min", "n_unique_tokens", "rate_negative_words", "kw_avg_avg", "n_non_stop_words", "global_sentiment_polarity", "avg_negative_polarity", "kw_max_max")

trimTest4 <- newTest1[, ! names(newTest1) %in% toRemove, drop = F]

yhat_lm4<-predict(lm4,trimTest4)
RMSE_lm4<-sqrt(mean((trimTest4$shares - exp(yhat_lm4))^2))

Backward Regression Selection

Transform the response with log, then fit a linear regression model with the variables after backward selection.

#backward selection after log transformation
library(leaps)
backward<- regsubsets(log(shares)~., trimTrain1, nvmax = 31, method = "backward")
backward_summary<-summary(backward)

#backward_summary[["which"]][size, ]
par(mfrow=c(1,3))
plot(backward_summary$cp, xlab = "Size", ylab = "backward Cp", type = "l")
plot(backward_summary$bic, xlab = "Size", ylab = "backward bic", type = "l")
plot(backward_summary$adjr2, xlab = "Size", ylab = "backward adjR2", type = "l")

coef(backward, which.min(backward_summary$cp))

##                  (Intercept) 
##                 6.331146e+00 
##                    timedelta 
##                 4.278536e-04 
##              n_unique_tokens 
##                -1.264842e+00 
##             n_non_stop_words 
##                 8.155583e-01 
##                    num_hrefs 
##                -5.579965e-03 
##                     num_imgs 
##                -4.733042e-03 
##                   num_videos 
##                 1.039906e-02 
##                 num_keywords 
##                 5.086320e-02 
##                   kw_min_min 
##                 5.370027e-04 
##                   kw_min_max 
##                -1.277926e-06 
##                   kw_avg_max 
##                 5.885611e-07 
##                   kw_max_avg 
##                -2.782200e-05 
##                   kw_avg_avg 
##                 2.447049e-04 
##   self_reference_avg_sharess 
##                 2.556476e-06 
##                   is_weekend 
##                 9.672698e-02 
##                       LDA_00 
##                 6.579277e-01 
##   global_rate_negative_words 
##                -6.486962e+00 
##          rate_negative_words 
##                 5.708355e-01 
##        avg_positive_polarity 
##                -7.618779e-01 
##        min_positive_polarity 
##                -8.260824e-01 
##       abs_title_subjectivity 
##                 2.558424e-01 
## abs_title_sentiment_polarity 
##                 3.114417e-01

coef(backward, which.max(backward_summary$adjr2))

##                  (Intercept) 
##                 6.352630e+00 
##                    timedelta 
##                 4.255111e-04 
##              n_unique_tokens 
##                -7.571428e-01 
##             n_non_stop_words 
##                 9.800345e-01 
##     n_non_stop_unique_tokens 
##                -6.207916e-01 
##                    num_hrefs 
##                -5.375455e-03 
##                     num_imgs 
##                -5.446052e-03 
##                   num_videos 
##                 1.093032e-02 
##                 num_keywords 
##                 4.876860e-02 
##                   kw_min_min 
##                 5.018445e-04 
##                   kw_min_max 
##                -1.259228e-06 
##                   kw_avg_max 
##                 5.458451e-07 
##                   kw_max_avg 
##                -2.599411e-05 
##                   kw_avg_avg 
##                 2.436406e-04 
##    self_reference_max_shares 
##                -1.013196e-06 
##   self_reference_avg_sharess 
##                 3.787873e-06 
##                   is_weekend 
##                 9.944649e-02 
##                       LDA_00 
##                 6.548547e-01 
##   global_rate_negative_words 
##                -6.386244e+00 
##          rate_negative_words 
##                 4.691810e-01 
##        avg_positive_polarity 
##                -4.950408e-01 
##        min_positive_polarity 
##                -9.788910e-01 
##        max_positive_polarity 
##                -1.512775e-01 
##        min_negative_polarity 
##                -9.984699e-02 
##       abs_title_subjectivity 
##                 2.544754e-01 
## abs_title_sentiment_polarity 
##                 2.959709e-01

#get best subset of the specified size with min cp.
sub <- backward_summary$which[which.min(backward_summary$cp), ]

# Create test model matrix, predcition, test error
test_model <- model.matrix(log(shares)~ ., data = lmNewTest3)
model <- test_model[, sub]
yhat_back<-model %*% coef(backward, which.min(backward_summary$cp))
RMSE_back<-sqrt(mean((test$shares - exp(yhat_back))^2))

Random Forests

As previously mentioned in the regression trees section, the random forest builds an entire forest of these trees, and merges them together to get a more accurate and stable predictions than one off trees. It is usually trained using the bagging method. Unlike regression trees, which are prone to overfitting, only a random subset of the features is taken into consideration by the algorithm for splitting a node (used CV to find the perfect amount of variables to use). This builds in additional error and makes a more robust prediction.

The manual dimensional reduction was necessary to have the processing speeds to handle the random forests model.

library(randomForest)
#single bagged model
tree.train<-randomForest(shares~., data=trimTrain1, mtry=32, importance=TRUE)
tree.train

## 
## Call:
##  randomForest(formula = shares ~ ., data = trimTrain1, mtry = 32,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 32
## 
##           Mean of squared residuals: 18761497
##                     % Var explained: 5.65

#single bagged regression tree error prediction
tree.test<-lmNewTest3["shares"]
yhat.bag<-predict(tree.train, newdata=lmNewTest3)
yhat.bag<-as.data.frame(yhat.bag)
yhat_bag<-mean((yhat.bag$yhat.bag-tree.test$shares)^2)
RMSE_bag<-sqrt(yhat_bag)

#random forests model
tree.trainRF<-randomForest(shares~., data=trimTrain1, mtry=12, importance=TRUE)
tree.trainRF

## 
## Call:
##  randomForest(formula = shares ~ ., data = trimTrain1, mtry = 12,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 12
## 
##           Mean of squared residuals: 18168449
##                     % Var explained: 8.63

#random forest error prediction
yhat.rf<-predict(tree.trainRF, newdata = lmNewTest3)
yhat.rf<-as.data.frame(yhat.rf)
yhat_rf<-mean((yhat.rf$yhat.rf-tree.test$shares)^2)
RMSE_rfTrimmed<-sqrt(yhat_rf)

varImpPlot(tree.trainRF)

Boosted Tree

Boosting is a general approach that can be applied to many statistical learning methods for regression or classification. The trees in boosting are grown sequentially : each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.

Procedure (for regression trees):
1.Initialize predictions as 0,
2.Find the residuals (observed-predicted), call the set of them
3.Fit a tree with splits (d+1 terminal nodes) treating the residuals as the response (which they are for the first fit)
4.Update predictions
5.Update residuals for new predictions and repeat B times

Tune parameters must be chosen shrinkage, B and d in the boosting tree model.

cvcontrol <- trainControl(method="repeatedcv", number = 10,
                          allowParallel=TRUE)
grid <- expand.grid(n.trees = c(1000,1500), 
                    interaction.depth=c(1:3), 
                    shrinkage=c(0.01,0.05,0.1), 
                    n.minobsinnode=c(20))
capture<-capture.output(train.gbm <- train(log(shares) ~ ., 
                   data=train,
                   method="gbm",
                   trControl=cvcontrol,
                   tuneGrid = grid))
train.gbm

## Stochastic Gradient Boosting 
## 
## 1626 samples
##   53 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 1464, 1465, 1465, 1463, 1462, 1463, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees
##   0.01       1                  1000   
##   0.01       1                  1500   
##   0.01       2                  1000   
##   0.01       2                  1500   
##   0.01       3                  1000   
##   0.01       3                  1500   
##   0.05       1                  1000   
##   0.05       1                  1500   
##   0.05       2                  1000   
##   0.05       2                  1500   
##   0.05       3                  1000   
##   0.05       3                  1500   
##   0.10       1                  1000   
##   0.10       1                  1500   
##   0.10       2                  1000   
##   0.10       2                  1500   
##   0.10       3                  1000   
##   0.10       3                  1500   
##   RMSE       Rsquared   MAE      
##   0.7560010  0.2172366  0.5692852
##   0.7539938  0.2208591  0.5682640
##   0.7506120  0.2272458  0.5653875
##   0.7509714  0.2278141  0.5658774
##   0.7493491  0.2302280  0.5653395
##   0.7502291  0.2293532  0.5663041
##   0.7568733  0.2218942  0.5700131
##   0.7618649  0.2156406  0.5740581
##   0.7618550  0.2168585  0.5749532
##   0.7697712  0.2091459  0.5829126
##   0.7700809  0.2078451  0.5847125
##   0.7809774  0.1956284  0.5942251
##   0.7720252  0.2042129  0.5836454
##   0.7823973  0.1946235  0.5913195
##   0.7832403  0.1949746  0.5907059
##   0.8023347  0.1773247  0.6089851
##   0.7994487  0.1809024  0.6105605
##   0.8137505  0.1703397  0.6233481
## 
## Tuning parameter 'n.minobsinnode' was
##  held constant at a value of 20
## RMSE was used to select the optimal
##  model using the smallest value.
## The final values used for the model
##  were n.trees = 1000, interaction.depth =
##  3, shrinkage = 0.01 and n.minobsinnode = 20.

boostPred <- predict(train.gbm, newdata = test)
RMSE_boost <- sqrt(mean((test$shares - exp(boostPred))^2))

Comparison

Generally, the model with the lowest RMSE is the best on comparison.

comparison<-data.frame(RMSE_lm, RMSE_lm2, RMSE_lm3, RMSE_lm4, RMSE_back,  RMSE_bag, RMSE_rfTrimmed, RMSE_boost, RMSE_regTree)

comparison

##    RMSE_lm RMSE_lm2 RMSE_lm3 RMSE_lm4
## 1 9711.245 9375.817 9081.466 7619.852
##   RMSE_back RMSE_bag RMSE_rfTrimmed RMSE_boost
## 1  8021.181 7332.038       7296.658   7340.778
##   RMSE_regTree
## 1     7600.915

which.min(comparison)

## RMSE_rfTrimmed 
##              7

The overall prediction error rate for this data set is very high. This is likely due to the high values of outlier articles with freakishly high shares, that are timely AND viral. These values were NOT removed from analysis, as these are the share metrics that a company would likely want to evaluate for emulation.

ST558Project2