<!DOCTYPE html>

ST 558 Project 2

library(rmarkdown)
library(usethis)
use_git_config(user.name="Mandy Liesch", user.email="amliesch@ncsu.edu")

Introduction

Online News Popularity Data Set summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity). Here we first showed some summary statistics and plots about the data grouped by weekdays. Then we create several models to predict the response, shares in different channels. The performance of these models will be evaluated by RMSE. The model having the lowest RMSE will be selected as a winner. The methods of modeling include:

Regression Tree
Log Transformed Full Linear Regression Model
Linear Regression Model Without Day of the Week
Subset Linear Regression Model #1
Subset Linear Regression Model #2
Backward Selection Linear Regression
Bagged Regression Tree
Random Forest Model
Boosted Tree Model

Data preparation

Subset Data by Channel

library(tidyverse)

data_whole<-read_csv("OnlineNewsPopularity/OnlineNewsPopularity.csv")

#create a new variable, channel, to help with the subsetting.
data_whole$channel <- names(data_whole[14:19])[apply(data_whole[14:19],1, match, x = 1)]
data_whole$channel <-sub("data_channel_is_", "", data_whole$channel)

#Subset the data to work on the data channel of interest
#channel_interest = params[[1]]$team

#Get the important data
data_interest<-data_whole%>%
  filter(channel==x[[2]]$team)%>%
  select(-c(1,14:19,62))

Establish Training Data

Split the data into a training (70% of the data) and test set (30% of the data)

library(caret)
library(rsample)
set.seed(14)
index <- initial_split(data_interest,
                       prop = 0.7)
train <- training(index)
test <- testing(index)

Data Summaries

Correlation Plots

This graphical function looks at the correlation of all of the different variables against each other.

library(corrplot)
#drop values that are not important (the days of the week)
newTrain<-train[ -c(25:31) ]
lmNewTest<-test[ -c(25:31) ]
#drop the predictor variables
predictTrain<-newTrain[ -c(47) ]
#Calculate the correlation Matrix and round it
res <- cor(predictTrain)

#Plot the correlation matrix values by cluster
corrplot(res, type = "upper", order = "hclust",
         tl.col = "black", tl.cex = 0.5)

From the results of this spot, it appears that we likely have some clusters of colinearity.

Table Summary

We summarize the train data of interest in tables grouped by weekdays, showing the pattern of shares in a week.

#create a new variable, weekday, to help with the creating plots.
train$weekday <- names(train[25:31])[apply(train[25:31],1, match, x = 1)]
train$weekday <-sub("weekday_is_", "", train$weekday)

#summarize the train data by weekday.knitr::kable(
summary<-train%>%group_by(weekday)%>%
  summarise(Avg=round(mean(shares),0),Sd=round(sd(shares),0),Median=median(shares),IQR=round(IQR(shares),0))
knitr::kable(summary)

weekday	Avg	Sd	Median	IQR
friday	2332	4740	1400	1341
monday	3592	24700	1400	1554
saturday	3743	4677	2600	2025
sunday	3574	5567	2100	2300
thursday	3180	15285	1400	1373
tuesday	3041	12209	1400	1372
wednesday	2768	8815	1300	1311

We summarize the train data of interest in the plots below. The histogram of shares shows that it is not a normal distribution. After log transformation, the distribution of log(share) is more close to a normal distribution.

#histogram of shares and log(shares).
hist(train$shares)

hist(log(train$shares))

Data Plots

Box Plots

We use box plots to show the difference in shares and num_images between weekdays and weekends.If the boxes of weekends are higher than the ones of weekdays, then articles be shared more often during weekends.

g1<-ggplot(train, aes(x=factor(is_weekend,labels=c("No", "Yes")),y=shares))
g1+geom_boxplot(fill="white", width=0.5,lwd=1.5,color='black',outlier.shape = NA)+
   scale_y_continuous(limits = quantile(train$shares, c(0.1, 0.9)))+
   labs(subtitle = "Shares on weekend",x="On weekend or not")

g2<-ggplot(train, aes(x=factor(is_weekend,labels=c("No", "Yes")),y=num_imgs))
g2+geom_boxplot(fill="white", width=0.5,lwd=1.5,color='black',outlier.shape = NA)+
   scale_y_continuous(limits = quantile(train$num_imgs, c(0, 0.95)))+
   labs(subtitle = "number of images on weekend",x="On weekend or not")

Linear Model

We can inspect the trend of shares as a function of num_images. If the points show an upward trend, then articles with more images tend to be shared more often. If we see a negative trend then articles with more images tend to be shared less often. We can also observe the difference after the log transformation.

g3<-ggplot(train,aes(x=num_imgs,y=shares))
g3+geom_point()+
  labs(subtitle = "num_imgs vs shares")+
  scale_y_continuous(limits = quantile(train$shares, c(0, 0.9)))+
  scale_x_continuous(limits = quantile(train$num_imgs, c(0, 0.9)))+
  geom_smooth(method="lm")

g4<-ggplot(train,aes(x=num_imgs,y=log(shares)))
g4+geom_point()+
  labs(subtitle = "num_imgs vs log(shares)")+
  scale_y_continuous(limits = quantile(log(train$shares), c(0, 0.9)))+
  scale_x_continuous(limits = quantile(train$num_imgs, c(0, 0.9)))+
  geom_smooth(method="lm")

#remove weekday from data set
train<-train%>%select(-weekday)

Models

Regression Tree

Classification trees are machine learning algorithms that have several benefits, including the ease of operation, and less pre-processing. Data does not require normalization, scaling, and removal of missing values. The results are usually easy to explain, and stakeholders usually can understand them. A regression tree is a tree that uses numerical values to predict the nodes and tree branches. Despite all of the benefits, the Decision Tree algorithm can’t be used for regression and predicting continuous values, it also does not transfer well to other datasets.

library(tree)
tree.news<-tree(shares~., data=train)
summary(tree.news)

## 
## Regression tree:
## tree(formula = shares ~ ., data = train)
## Variables actually used in tree construction:
## [1] "kw_avg_min" "LDA_03"    
## Number of terminal nodes:  3 
## Residual mean deviance:  185900000 = 8.135e+11 / 4377 
## Distribution of residuals:
##      Min.   1st Qu.    Median 
## -145600.0   -1955.0   -1509.0 
##      Mean   3rd Qu.      Max. 
##       0.0    -409.4  543100.0

plot(tree.news)
text(tree.news, pretty=0)

yhat.regTree<- predict(tree.news, newdata = test)
yhat.test<-test["shares"]
yhat.regTree<-as.data.frame(yhat.regTree)
meanRegTree<-mean((yhat.regTree$yhat.regTree-yhat.test$shares)^2)

RMSE_regTree<-sqrt(meanRegTree)

These results can vary widely depending on the datasets.

Linear Models

Linear models are very valuable and powerful tools, and are very versatile, and can be applied to many situations. Multiple regression examines the relationship between several independent variables and one dependent variable (in this case, total Shares). Regression models give users the ability to determine the relative influence of one or more predictor variables to the predictor, and it also allows users to identify outliers, or anomalies. The main disadvantages have to do with the input quality of data. Input that is incomplete may lead to wrong conclusions. It also assumes that data is independent, which is not always the case.

There are several different types of linear models. In this project, we use multiple different multiple regression values that were log transformed, representing the full dataset, and several partial subsets with multiple variables removed at different points for multicolinearity reasons.

There are also several different types of variable selection, including forward, backward, and stepwise, which user predefined criteria set the entry and/or exit criteria of the models. Backwards selection starts with a full model, and then removes variables that are least significant one at a time, until the model criteria defined by the user are hit. Forward regression does the opposite, and is not represented here.

Linear Regression After Log Transformation

Transform the response with log, then fit a linear regression model with all the variables. Then calculate the RMSE of the model.

lm<- lm(log(shares)~.,train)
summary(lm)

## 
## Call:
## lm(formula = log(shares) ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median 
## -8.1925 -0.4663 -0.1125 
##      3Q     Max 
##  0.3627  5.3852 
## 
## Coefficients: (3 not defined because of singularities)
##                                Estimate
## (Intercept)                   6.086e+00
## timedelta                     2.534e-05
## n_tokens_title               -8.081e-04
## n_tokens_content              1.728e-04
## n_unique_tokens               1.211e-01
## n_non_stop_words             -1.626e-01
## n_non_stop_unique_tokens      3.378e-01
## num_hrefs                     7.512e-03
## num_self_hrefs               -4.511e-03
## num_imgs                      9.702e-03
## num_videos                    7.629e-04
## average_token_length         -1.686e-01
## num_keywords                  3.940e-02
## kw_min_min                    1.562e-03
## kw_max_min                    7.168e-05
## kw_avg_min                   -3.336e-04
## kw_min_max                   -3.744e-07
## kw_max_max                    1.528e-07
## kw_avg_max                    1.366e-07
## kw_min_avg                    4.826e-06
## kw_max_avg                   -3.881e-05
## kw_avg_avg                    2.919e-04
## self_reference_min_shares    -2.407e-06
## self_reference_max_shares    -1.806e-06
## self_reference_avg_sharess    5.675e-06
## weekday_is_monday            -2.708e-01
## weekday_is_tuesday           -2.809e-01
## weekday_is_wednesday         -3.148e-01
## weekday_is_thursday          -2.889e-01
## weekday_is_friday            -2.829e-01
## weekday_is_saturday           3.834e-02
## weekday_is_sunday                    NA
## is_weekend                           NA
## LDA_00                        3.079e-01
## LDA_01                        1.001e-01
## LDA_02                        5.375e-02
## LDA_03                        1.504e-01
## LDA_04                               NA
## global_subjectivity           3.425e-01
## global_sentiment_polarity     3.112e-01
## global_rate_positive_words    1.351e-01
## global_rate_negative_words    6.531e+00
## rate_positive_words           5.995e-01
## rate_negative_words           1.945e-01
## avg_positive_polarity        -7.637e-03
## min_positive_polarity        -4.165e-01
## max_positive_polarity        -2.440e-01
## avg_negative_polarity        -2.494e-01
## min_negative_polarity        -7.268e-02
## max_negative_polarity         4.467e-01
## title_subjectivity            6.242e-02
## title_sentiment_polarity      8.541e-02
## abs_title_subjectivity        3.176e-01
## abs_title_sentiment_polarity  8.440e-02
##                              Std. Error
## (Intercept)                   2.588e-01
## timedelta                     8.295e-05
## n_tokens_title                5.798e-03
## n_tokens_content              5.378e-05
## n_unique_tokens               4.273e-01
## n_non_stop_words              8.421e-01
## n_non_stop_unique_tokens      3.701e-01
## num_hrefs                     1.922e-03
## num_self_hrefs                4.830e-03
## num_imgs                      3.781e-03
## num_videos                    3.646e-03
## average_token_length          4.965e-02
## num_keywords                  8.054e-03
## kw_min_min                    3.306e-04
## kw_max_min                    2.495e-05
## kw_avg_min                    1.273e-04
## kw_min_max                    1.706e-07
## kw_max_max                    1.178e-07
## kw_avg_max                    1.823e-07
## kw_min_avg                    1.539e-05
## kw_max_avg                    4.232e-06
## kw_avg_avg                    2.569e-05
## self_reference_min_shares     1.331e-06
## self_reference_max_shares     8.325e-07
## self_reference_avg_sharess    1.990e-06
## weekday_is_monday             5.810e-02
## weekday_is_tuesday            5.832e-02
## weekday_is_wednesday          5.785e-02
## weekday_is_thursday           5.777e-02
## weekday_is_friday             6.099e-02
## weekday_is_saturday           7.935e-02
## weekday_is_sunday                    NA
## is_weekend                           NA
## LDA_00                        7.984e-02
## LDA_01                        1.316e-01
## LDA_02                        1.293e-01
## LDA_03                        1.467e-01
## LDA_04                               NA
## global_subjectivity           1.808e-01
## global_sentiment_polarity     3.852e-01
## global_rate_positive_words    1.508e+00
## global_rate_negative_words    3.579e+00
## rate_positive_words           7.851e-01
## rate_negative_words           8.059e-01
## avg_positive_polarity         2.912e-01
## min_positive_polarity         2.444e-01
## max_positive_polarity         8.732e-02
## avg_negative_polarity         2.847e-01
## min_negative_polarity         1.017e-01
## max_negative_polarity         2.366e-01
## title_subjectivity            6.478e-02
## title_sentiment_polarity      5.970e-02
## abs_title_subjectivity        8.157e-02
## abs_title_sentiment_polarity  9.171e-02
##                              t value
## (Intercept)                   23.514
## timedelta                      0.305
## n_tokens_title                -0.139
## n_tokens_content               3.213
## n_unique_tokens                0.283
## n_non_stop_words              -0.193
## n_non_stop_unique_tokens       0.913
## num_hrefs                      3.908
## num_self_hrefs                -0.934
## num_imgs                       2.566
## num_videos                     0.209
## average_token_length          -3.397
## num_keywords                   4.892
## kw_min_min                     4.725
## kw_max_min                     2.873
## kw_avg_min                    -2.621
## kw_min_max                    -2.194
## kw_max_max                     1.297
## kw_avg_max                     0.749
## kw_min_avg                     0.313
## kw_max_avg                    -9.171
## kw_avg_avg                    11.363
## self_reference_min_shares     -1.808
## self_reference_max_shares     -2.169
## self_reference_avg_sharess     2.852
## weekday_is_monday             -4.660
## weekday_is_tuesday            -4.817
## weekday_is_wednesday          -5.441
## weekday_is_thursday           -5.001
## weekday_is_friday             -4.638
## weekday_is_saturday            0.483
## weekday_is_sunday                 NA
## is_weekend                        NA
## LDA_00                         3.857
## LDA_01                         0.761
## LDA_02                         0.416
## LDA_03                         1.025
## LDA_04                            NA
## global_subjectivity            1.894
## global_sentiment_polarity      0.808
## global_rate_positive_words     0.090
## global_rate_negative_words     1.825
## rate_positive_words            0.764
## rate_negative_words            0.241
## avg_positive_polarity         -0.026
## min_positive_polarity         -1.704
## max_positive_polarity         -2.795
## avg_negative_polarity         -0.876
## min_negative_polarity         -0.715
## max_negative_polarity          1.888
## title_subjectivity             0.964
## title_sentiment_polarity       1.431
## abs_title_subjectivity         3.893
## abs_title_sentiment_polarity   0.920
##                              Pr(>|t|)
## (Intercept)                   < 2e-16
## timedelta                    0.760047
## n_tokens_title               0.889155
## n_tokens_content             0.001322
## n_unique_tokens              0.776901
## n_non_stop_words             0.846879
## n_non_stop_unique_tokens     0.361476
## num_hrefs                    9.46e-05
## num_self_hrefs               0.350381
## num_imgs                     0.010327
## num_videos                   0.834281
## average_token_length         0.000688
## num_keywords                 1.04e-06
## kw_min_min                   2.37e-06
## kw_max_min                   0.004081
## kw_avg_min                   0.008794
## kw_min_max                   0.028283
## kw_max_max                   0.194744
## kw_avg_max                   0.453641
## kw_min_avg                   0.753943
## kw_max_avg                    < 2e-16
## kw_avg_avg                    < 2e-16
## self_reference_min_shares    0.070687
## self_reference_max_shares    0.030116
## self_reference_avg_sharess   0.004360
## weekday_is_monday            3.25e-06
## weekday_is_tuesday           1.51e-06
## weekday_is_wednesday         5.60e-08
## weekday_is_thursday          5.93e-07
## weekday_is_friday            3.62e-06
## weekday_is_saturday          0.628971
## weekday_is_sunday                  NA
## is_weekend                         NA
## LDA_00                       0.000116
## LDA_01                       0.446822
## LDA_02                       0.677554
## LDA_03                       0.305221
## LDA_04                             NA
## global_subjectivity          0.058339
## global_sentiment_polarity    0.419230
## global_rate_positive_words   0.928649
## global_rate_negative_words   0.068142
## rate_positive_words          0.445141
## rate_negative_words          0.809275
## avg_positive_polarity        0.979078
## min_positive_polarity        0.088455
## max_positive_polarity        0.005219
## avg_negative_polarity        0.381013
## min_negative_polarity        0.474818
## max_negative_polarity        0.059120
## title_subjectivity           0.335313
## title_sentiment_polarity     0.152607
## abs_title_subjectivity       0.000100
## abs_title_sentiment_polarity 0.357502
##                                 
## (Intercept)                  ***
## timedelta                       
## n_tokens_title                  
## n_tokens_content             ** 
## n_unique_tokens                 
## n_non_stop_words                
## n_non_stop_unique_tokens        
## num_hrefs                    ***
## num_self_hrefs                  
## num_imgs                     *  
## num_videos                      
## average_token_length         ***
## num_keywords                 ***
## kw_min_min                   ***
## kw_max_min                   ** 
## kw_avg_min                   ** 
## kw_min_max                   *  
## kw_max_max                      
## kw_avg_max                      
## kw_min_avg                      
## kw_max_avg                   ***
## kw_avg_avg                   ***
## self_reference_min_shares    .  
## self_reference_max_shares    *  
## self_reference_avg_sharess   ** 
## weekday_is_monday            ***
## weekday_is_tuesday           ***
## weekday_is_wednesday         ***
## weekday_is_thursday          ***
## weekday_is_friday            ***
## weekday_is_saturday             
## weekday_is_sunday               
## is_weekend                      
## LDA_00                       ***
## LDA_01                          
## LDA_02                          
## LDA_03                          
## LDA_04                          
## global_subjectivity          .  
## global_sentiment_polarity       
## global_rate_positive_words      
## global_rate_negative_words   .  
## rate_positive_words             
## rate_negative_words             
## avg_positive_polarity           
## min_positive_polarity        .  
## max_positive_polarity        ** 
## avg_negative_polarity           
## min_negative_polarity           
## max_negative_polarity        .  
## title_subjectivity              
## title_sentiment_polarity        
## abs_title_subjectivity       ***
## abs_title_sentiment_polarity    
## ---
## Signif. codes:  
##   0 '***' 0.001 '**' 0.01
##   '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7729 on 4329 degrees of freedom
## Multiple R-squared:  0.1705, Adjusted R-squared:  0.1609 
## F-statistic:  17.8 on 50 and 4329 DF,  p-value: < 2.2e-16

yhat_lm<-predict(lm,test)
RMSE_lm<-sqrt(mean((test$shares - exp(yhat_lm))^2))

Plot the lm Residuals

par(mfrow=c(2,2))
plot(lm)

Looking at our residuals, there seems to be skewing in both direction, indicating that the data, even after transformation, has extreme outliers in both directions.

Model Removing the Day Variable

#look at the data for multicolinearity
lmNewTest<-test[ -c(25:31) ]
lm2<- lm(log(shares)~.,newTrain)
yhat_lm2<-predict(lm2,lmNewTest)
RMSE_lm2<-sqrt(mean((lmNewTest$shares - exp(yhat_lm2))^2))

library(mctest)
omcdiag(lm2)

## 
## Call:
## omcdiag(mod = lm2)
## 
## 
## Overall Multicollinearity Diagnostics
## 
##                          MC Results
## Determinant |X'X|:     0.000000e+00
## Farrar Chi-Square:     2.935100e+05
## Red Indicator:         1.638000e-01
## Sum of Lambda Inverse: 9.914284e+14
## Theil's Method:        2.453890e+01
## Condition Number:      3.746208e+07
##                        detection
## Determinant |X'X|:             1
## Farrar Chi-Square:             1
## Red Indicator:                 0
## Sum of Lambda Inverse:         1
## Theil's Method:                1
## Condition Number:              1
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test

imcdiag(lm2)

## 
## Call:
## imcdiag(mod = lm2)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                                  VIF
## timedelta                     2.1757
## n_tokens_title                1.1435
## n_tokens_content              3.9679
## n_unique_tokens              13.8126
## n_non_stop_words             21.2526
## n_non_stop_unique_tokens      9.3499
## num_hrefs                     1.9298
## num_self_hrefs                1.2541
## num_imgs                      1.2475
## num_videos                    1.2117
## average_token_length          2.9043
## num_keywords                  1.8275
## kw_min_min                    4.2388
## kw_max_min                   13.8143
## kw_avg_min                   15.0843
## kw_min_max                    1.4571
## kw_max_max                    5.0276
## kw_avg_max                    5.2986
## kw_min_avg                    2.0548
## kw_max_avg                    7.7935
## kw_avg_avg                   10.2498
## self_reference_min_shares     9.1779
## self_reference_max_shares    13.0450
## self_reference_avg_sharess   29.5144
## is_weekend                    1.0818
## LDA_00                           Inf
## LDA_01                           Inf
## LDA_02                           Inf
## LDA_03                           Inf
## LDA_04                           Inf
## global_subjectivity           1.6952
## global_sentiment_polarity     7.2633
## global_rate_positive_words    4.4712
## global_rate_negative_words    6.9114
## rate_positive_words          97.2666
## rate_negative_words          92.5996
## avg_positive_polarity         4.1145
## min_positive_polarity         1.9264
## max_positive_polarity         2.6586
## avg_negative_polarity         7.5702
## min_negative_polarity         5.7358
## max_negative_polarity         3.0602
## title_subjectivity            2.7112
## title_sentiment_polarity      1.5039
## abs_title_subjectivity        1.7727
## abs_title_sentiment_polarity  2.7329
##                                 TOL
## timedelta                    0.4596
## n_tokens_title               0.8745
## n_tokens_content             0.2520
## n_unique_tokens              0.0724
## n_non_stop_words             0.0471
## n_non_stop_unique_tokens     0.1070
## num_hrefs                    0.5182
## num_self_hrefs               0.7974
## num_imgs                     0.8016
## num_videos                   0.8253
## average_token_length         0.3443
## num_keywords                 0.5472
## kw_min_min                   0.2359
## kw_max_min                   0.0724
## kw_avg_min                   0.0663
## kw_min_max                   0.6863
## kw_max_max                   0.1989
## kw_avg_max                   0.1887
## kw_min_avg                   0.4867
## kw_max_avg                   0.1283
## kw_avg_avg                   0.0976
## self_reference_min_shares    0.1090
## self_reference_max_shares    0.0767
## self_reference_avg_sharess   0.0339
## is_weekend                   0.9244
## LDA_00                       0.0000
## LDA_01                       0.0000
## LDA_02                       0.0000
## LDA_03                       0.0000
## LDA_04                       0.0000
## global_subjectivity          0.5899
## global_sentiment_polarity    0.1377
## global_rate_positive_words   0.2237
## global_rate_negative_words   0.1447
## rate_positive_words          0.0103
## rate_negative_words          0.0108
## avg_positive_polarity        0.2430
## min_positive_polarity        0.5191
## max_positive_polarity        0.3761
## avg_negative_polarity        0.1321
## min_negative_polarity        0.1743
## max_negative_polarity        0.3268
## title_subjectivity           0.3688
## title_sentiment_polarity     0.6649
## abs_title_subjectivity       0.5641
## abs_title_sentiment_polarity 0.3659
##                                     Wi
## timedelta                     113.2373
## n_tokens_title                 13.8182
## n_tokens_content              285.8444
## n_unique_tokens              1233.9913
## n_non_stop_words             1950.5481
## n_non_stop_unique_tokens      804.1921
## num_hrefs                      89.5479
## num_self_hrefs                 24.4746
## num_imgs                       23.8383
## num_videos                     20.3892
## average_token_length          183.4089
## num_keywords                   79.6961
## kw_min_min                    311.9330
## kw_max_min                   1234.1587
## kw_avg_min                   1356.4785
## kw_min_max                     44.0259
## kw_max_max                    387.9048
## kw_avg_max                    414.0039
## kw_min_avg                    101.5854
## kw_max_avg                    654.2867
## kw_avg_avg                    890.8597
## self_reference_min_shares     787.6217
## self_reference_max_shares    1160.0713
## self_reference_avg_sharess   2746.2573
## is_weekend                      7.8767
## LDA_00                             Inf
## LDA_01                             Inf
## LDA_02                             Inf
## LDA_03                             Inf
## LDA_04                             Inf
## global_subjectivity            66.9576
## global_sentiment_polarity     603.2208
## global_rate_positive_words    334.3134
## global_rate_negative_words    569.3334
## rate_positive_words          9271.5440
## rate_negative_words          8822.0623
## avg_positive_polarity         299.9627
## min_positive_polarity          89.2256
## max_positive_polarity         159.7404
## avg_negative_polarity         632.7785
## min_negative_polarity         456.1134
## max_negative_polarity         198.4165
## title_subjectivity            164.8044
## title_sentiment_polarity       48.5307
## abs_title_subjectivity         74.4159
## abs_title_sentiment_polarity  166.8955
##                                     Fi
## timedelta                     115.8376
## n_tokens_title                 14.1355
## n_tokens_content              292.4083
## n_unique_tokens              1262.3277
## n_non_stop_words             1995.3390
## n_non_stop_unique_tokens      822.6590
## num_hrefs                      91.6042
## num_self_hrefs                 25.0366
## num_imgs                       24.3857
## num_videos                     20.8574
## average_token_length          187.6205
## num_keywords                   81.5261
## kw_min_min                    319.0960
## kw_max_min                   1262.4990
## kw_avg_min                   1387.6277
## kw_min_max                     45.0368
## kw_max_max                    396.8123
## kw_avg_max                    423.5107
## kw_min_avg                    103.9181
## kw_max_avg                    669.3112
## kw_avg_avg                    911.3167
## self_reference_min_shares     805.7081
## self_reference_max_shares    1186.7103
## self_reference_avg_sharess   2809.3203
## is_weekend                      8.0576
## LDA_00                             Inf
## LDA_01                             Inf
## LDA_02                             Inf
## LDA_03                             Inf
## LDA_04                             Inf
## global_subjectivity            68.4952
## global_sentiment_polarity     617.0727
## global_rate_positive_words    341.9904
## global_rate_negative_words    582.4071
## rate_positive_words          9484.4488
## rate_negative_words          9024.6455
## avg_positive_polarity         306.8508
## min_positive_polarity          91.2745
## max_positive_polarity         163.4085
## avg_negative_polarity         647.3092
## min_negative_polarity         466.5872
## max_negative_polarity         202.9728
## title_subjectivity            168.5889
## title_sentiment_polarity       49.6451
## abs_title_subjectivity         76.1247
## abs_title_sentiment_polarity  170.7280
##                              Leamer
## timedelta                    0.6779
## n_tokens_title               0.9352
## n_tokens_content             0.5020
## n_unique_tokens              0.2691
## n_non_stop_words             0.2169
## n_non_stop_unique_tokens     0.3270
## num_hrefs                    0.7199
## num_self_hrefs               0.8930
## num_imgs                     0.8953
## num_videos                   0.9085
## average_token_length         0.5868
## num_keywords                 0.7397
## kw_min_min                   0.4857
## kw_max_min                   0.2691
## kw_avg_min                   0.2575
## kw_min_max                   0.8284
## kw_max_max                   0.4460
## kw_avg_max                   0.4344
## kw_min_avg                   0.6976
## kw_max_avg                   0.3582
## kw_avg_avg                   0.3124
## self_reference_min_shares    0.3301
## self_reference_max_shares    0.2769
## self_reference_avg_sharess   0.1841
## is_weekend                   0.9615
## LDA_00                       0.0000
## LDA_01                       0.0000
## LDA_02                       0.0000
## LDA_03                       0.0000
## LDA_04                       0.0000
## global_subjectivity          0.7680
## global_sentiment_polarity    0.3711
## global_rate_positive_words   0.4729
## global_rate_negative_words   0.3804
## rate_positive_words          0.1014
## rate_negative_words          0.1039
## avg_positive_polarity        0.4930
## min_positive_polarity        0.7205
## max_positive_polarity        0.6133
## avg_negative_polarity        0.3635
## min_negative_polarity        0.4175
## max_negative_polarity        0.5716
## title_subjectivity           0.6073
## title_sentiment_polarity     0.8154
## abs_title_subjectivity       0.7511
## abs_title_sentiment_polarity 0.6049
##                                  CVIF
## timedelta                      3.1123
## n_tokens_title                 1.6357
## n_tokens_content               5.6760
## n_unique_tokens               19.7584
## n_non_stop_words              30.4012
## n_non_stop_unique_tokens      13.3748
## num_hrefs                      2.7605
## num_self_hrefs                 1.7940
## num_imgs                       1.7845
## num_videos                     1.7333
## average_token_length           4.1546
## num_keywords                   2.6142
## kw_min_min                     6.0635
## kw_max_min                    19.7609
## kw_avg_min                    21.5777
## kw_min_max                     2.0844
## kw_max_max                     7.1919
## kw_avg_max                     7.5795
## kw_min_avg                     2.9393
## kw_max_avg                    11.1483
## kw_avg_avg                    14.6621
## self_reference_min_shares     13.1287
## self_reference_max_shares     18.6605
## self_reference_avg_sharess    42.2195
## is_weekend                     1.5475
## LDA_00                            Inf
## LDA_01                            Inf
## LDA_02                            Inf
## LDA_03                            Inf
## LDA_04                            Inf
## global_subjectivity            2.4250
## global_sentiment_polarity     10.3899
## global_rate_positive_words     6.3959
## global_rate_negative_words     9.8866
## rate_positive_words          139.1370
## rate_negative_words          132.4610
## avg_positive_polarity          5.8857
## min_positive_polarity          2.7557
## max_positive_polarity          3.8030
## avg_negative_polarity         10.8289
## min_negative_polarity          8.2049
## max_negative_polarity          4.3775
## title_subjectivity             3.8782
## title_sentiment_polarity       2.1513
## abs_title_subjectivity         2.5357
## abs_title_sentiment_polarity   3.9093
##                              Klein
## timedelta                        1
## n_tokens_title                   0
## n_tokens_content                 1
## n_unique_tokens                  1
## n_non_stop_words                 1
## n_non_stop_unique_tokens         1
## num_hrefs                        1
## num_self_hrefs                   1
## num_imgs                         1
## num_videos                       1
## average_token_length             1
## num_keywords                     1
## kw_min_min                       1
## kw_max_min                       1
## kw_avg_min                       1
## kw_min_max                       1
## kw_max_max                       1
## kw_avg_max                       1
## kw_min_avg                       1
## kw_max_avg                       1
## kw_avg_avg                       1
## self_reference_min_shares        1
## self_reference_max_shares        1
## self_reference_avg_sharess       1
## is_weekend                       0
## LDA_00                           1
## LDA_01                           1
## LDA_02                           1
## LDA_03                           1
## LDA_04                           1
## global_subjectivity              1
## global_sentiment_polarity        1
## global_rate_positive_words       1
## global_rate_negative_words       1
## rate_positive_words              1
## rate_negative_words              1
## avg_positive_polarity            1
## min_positive_polarity            1
## max_positive_polarity            1
## avg_negative_polarity            1
## min_negative_polarity            1
## max_negative_polarity            1
## title_subjectivity               1
## title_sentiment_polarity         1
## abs_title_subjectivity           1
## abs_title_sentiment_polarity     1
##                                IND1
## timedelta                    0.0047
## n_tokens_title               0.0089
## n_tokens_content             0.0026
## n_unique_tokens              0.0007
## n_non_stop_words             0.0005
## n_non_stop_unique_tokens     0.0011
## num_hrefs                    0.0053
## num_self_hrefs               0.0081
## num_imgs                     0.0081
## num_videos                   0.0084
## average_token_length         0.0035
## num_keywords                 0.0056
## kw_min_min                   0.0024
## kw_max_min                   0.0007
## kw_avg_min                   0.0007
## kw_min_max                   0.0070
## kw_max_max                   0.0020
## kw_avg_max                   0.0019
## kw_min_avg                   0.0049
## kw_max_avg                   0.0013
## kw_avg_avg                   0.0010
## self_reference_min_shares    0.0011
## self_reference_max_shares    0.0008
## self_reference_avg_sharess   0.0003
## is_weekend                   0.0094
## LDA_00                       0.0000
## LDA_01                       0.0000
## LDA_02                       0.0000
## LDA_03                       0.0000
## LDA_04                       0.0000
## global_subjectivity          0.0060
## global_sentiment_polarity    0.0014
## global_rate_positive_words   0.0023
## global_rate_negative_words   0.0015
## rate_positive_words          0.0001
## rate_negative_words          0.0001
## avg_positive_polarity        0.0025
## min_positive_polarity        0.0053
## max_positive_polarity        0.0038
## avg_negative_polarity        0.0013
## min_negative_polarity        0.0018
## max_negative_polarity        0.0033
## title_subjectivity           0.0037
## title_sentiment_polarity     0.0067
## abs_title_subjectivity       0.0057
## abs_title_sentiment_polarity 0.0037
##                                IND2
## timedelta                    0.7721
## n_tokens_title               0.1793
## n_tokens_content             1.0687
## n_unique_tokens              1.3253
## n_non_stop_words             1.3615
## n_non_stop_unique_tokens     1.2759
## num_hrefs                    0.6884
## num_self_hrefs               0.2895
## num_imgs                     0.2835
## num_videos                   0.2496
## average_token_length         0.9368
## num_keywords                 0.6469
## kw_min_min                   1.0917
## kw_max_min                   1.3253
## kw_avg_min                   1.3340
## kw_min_max                   0.4482
## kw_max_max                   1.1446
## kw_avg_max                   1.1591
## kw_min_avg                   0.7334
## kw_max_avg                   1.2454
## kw_avg_avg                   1.2893
## self_reference_min_shares    1.2731
## self_reference_max_shares    1.3192
## self_reference_avg_sharess   1.3803
## is_weekend                   0.1080
## LDA_00                       1.4287
## LDA_01                       1.4287
## LDA_02                       1.4287
## LDA_03                       1.4287
## LDA_04                       1.4287
## global_subjectivity          0.5859
## global_sentiment_polarity    1.2320
## global_rate_positive_words   1.1092
## global_rate_negative_words   1.2220
## rate_positive_words          1.4141
## rate_negative_words          1.4133
## avg_positive_polarity        1.0815
## min_positive_polarity        0.6871
## max_positive_polarity        0.8913
## avg_negative_polarity        1.2400
## min_negative_polarity        1.1796
## max_negative_polarity        0.9619
## title_subjectivity           0.9018
## title_sentiment_polarity     0.4787
## abs_title_subjectivity       0.6228
## abs_title_sentiment_polarity 0.9059
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## timedelta , n_tokens_title , n_unique_tokens , n_non_stop_words , n_non_stop_unique_tokens , num_self_hrefs , num_videos , kw_max_max , kw_avg_max , kw_min_avg , self_reference_min_shares , LDA_01 , LDA_02 , LDA_03 , LDA_04 , global_subjectivity , global_sentiment_polarity , global_rate_positive_words , global_rate_negative_words , rate_positive_words , rate_negative_words , avg_positive_polarity , max_positive_polarity , avg_negative_polarity , min_negative_polarity , max_negative_polarity , title_subjectivity , abs_title_subjectivity , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.1702 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

Looking at all of the VIF values, we are going to start by removing all of the LDA Values, and the positive word rate to remove all “infinite” VIF values.

First Multicolinearity Trim

The mctest package was used to calculate the VIF values of multicolinearity.

toRemove<-c( "LDA_01", "LDA_02", "LDA_03", "LDA_04", "rate_positive_words")
trimTrain1 <- newTrain[, ! names(newTrain) %in% toRemove, drop = F]
lmNewTest3<-lmNewTest[, ! names(newTrain) %in% toRemove, drop = F]

#Repeat linear Model process
lm3<- lm(log(shares)~., trimTrain1)
yhat_lm3<-predict(lm3,lmNewTest3)
RMSE_lm3<-sqrt(mean((lmNewTest3$shares - exp(yhat_lm3))^2))

imcdiag(lm3)

## 
## Call:
## imcdiag(mod = lm3)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                                  VIF
## timedelta                     2.1637
## n_tokens_title                1.1423
## n_tokens_content              3.9547
## n_unique_tokens              13.7044
## n_non_stop_words              3.6273
## n_non_stop_unique_tokens      9.2998
## num_hrefs                     1.9273
## num_self_hrefs                1.2451
## num_imgs                      1.2433
## num_videos                    1.1902
## average_token_length          2.8796
## num_keywords                  1.8195
## kw_min_min                    4.2318
## kw_max_min                   13.7813
## kw_avg_min                   15.0235
## kw_min_max                    1.4497
## kw_max_max                    5.0066
## kw_avg_max                    5.1841
## kw_min_avg                    2.0443
## kw_max_avg                    7.7040
## kw_avg_avg                   10.0804
## self_reference_min_shares     9.1525
## self_reference_max_shares    13.0262
## self_reference_avg_sharess   29.4418
## is_weekend                    1.0817
## LDA_00                        1.1403
## global_subjectivity           1.6838
## global_sentiment_polarity     7.2594
## global_rate_positive_words    4.4370
## global_rate_negative_words    6.8701
## rate_negative_words           9.0766
## avg_positive_polarity         4.1111
## min_positive_polarity         1.9225
## max_positive_polarity         2.6542
## avg_negative_polarity         7.5461
## min_negative_polarity         5.7282
## max_negative_polarity         3.0527
## title_subjectivity            2.7097
## title_sentiment_polarity      1.5030
## abs_title_subjectivity        1.7705
## abs_title_sentiment_polarity  2.7321
##                                 TOL
## timedelta                    0.4622
## n_tokens_title               0.8755
## n_tokens_content             0.2529
## n_unique_tokens              0.0730
## n_non_stop_words             0.2757
## n_non_stop_unique_tokens     0.1075
## num_hrefs                    0.5189
## num_self_hrefs               0.8031
## num_imgs                     0.8043
## num_videos                   0.8402
## average_token_length         0.3473
## num_keywords                 0.5496
## kw_min_min                   0.2363
## kw_max_min                   0.0726
## kw_avg_min                   0.0666
## kw_min_max                   0.6898
## kw_max_max                   0.1997
## kw_avg_max                   0.1929
## kw_min_avg                   0.4892
## kw_max_avg                   0.1298
## kw_avg_avg                   0.0992
## self_reference_min_shares    0.1093
## self_reference_max_shares    0.0768
## self_reference_avg_sharess   0.0340
## is_weekend                   0.9245
## LDA_00                       0.8770
## global_subjectivity          0.5939
## global_sentiment_polarity    0.1378
## global_rate_positive_words   0.2254
## global_rate_negative_words   0.1456
## rate_negative_words          0.1102
## avg_positive_polarity        0.2432
## min_positive_polarity        0.5202
## max_positive_polarity        0.3768
## avg_negative_polarity        0.1325
## min_negative_polarity        0.1746
## max_negative_polarity        0.3276
## title_subjectivity           0.3690
## title_sentiment_polarity     0.6653
## abs_title_subjectivity       0.5648
## abs_title_sentiment_polarity 0.3660
##                                     Wi
## timedelta                     126.2361
## n_tokens_title                 15.4317
## n_tokens_content              320.5086
## n_unique_tokens              1378.1060
## n_non_stop_words              284.9962
## n_non_stop_unique_tokens      900.3232
## num_hrefs                     100.5862
## num_self_hrefs                 26.5889
## num_imgs                       26.3971
## num_videos                     20.6332
## average_token_length          203.8867
## num_keywords                   88.8982
## kw_min_min                    350.5674
## kw_max_min                   1386.4492
## kw_avg_min                   1521.2003
## kw_min_max                     48.7786
## kw_max_max                    434.6155
## kw_avg_max                    453.8756
## kw_min_avg                    113.2765
## kw_max_avg                    727.2113
## kw_avg_avg                    984.9967
## self_reference_min_shares     884.3443
## self_reference_max_shares    1304.5450
## self_reference_avg_sharess   3085.2201
## is_weekend                      8.8620
## LDA_00                         15.2155
## global_subjectivity            74.1745
## global_sentiment_polarity     678.9848
## global_rate_positive_words    372.8255
## global_rate_negative_words    636.7594
## rate_negative_words           876.1099
## avg_positive_polarity         337.4762
## min_positive_polarity         100.0663
## max_positive_polarity         179.4343
## avg_negative_polarity         710.0900
## min_negative_polarity         512.8949
## max_negative_polarity         222.6647
## title_subjectivity            185.4600
## title_sentiment_polarity       54.5680
## abs_title_subjectivity         83.5766
## abs_title_sentiment_polarity  187.8872
##                                     Fi
## timedelta                     129.5028
## n_tokens_title                 15.8310
## n_tokens_content              328.8026
## n_unique_tokens              1413.7678
## n_non_stop_words              292.3712
## n_non_stop_unique_tokens      923.6212
## num_hrefs                     103.1891
## num_self_hrefs                 27.2770
## num_imgs                       27.0802
## num_videos                     21.1671
## average_token_length          209.1627
## num_keywords                   91.1986
## kw_min_min                    359.6392
## kw_max_min                   1422.3269
## kw_avg_min                   1560.5650
## kw_min_max                     50.0409
## kw_max_max                    445.8622
## kw_avg_max                    465.6207
## kw_min_avg                    116.2078
## kw_max_avg                    746.0297
## kw_avg_avg                   1010.4858
## self_reference_min_shares     907.2288
## self_reference_max_shares    1338.3032
## self_reference_avg_sharess   3165.0576
## is_weekend                      9.0913
## LDA_00                         15.6092
## global_subjectivity            76.0939
## global_sentiment_polarity     696.5552
## global_rate_positive_words    382.4733
## global_rate_negative_words    653.2371
## rate_negative_words           898.7814
## avg_positive_polarity         346.2093
## min_positive_polarity         102.6558
## max_positive_polarity         184.0776
## avg_negative_polarity         728.4652
## min_negative_polarity         526.1673
## max_negative_polarity         228.4266
## title_subjectivity            190.2592
## title_sentiment_polarity       55.9801
## abs_title_subjectivity         85.7393
## abs_title_sentiment_polarity  192.7492
##                              Leamer
## timedelta                    0.6798
## n_tokens_title               0.9357
## n_tokens_content             0.5029
## n_unique_tokens              0.2701
## n_non_stop_words             0.5251
## n_non_stop_unique_tokens     0.3279
## num_hrefs                    0.7203
## num_self_hrefs               0.8962
## num_imgs                     0.8968
## num_videos                   0.9166
## average_token_length         0.5893
## num_keywords                 0.7413
## kw_min_min                   0.4861
## kw_max_min                   0.2694
## kw_avg_min                   0.2580
## kw_min_max                   0.8305
## kw_max_max                   0.4469
## kw_avg_max                   0.4392
## kw_min_avg                   0.6994
## kw_max_avg                   0.3603
## kw_avg_avg                   0.3150
## self_reference_min_shares    0.3305
## self_reference_max_shares    0.2771
## self_reference_avg_sharess   0.1843
## is_weekend                   0.9615
## LDA_00                       0.9365
## global_subjectivity          0.7706
## global_sentiment_polarity    0.3712
## global_rate_positive_words   0.4747
## global_rate_negative_words   0.3815
## rate_negative_words          0.3319
## avg_positive_polarity        0.4932
## min_positive_polarity        0.7212
## max_positive_polarity        0.6138
## avg_negative_polarity        0.3640
## min_negative_polarity        0.4178
## max_negative_polarity        0.5723
## title_subjectivity           0.6075
## title_sentiment_polarity     0.8157
## abs_title_subjectivity       0.7515
## abs_title_sentiment_polarity 0.6050
##                                 CVIF
## timedelta                     3.0027
## n_tokens_title                1.5852
## n_tokens_content              5.4881
## n_unique_tokens              19.0183
## n_non_stop_words              5.0338
## n_non_stop_unique_tokens     12.9059
## num_hrefs                     2.6746
## num_self_hrefs                1.7279
## num_imgs                      1.7255
## num_videos                    1.6517
## average_token_length          3.9962
## num_keywords                  2.5251
## kw_min_min                    5.8727
## kw_max_min                   19.1251
## kw_avg_min                   20.8490
## kw_min_max                    2.0118
## kw_max_max                    6.9479
## kw_avg_max                    7.1943
## kw_min_avg                    2.8369
## kw_max_avg                   10.6912
## kw_avg_avg                   13.9892
## self_reference_min_shares    12.7015
## self_reference_max_shares    18.0773
## self_reference_avg_sharess   40.8580
## is_weekend                    1.5011
## LDA_00                        1.5824
## global_subjectivity           2.3367
## global_sentiment_polarity    10.0742
## global_rate_positive_words    6.1574
## global_rate_negative_words    9.5340
## rate_negative_words          12.5961
## avg_positive_polarity         5.7052
## min_positive_polarity         2.6679
## max_positive_polarity         3.6833
## avg_negative_polarity        10.4722
## min_negative_polarity         7.9494
## max_negative_polarity         4.2364
## title_subjectivity            3.7604
## title_sentiment_polarity      2.0859
## abs_title_subjectivity        2.4570
## abs_title_sentiment_polarity  3.7915
##                              Klein
## timedelta                        1
## n_tokens_title                   0
## n_tokens_content                 1
## n_unique_tokens                  1
## n_non_stop_words                 1
## n_non_stop_unique_tokens         1
## num_hrefs                        1
## num_self_hrefs                   1
## num_imgs                         1
## num_videos                       0
## average_token_length             1
## num_keywords                     1
## kw_min_min                       1
## kw_max_min                       1
## kw_avg_min                       1
## kw_min_max                       1
## kw_max_max                       1
## kw_avg_max                       1
## kw_min_avg                       1
## kw_max_avg                       1
## kw_avg_avg                       1
## self_reference_min_shares        1
## self_reference_max_shares        1
## self_reference_avg_sharess       1
## is_weekend                       0
## LDA_00                           0
## global_subjectivity              1
## global_sentiment_polarity        1
## global_rate_positive_words       1
## global_rate_negative_words       1
## rate_negative_words              1
## avg_positive_polarity            1
## min_positive_polarity            1
## max_positive_polarity            1
## avg_negative_polarity            1
## min_negative_polarity            1
## max_negative_polarity            1
## title_subjectivity               1
## title_sentiment_polarity         1
## abs_title_subjectivity           1
## abs_title_sentiment_polarity     1
##                                IND1
## timedelta                    0.0043
## n_tokens_title               0.0081
## n_tokens_content             0.0023
## n_unique_tokens              0.0007
## n_non_stop_words             0.0025
## n_non_stop_unique_tokens     0.0010
## num_hrefs                    0.0048
## num_self_hrefs               0.0074
## num_imgs                     0.0074
## num_videos                   0.0077
## average_token_length         0.0032
## num_keywords                 0.0051
## kw_min_min                   0.0022
## kw_max_min                   0.0007
## kw_avg_min                   0.0006
## kw_min_max                   0.0064
## kw_max_max                   0.0018
## kw_avg_max                   0.0018
## kw_min_avg                   0.0045
## kw_max_avg                   0.0012
## kw_avg_avg                   0.0009
## self_reference_min_shares    0.0010
## self_reference_max_shares    0.0007
## self_reference_avg_sharess   0.0003
## is_weekend                   0.0085
## LDA_00                       0.0081
## global_subjectivity          0.0055
## global_sentiment_polarity    0.0013
## global_rate_positive_words   0.0021
## global_rate_negative_words   0.0013
## rate_negative_words          0.0010
## avg_positive_polarity        0.0022
## min_positive_polarity        0.0048
## max_positive_polarity        0.0035
## avg_negative_polarity        0.0012
## min_negative_polarity        0.0016
## max_negative_polarity        0.0030
## title_subjectivity           0.0034
## title_sentiment_polarity     0.0061
## abs_title_subjectivity       0.0052
## abs_title_sentiment_polarity 0.0034
##                                IND2
## timedelta                    0.8501
## n_tokens_title               0.1969
## n_tokens_content             1.1809
## n_unique_tokens              1.4653
## n_non_stop_words             1.1448
## n_non_stop_unique_tokens     1.4106
## num_hrefs                    0.7605
## num_self_hrefs               0.3112
## num_imgs                     0.3094
## num_videos                   0.2526
## average_token_length         1.0317
## num_keywords                 0.7119
## kw_min_min                   1.2071
## kw_max_min                   1.4659
## kw_avg_min                   1.4754
## kw_min_max                   0.4903
## kw_max_max                   1.2649
## kw_avg_max                   1.2757
## kw_min_avg                   0.8074
## kw_max_avg                   1.3754
## kw_avg_avg                   1.4238
## self_reference_min_shares    1.4079
## self_reference_max_shares    1.4593
## self_reference_avg_sharess   1.5269
## is_weekend                   0.1194
## LDA_00                       0.1944
## global_subjectivity          0.6419
## global_sentiment_polarity    1.3629
## global_rate_positive_words   1.2244
## global_rate_negative_words   1.3505
## rate_negative_words          1.4065
## avg_positive_polarity        1.1961
## min_positive_polarity        0.7584
## max_positive_polarity        0.9851
## avg_negative_polarity        1.3711
## min_negative_polarity        1.3047
## max_negative_polarity        1.0628
## title_subjectivity           0.9973
## title_sentiment_polarity     0.5290
## abs_title_subjectivity       0.6878
## abs_title_sentiment_polarity 1.0021
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## timedelta , n_tokens_title , n_unique_tokens , n_non_stop_words , n_non_stop_unique_tokens , num_self_hrefs , num_videos , kw_max_max , kw_avg_max , kw_min_avg , self_reference_min_shares , global_sentiment_polarity , global_rate_positive_words , global_rate_negative_words , rate_negative_words , avg_positive_polarity , min_positive_polarity , avg_negative_polarity , min_negative_polarity , max_negative_polarity , title_subjectivity , title_sentiment_polarity , abs_title_sentiment_polarity , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.1698 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

This improves the model multicolinearity, but we are still left with some. We then pare down and select those models with the next highest VIF removed one at a time, until all values are below 5.

Second Mulitcolinearity Trim

toRemove<-c("self_reference_avg_sharess", "kw_avg_min", "n_unique_tokens", "rate_negative_words", "kw_avg_avg", "n_non_stop_words", "global_sentiment_polarity", "avg_negative_polarity", "kw_max_max")
trimTrain2 <- trimTrain1[, ! names(trimTrain1) %in% toRemove, drop = F]

#Repeat linear Model process
lm4<- lm(log(shares)~., trimTrain2)

imcdiag(lm4)

## 
## Call:
## imcdiag(mod = lm4)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                                 VIF
## timedelta                    1.8213
## n_tokens_title               1.1307
## n_tokens_content             2.7149
## n_non_stop_unique_tokens     2.0964
## num_hrefs                    1.8407
## num_self_hrefs               1.2044
## num_imgs                     1.2047
## num_videos                   1.1198
## average_token_length         1.4163
## num_keywords                 1.6474
## kw_min_min                   2.5520
## kw_max_min                   1.0904
## kw_min_max                   1.4115
## kw_avg_max                   3.7702
## kw_min_avg                   1.4495
## kw_max_avg                   1.2600
## self_reference_min_shares    1.4091
## self_reference_max_shares    1.5869
## is_weekend                   1.0708
## LDA_00                       1.1240
## global_subjectivity          1.4647
## global_rate_positive_words   1.5587
## global_rate_negative_words   1.4034
## avg_positive_polarity        2.5132
## min_positive_polarity        1.7910
## max_positive_polarity        2.5514
## min_negative_polarity        1.9323
## max_negative_polarity        1.2232
## title_subjectivity           2.6907
## title_sentiment_polarity     1.4747
## abs_title_subjectivity       1.7636
## abs_title_sentiment_polarity 2.7090
##                                 TOL
## timedelta                    0.5491
## n_tokens_title               0.8844
## n_tokens_content             0.3683
## n_non_stop_unique_tokens     0.4770
## num_hrefs                    0.5433
## num_self_hrefs               0.8303
## num_imgs                     0.8301
## num_videos                   0.8930
## average_token_length         0.7060
## num_keywords                 0.6070
## kw_min_min                   0.3918
## kw_max_min                   0.9171
## kw_min_max                   0.7084
## kw_avg_max                   0.2652
## kw_min_avg                   0.6899
## kw_max_avg                   0.7936
## self_reference_min_shares    0.7097
## self_reference_max_shares    0.6301
## is_weekend                   0.9339
## LDA_00                       0.8897
## global_subjectivity          0.6828
## global_rate_positive_words   0.6416
## global_rate_negative_words   0.7126
## avg_positive_polarity        0.3979
## min_positive_polarity        0.5584
## max_positive_polarity        0.3919
## min_negative_polarity        0.5175
## max_negative_polarity        0.8176
## title_subjectivity           0.3716
## title_sentiment_polarity     0.6781
## abs_title_subjectivity       0.5670
## abs_title_sentiment_polarity 0.3691
##                                    Wi
## timedelta                    115.1882
## n_tokens_title                18.3352
## n_tokens_content             240.5233
## n_non_stop_unique_tokens     153.7839
## num_hrefs                    117.9216
## num_self_hrefs                28.6644
## num_imgs                      28.7118
## num_videos                    16.8008
## average_token_length          58.3945
## num_keywords                  90.7980
## kw_min_min                   217.6802
## kw_max_min                    12.6843
## kw_min_max                    57.7225
## kw_avg_max                   388.5408
## kw_min_avg                    63.0459
## kw_max_avg                    36.4720
## self_reference_min_shares     57.3800
## self_reference_max_shares     82.3222
## is_weekend                     9.9336
## LDA_00                        17.3948
## global_subjectivity           65.1712
## global_rate_positive_words    78.3628
## global_rate_negative_words    56.5791
## avg_positive_polarity        212.2324
## min_positive_polarity        110.9400
## max_positive_polarity        217.6009
## min_negative_polarity        130.7636
## max_negative_polarity         31.2987
## title_subjectivity           237.1350
## title_sentiment_polarity      66.5745
## abs_title_subjectivity       107.1025
## abs_title_sentiment_polarity 239.6958
##                                    Fi
## timedelta                    119.0552
## n_tokens_title                18.9507
## n_tokens_content             248.5979
## n_non_stop_unique_tokens     158.9465
## num_hrefs                    121.8803
## num_self_hrefs                29.6266
## num_imgs                      29.6757
## num_videos                    17.3649
## average_token_length          60.3548
## num_keywords                  93.8462
## kw_min_min                   224.9879
## kw_max_min                    13.1101
## kw_min_max                    59.6603
## kw_avg_max                   401.5845
## kw_min_avg                    65.1624
## kw_max_avg                    37.6964
## self_reference_min_shares     59.3063
## self_reference_max_shares     85.0859
## is_weekend                    10.2671
## LDA_00                        17.9788
## global_subjectivity           67.3591
## global_rate_positive_words    80.9935
## global_rate_negative_words    58.4785
## avg_positive_polarity        219.3573
## min_positive_polarity        114.6643
## max_positive_polarity        224.9060
## min_negative_polarity        135.1535
## max_negative_polarity         32.3494
## title_subjectivity           245.0959
## title_sentiment_polarity      68.8095
## abs_title_subjectivity       110.6981
## abs_title_sentiment_polarity 247.7427
##                              Leamer
## timedelta                    0.7410
## n_tokens_title               0.9404
## n_tokens_content             0.6069
## n_non_stop_unique_tokens     0.6907
## num_hrefs                    0.7371
## num_self_hrefs               0.9112
## num_imgs                     0.9111
## num_videos                   0.9450
## average_token_length         0.8403
## num_keywords                 0.7791
## kw_min_min                   0.6260
## kw_max_min                   0.9576
## kw_min_max                   0.8417
## kw_avg_max                   0.5150
## kw_min_avg                   0.8306
## kw_max_avg                   0.8909
## self_reference_min_shares    0.8424
## self_reference_max_shares    0.7938
## is_weekend                   0.9664
## LDA_00                       0.9432
## global_subjectivity          0.8263
## global_rate_positive_words   0.8010
## global_rate_negative_words   0.8441
## avg_positive_polarity        0.6308
## min_positive_polarity        0.7472
## max_positive_polarity        0.6260
## min_negative_polarity        0.7194
## max_negative_polarity        0.9042
## title_subjectivity           0.6096
## title_sentiment_polarity     0.8235
## abs_title_subjectivity       0.7530
## abs_title_sentiment_polarity 0.6076
##                                CVIF
## timedelta                    2.2296
## n_tokens_title               1.3842
## n_tokens_content             3.3235
## n_non_stop_unique_tokens     2.5665
## num_hrefs                    2.2534
## num_self_hrefs               1.4744
## num_imgs                     1.4748
## num_videos                   1.3708
## average_token_length         1.7339
## num_keywords                 2.0167
## kw_min_min                   3.1241
## kw_max_min                   1.3349
## kw_min_max                   1.7280
## kw_avg_max                   4.6155
## kw_min_avg                   1.7745
## kw_max_avg                   1.5425
## self_reference_min_shares    1.7250
## self_reference_max_shares    1.9427
## is_weekend                   1.3109
## LDA_00                       1.3760
## global_subjectivity          1.7930
## global_rate_positive_words   1.9082
## global_rate_negative_words   1.7180
## avg_positive_polarity        3.0766
## min_positive_polarity        2.1925
## max_positive_polarity        3.1235
## min_negative_polarity        2.3655
## max_negative_polarity        1.4974
## title_subjectivity           3.2940
## title_sentiment_polarity     1.8053
## abs_title_subjectivity       2.1590
## abs_title_sentiment_polarity 3.3163
##                              Klein
## timedelta                        1
## n_tokens_title                   0
## n_tokens_content                 1
## n_non_stop_unique_tokens         1
## num_hrefs                        1
## num_self_hrefs                   1
## num_imgs                         1
## num_videos                       0
## average_token_length             1
## num_keywords                     1
## kw_min_min                       1
## kw_max_min                       0
## kw_min_max                       1
## kw_avg_max                       1
## kw_min_avg                       1
## kw_max_avg                       1
## self_reference_min_shares        1
## self_reference_max_shares        1
## is_weekend                       0
## LDA_00                           0
## global_subjectivity              1
## global_rate_positive_words       1
## global_rate_negative_words       1
## avg_positive_polarity            1
## min_positive_polarity            1
## max_positive_polarity            1
## min_negative_polarity            1
## max_negative_polarity            1
## title_subjectivity               1
## title_sentiment_polarity         1
## abs_title_subjectivity           1
## abs_title_sentiment_polarity     1
##                                IND1
## timedelta                    0.0039
## n_tokens_title               0.0063
## n_tokens_content             0.0026
## n_non_stop_unique_tokens     0.0034
## num_hrefs                    0.0039
## num_self_hrefs               0.0059
## num_imgs                     0.0059
## num_videos                   0.0064
## average_token_length         0.0050
## num_keywords                 0.0043
## kw_min_min                   0.0028
## kw_max_min                   0.0065
## kw_min_max                   0.0051
## kw_avg_max                   0.0019
## kw_min_avg                   0.0049
## kw_max_avg                   0.0057
## self_reference_min_shares    0.0051
## self_reference_max_shares    0.0045
## is_weekend                   0.0067
## LDA_00                       0.0063
## global_subjectivity          0.0049
## global_rate_positive_words   0.0046
## global_rate_negative_words   0.0051
## avg_positive_polarity        0.0028
## min_positive_polarity        0.0040
## max_positive_polarity        0.0028
## min_negative_polarity        0.0037
## max_negative_polarity        0.0058
## title_subjectivity           0.0026
## title_sentiment_polarity     0.0048
## abs_title_subjectivity       0.0040
## abs_title_sentiment_polarity 0.0026
##                                IND2
## timedelta                    1.2359
## n_tokens_title               0.3169
## n_tokens_content             1.7312
## n_non_stop_unique_tokens     1.4334
## num_hrefs                    1.2518
## num_self_hrefs               0.4651
## num_imgs                     0.4657
## num_videos                   0.2932
## average_token_length         0.8056
## num_keywords                 1.0770
## kw_min_min                   1.6667
## kw_max_min                   0.2273
## kw_min_max                   0.7991
## kw_avg_max                   2.0138
## kw_min_avg                   0.8499
## kw_max_avg                   0.5656
## self_reference_min_shares    0.7957
## self_reference_max_shares    1.0137
## is_weekend                   0.1813
## LDA_00                       0.3024
## global_subjectivity          0.8695
## global_rate_positive_words   0.9824
## global_rate_negative_words   0.7878
## avg_positive_polarity        1.6502
## min_positive_polarity        1.2104
## max_positive_polarity        1.6665
## min_negative_polarity        1.3223
## max_negative_polarity        0.5000
## title_subjectivity           1.7221
## title_sentiment_polarity     0.8822
## abs_title_subjectivity       1.1867
## abs_title_sentiment_polarity 1.7290
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## timedelta , n_tokens_title , num_self_hrefs , num_videos , self_reference_min_shares , global_rate_negative_words , avg_positive_polarity , min_positive_polarity , min_negative_polarity , max_negative_polarity , title_subjectivity , title_sentiment_polarity , abs_title_sentiment_polarity , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.142 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

After removing 15 more variables for obvious multicolinearity via VIF (>5), we need to replot the correlation matrix, which shows a much lower clustering rate of high correlations.

Replot Correlation

#Remove the predictor
train_cor<-trimTrain2[1:31]
res <- cor(train_cor)
palette = colorRampPalette(c("green", "white", "red")) (20)
heatmap(x = res, col = palette, symm = TRUE, cexRow=0.5, cexCol = 0.5)

The new heatmap appears to have less prominent clustering values.

Final Model Fit Prediction

#trim the testing data
newTest1<-test[ -c(25:31) ]
toRemove<-c( "LDA_01", "LDA_02", "LDA_03", "LDA_04", "rate_positive_words", "self_reference_avg_sharess", "kw_avg_min", "n_unique_tokens", "rate_negative_words", "kw_avg_avg", "n_non_stop_words", "global_sentiment_polarity", "avg_negative_polarity", "kw_max_max")

trimTest4 <- newTest1[, ! names(newTest1) %in% toRemove, drop = F]

yhat_lm4<-predict(lm4,trimTest4)
RMSE_lm4<-sqrt(mean((trimTest4$shares - exp(yhat_lm4))^2))

Open Parallel Processing

library(parallel)
library(doParallel)
cores<-detectCores()
cl <- makeCluster(cores-1)  
registerDoParallel(cl)

Backward Regression Selection

Transform the response with log, then fit a linear regression model with the variables after backward selection.

#backward selection after log transformation
library(leaps)
backward<- regsubsets(log(shares)~., trimTrain1, nvmax = 31, method = "backward")
backward_summary<-summary(backward)

#backward_summary[["which"]][size, ]
par(mfrow=c(1,3))
plot(backward_summary$cp, xlab = "Size", ylab = "backward Cp", type = "l")
plot(backward_summary$bic, xlab = "Size", ylab = "backward bic", type = "l")
plot(backward_summary$adjr2, xlab = "Size", ylab = "backward adjR2", type = "l")

coef(backward, which.min(backward_summary$cp))

##                (Intercept) 
##               6.026741e+00 
##           n_tokens_content 
##               1.819146e-04 
##   n_non_stop_unique_tokens 
##               5.199117e-01 
##                  num_hrefs 
##               7.145266e-03 
##                   num_imgs 
##               9.952843e-03 
##       average_token_length 
##              -1.295631e-01 
##               num_keywords 
##               3.588396e-02 
##                 kw_min_min 
##               1.522989e-03 
##                 kw_max_min 
##               7.164718e-05 
##                 kw_avg_min 
##              -3.395589e-04 
##                 kw_min_max 
##              -3.277207e-07 
##                 kw_max_max 
##               1.679088e-07 
##                 kw_max_avg 
##              -4.061436e-05 
##                 kw_avg_avg 
##               3.065877e-04 
##  self_reference_min_shares 
##              -2.513487e-06 
##  self_reference_max_shares 
##              -1.902204e-06 
## self_reference_avg_sharess 
##               5.896310e-06 
##                 is_weekend 
##               3.015607e-01 
##                     LDA_00 
##               2.665259e-01 
##        global_subjectivity 
##               4.772389e-01 
## global_rate_negative_words 
##               7.373213e+00 
##        rate_negative_words 
##              -5.300122e-01 
##      min_positive_polarity 
##              -3.568706e-01 
##      max_positive_polarity 
##              -1.915775e-01 
##      avg_negative_polarity 
##              -3.675670e-01 
##      max_negative_polarity 
##               5.050231e-01 
##         title_subjectivity 
##               9.705537e-02 
##   title_sentiment_polarity 
##               1.217799e-01 
##     abs_title_subjectivity 
##               3.173377e-01

coef(backward, which.max(backward_summary$adjr2))

##                (Intercept) 
##               5.909880e+00 
##           n_tokens_content 
##               1.693521e-04 
##           n_non_stop_words 
##               3.658614e-01 
##   n_non_stop_unique_tokens 
##               4.196801e-01 
##                  num_hrefs 
##               7.254519e-03 
##                   num_imgs 
##               9.549880e-03 
##       average_token_length 
##              -1.636281e-01 
##               num_keywords 
##               3.611458e-02 
##                 kw_min_min 
##               1.530708e-03 
##                 kw_max_min 
##               7.297875e-05 
##                 kw_avg_min 
##              -3.459285e-04 
##                 kw_min_max 
##              -3.176385e-07 
##                 kw_max_max 
##               1.701103e-07 
##                 kw_max_avg 
##              -4.064527e-05 
##                 kw_avg_avg 
##               3.065108e-04 
##  self_reference_min_shares 
##              -2.487701e-06 
##  self_reference_max_shares 
##              -1.909126e-06 
## self_reference_avg_sharess 
##               5.893345e-06 
##                 is_weekend 
##               3.038499e-01 
##                     LDA_00 
##               2.627707e-01 
##        global_subjectivity 
##               3.545519e-01 
##  global_sentiment_polarity 
##               3.373240e-01 
## global_rate_negative_words 
##               7.178969e+00 
##        rate_negative_words 
##              -4.251939e-01 
##      min_positive_polarity 
##              -4.047131e-01 
##      max_positive_polarity 
##              -2.418988e-01 
##      avg_negative_polarity 
##              -4.437618e-01 
##      max_negative_polarity 
##               5.471188e-01 
##         title_subjectivity 
##               1.018941e-01 
##   title_sentiment_polarity 
##               1.126524e-01 
##     abs_title_subjectivity 
##               3.195560e-01

#get best subset of the specified size with min cp.
sub <- backward_summary$which[which.min(backward_summary$cp), ]

# Create test model matrix, predcition, test error
test_model <- model.matrix(log(shares)~ ., data = lmNewTest3)
model <- test_model[, sub]
yhat_back<-model %*% coef(backward, which.min(backward_summary$cp))
RMSE_back<-sqrt(mean((test$shares - exp(yhat_back))^2))

Random Forests

As previously mentioned in the regression trees section, the random forest builds an entire forest of these trees, and merges them together to get a more accurate and stable predictions than one off trees. It is usually trained using the bagging method. Unlike regression trees, which are prone to overfitting, only a random subset of the features is taken into consideration by the algorithm for splitting a node (used CV to find the perfect amount of variables to use). This builds in additional error and makes a more robust prediction.

The manual dimensional reduction was necessary to have the processing speeds to handle the random forests model.

library(randomForest)
#single bagged model
tree.train<-randomForest(shares~., data=trimTrain1, mtry=32, importance=TRUE)
tree.train

## 
## Call:
##  randomForest(formula = shares ~ ., data = trimTrain1, mtry = 32,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 32
## 
##           Mean of squared residuals: 225836201
##                     % Var explained: -7.8

#single bagged regression tree error prediction
tree.test<-lmNewTest3["shares"]
yhat.bag<-predict(tree.train, newdata=lmNewTest3)
yhat.bag<-as.data.frame(yhat.bag)
yhat_bag<-mean((yhat.bag$yhat.bag-tree.test$shares)^2)
RMSE_bag<-sqrt(yhat_bag)

#run parallel processing to determine the best mtry value
control <- trainControl(method="repeatedcv", number=15, repeats=3, search="random")
mtry <- sqrt(ncol(trimTrain1))
rf_random <- train(shares~., data=trimTrain1, method="rf", tuneLength=15, trControl=control)
print(rf_random)

## Random Forest 
## 
## 4380 samples
##   41 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (15 fold, repeated 3 times) 
## Summary of sample sizes: 4087, 4089, 4089, 4087, 4088, 4089, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared    MAE     
##    6    11064.20  0.03506399  2856.553
##    8    11149.86  0.03278844  2874.201
##   10    11336.61  0.03119916  2910.572
##   14    11494.60  0.02843014  2948.545
##   17    11656.99  0.02554926  2969.380
##   18    11687.01  0.02619229  2975.067
##   19    11654.92  0.02618212  2971.074
##   20    11809.23  0.02424829  2994.641
##   27    12065.56  0.02198172  3037.255
##   30    12097.93  0.02159392  3035.926
##   31    12186.67  0.02046357  3051.066
##   35    12277.83  0.02117312  3056.618
##   39    12504.33  0.02050646  3083.314
##   41    12419.78  0.02062526  3086.602
## 
## RMSE was used to select the optimal model
##  using the smallest value.
## The final value used for the model was mtry = 6.

plot(rf_random)

mtry<-which.min(rf_random$results$RMSE)

#USe a model to determine the best number of trees
control <- trainControl(method="repeatedcv", number=5, repeats=3, search="grid")
tunegrid <- expand.grid(.mtry=mtry)
modellist <- list()
for (ntree in c(500, 1000, 1500, 2000)) {
fit <- train(shares~., data=trimTrain1, method="rf", tuneGrid=tunegrid, trControl=control, ntree=ntree)
key <- toString(ntree)
modellist[[key]] <- fit
}
results <- resamples(modellist)
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: 500, 1000, 1500, 2000 
## Number of resamples: 15 
## 
## MAE 
##          Min.  1st Qu.   Median     Mean
## 500  2279.469 2437.125 2623.993 2669.576
## 1000 2160.764 2487.919 2566.709 2648.974
## 1500 2166.048 2435.301 2679.688 2650.837
## 2000 2073.480 2269.924 2462.833 2652.750
##       3rd Qu.     Max. NA's
## 500  2807.488 3474.814    0
## 1000 2821.177 3448.537    0
## 1500 2767.883 3238.394    0
## 2000 2927.455 4092.169    0
## 
## RMSE 
##          Min.  1st Qu.    Median     Mean
## 500  4979.982 6417.385 11582.589 12673.30
## 1000 4448.666 6308.064 11429.829 12408.53
## 1500 4721.711 7035.141 11937.270 12618.13
## 2000 4296.065 5033.412  7198.068 11734.82
##       3rd Qu.     Max. NA's
## 500  14065.30 26461.74    0
## 1000 15464.89 27600.19    0
## 1500 12622.53 26038.11    0
## 2000 14992.14 30098.36    0
## 
## Rsquared 
##             Min.     1st Qu.     Median
## 500  0.007362031 0.011030561 0.01413348
## 1000 0.006240405 0.013242814 0.01905264
## 1500 0.006128803 0.009252216 0.01442754
## 2000 0.003513309 0.014457394 0.02573561
##            Mean    3rd Qu.       Max. NA's
## 500  0.02308878 0.02442855 0.09893498    0
## 1000 0.02642723 0.03961135 0.06218658    0
## 1500 0.02685317 0.02582982 0.14669403    0
## 2000 0.03573336 0.05556318 0.10011027    0

#Apply best fit parameters to model. 
#random forests model
tree.trainRF<-randomForest(shares~., data=trimTrain1, mtry=mtry,  importance=TRUE)
tree.trainRF

## 
## Call:
##  randomForest(formula = shares ~ ., data = trimTrain1, mtry = mtry,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 208082709
##                     % Var explained: 0.67

#random forest error prediction
tree.trainRF<-randomForest(shares~., data=trimTrain1, ntree=500, mtry=2, importance=TRUE)
tree.trainRF

## 
## Call:
##  randomForest(formula = shares ~ ., data = trimTrain1, ntree = 500,      mtry = 2, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 208432947
##                     % Var explained: 0.5

plot(tree.trainRF)

#Calculate Prediction
yhat.rf<-predict(tree.trainRF, newdata = lmNewTest3)
yhat.rf<-as.data.frame(yhat.rf)
yhat_rf<-mean((yhat.rf$yhat.rf-tree.test$shares)^2)
RMSE_rfTrimmed<-sqrt(yhat_rf)
varImpPlot(tree.trainRF)

Boosted Tree

Boosting is a general approach that can be applied to many statistical learning methods for regression or classification. The trees in boosting are grown sequentially : each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.

Procedure (for regression trees):
1.Initialize predictions as 0,
2.Find the residuals (observed-predicted), call the set of them
3.Fit a tree with splits (d+1 terminal nodes) treating the residuals as the response (which they are for the first fit)
4.Update predictions
5.Update residuals for new predictions and repeat B times

Tune parameters must be chosen shrinkage, B and d in the boosting tree model.

cvcontrol <- trainControl(method="repeatedcv", number = 10,
                          allowParallel=TRUE)
grid <- expand.grid(n.trees = c(1000,1500), 
                    interaction.depth=c(1:3), 
                    shrinkage=c(0.01,0.05,0.1), 
                    n.minobsinnode=c(20))
capture<-capture.output(train.gbm <- train(log(shares) ~ ., 
                   data=train,
                   method="gbm",
                   trControl=cvcontrol,
                   tuneGrid = grid))
train.gbm

## Stochastic Gradient Boosting 
## 
## 4380 samples
##   53 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 3941, 3942, 3942, 3943, 3942, 3941, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees
##   0.01       1                  1000   
##   0.01       1                  1500   
##   0.01       2                  1000   
##   0.01       2                  1500   
##   0.01       3                  1000   
##   0.01       3                  1500   
##   0.05       1                  1000   
##   0.05       1                  1500   
##   0.05       2                  1000   
##   0.05       2                  1500   
##   0.05       3                  1000   
##   0.05       3                  1500   
##   0.10       1                  1000   
##   0.10       1                  1500   
##   0.10       2                  1000   
##   0.10       2                  1500   
##   0.10       3                  1000   
##   0.10       3                  1500   
##   RMSE       Rsquared   MAE      
##   0.7681557  0.1732328  0.5523774
##   0.7660428  0.1772920  0.5498350
##   0.7634276  0.1827888  0.5464896
##   0.7624344  0.1850689  0.5456519
##   0.7612713  0.1875911  0.5441152
##   0.7610773  0.1880597  0.5433628
##   0.7645608  0.1813511  0.5483374
##   0.7651187  0.1812789  0.5489533
##   0.7668069  0.1796954  0.5489326
##   0.7719152  0.1729540  0.5536085
##   0.7686746  0.1783743  0.5504686
##   0.7743425  0.1725768  0.5554977
##   0.7676056  0.1780561  0.5527836
##   0.7733872  0.1695549  0.5570542
##   0.7835695  0.1574489  0.5606110
##   0.7901480  0.1525438  0.5668374
##   0.7902980  0.1532283  0.5662682
##   0.8004034  0.1448349  0.5743237
## 
## Tuning parameter 'n.minobsinnode' was
##  held constant at a value of 20
## RMSE was used to select the optimal model
##  using the smallest value.
## The final values used for the model
##  were n.trees = 1500, interaction.depth =
##  3, shrinkage = 0.01 and n.minobsinnode = 20.

boostPred <- predict(train.gbm, newdata = test)
RMSE_boost <- sqrt(mean((test$shares - exp(boostPred))^2))

stopCluster(cl)

Comparison

Generally, the model with the lowest RMSE is the best on comparison.

comparison<-data.frame(RMSE_lm, RMSE_lm2, RMSE_lm3, RMSE_lm4, RMSE_back,  RMSE_bag, RMSE_rfTrimmed, RMSE_boost, RMSE_regTree)

comparison

##    RMSE_lm RMSE_lm2 RMSE_lm3 RMSE_lm4 RMSE_back
## 1 94872.86 97810.15  93811.9 192485.1  70876.98
##   RMSE_bag RMSE_rfTrimmed RMSE_boost
## 1  17372.6       16091.67   16231.99
##   RMSE_regTree
## 1     18192.27

which.min(comparison)

## RMSE_rfTrimmed 
##              7

The overall prediction error rate for this data set is very high. This is likely due to the high values of outlier articles with freakishly high shares, that are timely AND viral. These values were NOT removed from analysis, as these are the share metrics that a company would likely want to evaluate for emulation.

ST558Project2