<!DOCTYPE html>

ST 558 Project 2
library(rmarkdown)
library(usethis)
use_git_config(user.name="Mandy Liesch", user.email="amliesch@ncsu.edu")

Introduction

Online News Popularity Data Set summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity). Here we first showed some summary statistics and plots about the data grouped by weekdays. Then we create several models to predict the response, shares in different channels. The performance of these models will be evaluated by RMSE. The model having the lowest RMSE will be selected as a winner. The methods of modeling include:

  1. Regression Tree
  2. Log Transformed Full Linear Regression Model
  3. Linear Regression Model Without Day of the Week
  4. Subset Linear Regression Model #1
  5. Subset Linear Regression Model #2
  6. Backward Selection Linear Regression
  7. Bagged Regression Tree
  8. Random Forest Model
  9. Boosted Tree Model

Data preparation

Subset Data by Channel

library(tidyverse)

data_whole<-read_csv("OnlineNewsPopularity/OnlineNewsPopularity.csv")

#create a new variable, channel, to help with the subsetting.
data_whole$channel <- names(data_whole[14:19])[apply(data_whole[14:19],1, match, x = 1)]
data_whole$channel <-sub("data_channel_is_", "", data_whole$channel)

#Subset the data to work on the data channel of interest
#channel_interest = params[[1]]$team

#Get the important data
data_interest<-data_whole%>%
  filter(channel==x[[2]]$team)%>%
  select(-c(1,14:19,62))

Establish Training Data

Split the data into a training (70% of the data) and test set (30% of the data)

library(caret)
library(rsample)
set.seed(14)
index <- initial_split(data_interest,
                       prop = 0.7)
train <- training(index)
test <- testing(index)

Data Summaries

Correlation Plots

This graphical function looks at the correlation of all of the different variables against each other.

library(corrplot)
#drop values that are not important (the days of the week)
newTrain<-train[ -c(25:31) ]
lmNewTest<-test[ -c(25:31) ]
#drop the predictor variables
predictTrain<-newTrain[ -c(47) ]
#Calculate the correlation Matrix and round it
res <- cor(predictTrain)

#Plot the correlation matrix values by cluster
corrplot(res, type = "upper", order = "hclust",
         tl.col = "black", tl.cex = 0.5)

From the results of this spot, it appears that we likely have some clusters of colinearity.

Table Summary

We summarize the train data of interest in tables grouped by weekdays, showing the pattern of shares in a week.

#create a new variable, weekday, to help with the creating plots.
train$weekday <- names(train[25:31])[apply(train[25:31],1, match, x = 1)]
train$weekday <-sub("weekday_is_", "", train$weekday)

#summarize the train data by weekday.knitr::kable(
summary<-train%>%group_by(weekday)%>%
  summarise(Avg=round(mean(shares),0),Sd=round(sd(shares),0),Median=median(shares),IQR=round(IQR(shares),0))
knitr::kable(summary)
weekday Avg Sd Median IQR
friday 3197 4735 1600 2100
monday 4624 16315 1700 2400
saturday 4082 5502 2000 2750
sunday 3669 4714 2000 2600
thursday 3954 6662 1700 2600
tuesday 4115 14886 1500 2050
wednesday 2983 4374 1500 1800

We summarize the train data of interest in the plots below. The histogram of shares shows that it is not a normal distribution. After log transformation, the distribution of log(share) is more close to a normal distribution.

#histogram of shares and log(shares).
hist(train$shares)

hist(log(train$shares))

Data Plots

Box Plots

We use box plots to show the difference in shares and num_images between weekdays and weekends.If the boxes of weekends are higher than the ones of weekdays, then articles be shared more often during weekends.

g1<-ggplot(train, aes(x=factor(is_weekend,labels=c("No", "Yes")),y=shares))
g1+geom_boxplot(fill="white", width=0.5,lwd=1.5,color='black',outlier.shape = NA)+
   scale_y_continuous(limits = quantile(train$shares, c(0.1, 0.9)))+
   labs(subtitle = "Shares on weekend",x="On weekend or not")

g2<-ggplot(train, aes(x=factor(is_weekend,labels=c("No", "Yes")),y=num_imgs))
g2+geom_boxplot(fill="white", width=0.5,lwd=1.5,color='black',outlier.shape = NA)+
   scale_y_continuous(limits = quantile(train$num_imgs, c(0, 0.95)))+
   labs(subtitle = "number of images on weekend",x="On weekend or not")

Linear Model

We can inspect the trend of shares as a function of num_images. If the points show an upward trend, then articles with more images tend to be shared more often. If we see a negative trend then articles with more images tend to be shared less often. We can also observe the difference after the log transformation.

g3<-ggplot(train,aes(x=num_imgs,y=shares))
g3+geom_point()+
  labs(subtitle = "num_imgs vs shares")+
  scale_y_continuous(limits = quantile(train$shares, c(0, 0.9)))+
  scale_x_continuous(limits = quantile(train$num_imgs, c(0, 0.9)))+
  geom_smooth(method="lm")

g4<-ggplot(train,aes(x=num_imgs,y=log(shares)))
g4+geom_point()+
  labs(subtitle = "num_imgs vs log(shares)")+
  scale_y_continuous(limits = quantile(log(train$shares), c(0, 0.9)))+
  scale_x_continuous(limits = quantile(train$num_imgs, c(0, 0.9)))+
  geom_smooth(method="lm")

#remove weekday from data set
train<-train%>%select(-weekday)

Models

Regression Tree

Classification trees are machine learning algorithms that have several benefits, including the ease of operation, and less pre-processing. Data does not require normalization, scaling, and removal of missing values. The results are usually easy to explain, and stakeholders usually can understand them. A regression tree is a tree that uses numerical values to predict the nodes and tree branches. Despite all of the benefits, the Decision Tree algorithm can’t be used for regression and predicting continuous values, it also does not transfer well to other datasets.

library(tree)
tree.news<-tree(shares~., data=train)
summary(tree.news)
## 
## Regression tree:
## tree(formula = shares ~ ., data = train)
## Variables actually used in tree construction:
## [1] "num_videos"      "LDA_00"          "kw_avg_avg"     
## [4] "n_unique_tokens"
## Number of terminal nodes:  5 
## Residual mean deviance:  79550000 = 1.165e+11 / 1464 
## Distribution of residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu. 
## -43560.00  -2278.00  -1678.00      0.00    -78.36 
##      Max. 
## 163200.00
plot(tree.news)
text(tree.news, pretty=0)

yhat.regTree<- predict(tree.news, newdata = test)
yhat.test<-test["shares"]
yhat.regTree<-as.data.frame(yhat.regTree)
meanRegTree<-mean((yhat.regTree$yhat.regTree-yhat.test$shares)^2)

RMSE_regTree<-sqrt(meanRegTree)

These results can vary widely depending on the datasets.

Linear Models

Linear models are very valuable and powerful tools, and are very versatile, and can be applied to many situations. Multiple regression examines the relationship between several independent variables and one dependent variable (in this case, total Shares). Regression models give users the ability to determine the relative influence of one or more predictor variables to the predictor, and it also allows users to identify outliers, or anomalies. The main disadvantages have to do with the input quality of data. Input that is incomplete may lead to wrong conclusions. It also assumes that data is independent, which is not always the case.

There are several different types of linear models. In this project, we use multiple different multiple regression values that were log transformed, representing the full dataset, and several partial subsets with multiple variables removed at different points for multicolinearity reasons.

There are also several different types of variable selection, including forward, backward, and stepwise, which user predefined criteria set the entry and/or exit criteria of the models. Backwards selection starts with a full model, and then removes variables that are least significant one at a time, until the model criteria defined by the user are hit. Forward regression does the opposite, and is not represented here.

Linear Regression After Log Transformation

Transform the response with log, then fit a linear regression model with all the variables. Then calculate the RMSE of the model.

lm<- lm(log(shares)~.,train)
summary(lm)
## 
## Call:
## lm(formula = log(shares) ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9490 -0.5683 -0.1498  0.4752  4.1539 
## 
## Coefficients: (4 not defined because of singularities)
##                                Estimate Std. Error
## (Intercept)                   6.903e+00  4.121e-01
## timedelta                     8.946e-04  1.947e-04
## n_tokens_title                1.280e-02  1.335e-02
## n_tokens_content             -7.124e-06  7.247e-05
## n_unique_tokens              -1.128e+00  8.469e-01
## n_non_stop_words             -1.808e+00  7.583e-01
## n_non_stop_unique_tokens      1.099e+00  7.246e-01
## num_hrefs                     5.406e-03  2.731e-03
## num_self_hrefs               -8.279e-03  9.033e-03
## num_imgs                      1.403e-02  4.652e-03
## num_videos                    4.451e-02  1.601e-02
## average_token_length          1.726e-01  1.045e-01
## num_keywords                  3.096e-02  1.725e-02
## kw_min_min                    2.820e-04  5.527e-04
## kw_max_min                    1.573e-05  2.131e-05
## kw_avg_min                   -1.184e-04  1.495e-04
## kw_min_max                    4.595e-07  1.920e-06
## kw_max_max                    6.914e-08  2.097e-07
## kw_avg_max                    3.758e-07  5.836e-07
## kw_min_avg                   -3.432e-05  3.022e-05
## kw_max_avg                   -3.370e-05  8.627e-06
## kw_avg_avg                    2.432e-04  5.410e-05
## self_reference_min_shares     1.531e-05  7.497e-06
## self_reference_max_shares     3.339e-06  4.731e-06
## self_reference_avg_sharess   -1.055e-05  1.125e-05
## weekday_is_monday            -9.674e-02  9.918e-02
## weekday_is_tuesday           -2.289e-01  1.000e-01
## weekday_is_wednesday         -2.564e-01  9.563e-02
## weekday_is_thursday          -1.049e-01  9.762e-02
## weekday_is_friday            -1.750e-01  9.989e-02
## weekday_is_saturday           1.218e-01  1.104e-01
## weekday_is_sunday                    NA         NA
## is_weekend                           NA         NA
## LDA_00                        1.054e-01  9.933e-02
## LDA_01                        1.407e-01  2.711e-01
## LDA_02                       -1.306e-01  2.347e-01
## LDA_03                       -2.486e-01  1.637e-01
## LDA_04                               NA         NA
## global_subjectivity          -1.049e-01  3.952e-01
## global_sentiment_polarity    -9.504e-01  7.686e-01
## global_rate_positive_words    5.815e+00  3.257e+00
## global_rate_negative_words   -5.904e-01  7.377e+00
## rate_positive_words          -1.391e-01  5.881e-01
## rate_negative_words                  NA         NA
## avg_positive_polarity        -2.487e-01  6.187e-01
## min_positive_polarity         1.038e+00  4.863e-01
## max_positive_polarity         5.477e-03  1.887e-01
## avg_negative_polarity        -1.333e-02  5.457e-01
## min_negative_polarity        -1.552e-01  1.854e-01
## max_negative_polarity        -2.218e-01  4.560e-01
## title_subjectivity            6.015e-03  1.110e-01
## title_sentiment_polarity     -1.837e-01  1.059e-01
## abs_title_subjectivity        1.299e-01  1.553e-01
## abs_title_sentiment_polarity  9.282e-02  1.527e-01
##                              t value Pr(>|t|)    
## (Intercept)                   16.750  < 2e-16 ***
## timedelta                      4.594 4.72e-06 ***
## n_tokens_title                 0.959  0.33789    
## n_tokens_content              -0.098  0.92170    
## n_unique_tokens               -1.332  0.18315    
## n_non_stop_words              -2.385  0.01721 *  
## n_non_stop_unique_tokens       1.517  0.12955    
## num_hrefs                      1.980  0.04793 *  
## num_self_hrefs                -0.917  0.35954    
## num_imgs                       3.016  0.00261 ** 
## num_videos                     2.780  0.00551 ** 
## average_token_length           1.651  0.09889 .  
## num_keywords                   1.794  0.07298 .  
## kw_min_min                     0.510  0.60993    
## kw_max_min                     0.738  0.46063    
## kw_avg_min                    -0.792  0.42865    
## kw_min_max                     0.239  0.81090    
## kw_max_max                     0.330  0.74162    
## kw_avg_max                     0.644  0.51976    
## kw_min_avg                    -1.136  0.25618    
## kw_max_avg                    -3.906 9.81e-05 ***
## kw_avg_avg                     4.495 7.53e-06 ***
## self_reference_min_shares      2.043  0.04124 *  
## self_reference_max_shares      0.706  0.48052    
## self_reference_avg_sharess    -0.938  0.34845    
## weekday_is_monday             -0.975  0.32953    
## weekday_is_tuesday            -2.289  0.02223 *  
## weekday_is_wednesday          -2.681  0.00743 ** 
## weekday_is_thursday           -1.074  0.28286    
## weekday_is_friday             -1.752  0.07993 .  
## weekday_is_saturday            1.103  0.27002    
## weekday_is_sunday                 NA       NA    
## is_weekend                        NA       NA    
## LDA_00                         1.061  0.28881    
## LDA_01                         0.519  0.60395    
## LDA_02                        -0.556  0.57799    
## LDA_03                        -1.519  0.12903    
## LDA_04                            NA       NA    
## global_subjectivity           -0.265  0.79082    
## global_sentiment_polarity     -1.236  0.21649    
## global_rate_positive_words     1.786  0.07435 .  
## global_rate_negative_words    -0.080  0.93622    
## rate_positive_words           -0.237  0.81306    
## rate_negative_words               NA       NA    
## avg_positive_polarity         -0.402  0.68776    
## min_positive_polarity          2.134  0.03304 *  
## max_positive_polarity          0.029  0.97685    
## avg_negative_polarity         -0.024  0.98051    
## min_negative_polarity         -0.837  0.40269    
## max_negative_polarity         -0.486  0.62679    
## title_subjectivity             0.054  0.95679    
## title_sentiment_polarity      -1.734  0.08314 .  
## abs_title_subjectivity         0.836  0.40303    
## abs_title_sentiment_polarity   0.608  0.54341    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9049 on 1419 degrees of freedom
## Multiple R-squared:  0.1083, Adjusted R-squared:  0.07755 
## F-statistic: 3.519 on 49 and 1419 DF,  p-value: 1.548e-14
yhat_lm<-predict(lm,test)
RMSE_lm<-sqrt(mean((test$shares - exp(yhat_lm))^2))

Plot the lm Residuals

par(mfrow=c(2,2))
plot(lm)

Looking at our residuals, there seems to be skewing in both direction, indicating that the data, even after transformation, has extreme outliers in both directions.

Model Removing the Day Variable

#look at the data for multicolinearity
lmNewTest<-test[ -c(25:31) ]
lm2<- lm(log(shares)~.,newTrain)
yhat_lm2<-predict(lm2,lmNewTest)
RMSE_lm2<-sqrt(mean((lmNewTest$shares - exp(yhat_lm2))^2))

library(mctest)
omcdiag(lm2)
## 
## Call:
## omcdiag(mod = lm2)
## 
## 
## Overall Multicollinearity Diagnostics
## 
##                          MC Results detection
## Determinant |X'X|:     0.000000e+00         1
## Farrar Chi-Square:     1.456565e+05         1
## Red Indicator:         1.790000e-01         0
## Sum of Lambda Inverse: 8.953332e+15         1
## Theil's Method:        2.865380e+01         1
## Condition Number:               NaN        NA
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
imcdiag(lm2)
## 
## Call:
## imcdiag(mod = lm2)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                                       VIF    TOL
## timedelta                    3.295300e+00 0.3035
## n_tokens_title               1.145400e+00 0.8731
## n_tokens_content             3.298000e+00 0.3032
## n_unique_tokens              1.508940e+01 0.0663
## n_non_stop_words             4.503600e+15 0.0000
## n_non_stop_unique_tokens     1.300410e+01 0.0769
## num_hrefs                    1.697800e+00 0.5890
## num_self_hrefs               1.301400e+00 0.7684
## num_imgs                     2.871400e+00 0.3483
## num_videos                   1.122700e+00 0.8907
## average_token_length         5.957700e+00 0.1679
## num_keywords                 1.488800e+00 0.6717
## kw_min_min                   3.800900e+00 0.2631
## kw_max_min                   2.079440e+01 0.0481
## kw_avg_min                   2.138410e+01 0.0468
## kw_min_max                   1.863900e+00 0.5365
## kw_max_max                   5.344300e+00 0.1871
## kw_avg_max                   5.703400e+00 0.1753
## kw_min_avg                   2.639000e+00 0.3789
## kw_max_avg                   7.156300e+00 0.1397
## kw_avg_avg                   1.011760e+01 0.0988
## self_reference_min_shares    1.186230e+01 0.0843
## self_reference_max_shares    3.404480e+01 0.0294
## self_reference_avg_sharess   6.179350e+01 0.0162
## is_weekend                   1.132000e+00 0.8834
## LDA_00                                Inf 0.0000
## LDA_01                                Inf 0.0000
## LDA_02                                Inf 0.0000
## LDA_03                                Inf 0.0000
## LDA_04                                Inf 0.0000
## global_subjectivity          2.508200e+00 0.3987
## global_sentiment_polarity    8.185800e+00 0.1222
## global_rate_positive_words   4.507700e+00 0.2218
## global_rate_negative_words   7.763300e+00 0.1288
## rate_positive_words                   Inf 0.0000
## rate_negative_words          9.007199e+15 0.0000
## avg_positive_polarity        5.007500e+00 0.1997
## min_positive_polarity        1.765300e+00 0.5665
## max_positive_polarity        2.822700e+00 0.3543
## avg_negative_polarity        6.476500e+00 0.1544
## min_negative_polarity        4.344500e+00 0.2302
## max_negative_polarity        3.239400e+00 0.3087
## title_subjectivity           2.369800e+00 0.4220
## title_sentiment_polarity     1.681400e+00 0.5948
## abs_title_subjectivity       1.519200e+00 0.6583
## abs_title_sentiment_polarity 2.714300e+00 0.3684
##                                        Wi           Fi
## timedelta                    7.258170e+01 7.428340e+01
## n_tokens_title               4.596800e+00 4.704600e+00
## n_tokens_content             7.266860e+01 7.437240e+01
## n_unique_tokens              4.455366e+02 4.559827e+02
## n_non_stop_words             1.424138e+17 1.457529e+17
## n_non_stop_unique_tokens     3.795974e+02 3.884974e+02
## num_hrefs                    2.206690e+01 2.258420e+01
## num_self_hrefs               9.530100e+00 9.753500e+00
## num_imgs                     5.917730e+01 6.056480e+01
## num_videos                   3.881200e+00 3.972200e+00
## average_token_length         1.567724e+02 1.604481e+02
## num_keywords                 1.545550e+01 1.581790e+01
## kw_min_min                   8.857000e+01 9.064670e+01
## kw_max_min                   6.259414e+02 6.406172e+02
## kw_avg_min                   6.445920e+02 6.597051e+02
## kw_min_max                   2.731850e+01 2.795910e+01
## kw_max_max                   1.373761e+02 1.405970e+02
## kw_avg_max                   1.487309e+02 1.522181e+02
## kw_min_avg                   5.183000e+01 5.304520e+01
## kw_max_avg                   1.946754e+02 1.992398e+02
## kw_avg_avg                   2.883196e+02 2.950796e+02
## self_reference_min_shares    3.434896e+02 3.515430e+02
## self_reference_max_shares    1.044950e+03 1.069450e+03
## self_reference_avg_sharess   1.922425e+03 1.967498e+03
## is_weekend                   4.173600e+00 4.271500e+00
## LDA_00                                Inf          Inf
## LDA_01                                Inf          Inf
## LDA_02                                Inf          Inf
## LDA_03                                Inf          Inf
## LDA_04                                Inf          Inf
## global_subjectivity          4.769130e+01 4.880950e+01
## global_sentiment_polarity    2.272298e+02 2.325575e+02
## global_rate_positive_words   1.109205e+02 1.135211e+02
## global_rate_negative_words   2.138705e+02 2.188849e+02
## rate_positive_words                   Inf          Inf
## rate_negative_words          2.848277e+17 2.915057e+17
## avg_positive_polarity        1.267255e+02 1.296967e+02
## min_positive_polarity        2.420010e+01 2.476750e+01
## max_positive_polarity        5.763820e+01 5.898950e+01
## avg_negative_polarity        1.731792e+02 1.772395e+02
## min_negative_polarity        1.057619e+02 1.082416e+02
## max_negative_polarity        7.081520e+01 7.247550e+01
## title_subjectivity           4.331560e+01 4.433110e+01
## title_sentiment_polarity     2.154670e+01 2.205190e+01
## abs_title_subjectivity       1.641740e+01 1.680230e+01
## abs_title_sentiment_polarity 5.421080e+01 5.548190e+01
##                              Leamer         CVIF Klein
## timedelta                    0.5509 3.549700e+00     1
## n_tokens_title               0.9344 1.233800e+00     1
## n_tokens_content             0.5506 3.552700e+00     1
## n_unique_tokens              0.2574 1.625450e+01     1
## n_non_stop_words             0.0000 4.851357e+15     1
## n_non_stop_unique_tokens     0.2773 1.400830e+01     1
## num_hrefs                    0.7675 1.828900e+00     1
## num_self_hrefs               0.8766 1.401900e+00     1
## num_imgs                     0.5901 3.093100e+00     1
## num_videos                   0.9438 1.209400e+00     1
## average_token_length         0.4097 6.417700e+00     1
## num_keywords                 0.8196 1.603700e+00     1
## kw_min_min                   0.5129 4.094400e+00     1
## kw_max_min                   0.2193 2.240000e+01     1
## kw_avg_min                   0.2162 2.303540e+01     1
## kw_min_max                   0.7325 2.007800e+00     1
## kw_max_max                   0.4326 5.757000e+00     1
## kw_avg_max                   0.4187 6.143800e+00     1
## kw_min_avg                   0.6156 2.842800e+00     1
## kw_max_avg                   0.3738 7.708900e+00     1
## kw_avg_avg                   0.3144 1.089890e+01     1
## self_reference_min_shares    0.2903 1.277830e+01     1
## self_reference_max_shares    0.1714 3.667370e+01     1
## self_reference_avg_sharess   0.1272 6.656500e+01     1
## is_weekend                   0.9399 1.219400e+00     1
## LDA_00                       0.0000          Inf     1
## LDA_01                       0.0000          Inf     1
## LDA_02                       0.0000          Inf     1
## LDA_03                       0.0000          Inf     1
## LDA_04                       0.0000          Inf     1
## global_subjectivity          0.6314 2.701800e+00     1
## global_sentiment_polarity    0.3495 8.817900e+00     1
## global_rate_positive_words   0.4710 4.855700e+00     1
## global_rate_negative_words   0.3589 8.362800e+00     1
## rate_positive_words          0.0000          Inf     1
## rate_negative_words          0.0000 9.702715e+15     1
## avg_positive_polarity        0.4469 5.394100e+00     1
## min_positive_polarity        0.7526 1.901600e+00     1
## max_positive_polarity        0.5952 3.040700e+00     1
## avg_negative_polarity        0.3929 6.976600e+00     1
## min_negative_polarity        0.4798 4.680000e+00     1
## max_negative_polarity        0.5556 3.489600e+00     1
## title_subjectivity           0.6496 2.552800e+00     1
## title_sentiment_polarity     0.7712 1.811200e+00     1
## abs_title_subjectivity       0.8113 1.636500e+00     1
## abs_title_sentiment_polarity 0.6070 2.923900e+00     1
##                                IND1   IND2
## timedelta                    0.0092 0.9615
## n_tokens_title               0.0263 0.1752
## n_tokens_content             0.0091 0.9618
## n_unique_tokens              0.0020 1.2889
## n_non_stop_words             0.0000 1.3804
## n_non_stop_unique_tokens     0.0023 1.2742
## num_hrefs                    0.0178 0.5673
## num_self_hrefs               0.0232 0.3197
## num_imgs                     0.0105 0.8996
## num_videos                   0.0269 0.1509
## average_token_length         0.0051 1.1487
## num_keywords                 0.0203 0.4532
## kw_min_min                   0.0079 1.0172
## kw_max_min                   0.0015 1.3140
## kw_avg_min                   0.0014 1.3158
## kw_min_max                   0.0162 0.6398
## kw_max_max                   0.0056 1.1221
## kw_avg_max                   0.0053 1.1383
## kw_min_avg                   0.0114 0.8573
## kw_max_avg                   0.0042 1.1875
## kw_avg_avg                   0.0030 1.2439
## self_reference_min_shares    0.0025 1.2640
## self_reference_max_shares    0.0009 1.3398
## self_reference_avg_sharess   0.0005 1.3580
## is_weekend                   0.0267 0.1609
## LDA_00                       0.0000 1.3804
## LDA_01                       0.0000 1.3804
## LDA_02                       0.0000 1.3804
## LDA_03                       0.0000 1.3804
## LDA_04                       0.0000 1.3804
## global_subjectivity          0.0120 0.8300
## global_sentiment_polarity    0.0037 1.2117
## global_rate_positive_words   0.0067 1.0741
## global_rate_negative_words   0.0039 1.2025
## rate_positive_words          0.0000 1.3804
## rate_negative_words          0.0000 1.3804
## avg_positive_polarity        0.0060 1.1047
## min_positive_polarity        0.0171 0.5984
## max_positive_polarity        0.0107 0.8913
## avg_negative_polarity        0.0047 1.1672
## min_negative_polarity        0.0069 1.0626
## max_negative_polarity        0.0093 0.9542
## title_subjectivity           0.0127 0.7979
## title_sentiment_polarity     0.0179 0.5594
## abs_title_subjectivity       0.0199 0.4717
## abs_title_sentiment_polarity 0.0111 0.8718
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## n_tokens_title , n_tokens_content , n_unique_tokens , n_non_stop_unique_tokens , num_hrefs , num_self_hrefs , average_token_length , num_keywords , kw_min_min , kw_max_min , kw_avg_min , kw_min_max , kw_max_max , kw_avg_max , kw_min_avg , self_reference_max_shares , self_reference_avg_sharess , LDA_00 , LDA_01 , LDA_02 , LDA_03 , LDA_04 , global_subjectivity , global_sentiment_polarity , global_rate_positive_words , global_rate_negative_words , rate_positive_words , avg_positive_polarity , min_positive_polarity , max_positive_polarity , avg_negative_polarity , min_negative_polarity , max_negative_polarity , title_subjectivity , title_sentiment_polarity , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.1038 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

Looking at all of the VIF values, we are going to start by removing all of the LDA Values, and the positive word rate to remove all “infinite” VIF values.

First Multicolinearity Trim

The mctest package was used to calculate the VIF values of multicolinearity.

toRemove<-c( "LDA_01", "LDA_02", "LDA_03", "LDA_04", "rate_positive_words")
trimTrain1 <- newTrain[, ! names(newTrain) %in% toRemove, drop = F]
lmNewTest3<-lmNewTest[, ! names(newTrain) %in% toRemove, drop = F]

#Repeat linear Model process
lm3<- lm(log(shares)~., trimTrain1)
yhat_lm3<-predict(lm3,lmNewTest3)
RMSE_lm3<-sqrt(mean((lmNewTest3$shares - exp(yhat_lm3))^2))

imcdiag(lm3)
## 
## Call:
## imcdiag(mod = lm3)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                                  VIF    TOL        Wi
## timedelta                     3.2699 0.3058   81.0364
## n_tokens_title                1.1414 0.8761    5.0475
## n_tokens_content              3.2836 0.3045   81.5259
## n_unique_tokens              14.8633 0.0673  494.9185
## n_non_stop_words              8.7883 0.1138  278.0435
## n_non_stop_unique_tokens     12.9177 0.0774  425.4610
## num_hrefs                     1.6628 0.6014   23.6631
## num_self_hrefs                1.2777 0.7826    9.9150
## num_imgs                      2.6940 0.3712   60.4747
## num_videos                    1.1015 0.9079    3.6218
## average_token_length          5.8274 0.1716  172.3376
## num_keywords                  1.4467 0.6912   15.9478
## kw_min_min                    3.7898 0.2639   99.5961
## kw_max_min                   20.5350 0.0487  697.3991
## kw_avg_min                   21.0562 0.0475  716.0054
## kw_min_max                    1.8457 0.5418   30.1913
## kw_max_max                    5.1766 0.1932  149.1035
## kw_avg_max                    5.4687 0.1829  159.5309
## kw_min_avg                    2.5959 0.3852   56.9745
## kw_max_avg                    6.9219 0.1445  211.4109
## kw_avg_avg                    9.4362 0.1060  301.1732
## self_reference_min_shares    11.8401 0.0845  386.9910
## self_reference_max_shares    34.0382 0.0294 1179.4627
## self_reference_avg_sharess   61.7741 0.0162 2169.6349
## is_weekend                    1.1259 0.8881    4.4959
## LDA_00                        1.0808 0.9252    2.8848
## global_subjectivity           2.4784 0.4035   52.7806
## global_sentiment_polarity     8.1764 0.1223  256.1982
## global_rate_positive_words    4.4920 0.2226  124.6634
## global_rate_negative_words    7.7206 0.1295  239.9259
## rate_negative_words          10.4657 0.0956  337.9239
## avg_positive_polarity         5.0049 0.1998  142.9765
## min_positive_polarity         1.7589 0.5685   27.0922
## max_positive_polarity         2.8101 0.3559   64.6221
## avg_negative_polarity         6.4400 0.1553  194.2086
## min_negative_polarity         4.3363 0.2306  119.1073
## max_negative_polarity         3.2259 0.3100   79.4645
## title_subjectivity            2.3690 0.4221   48.8717
## title_sentiment_polarity      1.6788 0.5956   24.2347
## abs_title_subjectivity        1.5120 0.6614   18.2782
## abs_title_sentiment_polarity  2.7087 0.3692   61.0014
##                                     Fi Leamer    CVIF
## timedelta                      83.1724 0.5530  3.4432
## n_tokens_title                  5.1806 0.9360  1.2019
## n_tokens_content               83.6749 0.5519  3.4576
## n_unique_tokens               507.9642 0.2594 15.6507
## n_non_stop_words              285.3725 0.3373  9.2539
## n_non_stop_unique_tokens      436.6759 0.2782 13.6020
## num_hrefs                      24.2869 0.7755  1.7509
## num_self_hrefs                 10.1764 0.8847  1.3454
## num_imgs                       62.0688 0.6093  2.8367
## num_videos                      3.7172 0.9528  1.1598
## average_token_length          176.8803 0.4143  6.1361
## num_keywords                   16.3682 0.8314  1.5234
## kw_min_min                    102.2214 0.5137  3.9906
## kw_max_min                    715.7820 0.2207 21.6229
## kw_avg_min                    734.8788 0.2179 22.1717
## kw_min_max                     30.9871 0.7361  1.9435
## kw_max_max                    153.0338 0.4395  5.4508
## kw_avg_max                    163.7360 0.4276  5.7584
## kw_min_avg                     58.4763 0.6207  2.7335
## kw_max_avg                    216.9836 0.3801  7.2886
## kw_avg_avg                    309.1119 0.3255  9.9361
## self_reference_min_shares     397.1918 0.2906 12.4674
## self_reference_max_shares    1210.5524 0.1714 35.8415
## self_reference_avg_sharess   2226.8249 0.1272 65.0468
## is_weekend                      4.6145 0.9424  1.1856
## LDA_00                          2.9609 0.9619  1.1381
## global_subjectivity            54.1719 0.6352  2.6098
## global_sentiment_polarity     262.9514 0.3497  8.6096
## global_rate_positive_words    127.9495 0.4718  4.7300
## global_rate_negative_words    246.2502 0.3599  8.1296
## rate_negative_words           346.8313 0.3091 11.0201
## avg_positive_polarity         146.7453 0.4470  5.2701
## min_positive_polarity          27.8063 0.7540  1.8521
## max_positive_polarity          66.3255 0.5965  2.9590
## avg_negative_polarity         199.3278 0.3941  6.7812
## min_negative_polarity         122.2469 0.4802  4.5661
## max_negative_polarity          81.5591 0.5568  3.3968
## title_subjectivity             50.1599 0.6497  2.4945
## title_sentiment_polarity       24.8735 0.7718  1.7678
## abs_title_subjectivity         18.7600 0.8133  1.5921
## abs_title_sentiment_polarity   62.6093 0.6076  2.8522
##                              Klein   IND1   IND2
## timedelta                        1 0.0086 1.0529
## n_tokens_title                   1 0.0245 0.1879
## n_tokens_content                 1 0.0085 1.0549
## n_unique_tokens                  1 0.0019 1.4148
## n_non_stop_words                 1 0.0032 1.3442
## n_non_stop_unique_tokens         1 0.0022 1.3994
## num_hrefs                        1 0.0168 0.6046
## num_self_hrefs                   1 0.0219 0.3297
## num_imgs                         1 0.0104 0.9538
## num_videos                       0 0.0254 0.1397
## average_token_length             1 0.0048 1.2565
## num_keywords                     1 0.0194 0.4684
## kw_min_min                       1 0.0074 1.1166
## kw_max_min                       1 0.0014 1.4430
## kw_avg_min                       1 0.0013 1.4448
## kw_min_max                       1 0.0152 0.6950
## kw_max_max                       1 0.0054 1.2238
## kw_avg_max                       1 0.0051 1.2395
## kw_min_avg                       1 0.0108 0.9325
## kw_max_avg                       1 0.0040 1.2977
## kw_avg_avg                       1 0.0030 1.3561
## self_reference_min_shares        1 0.0024 1.3887
## self_reference_max_shares        1 0.0008 1.4723
## self_reference_avg_sharess       1 0.0005 1.4923
## is_weekend                       1 0.0249 0.1697
## LDA_00                           0 0.0259 0.1134
## global_subjectivity              1 0.0113 0.9048
## global_sentiment_polarity        1 0.0034 1.3313
## global_rate_positive_words       1 0.0062 1.1791
## global_rate_negative_words       1 0.0036 1.3204
## rate_negative_words              1 0.0027 1.3719
## avg_positive_polarity            1 0.0056 1.2138
## min_positive_polarity            1 0.0159 0.6544
## max_positive_polarity            1 0.0100 0.9771
## avg_negative_polarity            1 0.0043 1.2813
## min_negative_polarity            1 0.0065 1.1670
## max_negative_polarity            1 0.0087 1.0466
## title_subjectivity               1 0.0118 0.8765
## title_sentiment_polarity         1 0.0167 0.6133
## abs_title_subjectivity           1 0.0185 0.5136
## abs_title_sentiment_polarity     1 0.0103 0.9568
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## n_tokens_title , n_tokens_content , n_unique_tokens , n_non_stop_unique_tokens , num_hrefs , num_self_hrefs , average_token_length , num_keywords , kw_min_min , kw_max_min , kw_avg_min , kw_min_max , kw_max_max , kw_avg_max , kw_min_avg , self_reference_max_shares , self_reference_avg_sharess , LDA_00 , global_subjectivity , global_sentiment_polarity , global_rate_positive_words , global_rate_negative_words , rate_negative_words , avg_positive_polarity , max_positive_polarity , avg_negative_polarity , min_negative_polarity , max_negative_polarity , title_subjectivity , title_sentiment_polarity , abs_title_subjectivity , abs_title_sentiment_polarity , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.102 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

This improves the model multicolinearity, but we are still left with some. We then pare down and select those models with the next highest VIF removed one at a time, until all values are below 5.

Second Mulitcolinearity Trim

toRemove<-c("self_reference_avg_sharess", "kw_avg_min", "n_unique_tokens", "rate_negative_words", "kw_avg_avg", "n_non_stop_words", "global_sentiment_polarity", "avg_negative_polarity", "kw_max_max")
trimTrain2 <- trimTrain1[, ! names(trimTrain1) %in% toRemove, drop = F]

#Repeat linear Model process
lm4<- lm(log(shares)~., trimTrain2)

imcdiag(lm4)
## 
## Call:
## imcdiag(mod = lm4)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                                 VIF    TOL       Wi
## timedelta                    3.1261 0.3199  98.5571
## n_tokens_title               1.1293 0.8855   5.9946
## n_tokens_content             2.1811 0.4585  54.7488
## n_non_stop_unique_tokens     2.5044 0.3993  69.7360
## num_hrefs                    1.6451 0.6079  29.9020
## num_self_hrefs               1.2170 0.8217  10.0603
## num_imgs                     2.0991 0.4764  50.9478
## num_videos                   1.0844 0.9221   3.9146
## average_token_length         2.0090 0.4978  46.7702
## num_keywords                 1.3784 0.7255  17.5397
## kw_min_min                   2.0783 0.4812  49.9863
## kw_max_min                   1.8482 0.5411  39.3180
## kw_min_max                   1.8004 0.5554  37.1028
## kw_avg_max                   4.1915 0.2386 147.9433
## kw_min_avg                   1.8070 0.5534  37.4063
## kw_max_avg                   1.9736 0.5067  45.1306
## self_reference_min_shares    1.4124 0.7080  19.1161
## self_reference_max_shares    1.3616 0.7344  16.7635
## is_weekend                   1.1131 0.8984   5.2415
## LDA_00                       1.0769 0.9286   3.5627
## global_subjectivity          2.0956 0.4772  50.7878
## global_rate_positive_words   1.4562 0.6867  21.1460
## global_rate_negative_words   1.4788 0.6762  22.1926
## avg_positive_polarity        3.1148 0.3210  98.0319
## min_positive_polarity        1.6464 0.6074  29.9648
## max_positive_polarity        2.6356 0.3794  75.8169
## min_negative_polarity        1.7657 0.5663  35.4938
## max_negative_polarity        1.2841 0.7788  13.1680
## title_subjectivity           2.3469 0.4261  62.4332
## title_sentiment_polarity     1.6307 0.6133  29.2339
## abs_title_subjectivity       1.5049 0.6645  23.4056
## abs_title_sentiment_polarity 2.6650 0.3752  77.1814
##                                    Fi Leamer   CVIF
## timedelta                    101.9132 0.5656 3.2184
## n_tokens_title                 6.1987 0.9410 1.1626
## n_tokens_content              56.6132 0.6771 2.2454
## n_non_stop_unique_tokens      72.1107 0.6319 2.5783
## num_hrefs                     30.9203 0.7797 1.6936
## num_self_hrefs                10.4029 0.9065 1.2529
## num_imgs                      52.6826 0.6902 2.1610
## num_videos                     4.0479 0.9603 1.1164
## average_token_length          48.3628 0.7055 2.0682
## num_keywords                  18.1370 0.8518 1.4190
## kw_min_min                    51.6884 0.6937 2.1397
## kw_max_min                    40.6569 0.7356 1.9027
## kw_min_max                    38.3663 0.7453 1.8535
## kw_avg_max                   152.9812 0.4884 4.3152
## kw_min_avg                    38.6801 0.7439 1.8603
## kw_max_avg                    46.6674 0.7118 2.0318
## self_reference_min_shares     19.7671 0.8414 1.4541
## self_reference_max_shares     17.3343 0.8570 1.4018
## is_weekend                     5.4200 0.9478 1.1459
## LDA_00                         3.6840 0.9637 1.1086
## global_subjectivity           52.5173 0.6908 2.1575
## global_rate_positive_words    21.8661 0.8287 1.4991
## global_rate_negative_words    22.9483 0.8223 1.5224
## avg_positive_polarity        101.3701 0.5666 3.2067
## min_positive_polarity         30.9851 0.7793 1.6950
## max_positive_polarity         78.3986 0.6160 2.7133
## min_negative_polarity         36.7025 0.7526 1.8178
## max_negative_polarity         13.6164 0.8825 1.3220
## title_subjectivity            64.5592 0.6528 2.4161
## title_sentiment_polarity      30.2294 0.7831 1.6788
## abs_title_subjectivity        24.2026 0.8152 1.5493
## abs_title_sentiment_polarity  79.8096 0.6126 2.7436
##                              Klein   IND1   IND2
## timedelta                        1 0.0069 1.6528
## n_tokens_title                   1 0.0191 0.2783
## n_tokens_content                 1 0.0099 1.3160
## n_non_stop_unique_tokens         1 0.0086 1.4598
## num_hrefs                        1 0.0131 0.9529
## num_self_hrefs                   1 0.0177 0.4334
## num_imgs                         1 0.0103 1.2725
## num_videos                       0 0.0199 0.1892
## average_token_length             1 0.0107 1.2205
## num_keywords                     1 0.0157 0.6671
## kw_min_min                       1 0.0104 1.2609
## kw_max_min                       1 0.0117 1.1153
## kw_min_max                       1 0.0120 1.0804
## kw_avg_max                       1 0.0051 1.8504
## kw_min_avg                       1 0.0119 1.0853
## kw_max_avg                       1 0.0109 1.1989
## self_reference_min_shares        1 0.0153 0.7096
## self_reference_max_shares        1 0.0158 0.6454
## is_weekend                       1 0.0194 0.2469
## LDA_00                           0 0.0200 0.1734
## global_subjectivity              1 0.0103 1.2706
## global_rate_positive_words       1 0.0148 0.7613
## global_rate_negative_words       1 0.0146 0.7868
## avg_positive_polarity            1 0.0069 1.6500
## min_positive_polarity            1 0.0131 0.9542
## max_positive_polarity            1 0.0082 1.5081
## min_negative_polarity            1 0.0122 1.0539
## max_negative_polarity            1 0.0168 0.5376
## title_subjectivity               1 0.0092 1.3947
## title_sentiment_polarity         1 0.0132 0.9399
## abs_title_subjectivity           1 0.0143 0.8154
## abs_title_sentiment_polarity     1 0.0081 1.5183
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## n_tokens_title , n_tokens_content , n_non_stop_unique_tokens , num_self_hrefs , average_token_length , num_keywords , kw_min_min , kw_max_min , kw_min_max , kw_avg_max , kw_min_avg , kw_max_avg , self_reference_max_shares , LDA_00 , global_subjectivity , global_rate_positive_words , global_rate_negative_words , avg_positive_polarity , max_positive_polarity , max_negative_polarity , title_subjectivity , title_sentiment_polarity , abs_title_subjectivity , abs_title_sentiment_polarity , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.0813 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

After removing 15 more variables for obvious multicolinearity via VIF (>5), we need to replot the correlation matrix, which shows a much lower clustering rate of high correlations.

Replot Correlation

#Remove the predictor
train_cor<-trimTrain2[1:31]
res <- cor(train_cor)
palette = colorRampPalette(c("green", "white", "red")) (20)
heatmap(x = res, col = palette, symm = TRUE, cexRow=0.5, cexCol = 0.5)

The new heatmap appears to have less prominent clustering values.

Final Model Fit Prediction

#trim the testing data
newTest1<-test[ -c(25:31) ]
toRemove<-c( "LDA_01", "LDA_02", "LDA_03", "LDA_04", "rate_positive_words", "self_reference_avg_sharess", "kw_avg_min", "n_unique_tokens", "rate_negative_words", "kw_avg_avg", "n_non_stop_words", "global_sentiment_polarity", "avg_negative_polarity", "kw_max_max")

trimTest4 <- newTest1[, ! names(newTest1) %in% toRemove, drop = F]

yhat_lm4<-predict(lm4,trimTest4)
RMSE_lm4<-sqrt(mean((trimTest4$shares - exp(yhat_lm4))^2))

Backward Regression Selection

Transform the response with log, then fit a linear regression model with the variables after backward selection.

#backward selection after log transformation
library(leaps)
backward<- regsubsets(log(shares)~., trimTrain1, nvmax = 31, method = "backward")
backward_summary<-summary(backward)

#backward_summary[["which"]][size, ]
par(mfrow=c(1,3))
plot(backward_summary$cp, xlab = "Size", ylab = "backward Cp", type = "l")
plot(backward_summary$bic, xlab = "Size", ylab = "backward bic", type = "l")
plot(backward_summary$adjr2, xlab = "Size", ylab = "backward adjR2", type = "l")

coef(backward, which.min(backward_summary$cp))
##                (Intercept)                  timedelta 
##               7.484522e+00               6.530856e-04 
##           n_non_stop_words                  num_hrefs 
##              -9.269018e-01               5.345769e-03 
##                   num_imgs                 num_videos 
##               1.063025e-02               4.352489e-02 
##                 kw_max_avg                 kw_avg_avg 
##              -2.646323e-05               1.914506e-04 
##  self_reference_min_shares                 is_weekend 
##               7.230813e-06               2.108812e-01 
##  global_sentiment_polarity global_rate_positive_words 
##              -1.392669e+00               5.976567e+00 
##      min_positive_polarity   title_sentiment_polarity 
##               6.126773e-01              -1.434343e-01
coef(backward, which.max(backward_summary$adjr2))
##                (Intercept)                  timedelta 
##               7.227471e+00               6.894014e-04 
##            n_unique_tokens           n_non_stop_words 
##              -1.081231e+00              -2.016536e+00 
##   n_non_stop_unique_tokens                  num_hrefs 
##               1.135138e+00               4.220366e-03 
##                   num_imgs                 num_videos 
##               1.185756e-02               4.047826e-02 
##       average_token_length               num_keywords 
##               1.701021e-01               2.298952e-02 
##                 kw_max_avg                 kw_avg_avg 
##              -2.869676e-05               2.016231e-04 
##  self_reference_min_shares self_reference_avg_sharess 
##               1.084693e-05              -3.029088e-06 
##                 is_weekend                     LDA_00 
##               2.230891e-01               1.266156e-01 
##  global_sentiment_polarity global_rate_positive_words 
##              -1.170166e+00               5.246845e+00 
##      min_positive_polarity      min_negative_polarity 
##               8.984881e-01              -1.477351e-01 
##   title_sentiment_polarity 
##              -1.508617e-01
#get best subset of the specified size with min cp.
sub <- backward_summary$which[which.min(backward_summary$cp), ]

# Create test model matrix, predcition, test error
test_model <- model.matrix(log(shares)~ ., data = lmNewTest3)
model <- test_model[, sub]
yhat_back<-model %*% coef(backward, which.min(backward_summary$cp))
RMSE_back<-sqrt(mean((test$shares - exp(yhat_back))^2))

Random Forests

As previously mentioned in the regression trees section, the random forest builds an entire forest of these trees, and merges them together to get a more accurate and stable predictions than one off trees. It is usually trained using the bagging method. Unlike regression trees, which are prone to overfitting, only a random subset of the features is taken into consideration by the algorithm for splitting a node (used CV to find the perfect amount of variables to use). This builds in additional error and makes a more robust prediction.

The manual dimensional reduction was necessary to have the processing speeds to handle the random forests model.

library(randomForest)
#single bagged model
tree.train<-randomForest(shares~., data=trimTrain1, mtry=32, importance=TRUE)
tree.train
## 
## Call:
##  randomForest(formula = shares ~ ., data = trimTrain1, mtry = 32,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 32
## 
##           Mean of squared residuals: 113382819
##                     % Var explained: -20.57
#single bagged regression tree error prediction
tree.test<-lmNewTest3["shares"]
yhat.bag<-predict(tree.train, newdata=lmNewTest3)
yhat.bag<-as.data.frame(yhat.bag)
yhat_bag<-mean((yhat.bag$yhat.bag-tree.test$shares)^2)
RMSE_bag<-sqrt(yhat_bag)

#random forests model
tree.trainRF<-randomForest(shares~., data=trimTrain1, mtry=12, importance=TRUE)
tree.trainRF
## 
## Call:
##  randomForest(formula = shares ~ ., data = trimTrain1, mtry = 12,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 12
## 
##           Mean of squared residuals: 101170138
##                     % Var explained: -7.58
#random forest error prediction
yhat.rf<-predict(tree.trainRF, newdata = lmNewTest3)
yhat.rf<-as.data.frame(yhat.rf)
yhat_rf<-mean((yhat.rf$yhat.rf-tree.test$shares)^2)
RMSE_rfTrimmed<-sqrt(yhat_rf)

varImpPlot(tree.trainRF)

Boosted Tree

Boosting is a general approach that can be applied to many statistical learning methods for regression or classification. The trees in boosting are grown sequentially : each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.

Procedure (for regression trees):
1.Initialize predictions as 0,
2.Find the residuals (observed-predicted), call the set of them
3.Fit a tree with splits (d+1 terminal nodes) treating the residuals as the response (which they are for the first fit)
4.Update predictions
5.Update residuals for new predictions and repeat B times

Tune parameters must be chosen shrinkage, B and d in the boosting tree model.

cvcontrol <- trainControl(method="repeatedcv", number = 10,
                          allowParallel=TRUE)
grid <- expand.grid(n.trees = c(1000,1500), 
                    interaction.depth=c(1:3), 
                    shrinkage=c(0.01,0.05,0.1), 
                    n.minobsinnode=c(20))
capture<-capture.output(train.gbm <- train(log(shares) ~ ., 
                   data=train,
                   method="gbm",
                   trControl=cvcontrol,
                   tuneGrid = grid))
train.gbm
## Stochastic Gradient Boosting 
## 
## 1469 samples
##   53 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 1323, 1322, 1323, 1321, 1323, 1322, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE     
##   0.01       1                  1000     0.9237667
##   0.01       1                  1500     0.9258402
##   0.01       2                  1000     0.9250762
##   0.01       2                  1500     0.9281214
##   0.01       3                  1000     0.9231966
##   0.01       3                  1500     0.9261154
##   0.05       1                  1000     0.9412686
##   0.05       1                  1500     0.9485568
##   0.05       2                  1000     0.9528973
##   0.05       2                  1500     0.9642633
##   0.05       3                  1000     0.9638626
##   0.05       3                  1500     0.9771778
##   0.10       1                  1000     0.9564422
##   0.10       1                  1500     0.9688196
##   0.10       2                  1000     0.9856722
##   0.10       2                  1500     1.0017541
##   0.10       3                  1000     0.9971562
##   0.10       3                  1500     1.0110305
##   Rsquared    MAE      
##   0.03977513  0.7003342
##   0.03912668  0.7001502
##   0.04119647  0.6992808
##   0.04153391  0.7003171
##   0.04693402  0.6962659
##   0.04812201  0.6978224
##   0.03803374  0.7120079
##   0.03745736  0.7163856
##   0.03907406  0.7185559
##   0.03754556  0.7305598
##   0.03744277  0.7295671
##   0.03547716  0.7425868
##   0.03795484  0.7248907
##   0.03629342  0.7354428
##   0.03275410  0.7517526
##   0.03016025  0.7629391
##   0.03346575  0.7585675
##   0.03276467  0.7713565
## 
## Tuning parameter 'n.minobsinnode' was held constant at
##  a value of 20
## RMSE was used to select the optimal model using
##  the smallest value.
## The final values used for the model were n.trees =
##  1000, interaction.depth = 3, shrinkage = 0.01
##  and n.minobsinnode = 20.
boostPred <- predict(train.gbm, newdata = test)
RMSE_boost <- sqrt(mean((test$shares - exp(boostPred))^2))

Comparison

Generally, the model with the lowest RMSE is the best on comparison.

comparison<-data.frame(RMSE_lm, RMSE_lm2, RMSE_lm3, RMSE_lm4, RMSE_back,  RMSE_bag, RMSE_rfTrimmed, RMSE_boost, RMSE_regTree)

comparison  
##    RMSE_lm RMSE_lm2 RMSE_lm3 RMSE_lm4 RMSE_back RMSE_bag
## 1 6715.992 6694.652  6665.73 6709.166  6726.546 8423.705
##   RMSE_rfTrimmed RMSE_boost RMSE_regTree
## 1       7197.748   6644.584     7352.023
which.min(comparison)
## RMSE_boost 
##          8

The overall prediction error rate for this data set is very high. This is likely due to the high values of outlier articles with freakishly high shares, that are timely AND viral. These values were NOT removed from analysis, as these are the share metrics that a company would likely want to evaluate for emulation.