<!DOCTYPE html>

ST 558 Project 2

library(rmarkdown)
library(usethis)
use_git_config(user.name="Mandy Liesch", user.email="amliesch@ncsu.edu")

Introduction

Online News Popularity Data Set summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity). Here we first showed some summary statistics and plots about the data grouped by weekdays. Then we create several models to predict the response, shares in different channels. The performance of these models will be evaluated by RMSE. The model having the lowest RMSE will be selected as a winner. The methods of modeling include:

Regression Tree
Log Transformed Full Linear Regression Model
Linear Regression Model Without Day of the Week
Subset Linear Regression Model #1
Subset Linear Regression Model #2
Backward Selection Linear Regression
Bagged Regression Tree
Random Forest Model
Boosted Tree Model

Data preparation

Subset Data by Channel

library(tidyverse)

data_whole<-read_csv("OnlineNewsPopularity/OnlineNewsPopularity.csv")

#create a new variable, channel, to help with the subsetting.
data_whole$channel <- names(data_whole[14:19])[apply(data_whole[14:19],1, match, x = 1)]
data_whole$channel <-sub("data_channel_is_", "", data_whole$channel)

#Subset the data to work on the data channel of interest
#channel_interest = params[[1]]$team

#Get the important data
data_interest<-data_whole%>%
  filter(channel==x[[2]]$team)%>%
  select(-c(1,14:19,62))

Establish Training Data

Split the data into a training (70% of the data) and test set (30% of the data)

library(caret)
library(rsample)
set.seed(14)
index <- initial_split(data_interest,
                       prop = 0.7)
train <- training(index)
test <- testing(index)

Data Summaries

Correlation Plots

This graphical function looks at the correlation of all of the different variables against each other.

library(corrplot)
#drop values that are not important (the days of the week)
newTrain<-train[ -c(25:31) ]
lmNewTest<-test[ -c(25:31) ]
#drop the predictor variables
predictTrain<-newTrain[ -c(47) ]
#Calculate the correlation Matrix and round it
res <- cor(predictTrain)

#Plot the correlation matrix values by cluster
corrplot(res, type = "upper", order = "hclust",
         tl.col = "black", tl.cex = 0.5)

From the results of this spot, it appears that we likely have some clusters of colinearity.

Table Summary

We summarize the train data of interest in tables grouped by weekdays, showing the pattern of shares in a week.

#create a new variable, weekday, to help with the creating plots.
train$weekday <- names(train[25:31])[apply(train[25:31],1, match, x = 1)]
train$weekday <-sub("weekday_is_", "", train$weekday)

#summarize the train data by weekday.knitr::kable(
summary<-train%>%group_by(weekday)%>%
  summarise(Avg=round(mean(shares),0),Sd=round(sd(shares),0),Median=median(shares),IQR=round(IQR(shares),0))
knitr::kable(summary)

weekday	Avg	Sd	Median	IQR
friday	2227	6305	1100	954
monday	2284	5829	1100	984
saturday	2913	5393	1600	1600
sunday	2357	2774	1400	1200
thursday	2211	4870	1100	914
tuesday	2127	4741	1100	937
wednesday	1891	2995	1100	911

We summarize the train data of interest in the plots below. The histogram of shares shows that it is not a normal distribution. After log transformation, the distribution of log(share) is more close to a normal distribution.

#histogram of shares and log(shares).
hist(train$shares)

hist(log(train$shares))

Data Plots

Box Plots

We use box plots to show the difference in shares and num_images between weekdays and weekends.If the boxes of weekends are higher than the ones of weekdays, then articles be shared more often during weekends.

g1<-ggplot(train, aes(x=factor(is_weekend,labels=c("No", "Yes")),y=shares))
g1+geom_boxplot(fill="white", width=0.5,lwd=1.5,color='black',outlier.shape = NA)+
   scale_y_continuous(limits = quantile(train$shares, c(0.1, 0.9)))+
   labs(subtitle = "Shares on weekend",x="On weekend or not")

g2<-ggplot(train, aes(x=factor(is_weekend,labels=c("No", "Yes")),y=num_imgs))
g2+geom_boxplot(fill="white", width=0.5,lwd=1.5,color='black',outlier.shape = NA)+
   scale_y_continuous(limits = quantile(train$num_imgs, c(0, 0.95)))+
   labs(subtitle = "number of images on weekend",x="On weekend or not")

Linear Model

We can inspect the trend of shares as a function of num_images. If the points show an upward trend, then articles with more images tend to be shared more often. If we see a negative trend then articles with more images tend to be shared less often. We can also observe the difference after the log transformation.

g3<-ggplot(train,aes(x=num_imgs,y=shares))
g3+geom_point()+
  labs(subtitle = "num_imgs vs shares")+
  scale_y_continuous(limits = quantile(train$shares, c(0, 0.9)))+
  scale_x_continuous(limits = quantile(train$num_imgs, c(0, 0.9)))+
  geom_smooth(method="lm")

g4<-ggplot(train,aes(x=num_imgs,y=log(shares)))
g4+geom_point()+
  labs(subtitle = "num_imgs vs log(shares)")+
  scale_y_continuous(limits = quantile(log(train$shares), c(0, 0.9)))+
  scale_x_continuous(limits = quantile(train$num_imgs, c(0, 0.9)))+
  geom_smooth(method="lm")

#remove weekday from data set
train<-train%>%select(-weekday)

Models

Regression Tree

Classification trees are machine learning algorithms that have several benefits, including the ease of operation, and less pre-processing. Data does not require normalization, scaling, and removal of missing values. The results are usually easy to explain, and stakeholders usually can understand them. A regression tree is a tree that uses numerical values to predict the nodes and tree branches. Despite all of the benefits, the Decision Tree algorithm can’t be used for regression and predicting continuous values, it also does not transfer well to other datasets.

library(tree)
tree.news<-tree(shares~., data=train)
summary(tree.news)

## 
## Regression tree:
## tree(formula = shares ~ ., data = train)
## Variables actually used in tree construction:
## [1] "num_imgs"              "average_token_length" 
## [3] "avg_negative_polarity" "LDA_02"               
## [5] "LDA_01"               
## Number of terminal nodes:  7 
## Residual mean deviance:  21430000 = 1.262e+11 / 5891 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -32680.0  -1167.0   -848.2      0.0   -201.1 138100.0

plot(tree.news)
text(tree.news, pretty=0)

yhat.regTree<- predict(tree.news, newdata = test)
yhat.test<-test["shares"]
yhat.regTree<-as.data.frame(yhat.regTree)
meanRegTree<-mean((yhat.regTree$yhat.regTree-yhat.test$shares)^2)

RMSE_regTree<-sqrt(meanRegTree)

These results can vary widely depending on the datasets.

Linear Models

Linear models are very valuable and powerful tools, and are very versatile, and can be applied to many situations. Multiple regression examines the relationship between several independent variables and one dependent variable (in this case, total Shares). Regression models give users the ability to determine the relative influence of one or more predictor variables to the predictor, and it also allows users to identify outliers, or anomalies. The main disadvantages have to do with the input quality of data. Input that is incomplete may lead to wrong conclusions. It also assumes that data is independent, which is not always the case.

There are several different types of linear models. In this project, we use multiple different multiple regression values that were log transformed, representing the full dataset, and several partial subsets with multiple variables removed at different points for multicolinearity reasons.

There are also several different types of variable selection, including forward, backward, and stepwise, which user predefined criteria set the entry and/or exit criteria of the models. Backwards selection starts with a full model, and then removes variables that are least significant one at a time, until the model criteria defined by the user are hit. Forward regression does the opposite, and is not represented here.

Linear Regression After Log Transformation

Transform the response with log, then fit a linear regression model with all the variables. Then calculate the RMSE of the model.

lm<- lm(log(shares)~.,train)
summary(lm)

## 
## Call:
## lm(formula = log(shares) ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4318 -0.4428 -0.1150  0.2926  4.5366 
## 
## Coefficients: (4 not defined because of singularities)
##                                Estimate Std. Error
## (Intercept)                   6.929e+00  1.725e-01
## timedelta                     4.249e-05  7.046e-05
## n_tokens_title                1.207e-02  5.196e-03
## n_tokens_content             -1.022e-05  4.431e-05
## n_unique_tokens               4.999e-01  3.946e-01
## n_non_stop_words              1.183e+00  3.180e-01
## n_non_stop_unique_tokens     -5.029e-01  3.300e-01
## num_hrefs                     7.042e-03  1.351e-03
## num_self_hrefs                4.730e-03  4.290e-03
## num_imgs                      1.274e-02  2.317e-03
## num_videos                    1.907e-02  7.274e-03
## average_token_length         -3.665e-01  4.905e-02
## num_keywords                 -3.081e-03  6.790e-03
## kw_min_min                    1.297e-03  3.770e-04
## kw_max_min                    3.305e-06  1.292e-05
## kw_avg_min                   -9.782e-06  8.535e-05
## kw_min_max                   -8.005e-07  7.249e-07
## kw_max_max                    2.864e-07  1.358e-07
## kw_avg_max                   -5.764e-07  1.916e-07
## kw_min_avg                   -6.469e-05  1.836e-05
## kw_max_avg                   -4.243e-05  6.333e-06
## kw_avg_avg                    3.263e-04  3.319e-05
## self_reference_min_shares    -8.470e-06  2.961e-06
## self_reference_max_shares    -4.066e-06  1.574e-06
## self_reference_avg_sharess    1.427e-05  4.305e-06
## weekday_is_monday            -2.499e-01  4.636e-02
## weekday_is_tuesday           -3.062e-01  4.561e-02
## weekday_is_wednesday         -3.002e-01  4.560e-02
## weekday_is_thursday          -2.886e-01  4.556e-02
## weekday_is_friday            -2.440e-01  4.680e-02
## weekday_is_saturday           4.363e-02  5.642e-02
## weekday_is_sunday                    NA         NA
## is_weekend                           NA         NA
## LDA_00                        3.419e-01  1.145e-01
## LDA_01                       -2.132e-01  1.361e-01
## LDA_02                       -2.464e-01  6.690e-02
## LDA_03                       -4.818e-02  1.119e-01
## LDA_04                               NA         NA
## global_subjectivity           5.229e-01  1.594e-01
## global_sentiment_polarity    -1.723e-01  3.677e-01
## global_rate_positive_words    7.561e-02  1.706e+00
## global_rate_negative_words    1.707e+00  2.903e+00
## rate_positive_words           2.455e-01  2.145e-01
## rate_negative_words                  NA         NA
## avg_positive_polarity        -3.594e-02  2.522e-01
## min_positive_polarity         7.066e-02  2.270e-01
## max_positive_polarity        -8.266e-02  7.502e-02
## avg_negative_polarity         1.368e-01  2.140e-01
## min_negative_polarity         2.276e-02  7.355e-02
## max_negative_polarity        -3.293e-01  2.044e-01
## title_subjectivity            8.410e-02  5.207e-02
## title_sentiment_polarity      1.349e-01  4.491e-02
## abs_title_subjectivity        1.483e-01  6.785e-02
## abs_title_sentiment_polarity  4.200e-02  7.478e-02
##                              t value Pr(>|t|)    
## (Intercept)                   40.178  < 2e-16 ***
## timedelta                      0.603 0.546503    
## n_tokens_title                 2.323 0.020232 *  
## n_tokens_content              -0.231 0.817517    
## n_unique_tokens                1.267 0.205259    
## n_non_stop_words               3.720 0.000201 ***
## n_non_stop_unique_tokens      -1.524 0.127576    
## num_hrefs                      5.213 1.92e-07 ***
## num_self_hrefs                 1.102 0.270303    
## num_imgs                       5.500 3.96e-08 ***
## num_videos                     2.621 0.008787 ** 
## average_token_length          -7.471 9.14e-14 ***
## num_keywords                  -0.454 0.650037    
## kw_min_min                     3.441 0.000583 ***
## kw_max_min                     0.256 0.798128    
## kw_avg_min                    -0.115 0.908759    
## kw_min_max                    -1.104 0.269505    
## kw_max_max                     2.109 0.034950 *  
## kw_avg_max                    -3.008 0.002638 ** 
## kw_min_avg                    -3.523 0.000430 ***
## kw_max_avg                    -6.699 2.29e-11 ***
## kw_avg_avg                     9.830  < 2e-16 ***
## self_reference_min_shares     -2.860 0.004250 ** 
## self_reference_max_shares     -2.583 0.009822 ** 
## self_reference_avg_sharess     3.316 0.000920 ***
## weekday_is_monday             -5.391 7.28e-08 ***
## weekday_is_tuesday            -6.715 2.06e-11 ***
## weekday_is_wednesday          -6.584 4.97e-11 ***
## weekday_is_thursday           -6.335 2.55e-10 ***
## weekday_is_friday             -5.212 1.93e-07 ***
## weekday_is_saturday            0.773 0.439344    
## weekday_is_sunday                 NA       NA    
## is_weekend                        NA       NA    
## LDA_00                         2.988 0.002824 ** 
## LDA_01                        -1.566 0.117319    
## LDA_02                        -3.684 0.000232 ***
## LDA_03                        -0.431 0.666708    
## LDA_04                            NA       NA    
## global_subjectivity            3.280 0.001044 ** 
## global_sentiment_polarity     -0.469 0.639317    
## global_rate_positive_words     0.044 0.964645    
## global_rate_negative_words     0.588 0.556626    
## rate_positive_words            1.145 0.252434    
## rate_negative_words               NA       NA    
## avg_positive_polarity         -0.143 0.886667    
## min_positive_polarity          0.311 0.755630    
## max_positive_polarity         -1.102 0.270595    
## avg_negative_polarity          0.639 0.522729    
## min_negative_polarity          0.309 0.756976    
## max_negative_polarity         -1.611 0.107229    
## title_subjectivity             1.615 0.106337    
## title_sentiment_polarity       3.004 0.002673 ** 
## abs_title_subjectivity         2.185 0.028924 *  
## abs_title_sentiment_polarity   0.562 0.574392    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7771 on 5848 degrees of freedom
## Multiple R-squared:  0.1161, Adjusted R-squared:  0.1087 
## F-statistic: 15.67 on 49 and 5848 DF,  p-value: < 2.2e-16

yhat_lm<-predict(lm,test)
RMSE_lm<-sqrt(mean((test$shares - exp(yhat_lm))^2))

Plot the lm Residuals

par(mfrow=c(2,2))
plot(lm)

Looking at our residuals, there seems to be skewing in both direction, indicating that the data, even after transformation, has extreme outliers in both directions.

Model Removing the Day Variable

#look at the data for multicolinearity
lmNewTest<-test[ -c(25:31) ]
lm2<- lm(log(shares)~.,newTrain)
yhat_lm2<-predict(lm2,lmNewTest)
RMSE_lm2<-sqrt(mean((lmNewTest$shares - exp(yhat_lm2))^2))

library(mctest)
omcdiag(lm2)

## 
## Call:
## omcdiag(mod = lm2)
## 
## 
## Overall Multicollinearity Diagnostics
## 
##                           MC Results detection
## Determinant |X'X|:      0.000000e+00         1
## Farrar Chi-Square:               NaN        NA
## Red Indicator:          1.835000e-01         0
## Sum of Lambda Inverse: -2.117766e+14         0
## Theil's Method:         2.692120e+01         1
## Condition Number:       1.170127e+09         1
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test

imcdiag(lm2)

## 
## Call:
## imcdiag(mod = lm2)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                                       VIF    TOL
## timedelta                    2.024200e+00 0.4940
## n_tokens_title               1.128600e+00 0.8861
## n_tokens_content             3.397500e+00 0.2943
## n_unique_tokens              2.308410e+01 0.0433
## n_non_stop_words             9.007199e+15 0.0000
## n_non_stop_unique_tokens     2.240020e+01 0.0446
## num_hrefs                    1.527600e+00 0.6546
## num_self_hrefs               1.246800e+00 0.8021
## num_imgs                     1.419700e+00 0.7044
## num_videos                   1.101100e+00 0.9082
## average_token_length         1.733450e+01 0.0577
## num_keywords                 1.613700e+00 0.6197
## kw_min_min                   4.319200e+00 0.2315
## kw_max_min                   1.823890e+01 0.0548
## kw_avg_min                   1.727100e+01 0.0579
## kw_min_max                   1.660100e+00 0.6024
## kw_max_max                   5.385300e+00 0.1857
## kw_avg_max                   2.790900e+00 0.3583
## kw_min_avg                   2.385500e+00 0.4192
## kw_max_avg                   7.479000e+00 0.1337
## kw_avg_avg                   7.769400e+00 0.1287
## self_reference_min_shares    1.641230e+01 0.0609
## self_reference_max_shares    1.451010e+01 0.0689
## self_reference_avg_sharess   4.460790e+01 0.0224
## is_weekend                   1.020600e+00 0.9798
## LDA_00                                Inf 0.0000
## LDA_01                                Inf 0.0000
## LDA_02                                Inf 0.0000
## LDA_03                                Inf 0.0000
## LDA_04                                Inf 0.0000
## global_subjectivity          2.818700e+00 0.3548
## global_sentiment_polarity    8.898600e+00 0.1124
## global_rate_positive_words   5.861800e+00 0.1706
## global_rate_negative_words   6.808500e+00 0.1469
## rate_positive_words          9.007199e+15 0.0000
## rate_negative_words          4.503600e+15 0.0000
## avg_positive_polarity        5.694400e+00 0.1756
## min_positive_polarity        1.857400e+00 0.5384
## max_positive_polarity        3.110900e+00 0.3215
## avg_negative_polarity        5.690900e+00 0.1757
## min_negative_polarity        4.228700e+00 0.2365
## max_negative_polarity        2.200900e+00 0.4544
## title_subjectivity           2.579500e+00 0.3877
## title_sentiment_polarity     1.134200e+00 0.8817
## abs_title_subjectivity       1.454300e+00 0.6876
## abs_title_sentiment_polarity 2.275700e+00 0.4394
##                                        Wi           Fi
## timedelta                    1.331925e+02 1.362429e+02
## n_tokens_title               1.672010e+01 1.710310e+01
## n_tokens_content             3.117792e+02 3.189196e+02
## n_unique_tokens              2.871921e+03 2.937693e+03
## n_non_stop_words             1.171336e+18 1.198162e+18
## n_non_stop_unique_tokens     2.782976e+03 2.846711e+03
## num_hrefs                    6.860640e+01 7.017760e+01
## num_self_hrefs               3.209010e+01 3.282500e+01
## num_imgs                     5.458440e+01 5.583450e+01
## num_videos                   1.315170e+01 1.345290e+01
## average_token_length         2.124213e+03 2.172862e+03
## num_keywords                 7.981220e+01 8.164000e+01
## kw_min_min                   4.316426e+02 4.415281e+02
## kw_max_min                   2.241826e+03 2.293169e+03
## kw_avg_min                   2.115948e+03 2.164407e+03
## kw_min_max                   8.584380e+01 8.780980e+01
## kw_max_max                   5.702837e+02 5.833444e+02
## kw_avg_max                   2.329030e+02 2.382370e+02
## kw_min_avg                   1.801806e+02 1.843071e+02
## kw_max_avg                   8.425625e+02 8.618588e+02
## kw_avg_avg                   8.803243e+02 9.004855e+02
## self_reference_min_shares    2.004283e+03 2.050186e+03
## self_reference_max_shares    1.756919e+03 1.797156e+03
## self_reference_avg_sharess   5.670968e+03 5.800845e+03
## is_weekend                   2.676000e+00 2.737300e+00
## LDA_00                                Inf          Inf
## LDA_01                                Inf          Inf
## LDA_02                                Inf          Inf
## LDA_03                                Inf          Inf
## LDA_04                                Inf          Inf
## global_subjectivity          2.365149e+02 2.419316e+02
## global_sentiment_polarity    1.027165e+03 1.050690e+03
## global_rate_positive_words   6.322449e+02 6.467246e+02
## global_rate_negative_words   7.553682e+02 7.726677e+02
## rate_positive_words          1.171336e+18 1.198162e+18
## rate_negative_words          5.856681e+17 5.990811e+17
## avg_positive_polarity        6.104825e+02 6.244638e+02
## min_positive_polarity        1.115049e+02 1.140585e+02
## max_positive_polarity        2.745077e+02 2.807945e+02
## avg_negative_polarity        6.100199e+02 6.239906e+02
## min_negative_polarity        4.198798e+02 4.294959e+02
## max_negative_polarity        1.561745e+02 1.597512e+02
## title_subjectivity           2.054049e+02 2.101091e+02
## title_sentiment_polarity     1.744680e+01 1.784630e+01
## abs_title_subjectivity       5.907720e+01 6.043020e+01
## abs_title_sentiment_polarity 1.658955e+02 1.696949e+02
##                              Leamer        CVIF Klein
## timedelta                    0.7029 2.26130e+00     1
## n_tokens_title               0.9413 1.26070e+00     0
## n_tokens_content             0.5425 3.79530e+00     1
## n_unique_tokens              0.2081 2.57875e+01     1
## n_non_stop_words             0.0000 1.00620e+16     1
## n_non_stop_unique_tokens     0.2113 2.50234e+01     1
## num_hrefs                    0.8091 1.70640e+00     1
## num_self_hrefs               0.8956 1.39280e+00     1
## num_imgs                     0.8393 1.58600e+00     1
## num_videos                   0.9530 1.23010e+00     0
## average_token_length         0.2402 1.93645e+01     1
## num_keywords                 0.7872 1.80270e+00     1
## kw_min_min                   0.4812 4.82500e+00     1
## kw_max_min                   0.2342 2.03748e+01     1
## kw_avg_min                   0.2406 1.92935e+01     1
## kw_min_max                   0.7761 1.85450e+00     1
## kw_max_max                   0.4309 6.01600e+00     1
## kw_avg_max                   0.5986 3.11780e+00     1
## kw_min_avg                   0.6475 2.66490e+00     1
## kw_max_avg                   0.3657 8.35490e+00     1
## kw_avg_avg                   0.3588 8.67930e+00     1
## self_reference_min_shares    0.2468 1.83343e+01     1
## self_reference_max_shares    0.2625 1.62094e+01     1
## self_reference_avg_sharess   0.1497 4.98318e+01     1
## is_weekend                   0.9899 1.14010e+00     0
## LDA_00                       0.0000         Inf     1
## LDA_01                       0.0000         Inf     1
## LDA_02                       0.0000         Inf     1
## LDA_03                       0.0000         Inf     1
## LDA_04                       0.0000         Inf     1
## global_subjectivity          0.5956 3.14880e+00     1
## global_sentiment_polarity    0.3352 9.94070e+00     1
## global_rate_positive_words   0.4130 6.54820e+00     1
## global_rate_negative_words   0.3832 7.60590e+00     1
## rate_positive_words          0.0000 1.00620e+16     1
## rate_negative_words          0.0000 5.03100e+15     1
## avg_positive_polarity        0.4191 6.36130e+00     1
## min_positive_polarity        0.7337 2.07500e+00     1
## max_positive_polarity        0.5670 3.47520e+00     1
## avg_negative_polarity        0.4192 6.35730e+00     1
## min_negative_polarity        0.4863 4.72400e+00     1
## max_negative_polarity        0.6741 2.45870e+00     1
## title_subjectivity           0.6226 2.88160e+00     1
## title_sentiment_polarity     0.9390 1.26700e+00     1
## abs_title_subjectivity       0.8292 1.62460e+00     1
## abs_title_sentiment_polarity 0.6629 2.54220e+00     1
##                                IND1   IND2
## timedelta                    0.0036 0.7250
## n_tokens_title               0.0065 0.1632
## n_tokens_content             0.0022 1.0111
## n_unique_tokens              0.0003 1.3708
## n_non_stop_words             0.0000 1.4329
## n_non_stop_unique_tokens     0.0003 1.3689
## num_hrefs                    0.0048 0.4949
## num_self_hrefs               0.0059 0.2836
## num_imgs                     0.0052 0.4236
## num_videos                   0.0067 0.1316
## average_token_length         0.0004 1.3502
## num_keywords                 0.0046 0.5449
## kw_min_min                   0.0017 1.1011
## kw_max_min                   0.0004 1.3543
## kw_avg_min                   0.0004 1.3499
## kw_min_max                   0.0044 0.5697
## kw_max_max                   0.0014 1.1668
## kw_avg_max                   0.0026 0.9195
## kw_min_avg                   0.0031 0.8322
## kw_max_avg                   0.0010 1.2413
## kw_avg_avg                   0.0009 1.2484
## self_reference_min_shares    0.0004 1.3456
## self_reference_max_shares    0.0005 1.3341
## self_reference_avg_sharess   0.0002 1.4007
## is_weekend                   0.0072 0.0289
## LDA_00                       0.0000 1.4329
## LDA_01                       0.0000 1.4329
## LDA_02                       0.0000 1.4329
## LDA_03                       0.0000 1.4329
## LDA_04                       0.0000 1.4329
## global_subjectivity          0.0026 0.9245
## global_sentiment_polarity    0.0008 1.2718
## global_rate_positive_words   0.0013 1.1884
## global_rate_negative_words   0.0011 1.2224
## rate_positive_words          0.0000 1.4329
## rate_negative_words          0.0000 1.4329
## avg_positive_polarity        0.0013 1.1812
## min_positive_polarity        0.0040 0.6614
## max_positive_polarity        0.0024 0.9723
## avg_negative_polarity        0.0013 1.1811
## min_negative_polarity        0.0017 1.0940
## max_negative_polarity        0.0033 0.7818
## title_subjectivity           0.0028 0.8774
## title_sentiment_polarity     0.0065 0.1695
## abs_title_subjectivity       0.0051 0.4476
## abs_title_sentiment_polarity 0.0032 0.8032
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## timedelta , n_tokens_content , n_unique_tokens , n_non_stop_unique_tokens , num_self_hrefs , num_keywords , kw_max_min , kw_avg_min , kw_min_max , LDA_01 , LDA_03 , global_subjectivity , global_sentiment_polarity , global_rate_positive_words , global_rate_negative_words , rate_positive_words , rate_negative_words , avg_positive_polarity , min_positive_polarity , max_positive_polarity , avg_negative_polarity , min_negative_polarity , title_sentiment_polarity , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.1152 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

Looking at all of the VIF values, we are going to start by removing all of the LDA Values, and the positive word rate to remove all “infinite” VIF values.

First Multicolinearity Trim

The mctest package was used to calculate the VIF values of multicolinearity.

toRemove<-c( "LDA_01", "LDA_02", "LDA_03", "LDA_04", "rate_positive_words")
trimTrain1 <- newTrain[, ! names(newTrain) %in% toRemove, drop = F]
lmNewTest3<-lmNewTest[, ! names(newTrain) %in% toRemove, drop = F]

#Repeat linear Model process
lm3<- lm(log(shares)~., trimTrain1)
yhat_lm3<-predict(lm3,lmNewTest3)
RMSE_lm3<-sqrt(mean((lmNewTest3$shares - exp(yhat_lm3))^2))

imcdiag(lm3)

## 
## Call:
## imcdiag(mod = lm3)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                                  VIF    TOL        Wi
## timedelta                     1.9793 0.5052  143.3981
## n_tokens_title                1.1269 0.8874   18.5752
## n_tokens_content              3.3895 0.2950  349.8795
## n_unique_tokens              22.9825 0.0435 3218.7912
## n_non_stop_words             26.3429 0.0380 3710.8314
## n_non_stop_unique_tokens     22.3663 0.0447 3128.5657
## num_hrefs                     1.5255 0.6555   76.9500
## num_self_hrefs                1.2448 0.8033   35.8503
## num_imgs                      1.4006 0.7140   58.6572
## num_videos                    1.0725 0.9324   10.6180
## average_token_length         17.1660 0.0583 2367.1050
## num_keywords                  1.5785 0.6335   84.7110
## kw_min_min                    4.3129 0.2319  485.0978
## kw_max_min                   18.1931 0.0550 2517.4927
## kw_avg_min                   17.2177 0.0581 2374.6711
## kw_min_max                    1.6536 0.6048   95.6964
## kw_max_max                    5.2773 0.1895  626.2994
## kw_avg_max                    2.6523 0.3770  241.9374
## kw_min_avg                    2.2501 0.4444  183.0446
## kw_max_avg                    7.2032 0.1388  908.3108
## kw_avg_avg                    6.9730 0.1434  874.6032
## self_reference_min_shares    16.3887 0.0610 2253.2967
## self_reference_max_shares    14.5006 0.0690 1976.8238
## self_reference_avg_sharess   44.5629 0.0224 6378.7009
## is_weekend                    1.0199 0.9805    2.9096
## LDA_00                        1.0289 0.9719    4.2282
## global_subjectivity           2.8137 0.3554  265.5692
## global_sentiment_polarity     8.8797 0.1126 1153.7901
## global_rate_positive_words    5.8543 0.1708  710.7860
## global_rate_negative_words    6.7969 0.1471  848.8054
## rate_negative_words          11.7807 0.0849 1578.5581
## avg_positive_polarity         5.6867 0.1759  686.2439
## min_positive_polarity         1.8556 0.5389  125.2824
## max_positive_polarity         3.1106 0.3215  309.0474
## avg_negative_polarity         5.6671 0.1765  683.3843
## min_negative_polarity         4.2227 0.2368  471.8794
## max_negative_polarity         2.1978 0.4550  175.3920
## title_subjectivity            2.5780 0.3879  231.0624
## title_sentiment_polarity      1.1327 0.8828   19.4326
## abs_title_subjectivity        1.4489 0.6902   65.7319
## abs_title_sentiment_polarity  2.2733 0.4399  186.4417
##                                     Fi Leamer    CVIF
## timedelta                     147.1001 0.7108  2.1098
## n_tokens_title                 19.0547 0.9420  1.2011
## n_tokens_content              358.9120 0.5432  3.6129
## n_unique_tokens              3301.8880 0.2086 24.4974
## n_non_stop_words             3806.6307 0.1948 28.0793
## n_non_stop_unique_tokens     3209.3332 0.2114 23.8406
## num_hrefs                      78.9365 0.8096  1.6261
## num_self_hrefs                 36.7758 0.8963  1.3269
## num_imgs                       60.1715 0.8450  1.4929
## num_videos                     10.8921 0.9656  1.1432
## average_token_length         2428.2145 0.2414 18.2975
## num_keywords                   86.8979 0.7959  1.6826
## kw_min_min                    497.6212 0.4815  4.5972
## kw_max_min                   2582.4847 0.2344 19.3922
## kw_avg_min                   2435.9760 0.2410 18.3526
## kw_min_max                     98.1669 0.7777  1.7625
## kw_max_max                    642.4680 0.4353  5.6251
## kw_avg_max                    248.1833 0.6140  2.8271
## kw_min_avg                    187.7701 0.6667  2.3984
## kw_max_avg                    931.7599 0.3726  7.6780
## kw_avg_avg                    897.1821 0.3787  7.4327
## self_reference_min_shares    2311.4682 0.2470 17.4690
## self_reference_max_shares    2027.8578 0.2626 15.4564
## self_reference_avg_sharess   6543.3744 0.1498 47.5003
## is_weekend                      2.9847 0.9902  1.0871
## LDA_00                          4.3373 0.9859  1.0967
## global_subjectivity           272.4251 0.5962  2.9991
## global_sentiment_polarity    1183.5765 0.3356  9.4650
## global_rate_positive_words    729.1357 0.4133  6.2401
## global_rate_negative_words    870.7183 0.3836  7.2449
## rate_negative_words          1619.3103 0.2914 12.5572
## avg_positive_polarity         703.9600 0.4193  6.0615
## min_positive_polarity         128.5167 0.7341  1.9779
## max_positive_polarity         317.0259 0.5670  3.3157
## avg_negative_polarity         701.0266 0.4201  6.0407
## min_negative_polarity         484.0615 0.4866  4.5010
## max_negative_polarity         179.9199 0.6745  2.3427
## title_subjectivity            237.0275 0.6228  2.7480
## title_sentiment_polarity       19.9343 0.9396  1.2074
## abs_title_subjectivity         67.4289 0.8308  1.5444
## abs_title_sentiment_polarity  191.2549 0.6632  2.4231
##                              Klein   IND1   IND2
## timedelta                        1 0.0035 0.7843
## n_tokens_title                   0 0.0061 0.1784
## n_tokens_content                 1 0.0020 1.1175
## n_unique_tokens                  1 0.0003 1.5162
## n_non_stop_words                 1 0.0003 1.5250
## n_non_stop_unique_tokens         1 0.0003 1.5143
## num_hrefs                        1 0.0045 0.5461
## num_self_hrefs                   1 0.0055 0.3118
## num_imgs                         1 0.0049 0.4534
## num_videos                       0 0.0064 0.1072
## average_token_length             1 0.0004 1.4928
## num_keywords                     1 0.0043 0.5809
## kw_min_min                       1 0.0016 1.2176
## kw_max_min                       1 0.0004 1.4980
## kw_avg_min                       1 0.0004 1.4931
## kw_min_max                       1 0.0041 0.6265
## kw_max_max                       1 0.0013 1.2848
## kw_avg_max                       1 0.0026 0.9875
## kw_min_avg                       1 0.0030 0.8807
## kw_max_avg                       1 0.0009 1.3651
## kw_avg_avg                       1 0.0010 1.3578
## self_reference_min_shares        1 0.0004 1.4884
## self_reference_max_shares        1 0.0005 1.4758
## self_reference_avg_sharess       1 0.0002 1.5496
## is_weekend                       0 0.0067 0.0309
## LDA_00                           0 0.0066 0.0445
## global_subjectivity              1 0.0024 1.0218
## global_sentiment_polarity        1 0.0008 1.4066
## global_rate_positive_words       1 0.0012 1.3144
## global_rate_negative_words       1 0.0010 1.3519
## rate_negative_words              1 0.0006 1.4506
## avg_positive_polarity            1 0.0012 1.3064
## min_positive_polarity            1 0.0037 0.7309
## max_positive_polarity            1 0.0022 1.0755
## avg_negative_polarity            1 0.0012 1.3054
## min_negative_polarity            1 0.0016 1.2097
## max_negative_polarity            1 0.0031 0.8639
## title_subjectivity               1 0.0026 0.9703
## title_sentiment_polarity         1 0.0060 0.1857
## abs_title_subjectivity           1 0.0047 0.4911
## abs_title_sentiment_polarity     1 0.0030 0.8878
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## timedelta , n_tokens_content , n_unique_tokens , n_non_stop_unique_tokens , num_self_hrefs , num_keywords , kw_max_min , kw_avg_min , kw_min_max , global_sentiment_polarity , global_rate_positive_words , global_rate_negative_words , rate_negative_words , avg_positive_polarity , min_positive_polarity , max_positive_polarity , avg_negative_polarity , min_negative_polarity , max_negative_polarity , title_subjectivity , abs_title_sentiment_polarity , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.1129 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

This improves the model multicolinearity, but we are still left with some. We then pare down and select those models with the next highest VIF removed one at a time, until all values are below 5.

Second Mulitcolinearity Trim

toRemove<-c("self_reference_avg_sharess", "kw_avg_min", "n_unique_tokens", "rate_negative_words", "kw_avg_avg", "n_non_stop_words", "global_sentiment_polarity", "avg_negative_polarity", "kw_max_max")
trimTrain2 <- trimTrain1[, ! names(trimTrain1) %in% toRemove, drop = F]

#Repeat linear Model process
lm4<- lm(log(shares)~., trimTrain2)

imcdiag(lm4)

## 
## Call:
## imcdiag(mod = lm4)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                                 VIF    TOL       Wi
## timedelta                    1.7739 0.5637 146.4432
## n_tokens_title               1.1057 0.9044  20.0000
## n_tokens_content             2.2194 0.4506 230.7345
## n_non_stop_unique_tokens     4.5488 0.2198 671.5243
## num_hrefs                    1.4165 0.7060  78.8137
## num_self_hrefs               1.2176 0.8213  41.1765
## num_imgs                     1.3162 0.7598  59.8364
## num_videos                   1.0530 0.9497  10.0208
## average_token_length         4.1121 0.2432 588.8884
## num_keywords                 1.4795 0.6759  90.7326
## kw_min_min                   1.8596 0.5378 162.6550
## kw_max_min                   2.0424 0.4896 197.2479
## kw_min_max                   1.6281 0.6142 118.8556
## kw_avg_max                   2.0572 0.4861 200.0526
## kw_min_avg                   1.4678 0.6813  88.5134
## kw_max_avg                   2.0913 0.4782 206.5000
## self_reference_min_shares    1.5139 0.6606  97.2348
## self_reference_max_shares    1.5342 0.6518 101.0794
## is_weekend                   1.0163 0.9839   3.0888
## LDA_00                       1.0217 0.9788   4.0997
## global_subjectivity          2.4836 0.4026 280.7339
## global_rate_positive_words   1.8660 0.5359 163.8656
## global_rate_negative_words   1.4861 0.6729  91.9893
## avg_positive_polarity        3.5263 0.2836 478.0412
## min_positive_polarity        1.7280 0.5787 137.7648
## max_positive_polarity        3.0311 0.3299 384.3438
## min_negative_polarity        1.7947 0.5572 150.3777
## max_negative_polarity        1.2328 0.8111  44.0551
## title_subjectivity           2.5720 0.3888 297.4680
## title_sentiment_polarity     1.1194 0.8933  22.6000
## abs_title_subjectivity       1.4455 0.6918  84.3000
## abs_title_sentiment_polarity 2.2647 0.4416 239.3212
##                                    Fi Leamer   CVIF
## timedelta                    151.3504 0.7508 1.8198
## n_tokens_title                20.6702 0.9510 1.1343
## n_tokens_content             238.4663 0.6713 2.2767
## n_non_stop_unique_tokens     694.0267 0.4689 4.6664
## num_hrefs                     81.4548 0.8402 1.4531
## num_self_hrefs                42.5563 0.9062 1.2491
## num_imgs                      61.8415 0.8716 1.3502
## num_videos                    10.3566 0.9745 1.0802
## average_token_length         608.6217 0.4931 4.2184
## num_keywords                  93.7730 0.8221 1.5177
## kw_min_min                   168.1054 0.7333 1.9077
## kw_max_min                   203.8576 0.6997 2.0952
## kw_min_max                   122.8384 0.7837 1.6702
## kw_avg_max                   206.7562 0.6972 2.1104
## kw_min_avg                    91.4795 0.8254 1.5057
## kw_max_avg                   213.4197 0.6915 2.1453
## self_reference_min_shares    100.4931 0.8128 1.5530
## self_reference_max_shares    104.4665 0.8074 1.5738
## is_weekend                     3.1923 0.9919 1.0426
## LDA_00                         4.2371 0.9893 1.0481
## global_subjectivity          290.1411 0.6345 2.5478
## global_rate_positive_words   169.3567 0.7321 1.9142
## global_rate_negative_words    95.0719 0.8203 1.5246
## avg_positive_polarity        494.0602 0.5325 3.6175
## min_positive_polarity        142.3812 0.7607 1.7727
## max_positive_polarity        397.2229 0.5744 3.1095
## min_negative_polarity        155.4168 0.7465 1.8411
## max_negative_polarity         45.5314 0.9006 1.2647
## title_subjectivity           307.4360 0.6235 2.6385
## title_sentiment_polarity      23.3573 0.9452 1.1484
## abs_title_subjectivity        87.1249 0.8317 1.4829
## abs_title_sentiment_polarity 247.3407 0.6645 2.3233
##                              Klein   IND1   IND2
## timedelta                        1 0.0030 1.1119
## n_tokens_title                   1 0.0048 0.2436
## n_tokens_content                 1 0.0024 1.4002
## n_non_stop_unique_tokens         1 0.0012 1.9883
## num_hrefs                        1 0.0037 0.7494
## num_self_hrefs                   1 0.0043 0.4555
## num_imgs                         1 0.0040 0.6123
## num_videos                       0 0.0050 0.1282
## average_token_length             1 0.0013 1.9288
## num_keywords                     1 0.0036 0.8260
## kw_min_min                       1 0.0028 1.1781
## kw_max_min                       1 0.0026 1.3007
## kw_min_max                       1 0.0032 0.9832
## kw_avg_max                       1 0.0026 1.3097
## kw_min_avg                       1 0.0036 0.8122
## kw_max_avg                       1 0.0025 1.3299
## self_reference_min_shares        1 0.0035 0.8651
## self_reference_max_shares        1 0.0034 0.8874
## is_weekend                       0 0.0052 0.0409
## LDA_00                           0 0.0052 0.0540
## global_subjectivity              1 0.0021 1.5224
## global_rate_positive_words       1 0.0028 1.1828
## global_rate_negative_words       1 0.0036 0.8337
## avg_positive_polarity            1 0.0015 1.8259
## min_positive_polarity            1 0.0031 1.0738
## max_positive_polarity            1 0.0017 1.7078
## min_negative_polarity            1 0.0029 1.1285
## max_negative_polarity            1 0.0043 0.4813
## title_subjectivity               1 0.0021 1.5577
## title_sentiment_polarity         1 0.0047 0.2719
## abs_title_subjectivity           1 0.0037 0.7855
## abs_title_sentiment_polarity     1 0.0023 1.4233
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## timedelta , n_tokens_content , n_non_stop_unique_tokens , num_self_hrefs , kw_max_min , kw_min_max , kw_avg_max , self_reference_min_shares , global_rate_negative_words , avg_positive_polarity , min_positive_polarity , max_positive_polarity , title_subjectivity , abs_title_sentiment_polarity , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.0856 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

After removing 15 more variables for obvious multicolinearity via VIF (>5), we need to replot the correlation matrix, which shows a much lower clustering rate of high correlations.

Replot Correlation

#Remove the predictor
train_cor<-trimTrain2[1:31]
res <- cor(train_cor)
palette = colorRampPalette(c("green", "white", "red")) (20)
heatmap(x = res, col = palette, symm = TRUE, cexRow=0.5, cexCol = 0.5)

The new heatmap appears to have less prominent clustering values.

Final Model Fit Prediction

#trim the testing data
newTest1<-test[ -c(25:31) ]
toRemove<-c( "LDA_01", "LDA_02", "LDA_03", "LDA_04", "rate_positive_words", "self_reference_avg_sharess", "kw_avg_min", "n_unique_tokens", "rate_negative_words", "kw_avg_avg", "n_non_stop_words", "global_sentiment_polarity", "avg_negative_polarity", "kw_max_max")

trimTest4 <- newTest1[, ! names(newTest1) %in% toRemove, drop = F]

yhat_lm4<-predict(lm4,trimTest4)
RMSE_lm4<-sqrt(mean((trimTest4$shares - exp(yhat_lm4))^2))

Backward Regression Selection

Transform the response with log, then fit a linear regression model with the variables after backward selection.

#backward selection after log transformation
library(leaps)
backward<- regsubsets(log(shares)~., trimTrain1, nvmax = 31, method = "backward")
backward_summary<-summary(backward)

#backward_summary[["which"]][size, ]
par(mfrow=c(1,3))
plot(backward_summary$cp, xlab = "Size", ylab = "backward Cp", type = "l")
plot(backward_summary$bic, xlab = "Size", ylab = "backward bic", type = "l")
plot(backward_summary$adjr2, xlab = "Size", ylab = "backward adjR2", type = "l")

coef(backward, which.min(backward_summary$cp))

##                (Intercept)             n_tokens_title 
##               6.402754e+00               1.258547e-02 
##            n_unique_tokens           n_non_stop_words 
##               7.475270e-01               1.439049e+00 
##   n_non_stop_unique_tokens                  num_hrefs 
##              -5.830464e-01               7.030187e-03 
##                   num_imgs                 num_videos 
##               1.276199e-02               2.061951e-02 
##       average_token_length                 kw_min_min 
##              -3.837005e-01               1.327585e-03 
##                 kw_max_max                 kw_avg_max 
##               2.969066e-07              -7.210723e-07 
##                 kw_min_avg                 kw_max_avg 
##              -8.348655e-05              -4.614613e-05 
##                 kw_avg_avg  self_reference_min_shares 
##               3.617624e-04              -8.186538e-06 
##  self_reference_max_shares self_reference_avg_sharess 
##              -3.723456e-06               1.375061e-05 
##                 is_weekend                     LDA_00 
##               2.968016e-01               5.153536e-01 
##        global_subjectivity        rate_negative_words 
##               5.066961e-01              -1.684708e-01 
##      max_positive_polarity         title_subjectivity 
##              -1.097451e-01               1.056518e-01 
##   title_sentiment_polarity     abs_title_subjectivity 
##               1.312288e-01               1.455139e-01

coef(backward, which.max(backward_summary$adjr2))

##                (Intercept)             n_tokens_title 
##               6.409383e+00               1.235084e-02 
##            n_unique_tokens           n_non_stop_words 
##               6.398954e-01               1.395293e+00 
##   n_non_stop_unique_tokens                  num_hrefs 
##              -5.287521e-01               6.892278e-03 
##             num_self_hrefs                   num_imgs 
##               4.304481e-03               1.260402e-02 
##                 num_videos       average_token_length 
##               2.051658e-02              -3.808777e-01 
##                 kw_min_min                 kw_max_max 
##               1.320596e-03               2.933964e-07 
##                 kw_avg_max                 kw_min_avg 
##              -7.206657e-07              -8.445304e-05 
##                 kw_max_avg                 kw_avg_avg 
##              -4.597711e-05               3.616949e-04 
##  self_reference_min_shares  self_reference_max_shares 
##              -8.416760e-06              -4.019270e-06 
## self_reference_avg_sharess                 is_weekend 
##               1.427997e-05               2.969454e-01 
##                     LDA_00        global_subjectivity 
##               5.147310e-01               5.531846e-01 
##        rate_negative_words      max_positive_polarity 
##              -1.238529e-01              -9.722964e-02 
##      avg_negative_polarity      max_negative_polarity 
##               1.376864e-01              -3.175778e-01 
##         title_subjectivity   title_sentiment_polarity 
##               1.087234e-01               1.288545e-01 
##     abs_title_subjectivity 
##               1.449588e-01

#get best subset of the specified size with min cp.
sub <- backward_summary$which[which.min(backward_summary$cp), ]

# Create test model matrix, predcition, test error
test_model <- model.matrix(log(shares)~ ., data = lmNewTest3)
model <- test_model[, sub]
yhat_back<-model %*% coef(backward, which.min(backward_summary$cp))
RMSE_back<-sqrt(mean((test$shares - exp(yhat_back))^2))

Random Forests

As previously mentioned in the regression trees section, the random forest builds an entire forest of these trees, and merges them together to get a more accurate and stable predictions than one off trees. It is usually trained using the bagging method. Unlike regression trees, which are prone to overfitting, only a random subset of the features is taken into consideration by the algorithm for splitting a node (used CV to find the perfect amount of variables to use). This builds in additional error and makes a more robust prediction.

The manual dimensional reduction was necessary to have the processing speeds to handle the random forests model.

library(randomForest)
#single bagged model
tree.train<-randomForest(shares~., data=trimTrain1, mtry=32, importance=TRUE)
tree.train

## 
## Call:
##  randomForest(formula = shares ~ ., data = trimTrain1, mtry = 32,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 32
## 
##           Mean of squared residuals: 25689972
##                     % Var explained: -6.26

#single bagged regression tree error prediction
tree.test<-lmNewTest3["shares"]
yhat.bag<-predict(tree.train, newdata=lmNewTest3)
yhat.bag<-as.data.frame(yhat.bag)
yhat_bag<-mean((yhat.bag$yhat.bag-tree.test$shares)^2)
RMSE_bag<-sqrt(yhat_bag)

#random forests model
tree.trainRF<-randomForest(shares~., data=trimTrain1, mtry=12, importance=TRUE)
tree.trainRF

## 
## Call:
##  randomForest(formula = shares ~ ., data = trimTrain1, mtry = 12,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 12
## 
##           Mean of squared residuals: 24336624
##                     % Var explained: -0.66

#random forest error prediction
yhat.rf<-predict(tree.trainRF, newdata = lmNewTest3)
yhat.rf<-as.data.frame(yhat.rf)
yhat_rf<-mean((yhat.rf$yhat.rf-tree.test$shares)^2)
RMSE_rfTrimmed<-sqrt(yhat_rf)

varImpPlot(tree.trainRF)

Boosted Tree

Boosting is a general approach that can be applied to many statistical learning methods for regression or classification. The trees in boosting are grown sequentially : each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.

Procedure (for regression trees):
1.Initialize predictions as 0,
2.Find the residuals (observed-predicted), call the set of them
3.Fit a tree with splits (d+1 terminal nodes) treating the residuals as the response (which they are for the first fit)
4.Update predictions
5.Update residuals for new predictions and repeat B times

Tune parameters must be chosen shrinkage, B and d in the boosting tree model.

cvcontrol <- trainControl(method="repeatedcv", number = 10,
                          allowParallel=TRUE)
grid <- expand.grid(n.trees = c(1000,1500), 
                    interaction.depth=c(1:3), 
                    shrinkage=c(0.01,0.05,0.1), 
                    n.minobsinnode=c(20))
capture<-capture.output(train.gbm <- train(log(shares) ~ ., 
                   data=train,
                   method="gbm",
                   trControl=cvcontrol,
                   tuneGrid = grid))
train.gbm

## Stochastic Gradient Boosting 
## 
## 5898 samples
##   53 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 5309, 5309, 5307, 5309, 5309, 5309, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees
##   0.01       1                  1000   
##   0.01       1                  1500   
##   0.01       2                  1000   
##   0.01       2                  1500   
##   0.01       3                  1000   
##   0.01       3                  1500   
##   0.05       1                  1000   
##   0.05       1                  1500   
##   0.05       2                  1000   
##   0.05       2                  1500   
##   0.05       3                  1000   
##   0.05       3                  1500   
##   0.10       1                  1000   
##   0.10       1                  1500   
##   0.10       2                  1000   
##   0.10       2                  1500   
##   0.10       3                  1000   
##   0.10       3                  1500   
##   RMSE       Rsquared   MAE      
##   0.7704382  0.1253222  0.5444390
##   0.7677343  0.1292380  0.5425947
##   0.7643665  0.1370172  0.5405721
##   0.7622319  0.1408862  0.5398244
##   0.7617169  0.1424937  0.5392324
##   0.7600554  0.1455175  0.5391943
##   0.7658797  0.1328486  0.5446494
##   0.7662343  0.1330982  0.5464222
##   0.7658943  0.1350531  0.5460199
##   0.7679272  0.1333796  0.5485868
##   0.7619294  0.1449790  0.5442774
##   0.7655236  0.1414328  0.5487261
##   0.7681046  0.1305679  0.5491781
##   0.7722241  0.1251168  0.5525444
##   0.7727992  0.1287123  0.5536598
##   0.7789730  0.1230637  0.5601528
##   0.7833594  0.1176739  0.5637165
##   0.7919707  0.1128793  0.5737120
## 
## Tuning parameter 'n.minobsinnode' was
##  held constant at a value of 20
## RMSE was used to select the optimal
##  model using the smallest value.
## The final values used for the model
##  were n.trees = 1500, interaction.depth =
##  3, shrinkage = 0.01 and n.minobsinnode = 20.

boostPred <- predict(train.gbm, newdata = test)
RMSE_boost <- sqrt(mean((test$shares - exp(boostPred))^2))

Comparison

Generally, the model with the lowest RMSE is the best on comparison.

comparison<-data.frame(RMSE_lm, RMSE_lm2, RMSE_lm3, RMSE_lm4, RMSE_back,  RMSE_bag, RMSE_rfTrimmed, RMSE_boost, RMSE_regTree)

comparison

##    RMSE_lm RMSE_lm2 RMSE_lm3 RMSE_lm4
## 1 8250.091  8257.97 8260.785 8229.648
##   RMSE_back RMSE_bag RMSE_rfTrimmed RMSE_boost
## 1  8259.882  8220.41       8129.113   8199.403
##   RMSE_regTree
## 1     8299.325

which.min(comparison)

## RMSE_rfTrimmed 
##              7

The overall prediction error rate for this data set is very high. This is likely due to the high values of outlier articles with freakishly high shares, that are timely AND viral. These values were NOT removed from analysis, as these are the share metrics that a company would likely want to evaluate for emulation.

ST558Project2