<!DOCTYPE html>
ST 558 Project 2
Yan Liu & Mandy Liesch
library(rmarkdown)
library(usethis)
use_git_config(user.name="Mandy Liesch", user.email="amliesch@ncsu.edu")
Introduction
Online News Popularity Data Set summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity). Here we first showed some summary statistics and plots about the data grouped by weekdays. Then we create several models to predict the response, shares in different channels. The performance of these models will be evaluated by RMSE. The model having the lowest RMSE will be selected as a winner. The methods of modeling include:
- Regression Tree
- Log Transformed Full Linear Regression Model
- Linear Regression Model Without Day of the Week
- Subset Linear Regression Model #1
- Subset Linear Regression Model #2
- Backward Selection Linear Regression
- Bagged Regression Tree
- Random Forest Model
- Boosted Tree Model
Data preparation
Subset Data by Channel
library(tidyverse)
data_whole<-read_csv("OnlineNewsPopularity/OnlineNewsPopularity.csv")
#create a new variable, channel, to help with the subsetting.
data_whole$channel <- names(data_whole[14:19])[apply(data_whole[14:19],1, match, x = 1)]
data_whole$channel <-sub("data_channel_is_", "", data_whole$channel)
#Subset the data to work on the data channel of interest
#channel_interest = params[[1]]$team
#Get the important data
data_interest<-data_whole%>%
filter(channel==x[[2]]$team)%>%
select(-c(1,14:19,62))
Establish Training Data
Split the data into a training (70% of the data) and test set (30% of the data)
library(caret)
library(rsample)
set.seed(14)
index <- initial_split(data_interest,
prop = 0.7)
train <- training(index)
test <- testing(index)
Data Summaries
Correlation Plots
This graphical function looks at the correlation of all of the different variables against each other.
library(corrplot)
#drop values that are not important (the days of the week)
newTrain<-train[ -c(25:31) ]
lmNewTest<-test[ -c(25:31) ]
#drop the predictor variables
predictTrain<-newTrain[ -c(47) ]
#Calculate the correlation Matrix and round it
res <- cor(predictTrain)
#Plot the correlation matrix values by cluster
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.cex = 0.5)
From the results of this spot, it appears that we likely have some clusters of colinearity.
Table Summary
We summarize the train data of interest in tables grouped by weekdays, showing the pattern of shares in a week.
#create a new variable, weekday, to help with the creating plots.
train$weekday <- names(train[25:31])[apply(train[25:31],1, match, x = 1)]
train$weekday <-sub("weekday_is_", "", train$weekday)
#summarize the train data by weekday.knitr::kable(
summary<-train%>%group_by(weekday)%>%
summarise(Avg=round(mean(shares),0),Sd=round(sd(shares),0),Median=median(shares),IQR=round(IQR(shares),0))
knitr::kable(summary)
| weekday | Avg | Sd | Median | IQR |
|---|---|---|---|---|
| friday | 3110 | 5914 | 1800 | 1800 |
| monday | 2934 | 4040 | 1600 | 2000 |
| saturday | 3671 | 5946 | 2300 | 2100 |
| sunday | 3862 | 6318 | 2200 | 2300 |
| thursday | 2761 | 4216 | 1600 | 1700 |
| tuesday | 2854 | 4486 | 1600 | 1600 |
| wednesday | 3449 | 21274 | 1600 | 1800 |
We summarize the train data of interest in the plots below. The histogram of shares shows that it is not a normal distribution. After log transformation, the distribution of log(share) is more close to a normal distribution.
#histogram of shares and log(shares).
hist(train$shares)
hist(log(train$shares))
Data Plots
Box Plots
We use box plots to show the difference in shares and num_images between weekdays and weekends.If the boxes of weekends are higher than the ones of weekdays, then articles be shared more often during weekends.
g1<-ggplot(train, aes(x=factor(is_weekend,labels=c("No", "Yes")),y=shares))
g1+geom_boxplot(fill="white", width=0.5,lwd=1.5,color='black',outlier.shape = NA)+
scale_y_continuous(limits = quantile(train$shares, c(0.1, 0.9)))+
labs(subtitle = "Shares on weekend",x="On weekend or not")
g2<-ggplot(train, aes(x=factor(is_weekend,labels=c("No", "Yes")),y=num_imgs))
g2+geom_boxplot(fill="white", width=0.5,lwd=1.5,color='black',outlier.shape = NA)+
scale_y_continuous(limits = quantile(train$num_imgs, c(0, 0.95)))+
labs(subtitle = "number of images on weekend",x="On weekend or not")
Linear Model
We can inspect the trend of shares as a function of num_images. If the points show an upward trend, then articles with more images tend to be shared more often. If we see a negative trend then articles with more images tend to be shared less often. We can also observe the difference after the log transformation.
g3<-ggplot(train,aes(x=num_imgs,y=shares))
g3+geom_point()+
labs(subtitle = "num_imgs vs shares")+
scale_y_continuous(limits = quantile(train$shares, c(0, 0.9)))+
scale_x_continuous(limits = quantile(train$num_imgs, c(0, 0.9)))+
geom_smooth(method="lm")
g4<-ggplot(train,aes(x=num_imgs,y=log(shares)))
g4+geom_point()+
labs(subtitle = "num_imgs vs log(shares)")+
scale_y_continuous(limits = quantile(log(train$shares), c(0, 0.9)))+
scale_x_continuous(limits = quantile(train$num_imgs, c(0, 0.9)))+
geom_smooth(method="lm")
#remove weekday from data set
train<-train%>%select(-weekday)
Models
Regression Tree
Classification trees are machine learning algorithms that have several benefits, including the ease of operation, and less pre-processing. Data does not require normalization, scaling, and removal of missing values. The results are usually easy to explain, and stakeholders usually can understand them. A regression tree is a tree that uses numerical values to predict the nodes and tree branches. Despite all of the benefits, the Decision Tree algorithm can’t be used for regression and predicting continuous values, it also does not transfer well to other datasets.
library(tree)
tree.news<-tree(shares~., data=train)
summary(tree.news)
##
## Regression tree:
## tree(formula = shares ~ ., data = train)
## Variables actually used in tree construction:
## [1] "n_unique_tokens" "abs_title_subjectivity"
## Number of terminal nodes: 3
## Residual mean deviance: 90380000 = 4.645e+11 / 5139
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu.
## -134000.0 -1879.0 -1279.0 0.0 21.1
## Max.
## 527800.0
plot(tree.news)
text(tree.news, pretty=0)
yhat.regTree<- predict(tree.news, newdata = test)
yhat.test<-test["shares"]
yhat.regTree<-as.data.frame(yhat.regTree)
meanRegTree<-mean((yhat.regTree$yhat.regTree-yhat.test$shares)^2)
RMSE_regTree<-sqrt(meanRegTree)
These results can vary widely depending on the datasets.
Linear Models
Linear models are very valuable and powerful tools, and are very versatile, and can be applied to many situations. Multiple regression examines the relationship between several independent variables and one dependent variable (in this case, total Shares). Regression models give users the ability to determine the relative influence of one or more predictor variables to the predictor, and it also allows users to identify outliers, or anomalies. The main disadvantages have to do with the input quality of data. Input that is incomplete may lead to wrong conclusions. It also assumes that data is independent, which is not always the case.
There are several different types of linear models. In this project, we use multiple different multiple regression values that were log transformed, representing the full dataset, and several partial subsets with multiple variables removed at different points for multicolinearity reasons.
There are also several different types of variable selection, including forward, backward, and stepwise, which user predefined criteria set the entry and/or exit criteria of the models. Backwards selection starts with a full model, and then removes variables that are least significant one at a time, until the model criteria defined by the user are hit. Forward regression does the opposite, and is not represented here.
Linear Regression After Log Transformation
Transform the response with log, then fit a linear regression model with all the variables. Then calculate the RMSE of the model.
lm<- lm(log(shares)~.,train)
summary(lm)
##
## Call:
## lm(formula = log(shares) ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4486 -0.5194 -0.1451 0.3952 5.3243
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error
## (Intercept) 7.622e+00 2.559e-01
## timedelta -2.772e-04 9.976e-05
## n_tokens_title -2.818e-03 5.484e-03
## n_tokens_content 6.372e-05 4.267e-05
## n_unique_tokens -8.789e-01 3.898e-01
## n_non_stop_words -2.137e-01 3.806e-01
## n_non_stop_unique_tokens 6.384e-01 3.237e-01
## num_hrefs 7.893e-03 1.815e-03
## num_self_hrefs -1.104e-02 2.837e-03
## num_imgs 3.013e-04 2.105e-03
## num_videos 2.109e-02 7.368e-03
## average_token_length -4.045e-02 4.824e-02
## num_keywords -1.292e-02 7.549e-03
## kw_min_min 1.333e-03 2.742e-04
## kw_max_min -1.268e-06 1.473e-05
## kw_avg_min -5.477e-05 1.011e-04
## kw_min_max -4.527e-07 3.841e-07
## kw_max_max 6.054e-08 9.746e-08
## kw_avg_max -5.500e-07 1.973e-07
## kw_min_avg -3.420e-05 1.588e-05
## kw_max_avg -3.536e-05 7.254e-06
## kw_avg_avg 3.620e-04 3.759e-05
## self_reference_min_shares -2.369e-07 1.320e-06
## self_reference_max_shares -6.340e-07 8.720e-07
## self_reference_avg_sharess 1.915e-06 2.056e-06
## weekday_is_monday -2.056e-01 5.479e-02
## weekday_is_tuesday -2.711e-01 5.333e-02
## weekday_is_wednesday -2.771e-01 5.369e-02
## weekday_is_thursday -2.996e-01 5.431e-02
## weekday_is_friday -2.101e-01 5.624e-02
## weekday_is_saturday 1.937e-02 6.242e-02
## weekday_is_sunday NA NA
## is_weekend NA NA
## LDA_00 1.015e-01 1.024e-01
## LDA_01 -1.179e-01 1.186e-01
## LDA_02 -5.780e-02 8.177e-02
## LDA_03 -9.400e-04 1.256e-01
## LDA_04 NA NA
## global_subjectivity 2.361e-01 1.795e-01
## global_sentiment_polarity -9.445e-02 3.662e-01
## global_rate_positive_words -3.011e+00 1.493e+00
## global_rate_negative_words 2.210e+00 3.591e+00
## rate_positive_words 3.952e-02 2.596e-01
## rate_negative_words NA NA
## avg_positive_polarity -9.459e-02 2.895e-01
## min_positive_polarity -2.357e-01 2.410e-01
## max_positive_polarity 1.316e-02 7.807e-02
## avg_negative_polarity -5.793e-01 2.685e-01
## min_negative_polarity 2.162e-01 9.651e-02
## max_negative_polarity 6.815e-01 2.319e-01
## title_subjectivity 1.637e-02 5.325e-02
## title_sentiment_polarity 1.058e-01 6.065e-02
## abs_title_subjectivity 1.632e-01 7.231e-02
## abs_title_sentiment_polarity 1.607e-01 8.703e-02
## t value Pr(>|t|)
## (Intercept) 29.788 < 2e-16 ***
## timedelta -2.778 0.005484 **
## n_tokens_title -0.514 0.607362
## n_tokens_content 1.493 0.135430
## n_unique_tokens -2.255 0.024204 *
## n_non_stop_words -0.561 0.574594
## n_non_stop_unique_tokens 1.972 0.048672 *
## num_hrefs 4.349 1.39e-05 ***
## num_self_hrefs -3.890 0.000102 ***
## num_imgs 0.143 0.886201
## num_videos 2.863 0.004217 **
## average_token_length -0.838 0.401828
## num_keywords -1.711 0.087120 .
## kw_min_min 4.863 1.19e-06 ***
## kw_max_min -0.086 0.931381
## kw_avg_min -0.542 0.587964
## kw_min_max -1.179 0.238594
## kw_max_max 0.621 0.534535
## kw_avg_max -2.787 0.005334 **
## kw_min_avg -2.153 0.031339 *
## kw_max_avg -4.874 1.12e-06 ***
## kw_avg_avg 9.631 < 2e-16 ***
## self_reference_min_shares -0.180 0.857531
## self_reference_max_shares -0.727 0.467269
## self_reference_avg_sharess 0.931 0.351757
## weekday_is_monday -3.753 0.000177 ***
## weekday_is_tuesday -5.083 3.84e-07 ***
## weekday_is_wednesday -5.161 2.54e-07 ***
## weekday_is_thursday -5.516 3.63e-08 ***
## weekday_is_friday -3.735 0.000190 ***
## weekday_is_saturday 0.310 0.756293
## weekday_is_sunday NA NA
## is_weekend NA NA
## LDA_00 0.991 0.321702
## LDA_01 -0.994 0.320135
## LDA_02 -0.707 0.479692
## LDA_03 -0.007 0.994030
## LDA_04 NA NA
## global_subjectivity 1.315 0.188548
## global_sentiment_polarity -0.258 0.796455
## global_rate_positive_words -2.017 0.043731 *
## global_rate_negative_words 0.615 0.538380
## rate_positive_words 0.152 0.879011
## rate_negative_words NA NA
## avg_positive_polarity -0.327 0.743862
## min_positive_polarity -0.978 0.328031
## max_positive_polarity 0.169 0.866115
## avg_negative_polarity -2.157 0.031038 *
## min_negative_polarity 2.240 0.025121 *
## max_negative_polarity 2.938 0.003313 **
## title_subjectivity 0.307 0.758583
## title_sentiment_polarity 1.744 0.081179 .
## abs_title_subjectivity 2.257 0.024053 *
## abs_title_sentiment_polarity 1.847 0.064815 .
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7756 on 5092 degrees of freedom
## Multiple R-squared: 0.09692, Adjusted R-squared: 0.08823
## F-statistic: 11.15 on 49 and 5092 DF, p-value: < 2.2e-16
yhat_lm<-predict(lm,test)
RMSE_lm<-sqrt(mean((test$shares - exp(yhat_lm))^2))
Plot the lm Residuals
par(mfrow=c(2,2))
plot(lm)
Looking at our residuals, there seems to be skewing in both direction, indicating that the data, even after transformation, has extreme outliers in both directions.
Model Removing the Day Variable
#look at the data for multicolinearity
lmNewTest<-test[ -c(25:31) ]
lm2<- lm(log(shares)~.,newTrain)
yhat_lm2<-predict(lm2,lmNewTest)
RMSE_lm2<-sqrt(mean((lmNewTest$shares - exp(yhat_lm2))^2))
library(mctest)
omcdiag(lm2)
##
## Call:
## omcdiag(mod = lm2)
##
##
## Overall Multicollinearity Diagnostics
##
## MC Results detection
## Determinant |X'X|: 0.000000e+00 1
## Farrar Chi-Square: NaN NA
## Red Indicator: 1.640000e-01 0
## Sum of Lambda Inverse: 2.470923e+15 1
## Theil's Method: 2.772590e+01 1
## Condition Number: NaN NA
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
imcdiag(lm2)
##
## Call:
## imcdiag(mod = lm2)
##
##
## All Individual Multicollinearity Diagnostics Result
##
## VIF TOL
## timedelta 3.684700e+00 0.2714
## n_tokens_title 1.148700e+00 0.8706
## n_tokens_content 3.758700e+00 0.2661
## n_unique_tokens 1.446920e+01 0.0691
## n_non_stop_words 1.501200e+15 0.0000
## n_non_stop_unique_tokens 1.087190e+01 0.0920
## num_hrefs 1.972000e+00 0.5071
## num_self_hrefs 1.690300e+00 0.5916
## num_imgs 1.827000e+00 0.5473
## num_videos 1.059700e+00 0.9437
## average_token_length 2.526200e+00 0.3959
## num_keywords 1.436200e+00 0.6963
## kw_min_min 3.386700e+00 0.2953
## kw_max_min 9.687000e+00 0.1032
## kw_avg_min 9.552800e+00 0.1047
## kw_min_max 1.255300e+00 0.7966
## kw_max_max 4.080000e+00 0.2451
## kw_avg_max 4.559000e+00 0.2193
## kw_min_avg 2.383300e+00 0.4196
## kw_max_avg 4.657100e+00 0.2147
## kw_avg_avg 6.123800e+00 0.1633
## self_reference_min_shares 9.996100e+00 0.1000
## self_reference_max_shares 1.582280e+01 0.0632
## self_reference_avg_sharess 3.554400e+01 0.0281
## is_weekend 1.047100e+00 0.9550
## LDA_00 Inf 0.0000
## LDA_01 Inf 0.0000
## LDA_02 Inf 0.0000
## LDA_03 Inf 0.0000
## LDA_04 Inf 0.0000
## global_subjectivity 1.601500e+00 0.6244
## global_sentiment_polarity 6.778400e+00 0.1475
## global_rate_positive_words 4.085600e+00 0.2448
## global_rate_negative_words 7.653600e+00 0.1307
## rate_positive_words 9.007199e+15 0.0000
## rate_negative_words 4.503600e+15 0.0000
## avg_positive_polarity 4.081100e+00 0.2450
## min_positive_polarity 1.833900e+00 0.5453
## max_positive_polarity 2.427900e+00 0.4119
## avg_negative_polarity 6.616500e+00 0.1511
## min_negative_polarity 5.273100e+00 0.1896
## max_negative_polarity 2.867400e+00 0.3487
## title_subjectivity 2.298000e+00 0.4352
## title_sentiment_polarity 1.720800e+00 0.5811
## abs_title_subjectivity 1.596800e+00 0.6262
## abs_title_sentiment_polarity 2.849800e+00 0.3509
## Wi Fi
## timedelta 3.040301e+02 3.110009e+02
## n_tokens_title 1.683530e+01 1.722130e+01
## n_tokens_content 3.124025e+02 3.195652e+02
## n_unique_tokens 1.525309e+03 1.560281e+03
## n_non_stop_words 1.700025e+17 1.739004e+17
## n_non_stop_unique_tokens 1.117940e+03 1.143572e+03
## num_hrefs 1.100786e+02 1.126025e+02
## num_self_hrefs 7.817000e+01 7.996230e+01
## num_imgs 9.365800e+01 9.580540e+01
## num_videos 6.760900e+00 6.915900e+00
## average_token_length 1.728315e+02 1.767941e+02
## num_keywords 4.940060e+01 5.053320e+01
## kw_min_min 2.702772e+02 2.764741e+02
## kw_max_min 9.837571e+02 1.006313e+03
## kw_avg_min 9.685577e+02 9.907648e+02
## kw_min_max 2.891010e+01 2.957300e+01
## kw_max_max 3.487978e+02 3.567950e+02
## kw_avg_max 4.030393e+02 4.122802e+02
## kw_min_avg 1.566526e+02 1.602443e+02
## kw_max_avg 4.141464e+02 4.236420e+02
## kw_avg_avg 5.802458e+02 5.935497e+02
## self_reference_min_shares 1.018761e+03 1.042119e+03
## self_reference_max_shares 1.678596e+03 1.717083e+03
## self_reference_avg_sharess 3.911913e+03 4.001605e+03
## is_weekend 5.334300e+00 5.456600e+00
## LDA_00 Inf Inf
## LDA_01 Inf Inf
## LDA_02 Inf Inf
## LDA_03 Inf Inf
## LDA_04 Inf Inf
## global_subjectivity 6.811760e+01 6.967940e+01
## global_sentiment_polarity 6.543704e+02 6.693738e+02
## global_rate_positive_words 3.494317e+02 3.574435e+02
## global_rate_negative_words 7.534884e+02 7.707643e+02
## rate_positive_words 1.020015e+18 1.043402e+18
## rate_negative_words 5.100076e+17 5.217011e+17
## avg_positive_polarity 3.489158e+02 3.569158e+02
## min_positive_polarity 9.443420e+01 9.659940e+01
## max_positive_polarity 1.617020e+02 1.654095e+02
## avg_negative_polarity 6.360365e+02 6.506195e+02
## min_negative_polarity 4.839063e+02 4.950013e+02
## max_negative_polarity 2.114766e+02 2.163253e+02
## title_subjectivity 1.469959e+02 1.503662e+02
## title_sentiment_polarity 8.162390e+01 8.349530e+01
## abs_title_subjectivity 6.758710e+01 6.913670e+01
## abs_title_sentiment_polarity 2.094809e+02 2.142839e+02
## Leamer CVIF Klein
## timedelta 0.5210 3.967200e+00 1
## n_tokens_title 0.9330 1.236700e+00 1
## n_tokens_content 0.5158 4.046800e+00 1
## n_unique_tokens 0.2629 1.557820e+01 1
## n_non_stop_words 0.0000 1.616268e+15 1
## n_non_stop_unique_tokens 0.3033 1.170530e+01 1
## num_hrefs 0.7121 2.123200e+00 1
## num_self_hrefs 0.7692 1.819800e+00 1
## num_imgs 0.7398 1.967100e+00 1
## num_videos 0.9714 1.140900e+00 0
## average_token_length 0.6292 2.719800e+00 1
## num_keywords 0.8344 1.546300e+00 1
## kw_min_min 0.5434 3.646300e+00 1
## kw_max_min 0.3213 1.042950e+01 1
## kw_avg_min 0.3235 1.028500e+01 1
## kw_min_max 0.8925 1.351500e+00 1
## kw_max_max 0.4951 4.392800e+00 1
## kw_avg_max 0.4683 4.908500e+00 1
## kw_min_avg 0.6478 2.566000e+00 1
## kw_max_avg 0.4634 5.014100e+00 1
## kw_avg_avg 0.4041 6.593200e+00 1
## self_reference_min_shares 0.3163 1.076230e+01 1
## self_reference_max_shares 0.2514 1.703560e+01 1
## self_reference_avg_sharess 0.1677 3.826840e+01 1
## is_weekend 0.9772 1.127400e+00 0
## LDA_00 0.0000 Inf 1
## LDA_01 0.0000 Inf 1
## LDA_02 0.0000 Inf 1
## LDA_03 0.0000 Inf 1
## LDA_04 0.0000 Inf 1
## global_subjectivity 0.7902 1.724300e+00 1
## global_sentiment_polarity 0.3841 7.298000e+00 1
## global_rate_positive_words 0.4947 4.398800e+00 1
## global_rate_negative_words 0.3615 8.240300e+00 1
## rate_positive_words 0.0000 9.697605e+15 1
## rate_negative_words 0.0000 4.848803e+15 1
## avg_positive_polarity 0.4950 4.393900e+00 1
## min_positive_polarity 0.7384 1.974500e+00 1
## max_positive_polarity 0.6418 2.614000e+00 1
## avg_negative_polarity 0.3888 7.123600e+00 1
## min_negative_polarity 0.4355 5.677300e+00 1
## max_negative_polarity 0.5905 3.087200e+00 1
## title_subjectivity 0.6597 2.474200e+00 1
## title_sentiment_polarity 0.7623 1.852700e+00 1
## abs_title_subjectivity 0.7914 1.719200e+00 1
## abs_title_sentiment_polarity 0.5924 3.068300e+00 1
## IND1 IND2
## timedelta 0.0023 1.0471
## n_tokens_title 0.0073 0.1860
## n_tokens_content 0.0022 1.0548
## n_unique_tokens 0.0006 1.3378
## n_non_stop_words 0.0000 1.4371
## n_non_stop_unique_tokens 0.0008 1.3049
## num_hrefs 0.0043 0.7084
## num_self_hrefs 0.0050 0.5869
## num_imgs 0.0046 0.6505
## num_videos 0.0080 0.0810
## average_token_length 0.0033 0.8682
## num_keywords 0.0059 0.4365
## kw_min_min 0.0025 1.0128
## kw_max_min 0.0009 1.2888
## kw_avg_min 0.0009 1.2867
## kw_min_max 0.0067 0.2923
## kw_max_max 0.0021 1.0849
## kw_avg_max 0.0019 1.1219
## kw_min_avg 0.0035 0.8341
## kw_max_avg 0.0018 1.1285
## kw_avg_avg 0.0014 1.2024
## self_reference_min_shares 0.0008 1.2934
## self_reference_max_shares 0.0005 1.3463
## self_reference_avg_sharess 0.0002 1.3967
## is_weekend 0.0081 0.0646
## LDA_00 0.0000 1.4371
## LDA_01 0.0000 1.4371
## LDA_02 0.0000 1.4371
## LDA_03 0.0000 1.4371
## LDA_04 0.0000 1.4371
## global_subjectivity 0.0053 0.5398
## global_sentiment_polarity 0.0012 1.2251
## global_rate_positive_words 0.0021 1.0854
## global_rate_negative_words 0.0011 1.2494
## rate_positive_words 0.0000 1.4371
## rate_negative_words 0.0000 1.4371
## avg_positive_polarity 0.0021 1.0850
## min_positive_polarity 0.0046 0.6535
## max_positive_polarity 0.0035 0.8452
## avg_negative_polarity 0.0013 1.2199
## min_negative_polarity 0.0016 1.1646
## max_negative_polarity 0.0029 0.9359
## title_subjectivity 0.0037 0.8118
## title_sentiment_polarity 0.0049 0.6020
## abs_title_subjectivity 0.0053 0.5371
## abs_title_sentiment_polarity 0.0030 0.9328
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## n_tokens_title , n_tokens_content , n_non_stop_words , num_imgs , average_token_length , num_keywords , kw_max_min , kw_avg_min , kw_min_max , kw_max_max , self_reference_min_shares , self_reference_max_shares , self_reference_avg_sharess , LDA_00 , LDA_01 , LDA_02 , LDA_03 , LDA_04 , global_subjectivity , global_rate_positive_words , global_rate_negative_words , rate_positive_words , rate_negative_words , avg_positive_polarity , min_negative_polarity , max_negative_polarity , title_sentiment_polarity , coefficient(s) are non-significant may be due to multicollinearity
##
## R-square of y on all x: 0.0952
##
## * use method argument to check which regressors may be the reason of collinearity
## ===================================
Looking at all of the VIF values, we are going to start by removing all of the LDA Values, and the positive word rate to remove all “infinite” VIF values.
First Multicolinearity Trim
The mctest package was used to calculate the VIF values of multicolinearity.
toRemove<-c( "LDA_01", "LDA_02", "LDA_03", "LDA_04", "rate_positive_words")
trimTrain1 <- newTrain[, ! names(newTrain) %in% toRemove, drop = F]
lmNewTest3<-lmNewTest[, ! names(newTrain) %in% toRemove, drop = F]
#Repeat linear Model process
lm3<- lm(log(shares)~., trimTrain1)
yhat_lm3<-predict(lm3,lmNewTest3)
RMSE_lm3<-sqrt(mean((lmNewTest3$shares - exp(yhat_lm3))^2))
imcdiag(lm3)
##
## Call:
## imcdiag(mod = lm3)
##
##
## All Individual Multicollinearity Diagnostics Result
##
## VIF TOL Wi
## timedelta 3.6718 0.2723 340.7160
## n_tokens_title 1.1476 0.8714 18.8209
## n_tokens_content 3.7568 0.2662 351.5559
## n_unique_tokens 14.4349 0.0693 1713.2837
## n_non_stop_words 2.8516 0.3507 236.1231
## n_non_stop_unique_tokens 10.8390 0.0923 1254.7152
## num_hrefs 1.9563 0.5112 121.9472
## num_self_hrefs 1.6697 0.5989 85.4072
## num_imgs 1.8189 0.5498 104.4257
## num_videos 1.0422 0.9595 5.3857
## average_token_length 2.4301 0.4115 182.3765
## num_keywords 1.4342 0.6973 55.3673
## kw_min_min 3.3840 0.2955 304.0190
## kw_max_min 9.5926 0.1042 1095.7735
## kw_avg_min 9.4391 0.1059 1076.2004
## kw_min_max 1.2547 0.7970 32.4867
## kw_max_max 4.0776 0.2452 392.4651
## kw_avg_max 4.5313 0.2207 450.3298
## kw_min_avg 2.3643 0.4230 173.9785
## kw_max_avg 4.6020 0.2173 459.3483
## kw_avg_avg 6.0357 0.1657 642.1750
## self_reference_min_shares 9.9815 0.1002 1145.3721
## self_reference_max_shares 15.8060 0.0633 1888.1309
## self_reference_avg_sharess 35.4964 0.0282 4399.1553
## is_weekend 1.0462 0.9558 5.8949
## LDA_00 1.0355 0.9658 4.5223
## global_subjectivity 1.5981 0.6257 76.2790
## global_sentiment_polarity 6.7730 0.1476 736.2034
## global_rate_positive_words 4.0821 0.2450 393.0401
## global_rate_negative_words 7.6297 0.1311 845.4538
## rate_negative_words 9.4452 0.1059 1076.9777
## avg_positive_polarity 4.0728 0.2455 391.8632
## min_positive_polarity 1.8333 0.5455 106.2691
## max_positive_polarity 2.4267 0.4121 181.9451
## avg_negative_polarity 6.5932 0.1517 713.2700
## min_negative_polarity 5.2691 0.1898 544.4203
## max_negative_polarity 2.8632 0.3493 237.6074
## title_subjectivity 2.2960 0.4355 165.2694
## title_sentiment_polarity 1.7206 0.5812 91.8944
## abs_title_subjectivity 1.5965 0.6264 76.0699
## abs_title_sentiment_polarity 2.8492 0.3510 235.8172
## Fi Leamer CVIF
## timedelta 349.5208 0.5219 3.9267
## n_tokens_title 19.3073 0.9335 1.2273
## n_tokens_content 360.6408 0.5159 4.0176
## n_unique_tokens 1757.5585 0.2632 15.4370
## n_non_stop_words 242.2250 0.5922 3.0495
## n_non_stop_unique_tokens 1287.1397 0.3037 11.5914
## num_hrefs 125.0985 0.7150 2.0921
## num_self_hrefs 87.6143 0.7739 1.7856
## num_imgs 107.1243 0.7415 1.9451
## num_videos 5.5249 0.9795 1.1146
## average_token_length 187.0894 0.6415 2.5988
## num_keywords 56.7981 0.8350 1.5337
## kw_min_min 311.8755 0.5436 3.6189
## kw_max_min 1124.0906 0.3229 10.2586
## kw_avg_min 1104.0117 0.3255 10.0944
## kw_min_max 33.3263 0.8927 1.3419
## kw_max_max 402.6072 0.4952 4.3606
## kw_avg_max 461.9673 0.4698 4.8459
## kw_min_avg 178.4745 0.6504 2.5284
## kw_max_avg 471.2188 0.4661 4.9215
## kw_avg_avg 658.7701 0.4070 6.4547
## self_reference_min_shares 1174.9709 0.3165 10.6745
## self_reference_max_shares 1936.9242 0.2515 16.9033
## self_reference_avg_sharess 4512.8386 0.1678 37.9607
## is_weekend 6.0472 0.9777 1.1189
## LDA_00 4.6391 0.9827 1.1073
## global_subjectivity 78.2502 0.7910 1.7091
## global_sentiment_polarity 755.2285 0.3842 7.2432
## global_rate_positive_words 403.1970 0.4949 4.3654
## global_rate_negative_words 867.3021 0.3620 8.1594
## rate_negative_words 1104.8090 0.3254 10.1009
## avg_positive_polarity 401.9897 0.4955 4.3556
## min_positive_polarity 109.0153 0.7386 1.9606
## max_positive_polarity 186.6470 0.6419 2.5952
## avg_negative_polarity 731.7024 0.3895 7.0509
## min_negative_polarity 558.4893 0.4356 5.6349
## max_negative_polarity 243.7477 0.5910 3.0620
## title_subjectivity 169.5403 0.6600 2.4554
## title_sentiment_polarity 94.2692 0.7624 1.8400
## abs_title_subjectivity 78.0357 0.7914 1.7073
## abs_title_sentiment_polarity 241.9112 0.5924 3.0470
## Klein IND1 IND2
## timedelta 1 0.0021 1.1691
## n_tokens_title 1 0.0068 0.2066
## n_tokens_content 1 0.0021 1.1790
## n_unique_tokens 1 0.0005 1.4954
## n_non_stop_words 1 0.0027 1.0432
## n_non_stop_unique_tokens 1 0.0007 1.4584
## num_hrefs 1 0.0040 0.7854
## num_self_hrefs 1 0.0047 0.6444
## num_imgs 1 0.0043 0.7233
## num_videos 0 0.0075 0.0651
## average_token_length 1 0.0032 0.9455
## num_keywords 1 0.0055 0.4864
## kw_min_min 1 0.0023 1.1319
## kw_max_min 1 0.0008 1.4392
## kw_avg_min 1 0.0008 1.4364
## kw_min_max 1 0.0062 0.3262
## kw_max_max 1 0.0019 1.2126
## kw_avg_max 1 0.0017 1.2521
## kw_min_avg 1 0.0033 0.9271
## kw_max_avg 1 0.0017 1.2575
## kw_avg_avg 1 0.0013 1.3405
## self_reference_min_shares 1 0.0008 1.4457
## self_reference_max_shares 1 0.0005 1.5050
## self_reference_avg_sharess 1 0.0002 1.5614
## is_weekend 0 0.0075 0.0710
## LDA_00 0 0.0076 0.0550
## global_subjectivity 1 0.0049 0.6013
## global_sentiment_polarity 1 0.0012 1.3694
## global_rate_positive_words 1 0.0019 1.2131
## global_rate_negative_words 1 0.0010 1.3961
## rate_negative_words 1 0.0008 1.4366
## avg_positive_polarity 1 0.0019 1.2122
## min_positive_polarity 1 0.0043 0.7303
## max_positive_polarity 1 0.0032 0.9446
## avg_negative_polarity 1 0.0012 1.3630
## min_negative_polarity 1 0.0015 1.3017
## max_negative_polarity 1 0.0027 1.0455
## title_subjectivity 1 0.0034 0.9069
## title_sentiment_polarity 1 0.0046 0.6729
## abs_title_subjectivity 1 0.0049 0.6003
## abs_title_sentiment_polarity 1 0.0028 1.0428
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## n_tokens_title , n_tokens_content , n_non_stop_words , num_imgs , average_token_length , num_keywords , kw_max_min , kw_avg_min , kw_min_max , kw_max_max , self_reference_min_shares , self_reference_max_shares , self_reference_avg_sharess , LDA_00 , global_subjectivity , global_sentiment_polarity , global_rate_negative_words , rate_negative_words , avg_positive_polarity , min_positive_polarity , max_positive_polarity , title_subjectivity , title_sentiment_polarity , abs_title_sentiment_polarity , coefficient(s) are non-significant may be due to multicollinearity
##
## R-square of y on all x: 0.0949
##
## * use method argument to check which regressors may be the reason of collinearity
## ===================================
This improves the model multicolinearity, but we are still left with some. We then pare down and select those models with the next highest VIF removed one at a time, until all values are below 5.
Second Mulitcolinearity Trim
toRemove<-c("self_reference_avg_sharess", "kw_avg_min", "n_unique_tokens", "rate_negative_words", "kw_avg_avg", "n_non_stop_words", "global_sentiment_polarity", "avg_negative_polarity", "kw_max_max")
trimTrain2 <- trimTrain1[, ! names(trimTrain1) %in% toRemove, drop = F]
#Repeat linear Model process
lm4<- lm(log(shares)~., trimTrain2)
imcdiag(lm4)
##
## Call:
## imcdiag(mod = lm4)
##
##
## All Individual Multicollinearity Diagnostics Result
##
## VIF TOL Wi
## timedelta 3.5454 0.2821 419.5772
## n_tokens_title 1.1274 0.8870 20.9951
## n_tokens_content 2.7964 0.3576 296.1212
## n_non_stop_unique_tokens 2.3556 0.4245 223.4568
## num_hrefs 1.8736 0.5337 144.0071
## num_self_hrefs 1.6126 0.6201 100.9834
## num_imgs 1.7464 0.5726 123.0423
## num_videos 1.0332 0.9678 5.4771
## average_token_length 1.2973 0.7709 49.0006
## num_keywords 1.4075 0.7105 67.1699
## kw_min_min 1.5630 0.6398 92.8066
## kw_max_min 1.6397 0.6099 105.4424
## kw_min_max 1.2440 0.8038 40.2252
## kw_avg_max 4.0822 0.2450 508.0591
## kw_min_avg 1.2655 0.7902 43.7704
## kw_max_avg 1.6936 0.5904 114.3388
## self_reference_min_shares 1.3771 0.7262 62.1556
## self_reference_max_shares 1.3996 0.7145 65.8749
## is_weekend 1.0394 0.9621 6.4902
## LDA_00 1.0314 0.9695 5.1809
## global_subjectivity 1.4150 0.7067 68.4151
## global_rate_positive_words 1.4020 0.7133 66.2673
## global_rate_negative_words 1.4385 0.6951 72.2900
## avg_positive_polarity 2.4546 0.4074 239.7751
## min_positive_polarity 1.7172 0.5823 118.2298
## max_positive_polarity 2.3487 0.4258 222.3262
## min_negative_polarity 1.9011 0.5260 148.5340
## max_negative_polarity 1.1940 0.8376 31.9708
## title_subjectivity 2.2879 0.4371 212.2915
## title_sentiment_polarity 1.7065 0.5860 116.4632
## abs_title_subjectivity 1.5934 0.6276 97.8078
## abs_title_sentiment_polarity 2.8410 0.3520 303.4711
## Fi Leamer CVIF
## timedelta 433.6480 0.5311 3.6585
## n_tokens_title 21.6992 0.9418 1.1633
## n_tokens_content 306.0518 0.5980 2.8857
## n_non_stop_unique_tokens 230.9506 0.6516 2.4308
## num_hrefs 148.8364 0.7306 1.9334
## num_self_hrefs 104.3700 0.7875 1.6641
## num_imgs 127.1686 0.7567 1.8022
## num_videos 5.6608 0.9838 1.0662
## average_token_length 50.6438 0.8780 1.3387
## num_keywords 69.4225 0.8429 1.4524
## kw_min_min 95.9189 0.7999 1.6129
## kw_max_min 108.9785 0.7809 1.6920
## kw_min_max 41.5742 0.8966 1.2837
## kw_avg_max 525.0971 0.4949 4.2124
## kw_min_avg 45.2383 0.8889 1.3059
## kw_max_avg 118.1732 0.7684 1.7477
## self_reference_min_shares 64.2400 0.8522 1.4210
## self_reference_max_shares 68.0841 0.8453 1.4443
## is_weekend 6.7079 0.9809 1.0725
## LDA_00 5.3547 0.9846 1.0644
## global_subjectivity 70.7095 0.8406 1.4602
## global_rate_positive_words 68.4896 0.8445 1.4468
## global_rate_negative_words 74.7143 0.8338 1.4845
## avg_positive_polarity 247.8161 0.6383 2.5329
## min_positive_polarity 122.1947 0.7631 1.7721
## max_positive_polarity 229.7820 0.6525 2.4237
## min_negative_polarity 153.5152 0.7253 1.9618
## max_negative_polarity 33.0430 0.9152 1.2321
## title_subjectivity 219.4108 0.6611 2.3609
## title_sentiment_polarity 120.3689 0.7655 1.7610
## abs_title_subjectivity 101.0878 0.7922 1.6442
## abs_title_sentiment_polarity 313.6482 0.5933 2.9317
## Klein IND1 IND2
## timedelta 1 0.0017 1.9266
## n_tokens_title 1 0.0054 0.3032
## n_tokens_content 1 0.0022 1.7238
## n_non_stop_unique_tokens 1 0.0026 1.5443
## num_hrefs 1 0.0032 1.2512
## num_self_hrefs 1 0.0038 1.0194
## num_imgs 1 0.0035 1.1469
## num_videos 0 0.0059 0.0863
## average_token_length 1 0.0047 0.6149
## num_keywords 1 0.0043 0.7769
## kw_min_min 1 0.0039 0.9666
## kw_max_min 1 0.0037 1.0469
## kw_min_max 1 0.0049 0.5264
## kw_avg_max 1 0.0015 2.0261
## kw_min_avg 1 0.0048 0.5630
## kw_max_avg 1 0.0036 1.0990
## self_reference_min_shares 1 0.0044 0.7348
## self_reference_max_shares 1 0.0043 0.7662
## is_weekend 0 0.0058 0.1017
## LDA_00 0 0.0059 0.0818
## global_subjectivity 1 0.0043 0.7871
## global_rate_positive_words 1 0.0043 0.7694
## global_rate_negative_words 1 0.0042 0.8181
## avg_positive_polarity 1 0.0025 1.5902
## min_positive_polarity 1 0.0035 1.1208
## max_positive_polarity 1 0.0026 1.5409
## min_negative_polarity 1 0.0032 1.2719
## max_negative_polarity 1 0.0051 0.4359
## title_subjectivity 1 0.0027 1.5105
## title_sentiment_polarity 1 0.0036 1.1110
## abs_title_subjectivity 1 0.0038 0.9993
## abs_title_sentiment_polarity 1 0.0021 1.7389
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## n_tokens_title , n_non_stop_unique_tokens , num_imgs , num_keywords , kw_min_max , kw_avg_max , self_reference_min_shares , self_reference_max_shares , LDA_00 , global_subjectivity , global_rate_negative_words , avg_positive_polarity , min_positive_polarity , max_positive_polarity , min_negative_polarity , title_subjectivity , abs_title_sentiment_polarity , coefficient(s) are non-significant may be due to multicollinearity
##
## R-square of y on all x: 0.073
##
## * use method argument to check which regressors may be the reason of collinearity
## ===================================
After removing 15 more variables for obvious multicolinearity via VIF (>5), we need to replot the correlation matrix, which shows a much lower clustering rate of high correlations.
Replot Correlation
#Remove the predictor
train_cor<-trimTrain2[1:31]
res <- cor(train_cor)
palette = colorRampPalette(c("green", "white", "red")) (20)
heatmap(x = res, col = palette, symm = TRUE, cexRow=0.5, cexCol = 0.5)
The new heatmap appears to have less prominent clustering values.
Final Model Fit Prediction
#trim the testing data
newTest1<-test[ -c(25:31) ]
toRemove<-c( "LDA_01", "LDA_02", "LDA_03", "LDA_04", "rate_positive_words", "self_reference_avg_sharess", "kw_avg_min", "n_unique_tokens", "rate_negative_words", "kw_avg_avg", "n_non_stop_words", "global_sentiment_polarity", "avg_negative_polarity", "kw_max_max")
trimTest4 <- newTest1[, ! names(newTest1) %in% toRemove, drop = F]
yhat_lm4<-predict(lm4,trimTest4)
RMSE_lm4<-sqrt(mean((trimTest4$shares - exp(yhat_lm4))^2))
Backward Regression Selection
Transform the response with log, then fit a linear regression model with the variables after backward selection.
#backward selection after log transformation
library(leaps)
backward<- regsubsets(log(shares)~., trimTrain1, nvmax = 31, method = "backward")
backward_summary<-summary(backward)
#backward_summary[["which"]][size, ]
par(mfrow=c(1,3))
plot(backward_summary$cp, xlab = "Size", ylab = "backward Cp", type = "l")
plot(backward_summary$bic, xlab = "Size", ylab = "backward bic", type = "l")
plot(backward_summary$adjr2, xlab = "Size", ylab = "backward adjR2", type = "l")
coef(backward, which.min(backward_summary$cp))
## (Intercept)
## 7.199575e+00
## timedelta
## -2.837382e-04
## n_unique_tokens
## -1.231715e+00
## n_non_stop_unique_tokens
## 7.216471e-01
## num_hrefs
## 8.358289e-03
## num_self_hrefs
## -1.065347e-02
## num_videos
## 2.156525e-02
## num_keywords
## -1.216520e-02
## kw_min_min
## 1.203018e-03
## kw_avg_min
## -6.605619e-05
## kw_avg_max
## -5.840642e-07
## kw_min_avg
## -3.890582e-05
## kw_max_avg
## -3.671516e-05
## kw_avg_avg
## 3.724595e-04
## self_reference_avg_sharess
## 8.809493e-07
## is_weekend
## 2.678237e-01
## global_rate_positive_words
## -2.534546e+00
## avg_negative_polarity
## -6.013428e-01
## min_negative_polarity
## 1.648343e-01
## max_negative_polarity
## 7.541788e-01
## title_sentiment_polarity
## 8.939202e-02
## abs_title_subjectivity
## 1.530936e-01
## abs_title_sentiment_polarity
## 1.844868e-01
coef(backward, which.max(backward_summary$adjr2))
## (Intercept)
## 7.285686e+00
## timedelta
## -2.695623e-04
## n_tokens_content
## 5.748316e-05
## n_unique_tokens
## -8.349475e-01
## n_non_stop_unique_tokens
## 5.679596e-01
## num_hrefs
## 8.116934e-03
## num_self_hrefs
## -1.103329e-02
## num_videos
## 2.107119e-02
## average_token_length
## -6.080935e-02
## num_keywords
## -1.251591e-02
## kw_min_min
## 1.211841e-03
## kw_avg_min
## -6.404539e-05
## kw_min_max
## -4.629200e-07
## kw_avg_max
## -5.269550e-07
## kw_min_avg
## -3.487580e-05
## kw_max_avg
## -3.617401e-05
## kw_avg_avg
## 3.692974e-04
## self_reference_avg_sharess
## 8.622768e-07
## is_weekend
## 2.688252e-01
## LDA_00
## 1.053850e-01
## global_subjectivity
## 1.926311e-01
## global_rate_positive_words
## -3.185817e+00
## global_rate_negative_words
## 2.156132e+00
## min_positive_polarity
## -3.048485e-01
## avg_negative_polarity
## -5.875342e-01
## min_negative_polarity
## 2.171501e-01
## max_negative_polarity
## 6.862207e-01
## title_sentiment_polarity
## 1.010882e-01
## abs_title_subjectivity
## 1.518165e-01
## abs_title_sentiment_polarity
## 1.705095e-01
#get best subset of the specified size with min cp.
sub <- backward_summary$which[which.min(backward_summary$cp), ]
# Create test model matrix, predcition, test error
test_model <- model.matrix(log(shares)~ ., data = lmNewTest3)
model <- test_model[, sub]
yhat_back<-model %*% coef(backward, which.min(backward_summary$cp))
RMSE_back<-sqrt(mean((test$shares - exp(yhat_back))^2))
Random Forests
As previously mentioned in the regression trees section, the random forest builds an entire forest of these trees, and merges them together to get a more accurate and stable predictions than one off trees. It is usually trained using the bagging method. Unlike regression trees, which are prone to overfitting, only a random subset of the features is taken into consideration by the algorithm for splitting a node (used CV to find the perfect amount of variables to use). This builds in additional error and makes a more robust prediction.
The manual dimensional reduction was necessary to have the processing speeds to handle the random forests model.
library(randomForest)
#single bagged model
tree.train<-randomForest(shares~., data=trimTrain1, mtry=32, importance=TRUE)
tree.train
##
## Call:
## randomForest(formula = shares ~ ., data = trimTrain1, mtry = 32, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 32
##
## Mean of squared residuals: 118439971
## % Var explained: -10.2
#single bagged regression tree error prediction
tree.test<-lmNewTest3["shares"]
yhat.bag<-predict(tree.train, newdata=lmNewTest3)
yhat.bag<-as.data.frame(yhat.bag)
yhat_bag<-mean((yhat.bag$yhat.bag-tree.test$shares)^2)
RMSE_bag<-sqrt(yhat_bag)
#random forests model
tree.trainRF<-randomForest(shares~., data=trimTrain1, mtry=12, importance=TRUE)
tree.trainRF
##
## Call:
## randomForest(formula = shares ~ ., data = trimTrain1, mtry = 12, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 12
##
## Mean of squared residuals: 112985027
## % Var explained: -5.13
#random forest error prediction
yhat.rf<-predict(tree.trainRF, newdata = lmNewTest3)
yhat.rf<-as.data.frame(yhat.rf)
yhat_rf<-mean((yhat.rf$yhat.rf-tree.test$shares)^2)
RMSE_rfTrimmed<-sqrt(yhat_rf)
varImpPlot(tree.trainRF)
Boosted Tree
Boosting is a general approach that can be applied to many statistical learning methods for regression or classification. The trees in boosting are grown sequentially : each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.
Procedure (for regression trees):
1.Initialize predictions as 0,
2.Find the residuals (observed-predicted), call the set of them
3.Fit a tree with splits (d+1 terminal nodes) treating the residuals as the response (which they are for the first fit)
4.Update predictions
5.Update residuals for new predictions and repeat B times
Tune parameters must be chosen shrinkage, B and d in the boosting tree model.
cvcontrol <- trainControl(method="repeatedcv", number = 10,
allowParallel=TRUE)
grid <- expand.grid(n.trees = c(1000,1500),
interaction.depth=c(1:3),
shrinkage=c(0.01,0.05,0.1),
n.minobsinnode=c(20))
capture<-capture.output(train.gbm <- train(log(shares) ~ .,
data=train,
method="gbm",
trControl=cvcontrol,
tuneGrid = grid))
train.gbm
## Stochastic Gradient Boosting
##
## 5142 samples
## 53 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 4627, 4628, 4627, 4629, 4628, 4628, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE
## 0.01 1 1000 0.7681918
## 0.01 1 1500 0.7647089
## 0.01 2 1000 0.7599081
## 0.01 2 1500 0.7570735
## 0.01 3 1000 0.7559407
## 0.01 3 1500 0.7536862
## 0.05 1 1000 0.7609538
## 0.05 1 1500 0.7618861
## 0.05 2 1000 0.7588687
## 0.05 2 1500 0.7619148
## 0.05 3 1000 0.7584658
## 0.05 3 1500 0.7636395
## 0.10 1 1000 0.7637864
## 0.10 1 1500 0.7654240
## 0.10 2 1000 0.7647569
## 0.10 2 1500 0.7719003
## 0.10 3 1000 0.7716365
## 0.10 3 1500 0.7799070
## Rsquared MAE
## 0.1099130 0.5889274
## 0.1157004 0.5861461
## 0.1274264 0.5827007
## 0.1324360 0.5802111
## 0.1360213 0.5795454
## 0.1400805 0.5781561
## 0.1232025 0.5832450
## 0.1221163 0.5846752
## 0.1305829 0.5818999
## 0.1270942 0.5845462
## 0.1336602 0.5817103
## 0.1280051 0.5855592
## 0.1196104 0.5863451
## 0.1186703 0.5885118
## 0.1260543 0.5868815
## 0.1195385 0.5940197
## 0.1218293 0.5942420
## 0.1173440 0.5989953
##
## Tuning parameter 'n.minobsinnode' was held constant at
## a value of 20
## RMSE was used to select the optimal model using
## the smallest value.
## The final values used for the model were n.trees =
## 1500, interaction.depth = 3, shrinkage = 0.01
## and n.minobsinnode = 20.
boostPred <- predict(train.gbm, newdata = test)
RMSE_boost <- sqrt(mean((test$shares - exp(boostPred))^2))
Comparison
Generally, the model with the lowest RMSE is the best on comparison.
comparison<-data.frame(RMSE_lm, RMSE_lm2, RMSE_lm3, RMSE_lm4, RMSE_back, RMSE_bag, RMSE_rfTrimmed, RMSE_boost, RMSE_regTree)
comparison
## RMSE_lm RMSE_lm2 RMSE_lm3 RMSE_lm4 RMSE_back RMSE_bag
## 1 4601.276 4600.126 4601.318 4611.745 4601.781 7382.078
## RMSE_rfTrimmed RMSE_boost RMSE_regTree
## 1 7072.759 4535.213 4543.594
which.min(comparison)
## RMSE_boost
## 8
The overall prediction error rate for this data set is very high. This is likely due to the high values of outlier articles with freakishly high shares, that are timely AND viral. These values were NOT removed from analysis, as these are the share metrics that a company would likely want to evaluate for emulation.