Monthly Archives: August 2016

An Alternative to R pairs with reshape2 and ggplot2

Base plot function: “pairs”

You may be aware of the pairs plotting function in R. Here is an example with the Boston data set. We will just use a few variables since there isn’t much room here to show the plots for all of them.

miniBoston <- Boston[, c("crim", "lstat", "rm", "medv")]

Let us inspect this new data frame:

> head(miniBoston)
     crim lstat    rm medv
1 0.00632  4.98 6.575 24.0
2 0.02731  9.14 6.421 21.6
3 0.02729  4.03 7.185 34.7
4 0.03237  2.94 6.998 33.4
5 0.06905  5.33 7.147 36.2
6 0.02985  5.21 6.430 28.7

Let’s plot this with the pairs function now:

pairs(miniBoston)

This produces the following plot:
MiniBostonPairs

This is great for seeing associations between all the variables. While this looks OK, it takes too long to get generated and wastes a lot of space (each plot is there twice) when you have a lot of variables and too much data and all you are interested is in seeing how a specific variable is related to all the other variables.

Sensible Alternative with reshape2 and ggplot2

If you do not have reshape2 and ggplot2 installed already, install them first:

install.packages("reshape2")
install.packages("ggplot2")

Let us say we are interested in how the variable medv is related to the other variables. We proceed with:

library(reshape2)
meltBoston <- melt(miniBoston, "medv")

If you inspect meltBoston, you see

> str(meltBoston)
'data.frame':	1518 obs. of  3 variables:
 $ medv    : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
 $ variable: Factor w/ 3 levels "crim","lstat",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...

> head(meltBoston)
  medv variable   value
1 24.0     crim 0.00632
2 21.6     crim 0.02731
3 34.7     crim 0.02729
4 33.4     crim 0.03237
5 36.2     crim 0.06905
6 28.7     crim 0.02985

Now we can plot like this:

library(ggplot2)
ggplot(data = meltBoston, aes(x = value, y = medv)) + 
  geom_point(size = 0.3, pch = 1) + 
  facet_wrap(~ variable, 
             ncol = 3, 
             scales = "free_x")

You get a plot like this which uses the space more effectively:

BostonGgplot

Norton Antivirus Download for PC/Laptop on Windows 10/8.1/7

Norton is the best-recommended antivirus software that deals with all kinds of online threats of your computer. It secures your device by defending from spyware, viruses, phishing, malware and software vulnerabilities. This is one of the best security tools that you can get on your computer, Mac, Smartphone via download. So, if you want to keep your Windows computer from all kinds of viruses and malware attack then just read this post to know more about Norton software. Get Step by Step guide for Norton.com/setup free.

Benefits of using Norton Antivirus:

  • It offers an easy web portal that manages all the web contents on your computer.
  • The software offers you 100% guarantee that you can enjoy the time of your subscription.
  • Norton antivirus safeguards all your online transactions and keeps your identity secure.
  • It detects your device from malware, spyware, phishing and different kinds of online threats.
  • The software completely backups all your files and photos on your Windows PC or laptop.
  • It can be used on multiple devices such as PC, Smartphone, Mac, Tablets, etc.

Download Norton in three versions:

Norton has come with three versions that you can prefer to download such as Standard, Deluxe, and Premium. The Standard and Deluxe versions don’t have all the features as of Premium version. However, it is good to download the Premium versions for enjoying the best features of this software.

Steps to Download Norton Antivirus Software on PC?

You can download the Norton antivirus software by going to its official website. So, just follow the steps highlighted below:

  • First of all, open your web browser where you will find the search bar on the top of the screen.
  • Now, type Norton Antivirus and hit enter to start searching.
  • The related results of the search will appear on your Google web hosting page.
  • You will now find the official website of Norton software on the first page itself.
  • Just open the app and download any versions of the software from the three suggested.

You are now clear with the step by step guide for Download Norton Antivirus for PC I hope. There is no doubt that Norton is the best software that gained good reviews from its users worldwide. So, keep your device safe from different virus, malware, etc using Norton antivirus software.

Grid Search, Cross Validation and XGBoost in R

This post is to help take the chapter on tree-based methods of the highly reputed Introduction to Statistical Learning text to the next level with grid search, cross validation and XGBoost.

I assume that you have read the chapter on Tree-Based Methods (chapter 8 in the copy I got) and executed the code in the lab section. Here we look into regression trees only. It should be straight forward to apply the ideas here to a classification problem by changing the objective function.

The last statement of the Boosting sub-section (of the lab) states that

using lambda of 0.2 leads to a slightly lower test MSE than lambda of 0.001.

It is a pity that the authors do not teach how to select the optimal lambda via grid search and cross-validation. Here is an attempt to do that.

Grid Search with Cross Validation

Let us first split the Boston dataset into 80% training and 20% test (unlike the 50-50 split the authors do).

set.seed(1234)
trainInd <- sample(1:nrow(Boston), 0.8 * nrow(Boston))
dfTrain <- Boston[ trainInd,]
dfTest  <- Boston[-trainInd,]
dim(dfTrain) # 404 14
dim(dfTest) # 102 14

For selecting the best parameters for gbm, we will need the caret package. Install it using this R command:

install.packages("caret")

Here is the code to do a grid search over all the hyper-parameters of gbm:

require(caret)

# set a grid of parameters to optimize upon;
# all these are gbm params and must be specified
gbmGrid <- expand.grid(
  n.trees = c(250, 500, 1000),
  shrinkage = c(0.001, 0.01, 0.1), # lambda
  interaction.depth = c(1, 2, 4), 
  n.minobsinnode = c(10) # left at default
)

# 5-fold CV; set method to "repeatedcv" 
# and set repeats for repeated CV
gbmTrControl <- trainControl(
  method = "cv",
  number = 5,
  #repeats = 4,
  verboseIter = FALSE,
  allowParallel = TRUE
)

# do the training
gbmTrain <- train(
  x = as.matrix(dfTrain[, ! names(dfTrain) 
                              %in% c("medv")]), 
  y = dfTrain$medv,
  trControl = gbmTrControl,
  tuneGrid = gbmGrid,
  method = "gbm"
)

# get the top 5 models
head(gbmTrain$results[with(gbmTrain$results, 
                           order(RMSE)), ], 5)
# MSE for best model = 3.42 ^ 2 = 11.7

# get the best model's parameters
gbmTrain$bestTune
#   n.trees interaction.depth shrinkage n.minobsinnode
#       500                 4       0.1             10

yhatGbm <- predict(gbmTrain, newdata = dfTest)
mean((yhatGbm - dfTest$medv) ^ 2) # 7.28

# plot the preds in red and actual in blue
plot(dfTest$medv, yhatGbm, col = "red")
abline(0, 1, col = "blue")

The plot looks like this:

gbm

In this case, we are actually getting much lower test MSE than training MSE.

Feel free to expand the grid search to larger ranges and select better parameters.

XGBoost

Next, we see how to do the same optimal parameter selection for XGBoost.

Install the XGBoost package using

install.packages("xgboost")

The following code then searches and gives you the best hyper-parameters for XGBoost.

library(xgboost)

xgbGrid <- expand.grid(
  nrounds = c(250, 500, 1000),
  max_depth = c(1, 2, 4),
  eta = c(0.001, 0.003, 0.01),
  gamma = c(0, 1, 2),
  colsample_bytree = c(1, 0.5, 0.25),
  min_child_weight = c(1, 2)
)

xgbTrControl <- trainControl(
  method = "repeatedcv",
  number = 5,
  repeats = 2,
  verboseIter = FALSE,
  returnData = FALSE,
  allowParallel = TRUE
)

xgbTrain <- train(
  x = as.matrix(dfTrain[, ! names(dfTrain) %in% c("medv")]), 
  y = dfTrain$medv,
  objective = "reg:linear",
  trControl = xgbTrControl,
  tuneGrid = xgbGrid,
  method = "xgbTree"
)

# get the top model and its results
head(xgbTrain$results[with(xgbTrain$results, 
                             order(RMSE)), ], 1)
# MSE 3.12 ^ 2 = 9.74

yhatXgb <- predict(xgbTrain, newdata = dfTest)
mean((yhatXgb - dfTest$medv) ^ 2) # 10.49

plot(dfTest$medv, yhatXgb, col = "red")
abline(0, 1, col = "blue")

# Variable Importance
names <- names(dfTrain)[! names(dfTrain) %in% c("medv")]
importanceMatrix <- xgb.importance(names, 
                                   model = xgbTrain$finalModel)
xgb.plot.importance(importanceMatrix[1:10,])

The variable importance plot looks like this:

varImp

Again, we can see that lstat and rm are the two main features.

The test MSE for the two models are lower compared to the ones presented in the text. This could be because our training set size is larger and also because of the optimizations we did for the models. For fair comparison, you should test the code in the text with 80% training data like we did here.

PS: I could not figure out easily how to plot the variable importance for the GBM model.