Monthly Archives: November 2016

R bookdown on mac El Capitan

Was excited to see R bookdown from Yuhui Xe of RStudio, but when I tried to get started, I ran into a problem.

I am using (mac) OS X El Capitan 10.11.6 and when I tried to “build the book”, I got the following error:

xelatex: command not found.

I realized I never installed LaTex on my mac, so I tried to download MacTex 2016 from the main site, but that link did not download the package for me! It just displayed some text response.

Looking at MacTex FAQ, I found the mirror site. I could then download the package from one of the mirrors close to me. The size of the package was 2.7 GB. With that I installed LaTex.

There is one more step required though. We need to set the PATH variable. I edited my ~/.bashrc and added /Library/TeX/texbin to it. Then I restarted RStudio. I was then able to build the book 🙂

Beta Binomial Model in R

I have been reading Kevin Murphy’s book Machine Learning: a Probabilistic Perspective to study machine learning in more depth. I love the book and it has been an exciting experience so far, but I wish it could have been better.

Once I started reading the first new topic (for me) Sec. 3.3 The Beta Binomial Model, I started seeing that he has used inconsistent notation in the formulae, especially with the shape parameters for the Beta distribution. I am not going to get into the details here, but if you read that section, you will understand the sloppiness.

Here, I thought I could share the R code for producing the plot in Figure 3.7. It is quite unfortunate that the Matlab code for the book (pmtk3) also uses inconsistent variable names. Fig. 3.7 is using M = 10, which is not even mentioned in the text! For the plug-in approximation you need to use the MAP estimate given in Eq. (3.21) and a binomial distribution.

I hope the code below will help map the variable names to the ones defined in the text correctly and will clarify how the plots are produced.

##########################################
# beta binomial model
##########################################

library(ggplot2)

# prior is beta(a, b) with a = 2, b = 2
a <- 2
b <- 2

# likelihood; 17 tails and 3 heads
N0 <- 17
N1 <- 3
N <- N0 + N1

# predict number of heads in M = 10 more tosses
M <- 10

# posterior predictive distribution (ppd)
ppd <- rep(NA, M)
for (k in 0:M) {
  # Eq (3.34) - (a, b) replaced by (a + N1, b + N0)
  ppd[k + 1] <- choose(M, k) * 
    beta(a + k + N1, M - k + N0 + b) / 
    beta(a + N1, b + N0)
}
# store results in a date frame for plotting with ggplot
# store the title for faceting
pltDf <- data.frame(x = 0:M, d = ppd, 
                    title = rep("Posterior Pred", M + 1))

# plug-in approx.
# first get MAP estimate = mode of posterior. Eq (3.21)
thetaHat <- (a + N1 - 1) / (a + b + N - 2)
pluginApprox <- rep(NA, M + 1)
for (k in 0:M) {
  pluginApprox[k + 1] <- choose(M, k) * 
    thetaHat ^ k * 
    (1 - thetaHat) ^ (M - k)
}
pluginDf <- data.frame(x = 0:M, d = pluginApprox, 
                       title = rep("Plug-in Approx", M + 1))

pltDf <- rbind(pltDf, pluginDf)

ggplot(pltDf, aes(x, d)) + 
  geom_bar(stat = "identity", fill = "navyblue", 
           alpha = 0.8) + 
  scale_x_continuous(breaks = pltDf$x) + 
  facet_wrap(~title)

That produces the following plot:

fig3-7

Using Variables in PostgreSQL PgAdmin III Client

There was a requirement in my office to be able to use variables for table names and field values in PostgreSQL queries.

If you have used Oracle DB, then you know you can define variables using define and later use those variables in your queries. Here are two variable declarations, one for a table name and another for a field value.

define mytbl = 'sometable';
define maxid = 2500;

Once those variables are defined, we can use them in our queries like this:

select count(*) from &mytbl where id <= &maxid;

This is possible to do in Postgres using the PostgreSQL PgAdmin III client. The feature to use from PgAdmin III is pgScript.

Here is the code to do this. First, you would enter something like this in the SQL Editor:

declare @mytbl, @maxid; 
set @mytbl = 'sometable'; 
set @maxid = 2500; 
set @res = select count(*) from @mytbl where id <= @maxid;
print @res;

Instead of just executing this script like usual, you need to click on the icon next to execute query. The icon is “execute pgScript” and it has a P with a GS below it. Once it runs, it will show both the query (substituted with the variable table name and field value) and the results of your query.

So if you ever need to define your variables at the top of a PostgreSQL script and later use them inside your complicated SQLs below, then this is one way to do it.