Inference
Suppose that you wanna evaluate a joint probability density function \( f \) in an arbitrary interval. The conventional statistical techniques tell you to compute the integral of the function \( f \) in the given interval to find the density of the function in that certain region. Additionally, suppose that function is somehow complex so that calculating the integral is not an easy way to compute the density. This is the case where Monte Carlo inference can be deployed to help us computing the integral implicitly without integrating the function explicitly. Monte Carlo inference is a handy technique which allows us to calculate complex integrals with a little easy math though expensive computation that needs a bit trick to adjust few constants and parameters.
Suppose that you wanna evaluate a joint probability density function \( f \) in an arbitrary interval. The conventional statistical techniques tell you to compute the integral of the function \( f \) in the given interval to find the density of the function in that certain region. Additionally, suppose that function is somehow complex so that calculating the integral is not an easy way to compute the density. This is the case where Monte Carlo inference can be deployed to help us computing the integral implicitly without integrating the function explicitly. Monte Carlo inference is a handy technique which allows us to calculate complex integrals with a little easy math though expensive computation that needs a bit trick to adjust few constants and parameters.
Monte Carlo inference term refers to a general family of techniques which replaces an integral with a sum under certain assumptions. Rejection sampling is an instance of the Monte Carlo inference, that requires a proposal distribution, say \( q(x) \). By rejection sampling, even an unnormalized function \( \tilde{f} \) could be approximated if \( M q(x) \geq \tilde{f}(x)\) for all values of \( x \) and some constant \(M\). Let's start with an example to demonstrate how rejection sampling works. A bivariate joint density function \(f(x_1, x_2) \sim \exp\{ -\frac{1}{2} [ \frac{x_1^2}{100} + (x_2 + 0.05x_1^2 − 5)^2
]\}\) is given on \( \mathbf{R}^2\) up to an unknown constant, and I'm asked to compute the probability of the rectangle \( \{(x_1, x_2) : (x_1, x_2) \in (−3, −1) \times (5, 6) \} \). To this end, I should compute the integral of the exact function \(f(x_1, x_2)\) on the given rectangle. So far I have two problems in doing so, 1) I do not have the exact function, rather I have the unnormalized version of this function in \( \tilde{f}(x_1, x_2)\), 2) even if I had the function \(f\), I would be too lazy to review my calculous II to come up with calculating the 2D integral again! So it's turned out, the role of the Monte Carlo inference to approximate the integral with sum. Now, I wanna find an upper bound function \(q(x)\), say the envelope of the function \(\tilde{f}\) to preserve the aforementioned constraint. What I can do to find this envelope is to depict the function \(\tilde{f}\) to get a sense of how this function may look like. So here you go,
The above graphs show that the function \( \tilde{f}(x_1, x_2)\) is a unimodal function with two tails, so it could be imagined that a bivariate normal distribution function can simply cover this function as a proposal distribution with an appropriate \(M\). In the following graphs, the proposed envelope is depicted where \(M=argmax(\frac{\tilde{f}(x_1, x_2)}{q(x_1, x_2)})\).
Now, after having the proposal function \(q\), and the constant \(M\) fixed, we are able to draw samples from our proposal to approximate the unnormalized function \(\tilde{f}\), by keeping the accepted ones according to this rule,
iterate this procedure fairly large times,
1) Sample a uni-value random number \(u\) from a uniform distribution of \((0,1)\).
2) Sample a bivariate random number \((x_1, x_2)\) from our proposal \(q(x_1, x_2)\)
3) If \(u>\frac{\tilde{f}(x_1, x_2)}{Mq(x_1, x_2)}\), reject it. Otherwise, accept it.
In the following graph, fairly large number of samples are drawn from the proposal function \(q(x_1, x_2)\).
Then the accepted ones are depicted in the figures below,
As it could be seen, accepted samples are quite meaningful to capture the dense areas of the function \(\tilde{f}\). It's worse to mention that black squares, are the accepted samples in the rectangle \( \{(x_1, x_2) : (x_1, x_2) \in (−3, −1) \times (5, 6) \} \), thus the proportion of these black samples to the entire drawn samples denotes the probability of this region associated to the function \(\tilde{f}\).
This was a short introduction to the rejection sampling technique by an example, which shows how one can do an approximate inference in the absence of the normalizing constant of the given mass function without any integration.
Appendix
In this example, \(\mu=(0, 10)\) is suggested as the mean vector of the proposal distribution with \begin{bmatrix}1000 & 100 \\100 & 250 \end{bmatrix} for the covariance matrix. The following code snippets could help you to implement the above figures.
The function definition is implemented here,
f<- data-blogger-escaped-code="" data-blogger-escaped-exp="" data-blogger-escaped-function="" data-blogger-escaped-phi="" data-blogger-escaped-return="" data-blogger-escaped-x1="" data-blogger-escaped-x2="" data-blogger-escaped-x="">
Then \(n\) data is generated randomly to compute the values corresponding to the function \(\tilde{f}\) and \(q\).
n=10000
x1=runif(n, -20, 20)
x2=runif(n, -15, 8)
mn <- data-blogger-escaped-100="" data-blogger-escaped-10="" data-blogger-escaped-2="" data-blogger-escaped-add="TRUE)" data-blogger-escaped-blue="" data-blogger-escaped-c="" data-blogger-escaped-code="" data-blogger-escaped-col="cols)" data-blogger-escaped-cols="" data-blogger-escaped-dmvnorm="" data-blogger-escaped-f="" data-blogger-escaped-fv="" data-blogger-escaped-green="" data-blogger-escaped-matrix="" data-blogger-escaped-mn="" data-blogger-escaped-mycolorramp="" data-blogger-escaped-nrm="" data-blogger-escaped-open3d="" data-blogger-escaped-p="" data-blogger-escaped-plot3d="" data-blogger-escaped-red="" data-blogger-escaped-sigma="" data-blogger-escaped-x="x1," data-blogger-escaped-y="x2," data-blogger-escaped-yellow="" data-blogger-escaped-z="fv,col">
Finally, we should sample some data from \(q\) to accept those who satisfy the criterion, then portion of the accepted data in the given rectangle to the entire accepted samples, identifies the probability of the region.
p=rmvnorm(n, mn, sigma)
nrm<-dmvnorm data-blogger-escaped-0="" data-blogger-escaped-1="" data-blogger-escaped-c="" data-blogger-escaped-col="cols" data-blogger-escaped-cols="" data-blogger-escaped-f="" data-blogger-escaped-fv="" data-blogger-escaped-green="" data-blogger-escaped-idx="((x1" data-blogger-escaped-index="r<(tt)" data-blogger-escaped-mn="" data-blogger-escaped-mycolorramp="" data-blogger-escaped-nrm="" data-blogger-escaped-open3d="" data-blogger-escaped-p="" data-blogger-escaped-plot3d="" data-blogger-escaped-r="runif(n," data-blogger-escaped-sigma="" data-blogger-escaped-tt="(fv/(max(fv/nrm)*nrm))" data-blogger-escaped-x1="" data-blogger-escaped-x2="" data-blogger-escaped-yellow="">=-3 & x1<=-1) & (x2>=5 & x2<=6))
x11=x1[idx]
x22=x2[idx]
fv1=fv[idx]
plot3d( x11, x22, fv1, col = "black" , size=5, add=TRUE)
In addition, the color spectrum is a handy way to show the density of a function,
myColorRamp <- data-blogger-escaped--="" data-blogger-escaped-code="" data-blogger-escaped-colorramp="" data-blogger-escaped-colors="" data-blogger-escaped-diff="" data-blogger-escaped-function="" data-blogger-escaped-maxcolorvalue="255)" data-blogger-escaped-min="" data-blogger-escaped-range="" data-blogger-escaped-rgb="" data-blogger-escaped-v="" data-blogger-escaped-values="" data-blogger-escaped-x="">
Good! Too many picutres! Would have been better if the subject was less technical.
ReplyDeleteThanks for your feedback, sure I will keep it simple and more intuitive for the next post.
DeleteThis comment has been removed by the author.
ReplyDeleteDo like this post! Thank you for the clear and concise explanation. But I am wondering why you use "tt=(fv/(max(fv/nrm)*nrm))" to decide the argument of the maximum
ReplyDeleteHi Mei,
ReplyDeleteThanks for stopping by my post!
Actually tt contains the probability of accepting or rejecting the random samples in vector r that was generated in the previous line, and max(fv/nrm) is our constant M that I defined in the description of this algorithm. To this end the indices of r less than the corresponding values in tt would be accepted to approximate the unnormalized function f.
For more details I refer you to read more from the following excellent paper which has a great opening to the MCMC and its application in Machine Learning,
http://www.cs.ubc.ca/~arnaud/andrieu_defreitas_doucet_jordan_intromontecarlomachinelearning.pdf
If you have any further questions, or comments, you are more than welcome to share with me if I could be of any help.