**Step 1: Compile Bazel from source**

View the code on Gist.

**Step 2: Compile Swig from source**

View the code on Gist.

**Step 3: Compile Tensorflow**

View the code on Gist.

**Troubleshooting**

**Error 1: protoc failed, version `GLIBCXX_3.4.18′ not found (solution)**

View the code on Gist.

**Error 2: image_ops_gpu.cu buld error**

View the code on Gist.

**solution:**

This error occurs when I was trying to compile v1.2.0. Instead, I compiled from the master (commit: 7c10b24de3cb2408441dfd98e1a1a1e8f43f3a7d) and the problem was resolved.

Revision History

Jun 19, 2017: updated for TF 1.2.0

**Add user at the server**

View the code on Gist.

**Generate public and private keys**

View the code on Gist.

**Set up public key on the server**

View the code on Gist.

**Set up private key locally**

View the code on Gist.

**Create bare repo on the server**

View the code on Gist.

**Add remote to local repo and push**

View the code on Gist.

**Add more users**

To add more users, you simply need to put more public keys into ~/.ssh/authorized_keys.

**Limit git user privilege**

To limit git users to only push and pull operations, we can change the shell of user git to git-shell.

View the code on Gist.

**References**:

https://git-scm.com/book/en/v2/Git-on-the-Server-Setting-Up-the-Server

**Roadmap
**

This post starts by introducing energy-based models, including the graphical representation of the model and its learning with gradient descent of log-likelihood. This post then discusses Boltzmann machines by placing a specific energy function in energy-based models. Restricted Boltzmann Machines are further discussed with the introduction of restrictions in Boltzmann Machines connections.

**Notations**

There are three basic notations in this post: \(x\), \( h\) and \( y\). \(x\) represents an input variable that takes the form of a vector \( x=\left \{ x_{1}, x_{2},\dots ,x_{N} \right \}\), where \( x_{i}\) denotes the i-th feature of \(x\). \( h\) represents a hidden variables that also takes the form of a vector \( h=\left \{ h_{1}, h_{2},\dots ,h_{N} \right \}\), where \( h_{i}\) denotes the i-th hidden variable. The hidden variable is sometimes called latent variable in a more statistical language. \( y\) represents the label of a given input to be predicted.

As an example, for the problem of image recognition, \( x\) are the images of interest where \( x_{i}\) are the individual pixels from an image \( x\). \( h\) are the hidden features/descriptors that serve as a high-level representations of image. Finally, \( y\) are the labels of the images.

**Energy-Based Models**

Different from predictive/supervised models, such as multiplayer perception (feedforward neural networks), *energy-based models* capture the joint probability distribution \( P(\mathbf{x}) \) of the configuration of input variables \( \left \{ x_{1}, x_{2},\dots ,x_{N} \right \}\), rather than conditional probability of \( P(y|\mathbf{x}) \) for supervised learning.

Energy-based models associate a scalar energy (compatibility value) to each configuration of the random variables. For each configuration, there is a corresponding energy associated, which is proportional to the probability for the model to be in that particular configuration. Owing to energies are uncalibrated (unnormalized and measured in arbitrary units), only relative values of energies can provide meaningful information about the probability distribution. Thus, the probability distribution of energy-based models need to be normalized using the Gibbs measure **[3]** that takes the form

\( P(x)=\frac{e^{-Energy(x)}}{Z}\)

The normalizing factor \( Z\) is called the *partition function* or *normalization function* defined by

\( Z=\sum_{x}^{ }e^{-Energy(x)}\)

Because the energy is negated, this probability distribution favors “negative energy”, meaning configurations with low energy will have a high probability. *Learning* in energy-based models corresponds to modifying the energy function to reshape the energy space, so that the desired patterns in the data always lie in the lowest spots in the energy surface, so that it will have higher probability.

**An Simple Example of Energy-Based Models**

Suppose we have a simple energy-based model with only two variables \( a \) and \( b \) (where \( x=\left \{ a, b \right \}\)), as well as a connection weight \( w\). Here we define energy function \( Energy(a,b)=-wab\). If \( a \) and \( b \) can only take binary values, and \( w\) is 1, the resulting energy function takes the form \( Energy(a,b)=-ab\). The table below shows all possible configurations for \( a \), \( b \), \( Energy(a,b)\) and the prabability for each configuration after normalization. In other words, this energy-based model factorize the joint probability distribution shown below.

We can visualize the probability distribution using the graph below where the size of each circle represents the probability of each configuration.

**Introducing Hidden Variables
**When modelling a set of input features, we can use the input attributes as variables to learn the relationship between each other. However, directly modeling such variables seems to have less expressive power, hence, we introduce the concept of

\( P(x,h)=\frac{e^{-Energy(x,h)}}{Z}\)

\( Z=\sum_{x}^{ }\sum_{h}^{ }e^{-Energy(x,h)}\)

Since we are only interested in the distribution of visible variables, we want to model the *marginal distribution* over hidden variables that takes the form

\( P(x)=\sum_{h}^{ }\frac{e^{-Energy(x,h)}}{Z}\)

To simplify this formula, we defined *free energy *being the marginal energy of \( x\) summing over \( h\):

\( FreeEnergy(x)=-ln\sum_{h}^{ }e^{-Energy(x,h)}\)

Therefore we can rewrite the marginal probability distribution as:

\( P(x)=\frac{e^{-FreeEnergy(x)}}{Z}\)

\( Z=\sum_{x}^{ }e^{-FreeEnergy(x)}\)

**Gradient Learning of Energy-based Models**

We use log-likelihood gradient to train the model, and we use \( \theta\) to represent parameters of the model.

Hence, the average log-likelihood gradient for a training set is:

where \( \hat{P}\) is the distribution over the training set, and \( P\) is the distribution over the model’s current parameters. The resulting log-likelihood gradient tells us, as long as we are able to compute the derivative of free energy for the training examples, and the derivative of free energy for the model’s own distribution, we will be able to train the model tractably.

**Boltzmann Machines**

Boltzmann machine is a type of energy-based model, we get a Boltzmann machine when applying the energy function below in the previous section

\( Energy(x,h)=-{b}’x-{c}’h-{h}’Wx-{x}’Ux-{h}’Vh\)

where \( b\) is the bias for visible variables, \( c\) is the bias for hidden variables, \( W\) is the connection weights between hidden variables and visible variables, \( U\) is the connection weights within visible variables and \( V\) is the connection weights within hidden variables. This energy function is simply a fully connected graph with weights on each connection.

**Gradient Learning of Boltzmann Machines**

The gradient of log-likelihood can be written as:

Since the \( \frac{\partial Energy(x,h)}{\partial \theta}\) is easy to compute (which is simply the connection weights), the result of this gradient tells us, as long as we are able to compute the probability of hidden variables given visible variables, and the joint probability of hidden and visible variables (for the current constraints in the model), we will be able to learn the model tractably. Now the question comes to, how to compute the conditional and joint probability. That is, how to sample from \( P(x\vert h)\) and \( P(x,h)\).

**Gibbs Sampling for Conditional Probability**

Gibbs sampling of the joint distribution of N variables \( X_{1}…X_{N}\) is done through a sequence of \( N\) sampling sub-steps of the form:

\( X_{i}\sim P(X_{i}|X_{-i}=x_{-i})\)

which means, the sample of one variable comes from the conditional probability given all other variables. We can compute the conditional probability of one node given all the other nodes easily:

This conditional probability is the same form as the activation function of neural networks. In Gibbs sampling, one sample step is to sample all the \( N\) variables once. After the number of sample steps goes to \( \infty\), the sample distribution will converge to \( P(X)\).

**Gibbs Sampling for Boltzmann Machines**

To sample \( P(x\vert h)\) and \( P(x,h)\) in Boltzmann machine, we need to compute two Gibbs chins. For \( P(x\vert h)\), we *clamp* the visible nodes (x) to sample h, which is the conditional probability. For \( P(x,h)\), we let all the variables run free, to sample the distribution of the model itself. This method for training Boltzmann machine is computationally expensive, because we need to run two Gibbs chins, and each of them will need to run a large number of steps to get a good estimate of the probability. Now, we introduce the idea of Restricted Boltzmann Machine and how it speeds up learning.

**Restricted Boltzmann Machines**

Restricted Boltzmann machine is a type of Boltzmann machine, without interconnections within hidden nodes and visible nodes. The energy function is defined as

\( Energy(x,h)=-{b}’x-{c}’h-{h}’Wx\)

The conditional probability of hidden variables given visible variables are:

This conditional probability indicates the probability for each hidden variable given all the visible variables are independent, so that we can get the joint probability directly. This also indicates that, each hidden node can be seen as an expert, and we are using the product of experts to model the joint distribution. This formula also applies for \( P(x\vert h)\).

**Gibbs Sampling for Restricted Boltzmann Machines**

For the Gibbs sampling in Boltzmann machine, it is very slow because we need to take a lot of sub-steps to get only one Gibbs chin. But for Restricted Boltzmann machine, since all the hidden variables are independent given visible variables and all the visible variables are independent given hidden variables, we can just take one step to complete one Gibbs chin. So that we can easily get the conditional probability. For the joint probability, we use a hybrid Monte-Carlo method, an MCMC method involving a number of free-energy gradient computation sub-steps for each step of the Markov chain. For \( k\) Gibbs steps:

\( x_{0}\sim \hat{P}(x)\\h_{0}\sim P(h\vert x_{0})\\x_{1}\sim P(x\vert h_{0})\\h_{1}\sim P(h\vert x_{1})\\ \dots \\ x_{k} \sim P(x|h_{k-1})\)

**Contrastive Divergence**

Compared with the Gibbs sampling in Boltzmann machine, the above method to sample in Restricted Boltzmann machine is much more effective. However, runing the MCMC chain is still quite expensive. The idea of k-step Contrastive Divergence is to stop the MCMC chain after k steps. This method saved a lot of computational complexity. One way to interpret the Contrastive Divergence is that, after a few MCMC steps, we will know in which direction the error is heading towards, so that instead of waiting for the error becoming larger and larger, we can simply stop the chain the update the network.

**References**

[1] Learning Deep Architectures for AI, Yoshua Bengio, 2009, link

[2] A Tutorial on Energy-based Learning, Yann LeCun, etc., 2006, link

[3] Gibbs Measure, Wikipedia, link

**Changes**

tf.nn.rnn_cell -> tf.contrib.rnn

tf.nn.seq2seq -> tf.contrib.legacy_seq2seq

**Notations**

\(\mathrm{RSS}\) stands for Residual Sum Squared error, \(\beta\) denotes parameters in a form of column vector, \(X\) is an \(N\times p\) matrix where each row is an input vector, \(p\) is the number of entries/features for each vector, and \(y\) denotes labels in a form of column vector. That is,

\text{—} & x_{1} & \text{—} \\

\text{—} & x_{2} & \text{—} \\

& \vdots & \\

\text{—} & x_{N} & \text{—}

\end{pmatrix},

\beta = \begin{pmatrix}

\beta_{1}\\ \beta_{2}

\\ \vdots

\\ \beta_{p}

\end{pmatrix},

y = \begin{pmatrix}

y_{1}\\ y_{2}

\\ \vdots

\\ y_{N}

\end{pmatrix}.

\)

**Method 1. Vector Projection onto the Column Space**

This is the most *intuitive* way to understand the normal equation. The optimization of linear regression is equivalent to finding the projection of vector \(y\) onto the column space of \(X\). This projection can be understood as the following subspace shown below.

\(X\beta=\begin{pmatrix}

\text{—} & x_{1} & \text{—} \\

\text{—} & x_{2} & \text{—} \\

& \vdots & \\

\text{—} & x_{N} & \text{—}

\end{pmatrix}

\begin{pmatrix}

\beta_{1}\\ \beta_{2}

\\ \vdots

\\ \beta_{p}

\end{pmatrix}=

\begin{pmatrix}

\beta_{1}x_{11}+ & \cdots &+\beta_{p}x_{1p} \\

\beta_{1}x_{21}+ & \cdots &+\beta_{p}x_{2p} \\

\vdots& \vdots & \vdots \\

\beta_{1}x_{N1}+ & \cdots &+\beta_{p}x_{Np}

\end{pmatrix}\\

=\beta_{1}\begin{pmatrix}

x_{11}\\ x_{21}

\\ \vdots

\\ x_{N1}

\end{pmatrix}

+\cdots+\beta_{p}\begin{pmatrix}

x_{1p}\\ x_{2p}

\\ \vdots

\\ x_{Np}

\end{pmatrix}\)

As the projection is denoted by \(\widehat{y}=X\beta\), the optimal configuration of \(\beta\) is when the error vector \(y-X\beta\) is orthogonal to the column space of \(X\), that is

\(X^{T}(y-X\beta)=0.\tag 1\)

Solving this gives:

\(\beta=\left ( X^{T}X \right )^{-1}X^{T}y.\)

Here \(X^{T}X\) is denoted as the Gram Matrix and \(X^{T}y\) is denoted as a moment matrix. More intuitively, the Gram matrix captures the correlations among the features and the moment matrix captures the contributions from each feature to the regression outcome.

**Method 2. Direct Matrix Differentiation
**This is the most

\(S(\beta)=(y-X\beta)^{T}(y-X\beta)=y^{T}y-\beta^{T}X^{T}y-y^{T}X\beta+\beta^{T}X^{T}X\beta\\=y^{T}y-2\beta^{T}X^{T}y+\beta^{T}X^{T}X\beta.\)

Differentiate \(S(\beta)\) w.r.t. \(\beta\):

\(-2y^{T}X+\beta^{T}\left ( X^{T}X+\left ( X^{T}X \right )^{T} \right )=-2y^{T}X+2\beta^{T}X^{T}X=0,\)

Solving \(S(\beta)\) gives:

\(\beta=\left ( X^{T}X \right )^{-1}X^{T}y.\)

**Method 3. Matrix Differentiation with Chain-rule
**This is the

\(\frac{\partial S(\beta) }{\partial \beta}=\frac{\partial (y-X\beta)^{T}(y-X\beta) }{\partial (y-X\beta)}\frac{\partial (y-X\beta)}{\partial \beta}=-2(y-X\beta)^{T}X=0,\)

solving \(S(\beta)\) gives:

\(\beta=\left ( X^{T}X \right )^{-1}X^{T}y.\)

This method requires an understanding of matrix differentiation of the quadratic form: \(\frac{\partial x^{T}Wx}{\partial x}= x^{T}(W+W^{T}).\)

**Method 4. Without Matrix Differentiation**

We can rewrite \(S(\beta)\) as following:

\(S(\beta)=\left \langle \beta, \beta \right \rangle-2\left \langle \beta, (X^{T}X)^{-1}X^{T}y \right \rangle+\left \langle (X^{T}X)^{-1}X^{T}y, (X^{T}X)^{-1}X^{T}y \right \rangle+C,\)

where \(\langle \cdot ,\cdot \rangle\) is the inner product defined by

\(

\langle x,y\rangle =x^{\rm {T}}(\mathbf {X} ^{\rm {T}}\mathbf {X} )y.

\)

The idea is to rewrite \(S(\beta)\) into the form of \(S(\beta)=(x-a)^2+b\) such that \(x\) can be solved exactly.

**Method 5. Statistical Learning Theory**

An alternative method to derive the normal equation arises from the statistical learning theory. The aim of this task is to minimize the expected prediction error given by:

\(\mathrm{EPE}(\beta)=\int (y-x^{T}\beta)\mathrm{Pr}(dx, dy),\)

where \(x\) stands for a column vector of random variables, \(y\) denotes the target random variable, and \(\beta\) denotes a column vector of parameters (Note the definitions are different from the notations before).

Differentiating \(\mathrm{EPE}(\beta)\) w.r.t. \(\beta\) gives:

\(\frac{\partial \mathrm{EPE}(\beta)}{\partial \beta}=\int 2(y-x^{T}\beta)(-1)x^{T}\mathrm{Pr}(dx, dy)\).

Before we proceed, let’s check the dimensions to make sure the partial derivative is correct. \(\mathrm{EPE}\) is the expected error: a \(1\times1\) vector. \(\beta\) is a column vector that is \(N\times 1\). According to the Jacobian in vector calculus, the resulting partial derivative should take the form

\(\frac{\partial EPE}{\partial \mathbf{\beta}}= \left (\frac{\partial EPE}{\partial \beta_{1}}, \frac{\partial EPE}{\partial \beta_{2}}, \dots, \frac{\partial EPE}{\partial \beta_{N}} \right )\),

which is a \(1\times N\) vector. Looking back at the right-hand side of the equation above, we find \(2(y-x^{T}\beta)(-1)\) being a constant while \(x^{T}\) being a row vector, resuling the same \(1\times N\)dimension. Thus, we conclude the above partial derivative is correct. This derivative mirrors the relationship between the expected error and the way to adjust parameters so as to reduce the error. To understand why, imagine \(2(y-x^{T}\beta)(-1)\) being the errors incurred by the current parameter configurations \(\beta\) and \(x^{T}\) being the values of the input attributes. The resulting derivative equals to the error times the scales of each input attribute. Another way to make this point is: the contribution of error from each parameter \(\beta_{i}\) has a monotonic relationship with the error \(2(y-x^{T}\beta)(-1)\) as well as the scalar \(x^{T}\) that was multiplied to each \(\beta_{i}\).

Now, let’s go back to the derivation. Because \(2(y-x^{T}\beta)(-1)\) is \(1\times1\), we can rewrite it with its transpose:

\(\frac{\partial \mathrm{EPE}(\beta)}{\partial \beta}=\int 2(y-x^{T}\beta)^{T}(-1)x^{T}\mathrm{Pr}(dx, dy)\).

Solving \(\frac{\partial \mathrm{EPE}(\beta)}{\partial \beta}=0\) gives:

\(\\E\left [y^{T}x^{T}-\beta^{T}xx^{T} \right ]=0

\\ E\left [\beta^{T}xx^{T} \right ]=E\left [y^{T}x^{T} \right ]

\\ E\left [xx^{T}\beta \right ]=E\left [xy \right ]

\\\beta=E\left [xx^{T} \right ]^{-1}E\left [ xy \right ].\)

**References**

[1] Wikipedia: Linear Least Squares

[2] The elements of statistical Learning

]]>He who refuses to do arithmetic is doomed to talk nonsense.

-John McCarthy

Examples of inferential statistics include testing hypothesises and deriving estimates, while examples of descriptive statistics include sample size, mean and standard deviations.

Inferential statistics aims to provide insights on the generating process of data, which requires a statistical model on how the data is generated/sampled, followed by methods to distil underlying properties. Whereas descriptive statistics aims to understand the data alone.

Inferential Statistics | Descriptive Statistics | |
---|---|---|

goal | infer underlying distribution from and beyond observed data | summarization of observed data |

generative | yes | no |

requires statistical model | yes | no |

examples | testing hypothesises, deriving estimates | sample size, mean and standard deviations |

References:

[1] Wikipedia: statistical inference

[2] Wikipedia: descriptive statistics

]]>

It turns out that, I have set the PYTHONPATH variable in my ~/.bashrc, and pip installed packages to that specified directory. When I was trying to import the package, it could not find from the virtualenv directories. After removed the PYTHONPATH from my ~/.bashrc, everything works like a charm.

]]>In this post, I am going to talk about Beta Distribution and some intuitive interpretations behind it.

**An Example**

Suppose we have two coins (A and B), and we are making a statistical experiment to identify whether these coins are biased or not. For coin A, we tossed 5 times and the results are: 1,0,0,0,0. (1 indicates Head and 0 indicates Tail). For coin B, we tossed 10 times and the results are: 1,1,0,0,0,0,0,0,0,0. The probability for theses two coins to be Tail are identical: 0.2. *Is it safe to say, both coins equally favour the Tail?*

The answer is *NOT*, from the perspective of *law of large numbers* (LLN),

the results obtained from a large number of trials should be close to the expected ealue, and will tend to become closer as more trials are performed.rt

That means, although the standard deviation of the two distributions are the same,the standard error of A will be larger than that of B, because of a smaller sample size.

(Please note that the standard error of a sample average measures the rough size of the difference between the population average and the sample average.)

Now that we know, the expected outcome of coin A and B are the same,

while the confidence for these two events are different. Let \( P_{A}\) denote the probability for A to be Tail, and \( P_{B}\) denote the probability for B to be Tail. We want to know, the uncertainty for \( P_{A}\) and \( P_{B}\) to take different values, ranging from 0 to 1. In other words, that is the probability (uncertainty) for \( P_{A}\) and \( P_{B}\) to be different probabilities. That is, *the probability (uncertainty) for probability*.

Below is the probability (uncertainty) that we wanted.

(Please find the code that generates the below image here)

In the above graph, the red line illustrates coin A, the green line illustrates coin B, and the Blue line coorespondes to another coin with 80 Heads and 20 Tails. From the red curve (coin A), we can find that, even though there is one Tail in the five tosses, the probability for it to be Tail has a peak around zero. That means, the most probable probability for coin A to be Tail is close to zero. From the green curve (coin B), we can see that, the peak of probability is close to 0.15. That means, the most probable probability for coin B to be Tail is close to 0.15. From the blue curve, the peak is close to 0.2. That means, the most probable probability for it to be Tail is close to 0.2. Also, we find that, although the expected probabilities for these coins to be Tail are the same: 0.2, the shapes of the probability distribution are different. And, the more data we get, the shape of probability distribution becomes more condensed in a small area. *That is Beta Distribution.*

**Definition**

The pdf of Beta Distribution is:

\( P(x)=\left\{\begin{matrix}

\frac{(1-x)^{\beta -1}x^{\alpha -1}}{B(\alpha ,\beta )} & ,x\in [0,1]\\

0 & ,otherwise

\end{matrix}\right.\)

where \( B(\alpha ,\beta )\) is a normalizing constant to make the outcome of the formula ranging from 0 to 1.

\( B(\alpha ,\beta ) \\= \int_{0}^{1}y^{\alpha -1}(1-y)^{\beta -1}dy\\=y^{\alpha -1}(\frac{-(1-y)^{\beta }}{\beta })\bigg\vert_{0}^{1}+\frac{\alpha -1}{\beta }\int_{0}^{1}y^{\alpha -2}(1-y)^{\beta}dy \\= 0+\frac{\alpha -1}{\beta }\int_{0}^{1}y^{\alpha -2}(1-y)^{\beta}dy \\=\frac{\alpha -1}{\beta }B(\alpha -1,\beta +1) =\frac{(\alpha -1)(\alpha -2)\cdots1}{\beta (\beta +1)\cdots(\beta +\alpha -2)}\int_{0}^{1}(1-y)^{\alpha +\beta -2}dy \\=\frac{(\alpha -1)(\alpha -2)\cdots1}{\beta (\beta +1)\cdots(\beta +\alpha -1)} \\=\frac{\Gamma (\alpha )\Gamma (\beta )}{\Gamma (\alpha +\beta )}\)

where \( \Gamma(x)\) is the Gamma Function.

\( \Gamma(x)=(x-1)!\)

Beta Distribution can express a wide range of different shapes for pdf, the above graph shows a variety of pdf from Beta Distribution.

**Mean**

The expected value of Beta Distribution is $\frac{\alpha }{\alpha +\beta }$,

which answers the intuitive question why coin A and coin B has the same expected value.

\( \eta = \int_{0}^{1}xP(x)=\int_{0}^{1}x\frac{(1-x)^{\beta -1}x^{\alpha -1}}{B(\alpha ,\beta )}=\frac{\alpha }{\alpha +\beta }\)

**Variance**

The variance of a Beta distribution is:

\( var(X)=E[(x-\eta )^2]=\frac{\alpha \beta }{(\alpha +\beta )^2(\alpha +\beta +1)}\)

This answers the question that, if expected values are the same, as the number of trial are becoming larger and larger, the dispersion of the Beta distribution is becoming smaller and smaller.

**Conjugate Prior**

One important application of Beta distribution is that,

it can be used as a conjugate prior for binomial distributions in Bayesian analysis.

In Bayesian probability theory, if the posterior distributions $P(h|D)$ are in the same family as the prior probability distribution $P(h)$, the prior and posterior are then called **conjugate distributions**, and the prior is called a **conjugate prior**. A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior: otherwise a difficult numerical integration may be necessary. Further, conjugate priors may give intuition, by more transparently showing how a likelihood function updates a prior distribution.

For a Binomial distribution, we got \( \alpha\)successes and \( \beta\) failures,

we use this information as a prior to model the further \( s\) successes and \( f\) failures.

The prior is a Beta distribution:

\( P(q)=\frac{(1-x)^{\beta -1}x^{\alpha -1}}{B(\alpha ,\beta )}\)

The likelihood is a Binomial distribution:

\( P(s,f \vert x)=\begin{pmatrix}s\\ s+f\end{pmatrix}x^s(1-x)^f\)

The posterior is another Beta distribution:

\( P(x|s,f)=\frac{P(s,f|x)P(x)}{\int P(s,f|x)P(x)dx}=\frac{x^{s+\alpha -1}(1-x)^{f+\beta -1}}{B(s+\alpha ,f+\beta )}

\\=Beta(s+\alpha ,f+\beta )\)

This posterior distribution could then be used as the prior for more samples, with the hyperparameters simply adding each extra piece of information as it comes.

]]>**The Problem
**

**The Solution**

1. Activate Google Cloud Shell

2. Add a rule to the firewall to allow traffic from port 22:

*gcloud compute firewall-rules create allowssh –allow tcp:22 –source-ranges 0.0.0.0/2*

**The first principle is that you must not fool yourself — and you are the easiest person to fool. – Richard Feynman**

It is often helpful to read with questions in mind. This post summarizes a list of questions worthy asking while reading a paper. I would like to make this post a living document about how to read a paper, as I read more materials and gain more understanding of scientific research. The content of this post is largely from the references listed at the end.

- What type of paper is this?
- When was it written?
- Which other papers is it related to? (reference, citation)
- Which theoretical bases were used to analyze the problem?
- Do the assumptions appear to be valid?
- Is the logic of the paper clear and justifiable, given the assumptions, or is there a flaw in the reasoning?
- If the authors present data, did they gather the right data to substantiate their argument, and did they appear to gather it in the correct manner?
- Did they interpret the data in a reasonable manner?
- Would other data be more compelling?
- What part you do not understand?
- What are the paper’s main contributions?
- Is this paper well written?
- Are results shown with error bars, so that conclusions are statistically significant?
- If the authors attempt to solve a problem, are they solving the right problem?
- Are there simple solutions that the author do not seem to have considered?
- What are the limitations of the solution?
- What are the good ideas in this paper?
- Do these ideas have other applications or extensions that the authors might not have thought of?
- Can the good ideas be generalized even further?
- What are the major findings of the paper?
- What surprised you or struck you as interesting?
- What questions are still unanswered?
- Are there possible improvements that might make important practical differences?
- If you were going to start doing research from this paper, what would be the next thing you would do?
- Can you summarize the background in five sentences or less?
- Can you summarize the paper in five sentences or less?
- What is the question that authors started with and what is the answer?
- What is the general and specific question the author is trying to answer?
- What is the scientific contribution of the paper?
- What are the author trying to do to answer the question?
- Do the results answer the specific questions? What do you think they mean?
- Can you draw a diagram for each experiment, showing exactly what the authors did.
- Can you write one or more paragraphs to summarize the results for each experiment, each figure, and each table.
- Are the ideas really novel, or have they appeared before?
- Can you list an outline of the main points of the paper?
- What do you think is the quality of the ideas and its potential impact?
- What do other researchers say about this paper?
- How can I apply this approach in my work?
- How could future studies be improved?

**References**:

[1] How to read a paper, S.Keshav, link

[2] How to read a research paper, link

[3] How to read and understand a scientific paper: a guide for non-scientists, link

[4] How to read a paper, link

[5] Efficient reading of papers in science and technology, link

[6] How to read and review a scientific journal article, link