Gibb's Sampling Derivation for Gene Motif

bigpapi34 · Oct 19, 2016

Hello,

I have no idea where to start with this problem. I'm very overwhelmed with all the probability syntax. Any help or insights is greatly appreciated. Thank you in advance!

Here is the problem:

7. Gibb’s Sampling In-depth

Gibbs sampling is one of a family of approaches, called Markov chain Monte Carlo algorithms, for generating samples from a distribution given only limited information about it, such as only its pdf or, in the case of Gibbs sampling, only the conditionals. Gibbs sampling is particularly useful when the conditionals are much easier to formulate than the full joint. Given only a procedural description of Gibbs sampling (e.g. from problem 4), it may not be obvious that it converges to a meaningful answer. In this problem, we’ll walk through a simplified version of the algorithm in order to understand how and why it works.

We will look for a motif of length (L + 1) in just two sequences, X = x₁ · · x_{M +L} and Y = y₁ · · y_{N +L}. Let P (I = i, J = j) be the joint probability that a motif starts at position i in sequence X , i ∈ 1,. . , M , and at position j in sequence Y , j ∈ 1,. . , N . If we knew this distribution, then we could find the most likely positions for a motif as arg max ij P(I = i, J = j). Unfortunately, we do not know P (I = i, J = j).

(a) Design expressions for the conditional distributions P (I = i|J = j) and P (J = j|I = i). These expressions need not be closed form. Conceptually, P (I = i|J = j) answers the question: Given that some motif appears in Y at y_j · · · y_j+L, what is the probability that the same motif appears in X at x_i · · · x_i+L? (Hint: A simple heuristic would define P (I = i|J = j) to be proportional to the percent identity of x_i · · · x_i+L and y_j · · · y_j+L.). It is up to you how, or whether, to incorporate a background model. In any case, ensure that for all j, ∑ ?(? = ?|? = ?) = 1??=1, and similarly for all i, ∑ ?(? = ? |? = ?) = 1??=1.

The Gibbs sampler initializes with an arbitrary position in X, which well call I₀. We stochastically choose J₀ (a position in Y ) according to P (J₀ = j|I₀ = i), one of the conditional distributions you defined. Then,we choose I₁ according to P (I₁ = i|J₀ = j). Then we choose J₁, I₂, J₂, I₃, J₃, I₄, ...

(b) We can view the conditional distributions you defined in part (a) as transition matrices in this sequence of positions, giving the probability of the next choice based on the last choice. Assume you have computed the transition matrices A and B where a_ji = P (I = i|J = j) and b_ji = P(J = j | I = i). Give a formula for the transition matrix C where ?_??′ = ?(?_?+1 = ?′|?_? = ?).

(c) Now, consider a Markov chain governed by the transition matrix C, which generates the sequence I₁, I₂, I₃, .... The states in this Markov chain correspond to positions in the sequence X . (We could similarly derive a Markov chain for Y ). What is ?(?₆₈₇₈ = ? |?₀ = ?₀) in terms of C?

We now have an exact solution to this motif finding problem for two sequences. In the case of n sequences, we would have n conditional distributions, each dependent on the motif positions in the other n − 1 sequences.

(d) Describe what the transition matrices from part (b) would be in this case. Would it still be practical to use this approach? Explain your answer.

(e) In the two-sequence case, how could you approximate P (I = i) by simulating the Markov chain? Based on this intuition, outline a strategy for estimating the positions of the motif.

Gibb's Sampling Derivation for Gene Motif

bigpapi34

New member