Identification of Transcription Factor Cooperativity via Stochastic System Model
Supplementary
materials
We have some methods to avoid overfitting in this paper. First, we use cubic spline method to interpolate the data points while the number of TFs binding to the target gene exceeds 6, which implied the number of parameters to be estimated is 23. Only 5% of target genes have more than 6 TFs. The more TFs binding to a target gene, the more data points we interpolated by the cubic spline method to avoid overfitting. Second, in the method we proposed, we used p-value to determine the cooperativity among TFs, which could reduce uncertainty due to small data to avoid overfitting. Third, since data-shuffling technique is a good way to test the overfitting, we randomly shuffled the microarray expression data or genome-wide location data to prove that there is no overfitting in the method we proposed.
By comparing the results of expression data shuffling and
genome-wide location data shuffling with the result of original result in Table
1, we found that there exists much difference in these results, showing that
there is no overfitting. In the expression data
shuffling case, only 60% of cooperativities of TFs are found, which come from the
correct ChIP-chip data and the uncertainty reduction
by p-value detection. In the location data shuffling, no cooperativities
of TFs are found due to the same significance
threshold
as the original
result, 10-21. The shuffling results shows that the normal
genome-wide location data save the ratio of cooperativities
to be found in the expression data shuffling case, and the cooperativites
we found are more sensitive to the genome-wide location data. The shuffling
results are shown as follows
Expression data shuffling case
|
|
TF1 |
TF2 |
PC |
In original result? |
|
1 |
Swi4 |
Swi6 |
3.24E-153 |
Yes |
|
2 |
Mbp1 |
Swi6 |
3.43E-121 |
Yes |
|
3 |
Mbp1 |
Swi4 |
1.44E-63 |
Yes |
|
4 |
Gat3 |
Yap5 |
2.07E-52 |
Yes |
|
5 |
Fkh2 |
Ndd1 |
7.74E-52 |
Yes |
|
6 |
Fkh1 |
Fkh2 |
1.20E-42 |
Yes |
|
7 |
Pdr1 |
Yap5 |
1.85E-40 |
Yes |
|
8 |
Cin5 |
Yap6 |
6.15E-35 |
Yes |
|
9 |
Stb1 |
Swi6 |
1.44E-34 |
Yes |
|
10 |
Fkh1 |
Swi6 |
3.34E-34 |
Yes |
|
11 |
Fkh2 |
Swi6 |
5.97E-33 |
Yes |
|
12 |
Fkh2 |
Mbp1 |
1.23E-30 |
Yes |
|
13 |
Fkh1 |
Ndd1 |
1.56E-30 |
Yes |
|
14 |
Ste12 |
Swi6 |
1.61E-30 |
Yes |
|
15 |
Hap4 |
Yap5 |
2.04E-30 |
Yes |
|
16 |
Ndd1 |
Swi4 |
2.11E-30 |
Yes |
|
17 |
Msn4 |
Pdr1 |
5.76E-30 |
Yes |
|
18 |
Dig1 |
Ste12 |
3.58E-29 |
Yes |
|
19 |
Ndd1 |
Swi6 |
8.85E-29 |
Yes |
|
20 |
Msn4 |
Yap5 |
1.59E-27 |
Yes |
|
21 |
Skn7 |
Swi6 |
3.84E-27 |
Yes |
|
22 |
Fkh2 |
Swi4 |
4.34E-27 |
Yes |
|
23 |
Mcm1 |
Ndd1 |
1.06E-26 |
Yes |
|
24 |
Gat3 |
Hap4 |
1.37E-26 |
Yes |
|
25 |
Mbp1 |
Ndd1 |
1.37E-25 |
Yes |
|
26 |
Ste12 |
Swi4 |
4.65E-25 |
Yes |
|
27 |
Gat3 |
Pdr1 |
5.38E-25 |
Yes |
|
28 |
Mbp1 |
Stb1 |
3.02E-24 |
Yes |
|
29 |
Mcm1 |
Swi6 |
1.65E-23 |
Yes |
|
30 |
Rap1 |
Yap5 |
4.30E-23 |
Yes |
|
31 |
Skn7 |
Swi4 |
1.34E-22 |
Yes |
|
32 |
Fkh1 |
Mbp1 |
1.53E-22 |
Yes |
|
33 |
Stb1 |
Swi4 |
5.33E-22 |
Yes |
Genome-wide location data shuffling case
|
|
TF1 |
TF2 |
PC |
In original result? |
|
1 |
Gal3 |
Hir3 |
1.00E-12 |
|
|
2 |
Fhl1 |
Tec1 |
1.00E-09 |
|
|
3 |
Opi1 |
Spt2 |
1.67E-09 |
|
|
4 |
Mss11 |
Yap3 |
9.50E-08 |
|
|
5 |
Gln3 |
Rds1 |
1.17E-07 |
|
|
6 |
Ykr046w |
Ynr063w |
6.64E-07 |
|
|
7 |
Ifh1 |
Swi5 |
8.35E-07 |
|
|
8 |
Gal3 |
Stp4 |
9.00E-07 |
|
Note that when
the significance threshold
is the same as the
original result without shuffling, i.e., 10-21, we could not find
any cooperative pairs among TFs when shuffling
genome-wide location data. We just list top 8 pairs of the genome-wide location
data shuffling case.
By the three remedy methods presented above, we showed that there exists no overfitting in this paper and the cooperativities we inferred are reliable.
After constructing the discrete stochastic dynamic model, we use the method of maximum likelihood to estimate the parameters. Equation (4) in the manuscript can be written as the following regression form.

(S1)
where
denotes the
regression vector which can be obtained from the processing above.
is the parameter vector of the gene regulatory network which
is to be estimated.
If we have a lot of microarray data points obtained from the cubic spline method above, it is easy to get values of
for
and
, where
is the number of
expression time points of a target gene, and
is the number of TFs binding to the target gene. The parameter vector
can be estimated
by the following approach. After a matrix notation, equation (S1) at different
time points is of the following form
(S2)
For simplicity, we can further define the notations
,
, and
to represent
equation (S2) as follows
(S3)
In equation (S2), we
assume noises
at different time
points as independent random variables of normal distribution with zero mean
and unknown variance
, i.e., the variance of
is
, where
is an identity
matrix. In this study, a maximum likelihood parameter estimation method will be
employed to estimate
and
from microarray data of regulatory genes and the target gene
(Johansson, 1993). If
is assumed to be
normally distributed with
elements, its
probability density function is of the following form
(S4)
Under equation (S3), we have the likelihood function
(S5)
Equation (S5) can be
considered as a function of parameter
and
. We want to specify
and
to maximize the
likelihood function in (S5). It is practical to take the logarithm of their
likelihood function, and then we have the following log-likelihood function:
(S6)
where
and
are the k-th
elements of
and
, respectively.
Here we expect the
log-likelihood function to have the maximum at
and
. The necessary condition for the maximum likelihood
estimates
and
is as follows
(Johansson, 1993)
(S7)
The estimated parameters
and
are shown below
(S8)
(S9)
where
and
can be obtained
from the microarray data of regulatory genes and the
target gene. After obtaining the estimate
from equation
(S8) via microarray data, the transcriptional
regulatory network of target genes could be expressed by the following dynamic
equation
(S10)
For any
,
, where N denotes
the number of time points in the microarray data. In
this study, N=18. For normal data (without
random permutation), the expression of
is composed of
with the
corresponding time points sequence
. For the randomly permuted
, we randomly permute the time points sequence, which result
in, for example,
, then the corresponding
form the randomly
permuted expression of
.
Pilpel et al. (2001) uncover functional motif combinations in the promoters of S. cerevisae using microarray data. They uncover not only cell cycle-related motifs, but also sporulation and various stress responses. Focusing on cell cycle, they identified only 10 motif pairs. Comparing our results with the cell cycle results from Pilpel et al., there are only 3 TF pairs in common. This may be because they only use microarray data but not use ChIP-chip data to infer combinatorial motifs. The comparison is shown below.