Identification of Transcription Factor Cooperativity via Stochastic System Model

 Supplementary materials

 

Download the PDF file

A. Avoiding overfitting

    We have some methods to avoid overfitting in this paper. First, we use cubic spline method to interpolate the data points while the number of TFs binding to the target gene exceeds 6, which implied the number of parameters to be estimated is 23. Only 5% of target genes have more than 6 TFs. The more TFs binding to a target gene, the more data points we interpolated by the cubic spline method to avoid overfitting. Second, in the method we proposed, we used p-value to determine the cooperativity among TFs, which could reduce uncertainty due to small data to avoid overfitting. Third, since data-shuffling technique is a good way to test the overfitting, we randomly shuffled the microarray expression data or genome-wide location data to prove that there is no overfitting in the method we proposed.

By comparing the results of expression data shuffling and genome-wide location data shuffling with the result of original result in Table 1, we found that there exists much difference in these results, showing that there is no overfitting. In the expression data shuffling case, only 60% of cooperativities of TFs are found, which come from the correct ChIP-chip data and the uncertainty reduction by p-value detection. In the location data shuffling, no cooperativities of TFs are found due to the same significance threshold  as the original result, 10-21. The shuffling results shows that the normal genome-wide location data save the ratio of cooperativities to be found in the expression data shuffling case, and the cooperativites we found are more sensitive to the genome-wide location data. The shuffling results are shown as follows

Expression data shuffling case

 

TF1

TF2

PC

In original result?

1

Swi4

Swi6

3.24E-153

Yes

2

Mbp1

Swi6

3.43E-121

Yes

3

Mbp1

Swi4

1.44E-63

Yes

4

Gat3

Yap5

2.07E-52

Yes

5

Fkh2

Ndd1

7.74E-52

Yes

6

Fkh1

Fkh2

1.20E-42

Yes

7

Pdr1

Yap5

1.85E-40

Yes

8

Cin5

Yap6

6.15E-35

Yes

9

Stb1

Swi6

1.44E-34

Yes

10

Fkh1

Swi6

3.34E-34

Yes

11

Fkh2

Swi6

5.97E-33

Yes

12

Fkh2

Mbp1

1.23E-30

Yes

13

Fkh1

Ndd1

1.56E-30

Yes

14

Ste12

Swi6

1.61E-30

Yes

15

Hap4

Yap5

2.04E-30

Yes

16

Ndd1

Swi4

2.11E-30

Yes

17

Msn4

Pdr1

5.76E-30

Yes

18

Dig1

Ste12

3.58E-29

Yes

19

Ndd1

Swi6

8.85E-29

Yes

20

Msn4

Yap5

1.59E-27

Yes

21

Skn7

Swi6

3.84E-27

Yes

22

Fkh2

Swi4

4.34E-27

Yes

23

Mcm1

Ndd1

1.06E-26

Yes

24

Gat3

Hap4

1.37E-26

Yes

25

Mbp1

Ndd1

1.37E-25

Yes

26

Ste12

Swi4

4.65E-25

Yes

27

Gat3

Pdr1

5.38E-25

Yes

28

Mbp1

Stb1

3.02E-24

Yes

29

Mcm1

Swi6

1.65E-23

Yes

30

Rap1

Yap5

4.30E-23

Yes

31

Skn7

Swi4

1.34E-22

Yes

32

Fkh1

Mbp1

1.53E-22

Yes

33

Stb1

Swi4

5.33E-22

Yes

 

Genome-wide location data shuffling case

 

TF1

TF2

PC

In original result?

1

Gal3

Hir3

1.00E-12

 

2

Fhl1

Tec1

1.00E-09

 

3

Opi1

Spt2

1.67E-09

 

4

Mss11

Yap3

9.50E-08

 

5

Gln3

Rds1

1.17E-07

 

6

Ykr046w

Ynr063w

6.64E-07

 

7

Ifh1

Swi5

8.35E-07

 

8

Gal3

Stp4

9.00E-07

 

Note that when the significance threshold  is the same as the original result without shuffling, i.e., 10-21, we could not find any cooperative pairs among TFs when shuffling genome-wide location data. We just list top 8 pairs of the genome-wide location data shuffling case.

By the three remedy methods presented above, we showed that there exists no overfitting in this paper and the cooperativities we inferred are reliable.

 

B. Identifying gene regulatory networks

After constructing the discrete stochastic dynamic model, we use the method of maximum likelihood to estimate the parameters. Equation (4) in the manuscript can be written as the following regression form.

 

 

 

 

                                                                        (S1)

where  denotes the regression vector which can be obtained from the processing above.  is the parameter vector of the gene regulatory network which is to be estimated.

If we have a lot of microarray data points obtained from the cubic spline method above, it is easy to get values of  for  and , where  is the number of expression time points of a target gene, and  is the number of TFs binding to the target gene. The parameter vector  can be estimated by the following approach. After a matrix notation, equation (S1) at different time points is of the following form

                                                                                           (S2)

 

 

For simplicity, we can further define the notations , , and  to represent equation (S2) as follows

                                                                                                            (S3)

In equation (S2), we assume noises  at different time points as independent random variables of normal distribution with zero mean and unknown variance , i.e., the variance of  is  , where  is an identity matrix. In this study, a maximum likelihood parameter estimation method will be employed to estimate  and  from microarray data of regulatory genes and the target gene (Johansson, 1993). If  is assumed to be normally distributed with  elements, its probability density function is of the following form

                                                                                (S4)

 

 

Under equation (S3), we have the likelihood function

                                                       (S5)

 

 

Equation (S5) can be considered as a function of parameter  and . We want to specify  and  to maximize the likelihood function in (S5). It is practical to take the logarithm of their likelihood function, and then we have the following log-likelihood function:

                                                               (S6)

 

 

where  and  are the k-th elements of  and , respectively.

Here we expect the log-likelihood function to have the maximum at  and . The necessary condition for the maximum likelihood estimates  and  is as follows (Johansson, 1993)

                                                                                          (S7)

 

 

 

The estimated parameters  and  are shown below

                                                                                                    (S8)

                                                            (S9)

 

 

where  and  can be obtained from the microarray data of regulatory genes and the target gene. After obtaining the estimate  from equation (S8) via microarray data, the transcriptional regulatory network of target genes could be expressed by the following dynamic equation

                                                                                     (S10)

 

 

 

C. Random permutation of

    For any , , where N denotes the number of time points in the microarray data. In this study, N=18. For normal data (without random permutation), the expression of  is composed of  with the corresponding time points sequence . For the randomly permuted , we randomly permute the time points sequence, which result in, for example, , then the corresponding  form the randomly permuted expression of .

 

D. Comparison with Pilpel et al.’s results

Pilpel et al. (2001) uncover functional motif combinations in the promoters of S. cerevisae using microarray data. They uncover not only cell cycle-related motifs, but also sporulation and various stress responses. Focusing on cell cycle, they identified only 10 motif pairs. Comparing our results with the cell cycle results from Pilpel et al., there are only 3 TF pairs in common. This may be because they only use microarray data but not use ChIP-chip data to infer combinatorial motifs. The comparison is shown below.