Toric Ideals for

3 downloads 0 Views 913KB Size Report
Mahdy.AL-Mayal ,and Bashar Ahmed Jawad Sharb. Faculty of ... started from two state in the sample space, and ..... presented at Grostat V, New. Oeleans, 4-6 ...
Journal of Kufa for Mathematics and Computer

Toric Ideals for

( )

Vol.2, no.2, Dec,2014,pp 116-120

- Contingency Tables with

Fixed Two Dimensional Marginals When Number Husein Hadi Abas , Yahya .Mahdy.AL-Mayal

is an Even

,and Bashar Ahmed Jawad Sharb

Faculty of Mathematics and Computer Science University of Kufa Najaf - Kufa – Iraq

B.Sturmfels defined the notion of Markov basis for constructing a connected Markov chain for sampling from a conditional distribution over a discrete sample space and proved the fundamental fact that a Markov basis corresponds to a set of binomial generators of a toric ideal[9]. In 2000, M. Dyer, and C. Greenhill, found a Polynomial-time counting and sampling of two-rowed contingency tables[8]. In 2001,A.Dobra showed that the only moves that have to be included in a Markov basis that links all contingency tables having a set of fixed marginals when this set of marginals induces a decomposable graphical models[1]. In 2002, A. Dobra, and S.Sullivant, described a divide-and-conquer algorithm for generating Markov basis of multi-way tables that connects all tables of counts having a fixed set of marginal totals[2]. In 2003, S. Aoki, and A.Takemura.proved

Abstract: In this paper, we find Markov basis and toric ideals for ( ) - contingency tables with fixed two dimensional marginalswhen is an even number greater than or equal and we give an application of ( )

-contingency tables

in genetic algorithm . Key words and phrases:Computational algebraic statistics, Sufficient statistics, Linear transformation, Connected graph, Bipartite graph . 1.Introduction: Since In 1998the publication of P.Diaconis and B.Sturmfels, the new field of computational algebraic statistics has been developing rapidly, and in the same year P.Diaconis and 116

HuseinHadiAbas , Yahya .Mahdy.AL-Mayal

and Bashar Ahmed Jawad Sharb

that there exist a uniqe minimal basis for 3×3×K contingency tables consisting of four types of indispensable moves [13], And in the same year S. Aoki, and A.Takemurapresented a list of indispensable moves of unique minimal Markov basis of 3×4×K and 4×4×K contingency tables with fixed two-dimensional marginals[12], also A. Takemura, and S. Aoki. given some characterizations of minimal Markov basis for connected Markov chain and given a necessary and sufficient condition for uniqueness of minimal Markov basis [3]. In2005, A. Takemura, and S. Aoki. Studied the Markov basis for sampling from discret sample space, which is equipped with som convent metric and they started from two state in the sample space, and they asked whether they can always move closer by an element of a Markov basis and they called a Markov basis distance reducing[4]. In this paper, we givea new algorithm to find the Markov basis and toric ideals

for

( )

Markov basis , and toric ideals that we need in our work. Definition (2.1) : [11] | | Let be a finite set elements, we call an element of a cell and denoted by . is often multi-index . A non-negative integer denotes the frequency of a cell . The set of frequencies is called a contingency table and denoted as , we treat a contingency table as a -dimansional column vector of non-negative integers. Not that a contingency table can also be considered as a function from to defined as .

Remark(2.2): [ 11] The

-norm of

is called

the sample size and denoted as | |



We will denote ℤ

to the set of integer numbers, also ℤ

we denote to the

-

as fixed column vectors

contingency tables with fixed two dimensional marginals.

consisting of integers. A dimensional

2 .Some Basic Concepts:

column ℤ

In this section, we review some basic definitions and notations of contingency tables, moves,

-

vector

as

. Here denotes the transpose of a vector or matrix. 117

Journal of Kufa for Mathematics and Computer

We also define a

matrix ,

with its -row being

given by

Vol.2, no.2, Dec,2014,pp 116-120

This relation is an equivalence relation and partitioned

[

] and if

is a

-

into

is

disjoint

equivalence classes. The set of -

dimensional column vector, we

fibers is simply the set of these

define the set

equivalence classes. Furthermore

ℤ , where

, may be considered as a label

is the

set of natural numbers. In typical

for these equivalence classes.

situations of a statistical theory ,

Definition (2.3):[11]

is sufficient statistic for the

A -dimensional column vector of integers ℤ is called a move if it is in the kernel of ,

nuisance parameter. The set of 's for a given ,

i.e.,

-fibers),is considered

.

for performing similar tests, for

Remark(2.4) :[3]

the case of independence model

For a move , the positive part and the negative part are defined by , respectively, Then and . moreover, and are in the same -fiber, i.e., for . We define the degree of as the sample size of or ( ) and | | | |. denote it by In the following we denote the set of moves (for a given ) by ℤ .

of

two–way

contingency

tables.For example , is the row sums and column sums of is the set of

, and

s with the

same row sums and column sums to . The set of decomposition

-fibers gives a of

.

An

important observation is that fiber depends on given only through its kernel,

. For

different A's with the same kernel , the set of -fibersare the same. In fact , if we define 118

HuseinHadiAbas , Yahya .Mahdy.AL-Mayal

and Bashar Ahmed Jawad Sharb

Definition (2.5): [3]

Theorem (2.8) : [6]

Let be the set of moves and let . We say that accessible from by if there exists a sequance of moves and such that

For a graph the following statements are equivalent: 1. is bipartite . 2. Every cycle in has an even length.



Definition (2.9): [10]



Let ℤ ℤ a linear transformation , ℤ , and bethe set of -fibers, and let , then we define be the graph with vertex set and an edge if and only if

Now, we give some concepts about graph theory that we use later Definition (2.6): [6] A graph is connected if for every pair of distinct vertices where be the set of vertices of the graph the graph has a -path. Otherwise, we say the graph is disconnected.

Definition (2.10):[3] If

is a set such that is connected for all , is a Markov basis for

Definition (2.7) : [6]

then

A graph is bipartite graph if there are meeting the following conditions: 1. 2. 3. and are both null graphs, where and are sub graph of the graph induced by the set of vertices respectively.

Remark (2.11): Throughout this paper the symbol denotes a field of complex numbers , The set is the vector space of tuples of elements in . Henceforth denote indeterminates, that is, polynomial variables. A monomial in the indeterminates is a expression of the form ∏ 119

,

Journal of Kufa for Mathematics and Computer

where are nonnegative integers. We will often use the shorthand

Theorem(2.13) ( Diaconis _ Sturmfels 1998): [3] A

binomials { } is generating set of toric ideal if and only if is a Markov basis for In particular , since every toric ideal has a finite generating set of binomials, Markov basis exist.



to denote this monomial. A polynomial is a linear combination of finitely many monomials ∑

where the and at most finitely many of them are nonzero. Note any polynomial is also a function from to , simply by evaluating the polynomial at a pointof .

or

of

In this section, we describe some of the basic biological facts needed to understand phylogenetic models and then delve into the practical side of the algebraic statistics of these models. The basic genetic information of an organism is (almost always) carried in the form of DNA, a double helix consisting of two complementary B polymers bound together. The four nucleotides that form DNA come in two types: the purines (A and G) and the pyrimidines (C and T). The two strands of the double helix are joined together via the base pairings A to T (via 2 hydrogen bonds) and C to G (via 3 hydrogen bonds). Since each cell typically contains a copy of the DNA of the organism, DNA copying occurs frequently. Several types of errors are possible during the replication of DNA. Single bases can mutate, or large pieces of DNA can separate and become

,

for short. Note that has the structure of a ring because we can add and multiply two polynomials to produce new polynomials, and these addition and multiplication operations are well-behaved with respect to one another . Definition (2.12): [10] ℤ

collection

3. Genomics and phylogenetics

The set of all polynomials in the indeterminates is denoted by either



be a linear transformation, the Toric ideal is the ideal Let

Vol.2, no.2, Dec,2014,pp 116-120

where

121

HuseinHadiAbas , Yahya .Mahdy.AL-Mayal

and Bashar Ahmed Jawad Sharb

reattached, possibly at another position, possibly in the opposite direction. These are just some of the events that occur over the course of evolution.[5],[7].

other columns are zero denoted by

, or it has two columns and the other

columns are zero denoted by like

4. The Main Results In this section, we find Markov

[

basis and toric ideal for ( )

Also , we can write all elements of

- contingency tables with

[

as one dimensional column

is an even number and

greater than or equal . Also, we

such that

construct a new model of genetic algorithm

that

permutes

the ∑

pieces of Nucleotides in aligned DNA sequences by using Markov



……….(2.1).

Remark(4.1):

Theorem (4.2):

be an even number such

The number of elements in

, and let be





{

pervious contingency tables.

that





basis and toric ideals for a

Let

].

vector as follows:

fixed two dimensional marginals when

]

the representative

equal to ( )

elements of the set of contingency tables

Proof:

and

From remark(4.1), there are

such that each , dimension columns

two rows and

, is a matrix of

columns in

either has two

, such that it has two columns

and the

and the other 121

Journal of Kufa for Mathematics and Computer

columns

are

zero,then

number of elements ( ), but

in

Vol.2, no.2, Dec,2014,pp 116-120

the

Given a contingency table

is

,

the entry of the matrix in the column

is an element in for

indexed

by and

all

. Then the number

of elements

is( ). Since

in



row



will be each element in or

,

is either equal to one if

then the numbers of

elements in is

( )

( )

the ∑

( )

a pears in

and it will zero

otherwise. Then

■ A

Now, we will prove all elements in

[

are moves.

]

Remark(4.3): Theorem(4.4): The set

is a set

( )

[

[ ]

]

of moves. Proof : Let

.To prove

.

is [ ]

a move By

remark

From (2.1), we have

(4.3)

if [





if

]





.

We must show that if

122



.

Journal of Kufa for Mathematics and Computer

Vol.2, no.2, Dec,2014,pp 116-120

Then

( )

we can connected all ( )

. [

]

[ ]

.Then

[ ]

- contingency tableswith fixed Therefore, that

.This implies

two dimensional marginals by

is a set of moves (By ( ) edges by applying moves

definition (2.3)).■

from

Remark(4.5): Now,

to

from

we will construct a

one by one and go to

without

( )

connected graph by using the

causing negative cell frequencies

elements of

on the way, and also from

element of

Let

( ) is

connected

to an edge

and

( )

connect

( )

such that

,

and

be an

( )

and

undirected graph as shown in

… ,

,

of this type, by forming

figure (1)

is an edge where

( )



( ) ( )

( )

( )





… …

( )

( )

Figure(1):The graph

( )

,Where the contingency

2

( )

Journal of Kufa for Mathematics and Computer

Vol.2, no.2, Dec,2014,pp 116-120

tables interpreted as vertices and connecting moves are interpreted as edges of a graph,

( )

} and

the graph

Theorem(4.6): The graph

.

( )

, by (definition

(2.6)), is a connected graph .

is a

connected bipartite graph ( up to

Now,

we

prove

graph isomorphism).

the

graph

is a bipartite graph.

Proof :

Let

To prove

is a

( )

( )

be a cycle in .

connected graph. Suppose Let

, ( )

if

. Then

,

since the edge , then

or ( )

since the edge .

( )

, there exists

Continuing in this way , we see that if is odd, then

a path

is even then

and if .

Since

, it implies that

,

is even,and thus the cycle is of ( )

and if

and

even length. By theorem (2.8), then the graph

( )

there

( )

exists

bipartite graph. ■ a

( )

path

( )

Corollary(4.7): The

( )

set

of

moves

in

theorem(4.4)is a Markov basis .

, and that implies there exists a path between every pair of distinct vertices

is a

of 2

Journal of Kufa for Mathematics and Computer

Vol.2, no.2, Dec,2014,pp 116-120

dimension

Proof : Let

and either it has

two columns

be a finite set of

moves. From theorem (4.6) the

and the other columns are zero ,

graph

or

is a connected

graph. By definition (2.10)

it

has

two and

is

columns the

other

columns are zero. Then the toric

a Markov basisfor . ■

ideal is the ideal Now, we will find ideals

that

the toric

corresponding ( )

Markov basis for

, such that

to

.■ Example(4.9):

contingency tables. For Corollary(4.8):

(4.2) there are

Let

be an even number, such

that

, and let be a Markov

Markov

( )

for

-

(4.1)

- contingency

tables in

basis

moves in a

contingency table, and by remark

basis for . Then the toric ideal for

, according to theorem

{[

, , such that

] [

[

]

[

] [

]

] [

]}

.

By corollary (4.8) the toric ideal Proof:

of

From corollary (4.7), Markov

basis for

is a

, and by

theorem (2.9) the set of binomials {

of

} is a generating

toric ideal ,

, and since , is a matrix of 2

- contingency table

HuseinHadiAbas , Yahya .Mahdy.AL-Mayal

and Bashar Ahmed Jawad Sharb

Taxon1: A G C T A A C GGTACGT

Remark(4.10):

Taxon2: CGATCTGAC CCGT T

We can use corollary (4.8) to find the toric ideal ( )

Taxonm: ACGTCA CGTA ACGC

contingency

tables

without Then,

finding the Markov basis.

we

define

a

pattern

to be the sequence

5. ANew Module of Permutation The Pieces of Nucleotides inDNA Sequences

of characters. we look at a single site (column) of our sequence data. In the sequences above, we can look at the first site in the

In this section, we construct

sequences and see the pattern

a new model of genetic algorithm

"AC . . .A". A pattern frequency

of permutation of the pieces of Nucleotides

in aligned

is that

DNA

appears inour set of

sequence data , and we refer to

sequences using an algorithm .

the

Now we describe our model in

number

by where

the following steps.

of

frequencies

is an even number

grater than or equal . Step (1): . . . . . .



. . .

| | ∑

Suppose we have

Step (2): We can input the patterns frequency of above sequences



–contingency

( )

in

tables as follows :

taxons of

DNA sequences, each taxon of

Where | |

length such as

length of sequences (the sample size),



is

the

and

is the frequency of the first pattern.

3

Journal of Kufa for Mathematics and Computer

Vol.2, no.2, Dec,2014,pp 116-120

is the frequency of the second pattern. [

]

Where

is written as

is the frequency of the pattern. [

]

[

]

is the frequency of the pattern. [

is the frequency of the

]

where the columns of the matrix pattern.

index by the elements of the

is the frequency of the

column vector

pattern.

Step(5):

Step (3): Represent the contingency table as a - dimensional column vector of non-

negative

Use remark (4.1), theorem (4.2), and corollary (4.7) to find the Markov basis

integers

( )

.

Step (6):

, where denotes the

transpose of a vector or matrix ,

Use remark (4.5) and theorem

as in remark (2.2). Then is a -

(4.6) to find the connected graph

fiber (i.e

, and ( )

, where (

Step (4): is

- fibers)

tables. matrix

and

From

remark (4.3),

4

- contingency

HuseinHadiAbas , Yahya .Mahdy.AL-Mayal

and Bashar Ahmed Jawad Sharb

frequencies

Step (7):

respectively.

Use corollary (4.8) to find the toric ideals for

( )

Step (2):

-

Now, we input the

contingency tables.

frequency in Step (8): Use the

patterns

of above sequences

-contingency table as

follows : 5 6

( ) elements of the

Markov basis

4

in step(6) to find

the permutation of the pieces of nucleotides

and

in aligned

4

3

Step (3):

DNA

Represent the contingency table

sequences.

as

a

dimensional column vector of

Example (5.1):

non-

negative

Suppose we have the following

integers

, where denotes the

three aligned DNA sequences:

transpose of a vector or matrix ,

Taxon1: AGCTG A ATG GC

as in remark (2.2). Then is a -

Taxon2: AGA TCTATC CA

fiber (i.e

Taxon3: A T TCA GACA AT Step (1): There are three taxons for above DNA ∑

sequences

-

with

| |

and six patterns

AAA, G GT, CAT, T TC, GCA, ATG, A AA, T TC, GCA, GCA, and CAT

with

5

, where

Journal of Kufa for Mathematics and Computer

Vol.2, no.2, Dec,2014,pp 116-120

Step (4): is matrix remark (4.3),

and

From -1 0 1 0

1 -1

2 1 2 5 2 3 1 6 4 4 3 11

Step(5): 3 1 1 5 1 3 2 6 4 4 3 11

Use remark (4.1), theorem (4.2), and corollary (4.7) to find the Markov basis

.

( )

1 2 2 5 3 2 1 6 4 4 3 11 0 0

0 1 -1 0 -1 1

-1 1 1 -1

1 1 3 5 3 3 0 6 4 4 3 11

3 0 2 5 1 4 1 6 4 4 3 11 [

-1 1 0 1 -1 0

]

Where

1 0 -1 0

is written as

[

]

[

-1 1

2 0 3 5 2 4 0 6 4 4 3 11

1 -1 0 -1 1 0

]

Figure (2): The graph , [

]

Step (7):

where the columns of the matrix index by the elements of the column vector Step (6):

By corollary (4.8), the toric ideal of

-

contingency tables

The connected graph with ( )

Step (8):

- contingency

( - fibers)

Then the permutation for the pieces of nucleotides in aligned DNA sequences under

tables

as

vertices

of

it

the set of Markov basis be as follows:

6

HuseinHadiAbas , Yahya .Mahdy.AL-Mayal

and Bashar Ahmed Jawad Sharb

Figure (3): The permutation of the pieces of nucleotides in aligned DNA sequences under the set of Markov basis . Remark(5.2):

ii. We refer to ⓿ in the same

i. We refer to ①, ②, ③, and ④

figure to the hidden in the pattern

in figure (3) to the frequencies of

frequency of

the patterns in DNA sequences.

DNA sequences.

1

HuseinHadiAbas , Yahya .Mahdy.AL-Mayal

and Bashar Ahmed Jawad Sharb

Example(5.3) :

True

Consider finding the connected graph for contingency table. Using the computer system Maple 13, the following commands will do the job:

True

6.Conclusions A Contingency table is a matrix of nonnegative integers with

{

prescribed

}

positive

row

and

column sums.Contingency tables are used in statistics to store data from sample surveys. One of related problems for a survey of contingency tables is how to generate table from the set of all nonnegative

integer tables

with given row and column sums True

? (Fisher's Exact Test, Missing Data Problems).

True A

-contingency tables

can

be

written

as

a

-

dimensional column vector of non-negative

integers,

where

, the contingency table can also be considered as an element in -fibers

, where ,

is a linear transformation, and is 117

-dimensional column

Journal of Kufa for Mathematics and Computer

Vol.2, no.2, Dec,2014,pp 116-120

vector of row and column sums

, where

of the contingency table . Let

are the row sums and

be a finite set, such that and let

are the column sums for each

be the graph with

vertex set

and

- contingency tables. Also,

an

edge if and only if

they used to find a finite set

. If

is connected, then

. Remark(4.5) explains

is a

how to construct the graph

Markov basis for . The question

and theorem(4.6) states

how to find a finite set such that

another problem, if

is found,

then

binomials

the

that

is connected ? is

set

{

of

}

toric ideal

for

generates (i.e.

is a connected graph,

this implies

is a Markov basis

as mentioned inCorollary(4.7). Corollary(4.8) gives the toric

the

ideal

is used

for

using the Markov

basis . All these results are used

to find a toric ideal ) .

in section (5) to construct a new In this Paper, we introduce an

model of genetic algorithm of

algorithm based on the actions of

permutation

subgroups

of the

nucleotides

that

sequences.

and

dihedral group

on

the in

pieces

aligned

of DNA

gives solutions for all previous problems when

References

is an even

[1] A. Dobra, "Markov bases for decomposable graphical models", Paper presented at Grostat V, New Oeleans, 4-6 September 2001, submitted for publication, 2001.

number greater than or equal , and

.Remark(4.1),

theorem(4.2), theorem(4.4), and remark(4.5) are used to generate ( ) contingency tables (

matrices) in

-fibers

[2] A. Dobra, and S.Sullivant, "A divide-and-conquer algorithm for

, 118

HuseinHadiAbas , Yahya .Mahdy.AL-Mayal

and Bashar Ahmed Jawad Sharb

generating Markov bases of multi-way tables", Paper presented at the workshop "Multidimensional tables" statistics, combinatorial optimization, and Groebner bases", University of California at Davis, submitted for publication,15-16 February 2002. [3] A. Takemura, and S. Aoki,"Some characterizations of minimal Markov basis forsampling from discrete conditional distributions", Annals of the Institute of Statistical Mathematics,2003.

[8] M. Dyer, and C. Greenhill, "Polynomial-time counting and sampling of two-rowedcontingency tables", Theoretical Computer Sciences, 246,(2000), pp. 265-278. [9] P. Diaconis, and B. Sturmfels,"Algebraic algorithms for sampling from conditional distributions", The Annals of Statistics, 26, pp. 363397,(1998).

[4] A. Takemura, and S. Aoki,"Distance reducing Markov bases for sampling fromdiscrete space",Bernoulli, to appear, 2005.

decomposition", Mathematical Essays in Honor of Gian-Carlo Rota,

[10] P. Diaconis, D. Eisenbud, B. Sturmfels. "Lattice walks and primary

eds. B. Sagan and R. Stanley, Progress in Mathematics, Vol. 161,

[5]C.Semple, M. Steel,"Phylogenetics", volume 24 of Oxford Lecture

Birkhauser, (1998),pp.173-193.

Seriesin Mathematics and its Applications, Oxford University Press, Oxford,( 2003). [6] G. Agnarsson, R. Greenlaw. "Graph theory , modeling, applications, and algorithms ",United States of America,(2007).

Boston,

[11] S. Aoki, A. Takemura, "The largest group of invariance for Markov bases and toric ideals", J. Symbolic Computation 43 (5) (2008) 342–358.

[7] J.Felsenstein,"Inferring Phylogenies",Sinauer Associates, Sunderland,( 2003). 119

Journal of Kufa for Mathematics and Computer

[12] S. Aoki, and A.Takemura,"The list of indispensable moves of unique minimal Markov basis of 3×4×K and 4×4×K contingency tables with fixed twodimensional marginals", Technical Report METR 03-38, Department of Mathematical Engineering and Information Physics, The University of Tokyo. Submitted for publication,(2003). [13] S. Aoki, and A. Takemura,"Minimal basis for connected Markov chain over 3×3×K contingency tables with fixed twodimensional marginals". Australian and New Zealand Journal of Statistics, 45, pp 229-249,(2003).

121

Vol.2, no.2, Dec,2014,pp 116-120