Mahdy.AL-Mayal ,and Bashar Ahmed Jawad Sharb. Faculty of ... started from two state in the sample space, and ..... presented at Grostat V, New. Oeleans, 4-6 ...
Journal of Kufa for Mathematics and Computer
Toric Ideals for
( )
Vol.2, no.2, Dec,2014,pp 116-120
- Contingency Tables with
Fixed Two Dimensional Marginals When Number Husein Hadi Abas , Yahya .Mahdy.AL-Mayal
is an Even
,and Bashar Ahmed Jawad Sharb
Faculty of Mathematics and Computer Science University of Kufa Najaf - Kufa – Iraq
B.Sturmfels defined the notion of Markov basis for constructing a connected Markov chain for sampling from a conditional distribution over a discrete sample space and proved the fundamental fact that a Markov basis corresponds to a set of binomial generators of a toric ideal[9]. In 2000, M. Dyer, and C. Greenhill, found a Polynomial-time counting and sampling of two-rowed contingency tables[8]. In 2001,A.Dobra showed that the only moves that have to be included in a Markov basis that links all contingency tables having a set of fixed marginals when this set of marginals induces a decomposable graphical models[1]. In 2002, A. Dobra, and S.Sullivant, described a divide-and-conquer algorithm for generating Markov basis of multi-way tables that connects all tables of counts having a fixed set of marginal totals[2]. In 2003, S. Aoki, and A.Takemura.proved
Abstract: In this paper, we find Markov basis and toric ideals for ( ) - contingency tables with fixed two dimensional marginalswhen is an even number greater than or equal and we give an application of ( )
-contingency tables
in genetic algorithm . Key words and phrases:Computational algebraic statistics, Sufficient statistics, Linear transformation, Connected graph, Bipartite graph . 1.Introduction: Since In 1998the publication of P.Diaconis and B.Sturmfels, the new field of computational algebraic statistics has been developing rapidly, and in the same year P.Diaconis and 116
HuseinHadiAbas , Yahya .Mahdy.AL-Mayal
and Bashar Ahmed Jawad Sharb
that there exist a uniqe minimal basis for 3×3×K contingency tables consisting of four types of indispensable moves [13], And in the same year S. Aoki, and A.Takemurapresented a list of indispensable moves of unique minimal Markov basis of 3×4×K and 4×4×K contingency tables with fixed two-dimensional marginals[12], also A. Takemura, and S. Aoki. given some characterizations of minimal Markov basis for connected Markov chain and given a necessary and sufficient condition for uniqueness of minimal Markov basis [3]. In2005, A. Takemura, and S. Aoki. Studied the Markov basis for sampling from discret sample space, which is equipped with som convent metric and they started from two state in the sample space, and they asked whether they can always move closer by an element of a Markov basis and they called a Markov basis distance reducing[4]. In this paper, we givea new algorithm to find the Markov basis and toric ideals
for
( )
Markov basis , and toric ideals that we need in our work. Definition (2.1) : [11] | | Let be a finite set elements, we call an element of a cell and denoted by . is often multi-index . A non-negative integer denotes the frequency of a cell . The set of frequencies is called a contingency table and denoted as , we treat a contingency table as a -dimansional column vector of non-negative integers. Not that a contingency table can also be considered as a function from to defined as .
Remark(2.2): [ 11] The
-norm of
is called
the sample size and denoted as | |
∑
We will denote ℤ
to the set of integer numbers, also ℤ
we denote to the
-
as fixed column vectors
contingency tables with fixed two dimensional marginals.
consisting of integers. A dimensional
2 .Some Basic Concepts:
column ℤ
In this section, we review some basic definitions and notations of contingency tables, moves,
-
vector
as
. Here denotes the transpose of a vector or matrix. 117
Journal of Kufa for Mathematics and Computer
We also define a
matrix ,
with its -row being
given by
Vol.2, no.2, Dec,2014,pp 116-120
This relation is an equivalence relation and partitioned
[
] and if
is a
-
into
is
disjoint
equivalence classes. The set of -
dimensional column vector, we
fibers is simply the set of these
define the set
equivalence classes. Furthermore
ℤ , where
, may be considered as a label
is the
set of natural numbers. In typical
for these equivalence classes.
situations of a statistical theory ,
Definition (2.3):[11]
is sufficient statistic for the
A -dimensional column vector of integers ℤ is called a move if it is in the kernel of ,
nuisance parameter. The set of 's for a given ,
i.e.,
-fibers),is considered
.
for performing similar tests, for
Remark(2.4) :[3]
the case of independence model
For a move , the positive part and the negative part are defined by , respectively, Then and . moreover, and are in the same -fiber, i.e., for . We define the degree of as the sample size of or ( ) and | | | |. denote it by In the following we denote the set of moves (for a given ) by ℤ .
of
two–way
contingency
tables.For example , is the row sums and column sums of is the set of
, and
s with the
same row sums and column sums to . The set of decomposition
-fibers gives a of
.
An
important observation is that fiber depends on given only through its kernel,
. For
different A's with the same kernel , the set of -fibersare the same. In fact , if we define 118
HuseinHadiAbas , Yahya .Mahdy.AL-Mayal
and Bashar Ahmed Jawad Sharb
Definition (2.5): [3]
Theorem (2.8) : [6]
Let be the set of moves and let . We say that accessible from by if there exists a sequance of moves and such that
For a graph the following statements are equivalent: 1. is bipartite . 2. Every cycle in has an even length.
∑
Definition (2.9): [10]
∑
Let ℤ ℤ a linear transformation , ℤ , and bethe set of -fibers, and let , then we define be the graph with vertex set and an edge if and only if
Now, we give some concepts about graph theory that we use later Definition (2.6): [6] A graph is connected if for every pair of distinct vertices where be the set of vertices of the graph the graph has a -path. Otherwise, we say the graph is disconnected.
Definition (2.10):[3] If
is a set such that is connected for all , is a Markov basis for
Definition (2.7) : [6]
then
A graph is bipartite graph if there are meeting the following conditions: 1. 2. 3. and are both null graphs, where and are sub graph of the graph induced by the set of vertices respectively.
Remark (2.11): Throughout this paper the symbol denotes a field of complex numbers , The set is the vector space of tuples of elements in . Henceforth denote indeterminates, that is, polynomial variables. A monomial in the indeterminates is a expression of the form ∏ 119
,
Journal of Kufa for Mathematics and Computer
where are nonnegative integers. We will often use the shorthand
Theorem(2.13) ( Diaconis _ Sturmfels 1998): [3] A
binomials { } is generating set of toric ideal if and only if is a Markov basis for In particular , since every toric ideal has a finite generating set of binomials, Markov basis exist.
∏
to denote this monomial. A polynomial is a linear combination of finitely many monomials ∑
where the and at most finitely many of them are nonzero. Note any polynomial is also a function from to , simply by evaluating the polynomial at a pointof .
or
of
In this section, we describe some of the basic biological facts needed to understand phylogenetic models and then delve into the practical side of the algebraic statistics of these models. The basic genetic information of an organism is (almost always) carried in the form of DNA, a double helix consisting of two complementary B polymers bound together. The four nucleotides that form DNA come in two types: the purines (A and G) and the pyrimidines (C and T). The two strands of the double helix are joined together via the base pairings A to T (via 2 hydrogen bonds) and C to G (via 3 hydrogen bonds). Since each cell typically contains a copy of the DNA of the organism, DNA copying occurs frequently. Several types of errors are possible during the replication of DNA. Single bases can mutate, or large pieces of DNA can separate and become
,
for short. Note that has the structure of a ring because we can add and multiply two polynomials to produce new polynomials, and these addition and multiplication operations are well-behaved with respect to one another . Definition (2.12): [10] ℤ
collection
3. Genomics and phylogenetics
The set of all polynomials in the indeterminates is denoted by either
ℤ
be a linear transformation, the Toric ideal is the ideal Let
Vol.2, no.2, Dec,2014,pp 116-120
where
121
HuseinHadiAbas , Yahya .Mahdy.AL-Mayal
and Bashar Ahmed Jawad Sharb
reattached, possibly at another position, possibly in the opposite direction. These are just some of the events that occur over the course of evolution.[5],[7].
other columns are zero denoted by
, or it has two columns and the other
columns are zero denoted by like
4. The Main Results In this section, we find Markov
[
basis and toric ideal for ( )
Also , we can write all elements of
- contingency tables with
[
as one dimensional column
is an even number and
greater than or equal . Also, we
such that
construct a new model of genetic algorithm
that
permutes
the ∑
pieces of Nucleotides in aligned DNA sequences by using Markov
∑
……….(2.1).
Remark(4.1):
Theorem (4.2):
be an even number such
The number of elements in
, and let be
∑
∑
{
pervious contingency tables.
that
∑
∑
basis and toric ideals for a
Let
].
vector as follows:
fixed two dimensional marginals when
]
the representative
equal to ( )
elements of the set of contingency tables
Proof:
and
From remark(4.1), there are
such that each , dimension columns
two rows and
, is a matrix of
columns in
either has two
, such that it has two columns
and the
and the other 121
Journal of Kufa for Mathematics and Computer
columns
are
zero,then
number of elements ( ), but
in
Vol.2, no.2, Dec,2014,pp 116-120
the
Given a contingency table
is
,
the entry of the matrix in the column
is an element in for
indexed
by and
all
. Then the number
of elements
is( ). Since
in
∑
row
∑
will be each element in or
,
is either equal to one if
then the numbers of
elements in is
( )
( )
the ∑
( )
a pears in
and it will zero
otherwise. Then
■ A
Now, we will prove all elements in
[
are moves.
]
Remark(4.3): Theorem(4.4): The set
is a set
( )
[
[ ]
]
of moves. Proof : Let
.To prove
.
is [ ]
a move By
remark
From (2.1), we have
(4.3)
if [
∑
∑
if
]
∑
∑
.
We must show that if
122
∑
.
Journal of Kufa for Mathematics and Computer
Vol.2, no.2, Dec,2014,pp 116-120
Then
( )
we can connected all ( )
. [
]
[ ]
.Then
[ ]
- contingency tableswith fixed Therefore, that
.This implies
two dimensional marginals by
is a set of moves (By ( ) edges by applying moves
definition (2.3)).■
from
Remark(4.5): Now,
to
from
we will construct a
one by one and go to
without
( )
connected graph by using the
causing negative cell frequencies
elements of
on the way, and also from
element of
Let
( ) is
connected
to an edge
and
( )
connect
( )
such that
,
and
be an
( )
and
undirected graph as shown in
… ,
,
of this type, by forming
figure (1)
is an edge where
( )
…
( ) ( )
( )
( )
…
…
… …
( )
( )
Figure(1):The graph
( )
,Where the contingency
2
( )
Journal of Kufa for Mathematics and Computer
Vol.2, no.2, Dec,2014,pp 116-120
tables interpreted as vertices and connecting moves are interpreted as edges of a graph,
( )
} and
the graph
Theorem(4.6): The graph
.
( )
, by (definition
(2.6)), is a connected graph .
is a
connected bipartite graph ( up to
Now,
we
prove
graph isomorphism).
the
graph
is a bipartite graph.
Proof :
Let
To prove
is a
( )
( )
be a cycle in .
connected graph. Suppose Let
, ( )
if
. Then
,
since the edge , then
or ( )
since the edge .
( )
, there exists
Continuing in this way , we see that if is odd, then
a path
is even then
and if .
Since
, it implies that
,
is even,and thus the cycle is of ( )
and if
and
even length. By theorem (2.8), then the graph
( )
there
( )
exists
bipartite graph. ■ a
( )
path
( )
Corollary(4.7): The
( )
set
of
moves
in
theorem(4.4)is a Markov basis .
, and that implies there exists a path between every pair of distinct vertices
is a
of 2
Journal of Kufa for Mathematics and Computer
Vol.2, no.2, Dec,2014,pp 116-120
dimension
Proof : Let
and either it has
two columns
be a finite set of
moves. From theorem (4.6) the
and the other columns are zero ,
graph
or
is a connected
graph. By definition (2.10)
it
has
two and
is
columns the
other
columns are zero. Then the toric
a Markov basisfor . ■
ideal is the ideal Now, we will find ideals
that
the toric
corresponding ( )
Markov basis for
, such that
to
.■ Example(4.9):
contingency tables. For Corollary(4.8):
(4.2) there are
Let
be an even number, such
that
, and let be a Markov
Markov
( )
for
-
(4.1)
- contingency
tables in
basis
moves in a
contingency table, and by remark
basis for . Then the toric ideal for
, according to theorem
{[
, , such that
] [
[
]
[
] [
]
] [
]}
.
By corollary (4.8) the toric ideal Proof:
of
From corollary (4.7), Markov
basis for
is a
, and by
theorem (2.9) the set of binomials {
of
} is a generating
toric ideal ,
, and since , is a matrix of 2
- contingency table
HuseinHadiAbas , Yahya .Mahdy.AL-Mayal
and Bashar Ahmed Jawad Sharb
Taxon1: A G C T A A C GGTACGT
Remark(4.10):
Taxon2: CGATCTGAC CCGT T
We can use corollary (4.8) to find the toric ideal ( )
Taxonm: ACGTCA CGTA ACGC
contingency
tables
without Then,
finding the Markov basis.
we
define
a
pattern
to be the sequence
5. ANew Module of Permutation The Pieces of Nucleotides inDNA Sequences
of characters. we look at a single site (column) of our sequence data. In the sequences above, we can look at the first site in the
In this section, we construct
sequences and see the pattern
a new model of genetic algorithm
"AC . . .A". A pattern frequency
of permutation of the pieces of Nucleotides
in aligned
is that
DNA
appears inour set of
sequence data , and we refer to
sequences using an algorithm .
the
Now we describe our model in
number
by where
the following steps.
of
frequencies
is an even number
grater than or equal . Step (1): . . . . . .
∑
. . .
| | ∑
Suppose we have
Step (2): We can input the patterns frequency of above sequences
∑
–contingency
( )
in
tables as follows :
taxons of
DNA sequences, each taxon of
Where | |
length such as
length of sequences (the sample size),
∑
is
the
and
is the frequency of the first pattern.
3
Journal of Kufa for Mathematics and Computer
Vol.2, no.2, Dec,2014,pp 116-120
is the frequency of the second pattern. [
]
Where
is written as
is the frequency of the pattern. [
]
[
]
is the frequency of the pattern. [
is the frequency of the
]
where the columns of the matrix pattern.
index by the elements of the
is the frequency of the
column vector
pattern.
Step(5):
Step (3): Represent the contingency table as a - dimensional column vector of non-
negative
Use remark (4.1), theorem (4.2), and corollary (4.7) to find the Markov basis
integers
( )
.
Step (6):
, where denotes the
transpose of a vector or matrix ,
Use remark (4.5) and theorem
as in remark (2.2). Then is a -
(4.6) to find the connected graph
fiber (i.e
, and ( )
, where (
Step (4): is
- fibers)
tables. matrix
and
From
remark (4.3),
4
- contingency
HuseinHadiAbas , Yahya .Mahdy.AL-Mayal
and Bashar Ahmed Jawad Sharb
frequencies
Step (7):
respectively.
Use corollary (4.8) to find the toric ideals for
( )
Step (2):
-
Now, we input the
contingency tables.
frequency in Step (8): Use the
patterns
of above sequences
-contingency table as
follows : 5 6
( ) elements of the
Markov basis
4
in step(6) to find
the permutation of the pieces of nucleotides
and
in aligned
4
3
Step (3):
DNA
Represent the contingency table
sequences.
as
a
dimensional column vector of
Example (5.1):
non-
negative
Suppose we have the following
integers
, where denotes the
three aligned DNA sequences:
transpose of a vector or matrix ,
Taxon1: AGCTG A ATG GC
as in remark (2.2). Then is a -
Taxon2: AGA TCTATC CA
fiber (i.e
Taxon3: A T TCA GACA AT Step (1): There are three taxons for above DNA ∑
sequences
-
with
| |
and six patterns
AAA, G GT, CAT, T TC, GCA, ATG, A AA, T TC, GCA, GCA, and CAT
with
5
, where
Journal of Kufa for Mathematics and Computer
Vol.2, no.2, Dec,2014,pp 116-120
Step (4): is matrix remark (4.3),
and
From -1 0 1 0
1 -1
2 1 2 5 2 3 1 6 4 4 3 11
Step(5): 3 1 1 5 1 3 2 6 4 4 3 11
Use remark (4.1), theorem (4.2), and corollary (4.7) to find the Markov basis
.
( )
1 2 2 5 3 2 1 6 4 4 3 11 0 0
0 1 -1 0 -1 1
-1 1 1 -1
1 1 3 5 3 3 0 6 4 4 3 11
3 0 2 5 1 4 1 6 4 4 3 11 [
-1 1 0 1 -1 0
]
Where
1 0 -1 0
is written as
[
]
[
-1 1
2 0 3 5 2 4 0 6 4 4 3 11
1 -1 0 -1 1 0
]
Figure (2): The graph , [
]
Step (7):
where the columns of the matrix index by the elements of the column vector Step (6):
By corollary (4.8), the toric ideal of
-
contingency tables
The connected graph with ( )
Step (8):
- contingency
( - fibers)
Then the permutation for the pieces of nucleotides in aligned DNA sequences under
tables
as
vertices
of
it
the set of Markov basis be as follows:
6
HuseinHadiAbas , Yahya .Mahdy.AL-Mayal
and Bashar Ahmed Jawad Sharb
Figure (3): The permutation of the pieces of nucleotides in aligned DNA sequences under the set of Markov basis . Remark(5.2):
ii. We refer to ⓿ in the same
i. We refer to ①, ②, ③, and ④
figure to the hidden in the pattern
in figure (3) to the frequencies of
frequency of
the patterns in DNA sequences.
DNA sequences.
1
HuseinHadiAbas , Yahya .Mahdy.AL-Mayal
and Bashar Ahmed Jawad Sharb
Example(5.3) :
True
Consider finding the connected graph for contingency table. Using the computer system Maple 13, the following commands will do the job:
True
6.Conclusions A Contingency table is a matrix of nonnegative integers with
{
prescribed
}
positive
row
and
column sums.Contingency tables are used in statistics to store data from sample surveys. One of related problems for a survey of contingency tables is how to generate table from the set of all nonnegative
integer tables
with given row and column sums True
? (Fisher's Exact Test, Missing Data Problems).
True A
-contingency tables
can
be
written
as
a
-
dimensional column vector of non-negative
integers,
where
, the contingency table can also be considered as an element in -fibers
, where ,
is a linear transformation, and is 117
-dimensional column
Journal of Kufa for Mathematics and Computer
Vol.2, no.2, Dec,2014,pp 116-120
vector of row and column sums
, where
of the contingency table . Let
are the row sums and
be a finite set, such that and let
are the column sums for each
be the graph with
vertex set
and
- contingency tables. Also,
an
edge if and only if
they used to find a finite set
. If
is connected, then
. Remark(4.5) explains
is a
how to construct the graph
Markov basis for . The question
and theorem(4.6) states
how to find a finite set such that
another problem, if
is found,
then
binomials
the
that
is connected ? is
set
{
of
}
toric ideal
for
generates (i.e.
is a connected graph,
this implies
is a Markov basis
as mentioned inCorollary(4.7). Corollary(4.8) gives the toric
the
ideal
is used
for
using the Markov
basis . All these results are used
to find a toric ideal ) .
in section (5) to construct a new In this Paper, we introduce an
model of genetic algorithm of
algorithm based on the actions of
permutation
subgroups
of the
nucleotides
that
sequences.
and
dihedral group
on
the in
pieces
aligned
of DNA
gives solutions for all previous problems when
References
is an even
[1] A. Dobra, "Markov bases for decomposable graphical models", Paper presented at Grostat V, New Oeleans, 4-6 September 2001, submitted for publication, 2001.
number greater than or equal , and
.Remark(4.1),
theorem(4.2), theorem(4.4), and remark(4.5) are used to generate ( ) contingency tables (
matrices) in
-fibers
[2] A. Dobra, and S.Sullivant, "A divide-and-conquer algorithm for
, 118
HuseinHadiAbas , Yahya .Mahdy.AL-Mayal
and Bashar Ahmed Jawad Sharb
generating Markov bases of multi-way tables", Paper presented at the workshop "Multidimensional tables" statistics, combinatorial optimization, and Groebner bases", University of California at Davis, submitted for publication,15-16 February 2002. [3] A. Takemura, and S. Aoki,"Some characterizations of minimal Markov basis forsampling from discrete conditional distributions", Annals of the Institute of Statistical Mathematics,2003.
[8] M. Dyer, and C. Greenhill, "Polynomial-time counting and sampling of two-rowedcontingency tables", Theoretical Computer Sciences, 246,(2000), pp. 265-278. [9] P. Diaconis, and B. Sturmfels,"Algebraic algorithms for sampling from conditional distributions", The Annals of Statistics, 26, pp. 363397,(1998).
[4] A. Takemura, and S. Aoki,"Distance reducing Markov bases for sampling fromdiscrete space",Bernoulli, to appear, 2005.
decomposition", Mathematical Essays in Honor of Gian-Carlo Rota,
[10] P. Diaconis, D. Eisenbud, B. Sturmfels. "Lattice walks and primary
eds. B. Sagan and R. Stanley, Progress in Mathematics, Vol. 161,
[5]C.Semple, M. Steel,"Phylogenetics", volume 24 of Oxford Lecture
Birkhauser, (1998),pp.173-193.
Seriesin Mathematics and its Applications, Oxford University Press, Oxford,( 2003). [6] G. Agnarsson, R. Greenlaw. "Graph theory , modeling, applications, and algorithms ",United States of America,(2007).
Boston,
[11] S. Aoki, A. Takemura, "The largest group of invariance for Markov bases and toric ideals", J. Symbolic Computation 43 (5) (2008) 342–358.
[7] J.Felsenstein,"Inferring Phylogenies",Sinauer Associates, Sunderland,( 2003). 119
Journal of Kufa for Mathematics and Computer
[12] S. Aoki, and A.Takemura,"The list of indispensable moves of unique minimal Markov basis of 3×4×K and 4×4×K contingency tables with fixed twodimensional marginals", Technical Report METR 03-38, Department of Mathematical Engineering and Information Physics, The University of Tokyo. Submitted for publication,(2003). [13] S. Aoki, and A. Takemura,"Minimal basis for connected Markov chain over 3×3×K contingency tables with fixed twodimensional marginals". Australian and New Zealand Journal of Statistics, 45, pp 229-249,(2003).
121
Vol.2, no.2, Dec,2014,pp 116-120