Paper Title (use style: paper title) - TAMU Computer Science Student

A Content-Constrained Spatial (CCS) Model for Layout Analysis of Mathematical Expressions Xing Wang, Jyh-Charn Liu* Department of Computer Science and Engineering Texas A&M University College Station, USA {wxadb, jcliu}@tamu.edu Abstract—This paper proposes a content-constrained spatial (CCS) model to recover the mathematical layout (M-layout, or MLME) of an mathematical expression (ME) from its font setting layout (F-layout, or FLME). The M-layout can be used for content analysis applications such as ME based indexing and retrieval of documents. The first of the two-step process is to divide a compounded ME into blocks based on explicit mathematical structure primitives such as fraction lines, radical signs, fence, etc. Subscripts and superscripts within a block are resolved by probabilistic inference of their likelihood based on a global optimization model. The dual peak distributions of the features to capture the relative position between sibling blocks as super/subscript call for a sampling based non-parametric probability distribution estimation method to resolve their ambiguity. The notion of spatial constraint indicators is proposed to reduce the search space while improving the prediction performance. The proposed scheme is tested using the InftyCDB data set to achieve the F1 score of 0.98. Keywords—mathematical expression; layout; superscript; content-constrained spatial modeling

subscript;

I. INTRODUCTION The PDF file format is de facto the industry standard for a broad range of publications. The vast volume of scientific, technology engineering and mathematical (STEM) publications produced in the PDF format represents a major information source to analyze their technical essence, much of which is expressed in mathematical expressions (ME). A PDF file only contains layout specifications of fonts and their attributes, but no explicit label is available for MEs. Being able to identify MEs [1] in PDF files and then recover them back to their mathematical structures will enable a wide range of high-level applications such as information retrieval, machine reading, polymorphism modeling, and similarity analysis. Human readers can readily recognize printed stacking, subscripted and superscripted symbols even when they are in mixed fonts, styles. In contrast, the overlapping of the distribution for the features derived from relative positions of ME elements in the font setting layout (F-layout or FLME) of an ME can greatly challenge computer based generation of its mathematical layout (M-layout or MLME). An error at a position has the domino effect to affect the correctness of subsequent analysis steps. The objective of this paper is the development of novel Mlayout models of MEs for mathematical parsing of MEs in the *

Correspondence author.

PDF format. That is, given a PDF F-layout of an ME, the mathematical parser aims to produce its MLME expressed in a taxonomy of ME structures “MEGroup” (II.A) based on the mathematical semantics of the ME. MLME consists of atomic structure; enclosed structures such as fence, and radical; vertical structure such as accent, fraction, binding variable operator; and horizontal structure that arrange the elements in the same baseline (“HOR”), superscript (“SUP”) or subscript (“SUB”) as marked in arrow curve line in Fig. 1. HOR

HOR

HOR HOR

HOR

SUP SUB

Fig.1 Layout for horizontally arranged MEGroups

Vertical/enclosed structure processing 1.Accent Processing 2.Radical Processing 3.Fraction Processing 4.BindVar Processing 5. SUP&SUB Processing 6. Fence Processing

CCS horizontal structure processing Layout candidates generation 7. Op./Rel./Punct. Constraints 8. Script Level Sameness

Horizontal chain generation Recursive sub config gen.

Layout candidates Ranking Layout

CRPCs

P(CRPCs|HR, NVCD)

Fig.2 ME layout analysis system

The general workflow to recover MLME from an FLME is illustrated in Fig. 2. The first of the two-step process is to identify vertical and enclosed structures based on a sequence of pre-processing steps (shown as ❶ in Fig.2, II.B). The second step (shown as ❷ in Fig.2), which is the focus of this paper, attacks the primary challenge, i.e., recognition of the left sibling together with their relative spatial relation as same baseline (HOR), superscript (SUP) or subscript (SUB), based on a novel content constrained spatial (CCS) model. CCS uses the ME content to eliminate unlikely candidates of the search space being fed to the pattern analysis and naïve Bayesian decision models, which achieve signficant performance gains. In the CCS model, where all vertical relationships of an ME have been resolved, we are left with only horizontally elements 𝑆 = {𝑠1 , … , 𝑠𝑛 } to solve their sibling relationships. Let the horizontal M-layout candidate be denoted as a list of triplets HMLME={𝑇𝑖 }𝑖=1,…,𝑛 , where Ti= 〈𝑖, 𝑝, 𝑟〉 represents that 𝑠𝑖 is the right sibling to 𝑠𝑝 with the relation of 𝑟, where 𝑟 ∈

{𝐻𝑂𝑅, 𝑆𝑈𝐵, 𝑆𝑈𝑃}. Based on writing conventions, a feasible HMLME should satisfy all four axioms: A1: OneSibling, A2: OneSiblingRel, A3: VerOverlap, A4: NoSkipScript (section III.A.) To reduce the candidate space for HMLME, we propose three content based constraints: (C1) must-HOR-constraint, (C2) exist-HOR-constraint based on the usage of special math symbols such as punctuations, operations, and relations. (C3) the script level sameness constraints which can be reliably identified using a high threshold over the probabilistic output of HOR|SUP|SUB classifier (section III.B.) A recursive procedure is designed to generate the candidate list of HMLME that satisfies A1-A4 and C1-C3 (section III.C.) Among the HMLME candidates, the most likely one is inferred probabilistically based on the posterior of the relative positions of neighbors conditional on the observation of bounding box for MEgroups. To avoid the chain effect of a local error, we transform a layout candidate into a set of child-parentrelation chain (CPRC) for all pairs of MEGroups in 𝑆. Two features with stable discriminating ability are adopted to capture the relative size using the height ratio (HR) and vertical shift by the normalized vertical center difference (NVCD). The posterior probability distribution of the CPRC conditional on the observation of HR and NVCD are inferred recursively as the product or sum distribution. It is estimated using a nonparametric sampling method due to the two-peak shape of the distribution of the features. Parameter estimation and evaluation is conducted on public dataset InftyCDB [2], we achieve a high F1 score of 0.98 on the sibling-relation triple identification. In comparison with a greedy baseline, our system improved the F1 score by a significant value of 0.2. II. MATHEMATICAL STRUCTURE PROCESSING A. Types of MEGroup Generation of the MLME from its FLME requires correct grouping of elements in an ME. To serve this purpose, different types of ME elements and their groupings are organized into different abstractions illustrated in Fig. 3. An MEObject represents an MESymbol (alphanumerics, operators, and relations symbols) or MEPath (a fraction or radial symbol). The bounding boxes of MESymbols are normalized to offset the differences in the existence of ascender and descender [2]. The two subtypes are the building primitives to form MEGroup, which can be further refined into four subtypes: MESymbolGroup is the type consisting of a single symbol, and 3 other compounded types: vertical, horizontal, enclosed ME structure. The vertical type can be of MEAccentGroup (AG), MEFractionGroup (FrG), MEBindVarGroup (BG), and MESupSubGroup (SSG). The enclosed type can be of MERadicalGroup (RG) and MEFenceGroup (FeG). For simplicity, only paired asymmetric fence symbols including are used in this study. Symmetric fence symbols like is not included in this work. The horizontal type can be of MESupGroup (base and its superscript); MESubGroup (base and its subscript); and MEHorGroup (a horizontal chain of MEGroups). Transformation of FLME into its MLME is a recursive process to identify MEGroup instances based on their nested relationship. It can be modeled as a 2D search problem where

the vertical axis represents stacking and sub-superscripting, and horizontal axis represents the left to right expansion of ME terms. We propose a divide and conquer technique to solve this recursive grouping problem. The first step is identification of non-horizontal structure based on indicating ME elements. Then, the CCS model is used to resolve horizontal relative relation along the horizontal direction. The overall workflow is illustrated in Fig. 4. MEPath

MEObject

MESymbol

- adjustBBox: BBox - tightBBox: BBox

- latex_val: String -memberName

Primitive ME Structures

MEGroup

MESymbolGroup

- children(): vec - attachedObj(): MEObject - attacherObj(): MEObject

Vertical ME Structures MEAccentGroup - hatSymbol: MESymbol - underMEGroup: MEGroup

MEFractionGroup - fractionPath: MEPath - upMEGroup: MEGroup - downMEGroup: MEGroup

MEBindVarGruop - operator: MESymbol - upMEGroup: MEGroup - downMEGroup: MEGroup

- meSymbol: MESymbol

Horizontal ME Structures MEHorGroup - meGroups: vec

MESubGroup - baseMEGroup: MEGroup - subMEGroup: MEGroup

MESupGroup - baseMEGroup: MEGroup - supMEGroup: MEGroup

Enclosed ME Structures MERadicalGroup - radicalSymbol: MESymbol - containMEGroup: MEGroup

MESupSubGroup - baseMEGruop: MEGroup - supMEGroup: MEGroup - subMEGroup: MEGroup

MEFenceGroup - open/closeSymbol:MESymbol - meGroup: MEGroup

Fig. 3. ME primitives and their grouping types

FLME UnorganizedGroupPath - meGroups: vec - mePaths: vec

MEHS&SGroup - meGroups: vec

MEHS|SGroup

Non-horizontal Structure Identification ME Struct Processing 1,2,3,4 ME Struct Processing 5, 6

- meGroup: vec

MLME

CCS layout Model

Fig. 4. Processing sequence & intermediate ME Structure type

B. Non-horizontal ME Structure Analysis Given an FLME, a MESymbolGroup is created for each symbol and a MEPath for each vector graphic. MESymbolGroups and MEPaths together form an UnOrganizedGroupPath (UGP), which is used as input by processing modules (1,2,3,4) for identification of ME structure AG, RG, FrG, BG, so that after the processing only MEGroup left in the UGP is either MESupGroup, MESubGroup, MEHorGroup, or SSG. Intermediate results produced by processing modules 1,2,3,4 are called MEHS&SGroup (“HS&SG”), which allows one MEGroup to have both superscript and subscript. It will be passed to processing modules 5, 6 to identify the ME structure with both superscript and subscript and FeG. The horizontally arranged MEGroups that could not have both superscript and subscript are called the MEHS|SGroup (“HS|SG”), which will also be passed to the CCS Model to generate MLME. Now we discuss more details of these processing modules which are executed in the following sequence. For steps 1-4, the Accent processing (module 1) step is an iterative process. In each iteration, every accent symbol that does not have other accent symbol under it is identified. Under the assumption that the elements topped by the accent symbol overlap vertically, they are extracted as a UGP as part of the AG. The Radical processing (module 2) step iteratively finds radical symbol that is not contained in bigger radical scope and construct RG. After step 2, all the remaining unprocessed MEPath elements are of the fraction type. The Fraction processing (module 3) step iteratively identifies a fraction line with the smallest horizontal span. It then further locates numerator and denominator as UGPs by finding vertically overlapping MEGroups over/under the fraction line. Bind variable operation processing (module 4) constructs a BG for each big operator like so on [14], together with vertically overlapping component as UGPs over and under it. Both the sup- and subscript relationships can still be associated with an entity after the 4 steps. We aim to resolve them at step 5, and all UGPs are transformed into known ME structures or intermediate “HS|SG”. The step 6 aims to use the fence symbols to divide a long structure into smaller units by grouping the elements between paired fence symbols as an FeGs for the CCS model based analysis. The sub-/superscript processing (module 5) step runs in iterations. In each iteration, UGP in the existing MEGroup hierarchy is recursively traversed and processed. Within each UGP, we locate the first MEGroup 𝑠𝑏 with two direct right up or down MEGroups 𝑠𝑢 , 𝑠𝑑 that do not overlap vertically. Each of 𝑠𝑢 and 𝑠𝑑 is expanded with MEGroups on its right that vertically overlap with the current 𝑠𝑢 /𝑠𝑑 but not with the other MEGroup 𝑠𝑑 /𝑠𝑢 . Each expansion step will create an UGP, and together with the base MEGroup 𝑠𝑏 , they will construct an SSG. The process terminates when no SSG can be generated from an iteration.

Fig. 5. Math structures produced by each of the 6 processing steps highlighted by their corresponding colors coded in Fig.3.

An example on the MEGroup generated by the 6-step processing are highlighted by dash-lined boxes in Fig. 5, where the line color represents the type color coded in Fig. 3 III. CONTENT-CONSTRAINED SPATIAL MODEL After the six-step preprocessing, the hierarchical mathematical structure will contain only nodes of simple type, vertical structure, enclosed structure and the “HS|SG”. And we only need to resolve the sibling and (sub-/sup-) script relationship of the horizontally arranged MEGroups contained in “HS|SG”. In the first part of this section, we will explain the mathematical layout (HMLME) that could describe any possible sequence of MEGroups. Then three types of content-constraints that could reduce the candidate space are proposed and integrated into the recursive procedures to enumerate the HMLME candidates. Finally, we present the inference process of the most likely layout given the observation of the bounding box for each MEGroup. A. Horizontal Mathematical Layout HMLME The Horizontal Mathematical Layout HMLME aims to describe the relative position of horizontally arranged MEGroups 𝑆 = {𝑠1 , 𝑠2 , … , 𝑠𝑛 } under “HS|SG”, which are sorted by the left border increasingly. One possible layout for this sequence is an 𝑛 − 1 triples, 𝐿 = {〈𝑖, 𝑖𝑝 , 𝑟〉}, where 𝑖 and 𝑖𝑝 are index for MEGroup and 𝑟 ∈ {𝐻, 𝑆𝑈𝑃, 𝑆𝑈𝐵} is the relative position between the 𝑠𝑖 and its sibling 𝑠𝑖𝑝 . The triples need to satify four axioms based on writing convention. Axiom A1 (OneSibling): Each 𝑠𝑗 can only be attached to one 𝑠𝑖 on its left, i.e., ∀〈𝑖, 𝑖𝑝 , 𝑟〉 ∈ 𝐿, 𝑖𝑝 < 𝑖. Axiom A2 (OneSiblingRel): ∀𝑗 ∈ [1, 𝑛], 𝑟 ∈ {𝐻, 𝑆𝑈𝑃, 𝑆𝑈𝐵}, !∃𝑖 ≠ 𝑖 ′ , 〈𝑗, 𝑖, 𝑟〉 ∈ 𝐿 𝑎𝑛𝑑 〈𝑗, 𝑖′, 𝑟〉 ∈ 𝐿. Axiom A3 (VerOverlap): 𝑠𝑖 should vertically overlap with 𝑠𝑖𝑝 based on the typesetting convention of superscript, subscript, or baseline. Axiom A4 (NoSkipScript): If 𝑟 ∈ {𝑆𝑈𝑃, 𝑆𝑈𝐵}, then there are no other MEGroups between the 𝑠𝑖 and 𝑠𝑖𝑝 horizontally, i.e., 𝑖𝑝 = 𝑖 − 1. A1

A2

i

j

k

i

HOR

A3

HOR

j

j

k

l

A4 i

j

k

k

Fig.6. Illustration of the four axioms

B. Content-constraints over the ME layout Content constraints are certain placement of symbols and orders derived from mathematical conventions. The symbol value of following groups are indicators for constraints: unary relations (UR)={!, ∃, ∀,…}, binary relations (BR)={ 0.95. The three types of constraints are illustrated in Fig. 7. Here, the “+” sign in the middle of “j+1” will create the a C1 constraint that the symbol pair “+” and “1” satisfies a horizontal relationship. Since “+” is a binary operator, a C2 constraint is created such that there must exist one MEGroup on the left (shown in the curve bracket) that is with the horizontal relationship with “+”, which will play the role of the left operand. Further, there are two groups of MEGroups with the ̂𝑗 ", and the same script level. The group 1 includes "𝛿", "𝜁", "𝑑𝜁 second group includes "𝑗", "1". Relation/Operator/Punct. constraints Exist One to attach to with Horizontal relationship

Horizontal Relationship

Same Script Level Group 2 Same Script Level Group 1

Fig.7. Content-constrained HMLME model

C. Constrained HMLME candidate generation The HMLME candidate for 𝑆 is generated based on the four axioms A1-A4 and constraints C1-C3. Firstly, the constraint set 𝐶 is created based on the content of S: 𝐶 = {〈𝑖, [𝑗, 𝑘], 𝑐𝑡〉}, where 𝑖 is the index of a MEGroup, [𝑗, 𝑘] are the range of MEGroups affected, and 𝑐𝑡 ∈ {𝐶1, 𝐶2, 𝐶3} is the constraint type. We try to enumerate the sequence horizontal chain ℎ𝑐 derived from MEGroups and indexed by [𝑖1 , … , 𝑖𝐾 ], where 1 = 𝑖1 ≤ ⋯ ≤ 𝑖𝐾 ≤ 𝑛. 𝑠𝑖𝑘 is the left sibling of 𝑠𝑖𝑘+1 with the relative spatial relation between them as HOR and satisfying the constraints 𝐶. It is assumed that there is no left super/subscript, such that the 𝑠1 would always be the first element of the horizontal chain, i.e. 𝑖1 =1. Given a ℎ𝑐 that covers partial layout information, the problem is reduced to a set of sub-problems for subranges 𝑆𝑅𝑘 = [𝑖𝑘 + 1, 𝑖𝑘+1 − 1], 𝑘 ∈ [1, 𝐾) and 𝑖𝑘 + 1 ≠ 𝑖𝑘+1 . The constraints 𝐶′ for each subrange is tailed from 𝐶 by keeping the constraint 〈𝑖, [𝑗, 𝑘], 𝑐𝑡〉 with 𝑖, 𝑗, 𝑘 ∈ 𝑆𝑅𝑘 and shift the index in the constraint w.r.t the beginning of the subrange 𝑖𝑘 + 1. The HMLME candidates for all sub range will compose a product space 𝐿𝐶𝑃𝑟𝑜𝑑𝑆𝑝𝑎𝑐𝑒, and each item will be merged with ℎ𝑐 to create a candidate. All possible HMLME could be generated by EnumLayoutCand(𝑆, CreateConstraints(𝑆)). DEF EnumLayoutCand(S, C) → {𝑳}: For ℎ𝑐 from combinatorial enumeration of [1, 𝑛]: Skip if !ConstraintSat(C, hc) LCSpace = [] For 𝑆𝑅𝑘 = [𝑖𝑘 + 1, 𝑖𝑘+1 − 1] 𝑤𝑖𝑡ℎ 𝑖𝑘+1 − 𝑖𝑘 > 1 ∶ 𝐶′=CreateLocalConstraints(𝑆𝑅𝑘 , 𝐶) LC = EnumLayoutCand([𝑠𝑘 ′ : 𝑘 ′ ∈ 𝑆𝑅𝑘 ],𝐶′) LCSpace.add( {(k, lc)|lc∈LC} ) LCProdSpace = 𝐿𝐶𝑆𝑝𝑎𝑐𝑒1 × … × 𝐿𝐶𝑆𝑝𝑎𝑐𝑒𝐾 For LCProd ∈ LCProdSpace: Init L with hc For k, lc ∈ LCProd: Shift 𝑙𝑐 with index 𝑖𝑘 , and add to L

add 〈𝑙𝑐. 𝑓𝑖𝑟𝑠𝑡, 𝑖𝑘 , 𝑆𝑈𝑃|𝑆𝑈𝐵〉 to L yield L DEF CreateConstraints(S)→ 𝑪: For 𝑠𝑖 ∈ 𝑆: If(type(𝑠𝑖 ) == MESymbolGroup){ If(𝑠𝑖 .symbol ∈ UO∪ 𝑈𝑅): Add 〈𝑖 + 1, [𝑖, 𝑖], 𝑀𝐻〉 to 𝐶 Elif(𝑠𝑖 .symbol∈BO∪BR∪Punct.): Add 〈𝑖 + 1, [𝑖, 𝑖], 𝑀𝐻〉 to 𝐶 Add 〈𝑖, [1, 𝑖 − 1], 𝐸𝐻〉 to 𝐶 For 𝑠𝑖 , 𝑠𝑗 ∈ 𝑆, 𝑖 < 𝑗: If 𝑃 (𝐻𝑂𝑅|(si , sj )) > 0.95: add 〈𝑗, [𝑖, 𝑖], 𝑆𝐿𝑆〉 to 𝐶 DEF ConstraintSat(C, hc)→{true, false}: For 〈𝑖, [𝑗, 𝑘], 𝑐𝑡〉 ∈ 𝐶: If(ct == SLS || ct == MH): If( 𝑖 ∈ ℎ𝑐 XOR 𝑗 ∈ ℎ𝑐) return false If(ct == EH) If(𝑖 ∈ ℎ𝑐 && [j, k] ∩ hc = ∅ ) return false Return true DEF CreateLocalConstraints([𝒊′, 𝒋′], 𝑪)→ 𝑪′: For 〈𝑖, [𝑗, 𝑘], 𝑐𝑡〉 ∈ 𝐶: If ( 𝑖 ∈ [𝑖′, 𝑗′] && [𝑗, 𝑘] ⊆ [𝑖′, 𝑗 ′ ] ) Add 〈𝑖 − 𝑖′, [𝑗 − 𝑖′, 𝑘 − 𝑖′], 𝑐𝑡〉 to 𝐶′ D. Probabilistic inference of the most likely HMLME An HMLME is represented by a set of triples 𝐿 = {〈𝑖, 𝑖𝑝 , 𝑟〉}. To infer the sibling and the associated relative spatial relation, pairing of adjacent pairs alone might lead to error. To facilitate processing of long-distance constraints, we translate layout triples into a set of Child-Parent Relation Chain (CPRC) Θ𝐿 = {𝜃𝑖𝑗 } that covers all pairs of MEGroups, where the CPRC 𝜃𝑖𝑗 for the child 𝑠𝑖 and the parent 𝑠𝑗 is created as follows. First, for each 𝑖 𝑠𝑖 , we have a chain of index {𝑘1𝑖 = 𝑖, 𝑘2𝑖 , … , 𝑘𝑀 = 1} with 𝑖 𝑖 𝑖 𝑖 〈𝑘𝑗 , 𝑘𝑗+1 , 𝑟𝑗 〉 ∈ 𝐿. For the pair (𝑠𝑖 , 𝑠𝑗 ), the nearest common 𝑗 𝑗 𝑗 𝑖 𝑖 sibling is at index 𝑘𝑚 =𝑘𝑚𝑗 =max({𝑘1𝑖 , … , 𝑘𝑀 } ∩ {𝑘1 , … , 𝑘𝑀𝑗 }). 𝑖 𝑖 Then the CPRC for (𝑠𝑖 , 𝑠𝑗 ) is 〈𝑖, 𝑗, [𝑟1𝑖 , … , 𝑟𝑚𝑖 𝑖 , 𝑟𝑒𝑣(𝑟𝑚𝑗 𝑗 ), … , 𝑟𝑒𝑣(𝑟1𝑗 )]〉, where the 𝑟𝑒𝑣(𝑟) indicates 〈𝑖𝑝 , 𝑖, 𝑟〉 ∈ 𝐿. Take "12 3" for example, the CPRC triples are 〈2,1, 𝑆𝑈𝑃〉 and 〈3,1, 𝐻𝑂𝑅〉. Its corresponding CPRC is 〈2,1, [𝑆𝑈𝑃]〉, 〈3,1, [𝐻𝑂𝑅]〉 and 〈3,2, [𝐻𝑂𝑅, rev(𝑆𝑈𝑃)]〉. We would like to find argmax 𝑃(Θ𝐿 |𝑂), where 𝑂 is the 𝐿∈𝐿(𝑆)

observation feature values and 𝐿(𝑆) is the set of all HMLME candidates. Assuming that CPRCs {𝜃𝑖𝑗 } are conditionally independent on the observation 𝑂, we have 𝑃(Θ𝐿 |𝑂) = ∏𝜃𝑖𝑗∈Θ𝐿 𝑃(𝜃𝑖𝑗 |𝑂). Then we could apply the Bayesian rule on each of the posterior probability 𝑃(𝜃𝑖𝑗 |𝑂) and estimate it through the likelihood and prior on each type of relation chain as 𝑃(𝑂|𝜃𝑖𝑗 )𝑃(𝜃𝑖𝑗 )/𝑃(𝑂). As the number of CPRC for a given 𝑆 is fixed, we could safely eliminate 𝑃(𝑂). By further assuming the equal probability of 𝑃(𝜃𝑖𝑗 ), we have 𝑃(Θ𝐿 |𝑂) ≈ ∏𝜃𝑖𝑗∈Θ𝐿 𝑃(𝑂|𝜃𝑖𝑗 ). To keep the computing cost low, we removed the CPRC related to operator, relation and punctuation symbols, because their relation to other entities not mentioned above is already determined and the varieties of their upper and lower bound with respect to the baseline is not captured.

E. Features and non-parametric distribution estimation Optimal prediction of the HMLME is transformed into the likelihood 𝑃(𝑂|𝜃𝑖𝑗 ) estimation. Two features that may affect vertical relations are presented first, with explanation of the bounding box adjustment for different type of MEGroups. Then we present feature value distribution estimation for the relation chain 𝜃𝑖𝑗 to obtain the likelihood 𝑃(𝑂|𝜃𝑖𝑗 ) that the pair of MEGroups (𝑠𝑖 , 𝑠𝑗 ) satisfying the relation chain 𝜃𝑖𝑗 . As a CPRC 𝜃 consists of HOR|SUP|SUB, only vertical features contribute to the probability estimation. This leads to the selection of two features: height ratio (HR) and normalized vertical center difference (NVCD). Formally, we denote ℎ𝑖 as the height of MEGroup 𝑠𝑖 , 𝑐𝑖 as the center position on y-axis. Then 𝜙𝑖𝑗 = ℎ𝑗 /ℎ𝑖 is the HR between 𝑠𝑖 and 𝑠𝑗 ; 𝜓𝑖𝑗 = (𝑐𝑗 − 𝑐𝑖 )/ℎ𝑖 the NVCD between 𝑠𝑖 and 𝑠𝑗 w.r.t. the height of 𝑠𝑖 . When calculating these two vertical features, only the top and bottom of the MEGroup are required, which is adjusted based on the common practice [2]. Further, the height of MEGroups with vertical stacked elements is estimated using the central element to avoid the unnecessary increase of the height and thus inaccurate feature value. For example, the height of a binding variable operator is only based on the height of the big operator. Given the feature distributions for relation-chain of size one from the training dataset, next we present the recursive process to represent the feature random variable (r.v.) for a long relation chain into product and sum of r.v. for small relation-chains. A CPRC 𝜃𝑖1𝑖𝑚 corresponding to a series of 𝑚 triples, 〈𝑖𝑘+1 , 𝑖𝑘 , 𝑟𝑘 〉, 𝑘 ∈ [1, 𝑚 − 1] has their HR (or Φ𝑖1,𝑖𝑚 ) and NVCD (or Ψ𝑖1,𝑖𝑚 ) features, which are transformed into algebra operations over r.v. Φi1,ims , Φ𝑖𝑚𝑠 ,𝑖𝑚 , Ψi1,i𝑚𝑠 , and Ψ𝑖𝑚𝑠 ,i𝑚 , where 𝑚𝑠 = ⌊𝑚/2⌋. Given Φi1,i𝑚𝑠 = 𝐻𝑖𝑚𝑠 /𝐻i1 and Φ𝑖𝑚𝑠 ,i𝑚 = 𝐻𝑖𝑚 /𝐻𝑖𝑚𝑠 by definition, we get a product form of HR as Φi1,i𝑚 =

𝐻 𝑖𝑚 𝐻i1

=

𝐻𝑖𝑚 𝐻𝑖𝑚𝑠 𝐻im

s

𝐻 i1

= Φims ,i𝑚 Φi1,i𝑚𝑠 .

Based on the definition that Ψi1,ims = 𝐶𝑖𝑚 −𝐶𝑖𝑚

𝑠

𝐻 𝑖𝑚

𝑠

, Ψi1,i𝑚 =

𝐶𝑖𝑚 −𝐶i1 𝐻i1

𝐶𝑖𝑚 −𝐶i1 𝑠 , 𝐻i1

(1) Ψ𝑖𝑚𝑠 ,i𝑚 =

, we have the NVCD feature as

Ψi1,i𝑚 = Ψi1,i𝑚𝑠 + Ψ𝑖𝑚𝑠 ,i𝑚 Φi1,i𝑚𝑠 .

6 Ground Truth

1

Samples

5

(2)

PDF

2

CDF r.v. algebra

3

Sampling

4

Fig. 8. Data flow of non-parametric distribution estimation

The workflow for non-parametric distribution estimation is shown in Fig. 8. In step 1, from ground truth data, we have the feature statistics for the pairs of MEGroups in relation HOR, SUB, SUP, rev(SUB), and rev(SUP). In step 2, they are used to estimate Cumulative probability density function (CDF). Using step 3, 4, 5, the CDF could create samples for r.v. Together with the random variable algebra equation (1)(2), we generate samples of the HR and NVCD features for a longer relationship

chain. The generated samples through algebra operation are then used to estimate the CDF and PDF for features, where the CDF is recursively used for CDF estimation of even longer chain and the PDF is used in the probability inference process. Some more details on the CDF estimation from sample and kernel density based PDF estimation is discussed next. CDF is generated from the known feature values by counting the ratio of samples with feature values less than the given feature value. We then use interpolation to fill the gaps of the CDF. PDF is generated from the kernel density estimation, where the kernel size is adopted from Sheather [3] as ℎ𝑆𝑅𝑂𝑇 = 0.9𝐴𝑛−1/5 , where 𝐴 = min{𝜎, 𝑖𝑛𝑡𝑒𝑟𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒/1.34}. When 𝐴 = 0, a small value 1e-3 was used. IV. EXPERIMENT An overview of the experiment system is illustrated in Fig. 9. InftyCDB [2] is adopted to train and evaluate our CCS model as shown in Fig. 9. Our system is evaluated for 11554 MEs. The average number of symbols per ME is 7.6 and the standard derivation of the ME length is 10. Each ME in InftyCDB is represented by a set of MEObject with name, tight bounding box, and its parent. A. Experiment System InftyCDB

Line Identification

Split based on ME

Bbox Adjustment

InftyCDB ME

Feature Extraction Adjusted InftyCDB ME

Sampling based Probability Generation

UGP (Chars & Paths)

CCS Model Eval

Fig. 9. Experiment overview

We recover the MLME of MEs in InftyCDB based on the parent relationship. A chain of elements in HOR relation is identified to estimate the adjustment of the bounding box based on the symbol value [2]. Pairs of MEObjects based on the parental relationship with bounding box adjusted were fed into the feature extraction module to estimate the HR/NVCD feature distribution of the HOR/SUP/SUB relation. For evaluation purpose, the MEObjects for an ME are used to construct a UGP, which will be processed through the non-horizontal structure identification and horizontal structure resolution based on the CCS model. We use F1 measurement over the sibling-relation in MLME’s to evaluate the performance of the CCS model. True positive (TP) is defined as the number of identified triples of the Mlayout that also occur in the ground truth dataset. Precision (P) 𝑇𝑃 and recall (R) are respectively calculated as . 𝑇𝑃 # of triples in ground truth

. F1 is

2∗𝑃∗𝑅

# of identified triples

.

(𝑃+𝑅)

B. Baseline system To further assess the effectiveness of the CCS model, we designed a system that only uses local information to infer the M-layout. From left to right, with 𝑖 as the current index: 1) if

MEGroup at i+1 is identified with relation 𝑟 ∈ SUP|SUB|HOR of MEGroup 𝑖, create triple 〈𝑖 + 1, 𝑖, 𝑟〉. 2) if rev(SUP) or rev(SUB) is identified, initialized 𝑖 ′ = 𝑖, update 𝑖′ as 𝑖′′, where 〈𝑖 ′ , 𝑖 ′′ ,∗〉 is already identified, until the relative spatial relation 𝑟 between 𝑖′ and 𝑖 + 1 is identified as SUP|SUB|HOR, and then create triple 〈𝑖 + 1, 𝑖′, 𝑟〉. A five classes naïve Bayesian classifier built on HR and NVCD is used to make prediction of HOR|SUP|SUB|rev(SUP)|rev(SUB) for any pair of MEGroups. C. Result Overall, our system achieves the F1 score of 0.98. It improved the performance of the greedy search based baseline system by 0.2. As shown in Fig. 10, we observed a decline of the average F1 score with increase of the length of ME, due to the fact that the more complex script structures of larger ME are harder to identify correctly.

proposed features in the log-polar space and PCA is adopted for dimension reduction and improved discriminant ability. Further, Zanibbi gave a review of the existing features. A common bottleneck is predicting the subscript as a horizontal relation. Fotini [11] added a new feature of angle. Another line of research not considered by our work is incorporation of the semantic analysis with the layout analysis and use 2D PCFG. Okamoto [12] first used projection profiling cutting to produce a hierarchical structure, which is then traversed to re-written. Raja [13] adopted graph grammar rewriting over the neighbor graph of symbols by minimizing conflicts. VI. CONCLUSION In this work, we propose the CCS model to recover the MLME from an FLME. The framework is extendable and could incorporate other mathematical structure identification. The high F1 score suggests the system can be advanced to practical applications, while the CCS model based algorithms can be further enhanced by pruning unlikely branches of relationships between candidates. REFERENCE

Fig. 10. Correlation between ME length and average F1

Among different math structures, accent structures (685 occurrences) have the most impact on the layout analysis, with a slightly reduced F1 score of 0.96. The performance is not less than the average for other structures, like fraction, MEGroup with sup- and subscripts, and binding variable operators. TABLE I.

STATICS ON SYMBOLS VS. THE ERROR INSTANACES

Type

Values

Right item of FP

LeftPar 13, plus 25, minus 31, comma 45, equal 52, RightPar 70 vert 14, zero 16, p 16, f 21, P 22, one 23, overline 25 plus 25, minus 31, comma 45, equal 52, LeftPar 69 L 14, tilde 15, RightPar 16, P 18, f 18, overline 50

Left item of FP Right item of FN Left item of FN

To gain further insights on likely causes of false positive (FP) and false negative (FN) cases, we generated statistics on pairs of left/right symbols vs. FP/FN instances in table I. A majority of right symbol of both FP and FN cases are of the relation/operator/punctuation types, because too many possible sibling candidates can be attached to and their irregular bounds also lead to errors in the non-horizontal structures identification. The left item of the FP and FN cases also share some similar cases. Accent symbols appeared to be common in both cases.

[1]

[2]

[3] [4]

[5]

[6]

[7]

[8]

[9] [10]

V. RELATED WORK Two comprehensive surveys on mathematical expression recognition were done by Chan [4], and Zanibbi [5] who also classified the ME layout analysis solutions into threshold based and model based approaches. Okamoto [6] used fixed threshold to search for the sup/sub. Wang [7] proposed a global model, which is the basis of our work. The bounding box normalization problem was studied and improved in [2]. Aly [2] [8] used relative size and relative position features calculated from normalized bounding box to predict the relation between a pair of alphanumeric character as base-base, base-sub, base-sup. They primarily focused on alphanumeric, which only cover 57% of all chars, and 26.5% of all pairs. Ling [9] and Zanibbi [10]

[11]

[12]

[13] [14]

Wang, Xing and J.C. Liu. "On a Font Setting based Bayesian Model to extract mathematical expression in PDF files." In 14th IAPR International Conference on Document Analysis and Recognition, Nov. 2017. Aly, Walaa, Seiichi Uchida, and Masakazu Suzuki. "Identifying subscripts and superscripts in mathematical documents." Mathematics in Computer Science 2.2 (2008): 195-209. Sheather, Simon J. "Density estimation." Statistical Science 19.4 (2004): 588-597. Chan, Kam-Fai, and Dit-Yan Yeung. "Mathematical expression recognition: a survey." International Journal on Document Analysis and Recognition 3.1 (2000): 3-15. Blostein, Dorothea, and Richard Zanibbi. "Processing mathematical notation." Handbook of Document Image Processing and Recognition. Springer London, 2014. 679-702. Twaakyondo, Hashim M., and Masayuki Okamoto. "Structure analysis and recognition of mathematical expressions." Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on. Vol. 1. IEEE, 1995. Wang, Zi-Xiong, and Claudie Faure. "Structural analysis of handwritten mathematical expressions." Pattern Recognition, 1988., 9th International Conference on. IEEE, 1988. Aly, Walaa, et al. "Statistical classification of spatial relationships among mathematical symbols." Document Analysis and Recognition, 2009. ICDAR'09. 10th International Conference on. IEEE, 2009. Ouyang, Ling. A Symbol layout classification for mathematical formula using layout context. Diss. Rochester Institute of Technology, 2009. Álvaro, Francisco, and Richard Zanibbi. "A shape-based layout descriptor for classifying spatial relationships in handwritten math." Proceedings of the 2013 ACM symposium on Document engineering. ACM, 2013. Simistira, Fotini, Vassilis Katsouros, and George Carayannis. "Recognition of online handwritten mathematical formulas using probabilistic SVMs and stochastic context free grammars." Pattern Recognition Letters 53 (2015): 85-92. Okamoto, Masayuki, and Bin Miao. "Recognition of mathematical expressions by using the layout structures of symbols." Proceedings of the First International Conference on Document Analysis and Recognition. Vol. 1. 1991. Raja, Amar, et al. "Towards a parser for mathematical formula recognition." MKM. Vol. 6. No. 26. 2006. Oetiker, Tobias, et al. "The not so short introduction to LATEX 2ε." (2001).