Advanced Quranic Search Engine

20 downloads 568851 Views 4MB Size Report
Advanced Search/Indexing in Holy-Quran. Assem Chelli ...... desktop interface , Android/Iphone/Windows phone interfaces , facebook/twitter/G+ applications ...
Advanced Search/Indexing in Holy-Quran Assem Chelli 2011 / 2012

Ministry of Higher Education and Scientific Research National Higher School of Computer Science

Thesis Of Magister Option : Mobile Destributed Computing (IRM) Proposal of an Advanced Retrieval System for Noble Qur’an

Written By :

Supervised By :

• Assem CHELLI

• Pr. Amar BALLA • Mr. Taha ZERROUKI

2011/ 2012

... and say : O my Lord ! have compassion on them, as they brought me up (when I was) little. – Al-isra’ 24

iii

Acknowledgment

First at all, I am thanking Allah, the Almighty for giving me strength and patience to write this modest thesis. We gratefully acknowledge Pr. Amar Balla and Mr. Taha Zerrouki for giving me the honor of their supervising during that year and guiding me with advices, and meaningful criticism. I also thank the jury members for agreeing to evaluate our modest work. A big thanks to the faculty and the administration of the National Higher School of Computer Science (ESI) who took care of my training and monitoring throughout the study program.

Our deepest thanks go to the Arab open source community that has offered me a great support, especially Alfanous Team/Community that makes a valuable contribution to carry out this great work, and I hope to be worthy of the confidence they have placed on me. Finally I express my appreciation to all who contributed by their advice and their encouragement to the completion of this work, my family and my friends for their assistance and support.

iv

Abstract Noble Quran is different of all documents that we have known. It’s the sacred book of Muslims. It contains knowledge of all aspects of life. With this huge quantity of information, we can extract only a small part manually and this is considered insufficient compared to the size of knowledge contained by Quran. That raises the need for a method to extract those information because currently there is no efficient method except many printed lexicons and many tools of simple sequential search with regular expression. Due to this limitation, the Quran requires us to find new ways to interact. The goal through this work is to propose a system for advanced research in all of the information contained in the Quran by considering the morphology of the Arabic language and the properties of the Qur’anic text. It should be based on modern methods of information retrieval for good stability and high speed search. It would be very useful for researchers and could be generalized to cover all the content in Arabic.

Keywords : Indexing/Search, Arabic, Holy Quran, Information retrieval, Search engines.

v

Résumé Le Coran est différent de tous les documents que nous connaissons . C’est le livre sacré des musulmans. Il comporte des connaissances sur tous les aspects de la vie. Avec un tel volume d’informations, on ne peut y extraire qu’une infime partie manuellement. Ceci s’avère être insuffisant vue la quantité de connaissances que contient le Coran. D’où la nécessité de trouver une méthode pour extraire ces informations. Or il n’existe aucun outil à utiliser sauf quelques lexiques imprimés et quelques outils de recherche simple et séquentielle par les expressions régulières. En raison de cette limitation, le Coran nous oblige à trouver de nouvelles façons d’interaction. Le but recherché à travers ce travail est de proposer un système avancé de recherche dans l’ensemble des informations contenues dans le Coran en prenant en considération la morphologie de la langue Arabe et les propriétés du texte coranique. Elle doit être fondée sur les méthodes modernes de recherche d’informations pour obtenir une bonne stabilité et une recherche de grande vitesse. Elle serait trés utile pour les chercheurs et pourrait être généralisée pour couvrir l’ensemble du contenu en arabe.

Mots clés : Indexation/Recherche, Arabe, Coran, Recherche d’information, Moteurs de recherche.

vi

‫ملخّص‬ ‫القرآن الكريم يختلف عن جميع الوثائق التي نعرفها فهو يحوي المعارف في جميع جوانب الحياة‪ .‬مع‬ ‫هذا الحجم من المعلومات‪ ،‬لا يستطيع المرء استخراج إلا النزر اليسير يدويا وهذا ليس كافيا بالنسبة‬ ‫لحجم المعارف الواردة في القرآن الكريم‪ .‬ومن هنا جاءت الحاجة إلى إيجاد طريقة لاستخراج هذه‬ ‫المعلومات‪ .‬لا توجد أي وسيلة حالية فعالة باستثناء بعض المعاجم المطبوعة وبعض الأدوات التي‬ ‫تعتمد البحث البسيط التسلسلي بالعبارات النمطية‪ .‬وبسبب هذا القيد‪ ،‬يتوجب علينا إيجاد طرق‬ ‫جديدة للتفاعل‪.‬‬ ‫الهدف من خلال هذا العمل هو اقتراح نظام متقدم للبحث في جميع المعلومات الواردة في‬ ‫القرآن الكريم‪ ،‬مع الأخذ بعين الاعتبار مورفولوجيا اللغة العربية وخصائص النص القرآني‪ .‬وينبغي‬ ‫الاستناد إلى الأساليب الحديثة في استرجاع المعلومات من أجل تحصيل استقرار جيد وسرعة عالية‬ ‫في البحث‪ .‬هذا العمل مفيد للباحثين ودارسي القرآن ويمكن أن يع َّمم ليشمل جميع المحتوى باللغة‬ ‫العربية‪.‬‬ ‫كلمات مفتاحية‪ :‬بحث‪/‬فهرسة‪ ،‬العربية‪ ،‬القرآن‪ ،‬استخراج المعلومات‪ ،‬محركات البحث‪.‬‬

‫‪vii‬‬

Contents

Dedication

iii

Acknowledgment

iv

Table of Contents

xv

List of Figures

xvii

List of Tables

xix

List of Abbreviations

xx

Glossary

xxiv

General Introduction

1

I State Art

4

1 Search engines

5

1.1

Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.2

Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.2.1

Keyword

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.2.2

Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.2.3

Document

6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

1.2.4

Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.2.5

Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.3

Search engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.4

Full-text search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.5

Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.5.1

Crawler Features . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.5.1.1

Features a crawler must provide . . . . . . . . . . . .

9

1.5.1.2

Features a crawler should provide . . . . . . . . . . .

9

Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.6.1

Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.6.2

Indexing modes . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

1.6.2.1

Manual indexing . . . . . . . . . . . . . . . . . . . . .

11

1.6.2.2

Automatic indexing . . . . . . . . . . . . . . . . . . .

12

1.6.2.3

Semi-automatic indexing . . . . . . . . . . . . . . . .

13

Index types . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.6.3.1

Document Index

. . . . . . . . . . . . . . . . . . . .

13

1.6.3.2

Forward Index . . . . . . . . . . . . . . . . . . . . . .

14

1.6.3.3

Inverted index . . . . . . . . . . . . . . . . . . . . . .

14

1.6.3.4

N-gram index . . . . . . . . . . . . . . . . . . . . . . .

15

1.6.4

Index storage . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

1.6.5

Index update . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

1.6.5.1

Incremental update . . . . . . . . . . . . . . . . . . .

16

1.6.5.2

Global update . . . . . . . . . . . . . . . . . . . . . .

16

Indexing phases . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

1.6.6.1

Tokenization . . . . . . . . . . . . . . . . . . . . . . .

16

1.6.6.2

Normalization . . . . . . . . . . . . . . . . . . . . . .

17

1.6.6.3

Elimination of stop-words . . . . . . . . . . . . . . . .

17

1.6.6.4

Weighting . . . . . . . . . . . . . . . . . . . . . . . . .

17

Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

1.7.1

Relevance concept . . . . . . . . . . . . . . . . . . . . . . . . .

19

1.7.2

Similarity Function . . . . . . . . . . . . . . . . . . . . . . . . .

19

1.7.3

Search process . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

1.8

Semantic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

1.9

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

1.6

1.6.3

1.6.6

1.7

2 Arabic Language

25

2.1

Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

2.2

Orthography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25 ix

2.3

Lexicography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.3.1

26

Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1.1

26

. . . . . . .

27

Nouns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

2.3.2.1

‫)الأسماء‬: . . . . . . Nouns derived from verbals (‫ )الأسماء المشت ّقة‬: Primitive nouns (‫الجامدة‬

. . . . .

28

. . . .

28

2.3.2.3

Numbers : . . . . . . . . . . . . . . . . . . . . . . . .

28

2.3.2.4

Demonstrative pronouns (‫الإ شارة‬

. . . . . . . .

28

. . . . . . . . .

29

. . . . . . . .

29

Function words . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

2.4.1

Flexional Morphology . . . . . . . . . . . . . . . . . . . . . . .

30

2.4.1.1

Flexion of verbs . . . . . . . . . . . . . . . . . . . . .

31

2.4.1.2

Flexion of nouns . . . . . . . . . . . . . . . . . . . . .

32

2.4.1.3

Flexion of function words . . . . . . . . . . . . . . . .

34

Derivational morphology . . . . . . . . . . . . . . . . . . . . . .

34

2.3.2.2

2.3.2.5 2.3.2.6 2.3.3 2.4

‫)الفعل‬: . Verbs with augmented root (‫)الفعل المزيد‬:

. . . . . . .

2.3.1.2 2.3.2

Verbs with a simple root (‫المج ّرد‬

2.4.2

2.4.2.1 2.4.2.2

‫)أسماء‬:

‫) أسماء‬: . . Personal pronouns ( ‫) الضمائر المنفصلة‬: Relative pronouns (‫موصولة‬

Deverbal noun (‫)المصدر‬: Active participle (‫فاعل‬

. . . . . . . . . . . . . . . .

‫)اسم‬: . . . . . . . . . . Passive participle (‫)اسم مفعول‬: . . . . . . . . . Nouns of time and place (‫)أسماء الزمان والمكان‬: Noun of instrument (‫)اسم الآلة‬: . . . . . . . . . The Nomen Vicis (‫)اسم المرة‬: . . . . . . . . . . The Nomen Speciei (‫)اسم الهيئة‬: . . . . . . . . .

35

. . . .

35

. . . .

35

. . . .

35

. . . .

35

. . . .

36

. . . .

36

Ambiguity issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

2.5.1

The absence of vocalization . . . . . . . . . . . . . . . . . . . .

36

2.5.2

Prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

2.5.3

Suffixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

2.6

The computerization of  Arabic language . . . . . . . . . . . . . . . . .

39

2.7

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

2.4.2.3 2.4.2.4 2.4.2.5 2.4.2.6 2.4.2.7 2.5

3 The Qur’an

41

3.1

Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

3.2

Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

3.3

Qur’an Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.3.1

Fragmentation into surahs . . . . . . . . . . . . . . . . . . . . .

43

3.3.2

Fragmentation into Hizbs . . . . . . . . . . . . . . . . . . . . .

43 x

3.3.3 3.4

Fragmentation into Stops (Waqfs) . . . . . . . . . . . . . . . . .

44

Qur’anic Sciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

3.4.1

Knowledge of Ayahs revelation places

. . . . . . . . . . . . . .

45

3.4.2

Knowledge of Ayahs revelation causes: . . . . . . . . . . . . . .

46

3.4.3

Knowledge of Morphology:

46

3.4.4

‫ )علم مرسوم‬: . . . . . . Grammatical analysis of the Qur’an (‫ )إعراب ألفاظ القرآن‬: Science of allegorical ayahs (‫)علم المتشابه‬ : . . . . . . . . The beginnings of surahs (‫سور‬ ّ ‫)فواتح ال‬: . . . . . . . . . . Knowledge of Qur’anic Parables (‫)الأمثال القرآنية‬: . . . . . Tafssīr (‫)التفسير‬: . . . . . . . . . . . . . . . . . . . . . .

47

. . .

48

. . . .

48

. . . .

48

. . . .

49

. . . .

49

Computerization of the Qur’an . . . . . . . . . . . . . . . . . . . . . .

49

3.5.1

Advantages of Computerization . . . . . . . . . . . . . . . . . .

50

Qur’an Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

3.6.1

History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

3.6.1.1

Indexing words of the Qur’an . . . . . . . . . . . . . .

51

3.6.1.2

Ma’ājim of Qur’anic words . . . . . . . . . . . . . . .

51

3.6.1.3

Specialized Ma’ājim of Qur’anic words

. . . . . . . .

52

3.6.1.4

Qur’anic Indexes and Computer . . . . . . . . . . . .

52

Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

3.6.2.1

By unit: . . . . . . . . . . . . . . . . . . . . . . . . . .

52

3.6.2.2

By purpose . . . . . . . . . . . . . . . . . . . . . . . .

56

Projects of building indexes . . . . . . . . . . . . . . . . . . . .

58

3.6.3.1

Midād lbayān

58

3.6.3.2

Indexes by Taha Zerrouki

3.6.3.3

Qur’anic Arabic Corpus

3.4.6 3.4.7 3.4.8 3.4.9

3.6

3.6.2

3.6.3

3.7

3.8

ّ ‫الخ‬ Knowledge of Orthography (‫ط‬

. . . .

3.4.5

3.5

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

. . . . . . . . . . . . . . . .

60

3.6.3.4

Tanzil Project . . . . . . . . . . . . . . . . . . . . . .

62

3.6.3.5

Boundary-Annotated Qur’an Corpus . . . . . . . . . .

64

3.6.3.6

Qurany Concepts Tool . . . . . . . . . . . . . . . . . .

65

Qur’an Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

3.7.1

Qur’anic Concepts Ontology

. . . . . . . . . . . . . . . . . . .

67

3.7.2

The Ontology made by Hadj Henni: . . . . . . . . . . . . . .

68

Qur’anic Search Tools . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

3.8.1

69

Alawfa (‫ )الأوفى‬. . . . . . . . . . . . . . . . . . . . . . . . . . .

3.8.2

Al-Monaqeb-Alqurany (‫القرآني‬

‫)المنقب‬

. . . . . . . . . . . . . .

70

3.8.3

Quran complex search service . . . . . . . . . . . . . . . . . . .

71

3.8.4

Quranic Researcher (‫القرآني‬

72

‫)الباحث‬

. . . . . . . . . . . . . . . .

xi

3.8.5

Quranologie (‫القرآن‬

. . . . . . . . . . . . . . . . . . . . . .

73

3.8.6

Quranic Corpus Word-by-Word Search . . . . . . . . . . . . . .

74

3.8.7

Tanzil (‫ )تنزيل‬. . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

Zekr (‫ )ذكر‬. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

3.8.8 3.9

II

‫)علم‬

Analysis & Conception

78

4 Classification & Proposition of Qur’anic Search Features

79

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

4.2

Difficulties of Search in Quran

. . . . . . . . . . . . . . . . . . . . . .

79

4.3

Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

4.4

Proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

4.4.1

Advanced Query . . . . . . . . . . . . . . . . . . . . . . . . . .

83

4.4.2

Output Improvements . . . . . . . . . . . . . . . . . . . . . . .

84

4.4.3

Suggestion Systems . . . . . . . . . . . . . . . . . . . . . . . . .

85

4.4.4

Linguistic Aspects . . . . . . . . . . . . . . . . . . . . . . . . .

87

4.4.5

Qur’anic Options . . . . . . . . . . . . . . . . . . . . . . . . . .

91

4.4.6

Semantic Queries . . . . . . . . . . . . . . . . . . . . . . . . . .

93

4.4.7

Statistical System . . . . . . . . . . . . . . . . . . . . . . . . .

96

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

4.5.1

4.5

4.6

Survey

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

4.5.1.1

Survey Participants Details . . . . . . . . . . . . . . .

97

4.5.1.2

Results of survey . . . . . . . . . . . . . . . . . . . . .

99

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5 Conception

102

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.2

Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2.1

Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.2.2

Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.2.3

Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.2.4

Results Processing . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.2.5

Indexes Importing . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.3

Full vocalized search engine . . . . . . . . . . . . . . . . . . . . . . . . 107

5.4

Othmani script and text processing . . . . . . . . . . . . . . . . . . . . 111 5.4.1

Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.4.1.1

Romanizations . . . . . . . . . . . . . . . . . . . . . . 113 xii

5.4.1.2

5.5

5.6

Numbers into words:

. . . . . . . . . . . . . . . . . . 114

5.4.2

Tokenization: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.4.3

Normalization: . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.4.4

Filtering stop-words: . . . . . . . . . . . . . . . . . . . . . . . . 120

5.4.5

Lemmatization: . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Qur’anic Word Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.5.1

Word properties search . . . . . . . . . . . . . . . . . . . . . . . 124

5.5.2

Semantically Related Words . . . . . . . . . . . . . . . . . . . . 125

5.5.3

Multi-level Derivations . . . . . . . . . . . . . . . . . . . . . . . 126

5.5.4

Specific Derivations . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.5.5

Fuzzy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

III Implementation 6 Implementation

130 131

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.2

Why Open Source? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.2.1

License : AGPL

. . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.2.2

Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.2.3

Whoosh Search API . . . . . . . . . . . . . . . . . . . . . . . . 134

6.3

Previous Code Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.4

Our improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.4.1

A New Centralized JSON Output System: . . . . . . . . . . . . 138

6.4.2

Many new features . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.4.3

Resource Importing Manager . . . . . . . . . . . . . . . . . . . 142

6.4.4

Automating the API building . . . . . . . . . . . . . . . . . . . 143

6.4.5

A new console interface . . . . . . . . . . . . . . . . . . . . . . 143

6.4.6

Enhancing the web interface . . . . . . . . . . . . . . . . . . . . 143

6.4.7

Packaging system: . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.4.8

Multiple search units . . . . . . . . . . . . . . . . . . . . . . . . 144

6.4.9

Coding Standardization . . . . . . . . . . . . . . . . . . . . . . 145

6.4.10 Documentation covering . . . . . . . . . . . . . . . . . . . . . . 146 6.4.11 Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.5

Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.5.1

Application Programming Interface . . . . . . . . . . . . . . . . 150 6.5.1.1

JSON web service . . . . . . . . . . . . . . . . . . . . 152

xiii

6.5.1.2 6.6

Console interface

. . . . . . . . . . . . . . . . . . . . 153

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

General Conclusion

156

Bibliography

158

Appendices

A1

Annex A: Paper Abstracts

A2

xiv

List of Figures

1.1

The various components of a web search engine . . . . . . . . . . . . .

9

1.2

Indexing Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

1.3

Search process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

1.4

Search results returned as a web page: one of the possible ways to expose results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.1

Ligature of lâm and álif. . . . . . . . . . . . . . . . . . . . . . . . . . .

39

3.1

Explanatory diagram of fragmentation into Surahs . . . . . . . . . . .

43

3.2

Explanatory diagram of fragmentation into Hizbs . . . . . . . . . . . .

44

3.3

Index using the word as a unit [Arabic Quranic Corpus]

. . . . . . . .

53

3.4

Index that takes the ayah as a unit [Arabeyes Quran Model] . . . . . .

54

3.5

Classification indexes by purpose . . . . . . . . . . . . . . . . . . . . .

56

3.6

The various structures of Qur’an . . . . . . . . . . . . . . . . . . . . .

57

3.7

Preview of Qurany . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

3.8

A closer look at Qur’an Concepts Ontology

68

3.9

Diagram of domain ontology of Qur’anic documents made by Hadj

. . . . . . . . . . . . . . .

Henni[Hadjhenni2008] . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

3.10 Preview of Alawfa website: www.alawfa.com . . . . . . . . . . . . . . .

70

3.11 Preview of Al Monaqeb Alqurany . . . . . . . . . . . . . . . . . . . . .

71

3.12 Preview of Quran Complex Search page . . . . . . . . . . . . . . . . .

72

3.13 Preview of Quranic Researcher (www.quranicresearcher.com) . . . . . .

73

3.14 Preview of Quranologie (quranologie.com) . . . . . . . . . . . . . . . .

74

3.15 Preview of Qur’anic Arabic Corpus Word By Word search . . . . . . .

75

3.16 Preview of Tanzil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

3.17 Preview of Zekr Application . . . . . . . . . . . . . . . . . . . . . . . .

77

4.1

84

Pages view in Google.com . . . . . . . . . . . . . . . . . . . . . . . . .

xv

4.2

Highlight the keyword

‫( وحيدا‬alone) in Ayah 11 of al-modather – al-

fanous.org . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

4.3

Ayah in full diacritical marks - quran.com . . . . . . . . . . . . . . . .

85

4.4

Query spell correction in Google.com . . . . . . . . . . . . . . . . . . .

85

4.5

Focusing on ”Yaqub” in Ontology of concepts of corpus.quran.com

. .

86

4.6

Related searches suggestion in Google.com . . . . . . . . . . . . . . . .

87

4.7

Keyboard mapping Arabic to English in Google.com . . . . . . . . . .

87

4.8

Different romanizations for the word

. . . . . . . . . . . . . . . .

88

4.9

Different transliterations used in ElixirFM Resolve Online . . . . . . .

88

4.10 Syntactic Coloration of Basmalah – bayt-al-hikma.com . . . . . . . . .

88

4.11 Google Voice Search on Android . . . . . . . . . . . . . . . . . . . . . .

89

4.12 Annotations shown by Quranic Arabic Corpus website . . . . . . . . .

90

4.13 Divine names Highlight in Quran Reader iPhone application . . . . . .

91

4.14 Faceted Thematic Browsing - Qurany Project . . . . . . . . . . . . . .

93

4.15 Audience Background

. . . . . . . . . . . . . . . . . . . . . . . . . . .

98

4.16 Audience experience . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

‫َخليفَة‬

4.17 Clarity, Usefulness, and Need percentage of each feature . . . . . . . . 100 5.1

Basic Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.2

The behavior of Searcher . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.3

Text processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.4

Results processing phases . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.5

Different possible declensions of the word

5.6

Different types of the possible vocalizations of the word

5.7

Different writing forms of the word

5.8

Merging Words in Uthmani Script . . . . . . . . . . . . . . . . . . . . . 112

5.9

General Schema of Uthmani and Standard text processing. . . . . . . . 113

‫الملك‬

. . . . . . . . . . . . . 109

‫من‬

. . . . . . . 110

‫ بسطة‬using Othmani script

. . . . . 111

5.10 Example of Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.11 Tokenization of the word

‫َفأَ ْس َق ْي َنٰ ُك ُمو ُه‬

. . . . . . . . . . . . . . . . . . . . 116

5.12 Sub-tokens separation schema . . . . . . . . . . . . . . . . . . . . . . . 118 5.13 Example of tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.14 Arabic case markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.15 Example of normalization . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.16 Example of stop-word filtering 5.17 Examples of lemmatization

. . . . . . . . . . . . . . . . . . . . . . 121

. . . . . . . . . . . . . . . . . . . . . . . . 122

5.18 Two-Steps search behavior . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.19 Semantically related words : Idols in Quran . . . . . . . . . . . . . . . 124 5.20 Word properties search example : First person, Plural, Masculine . . . 125 xvi

5.21 Searching through an ontology . . . . . . . . . . . . . . . . . . . . . . . 125 5.22 Semantically Related Words, Hyponymy of the word

‫( نبي‬prophet) .

. . 126

5.23 Multi-level Derivation Search example . . . . . . . . . . . . . . . . . . 127 5.24 Special derivations example, Imperative of

‫( قال‬to say)

. . . . . . . . . 128

6.1

Screenshot of the Qt desktop interface . . . . . . . . . . . . . . . . . . 136

6.2

Json results of

6.3

Fuzzy search example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.4

Showing adjacent ayahs . . . . . . . . . . . . . . . . . . . . . . . . . . 141

6.5

Showing ayahs in different scripts . . . . . . . . . . . . . . . . . . . . . 141

6.6

Suggestion example of Vocalizations , Derivations ,and Synonyms of

6.7

Annotations of the keyword

6.8

Buckwalter translation example . . . . . . . . . . . . . . . . . . . . . . 142

6.9

Fields table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

‫الكوثر‬

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

‫ قيمة‬.

‫قول‬

141

. . . . . . . . . . . . . . . . . . . . . 141

6.10 Translation-as-unit search , Query: seven

. . . . . . . . . . . . . . . . 145

6.11 Word-as-unit json outout, Query:

‫قيمة‬

6.12 Interfaces dependency hierarchy

. . . . . . . . . . . . . . . . . . . . . 150

. . . . . . . . . . . . . . . . . . 145

6.13 API usage sample code . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.14 Preview of the JSON web service . . . . . . . . . . . . . . . . . . . . . 153 6.15 Preview of the Console interface . . . . . . . . . . . . . . . . . . . . . . 153

xvii

List of Tables

1.1

Document Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

1.2

Forward Index

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

1.3

Inverted Index

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

1.4

N-gram index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.1

3 types of Arabic letters: 1 form, 2 forms or 4 forms . . . . . . . . . . .

26

2.2

The change of meaning by changing the diacritical marks . . . . . . . .

36

2.3

The change of function by changing diacritical marks . . . . . . . . . .

37

2.4

Ambiguities of prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

2.5

Ambiguities due to suffixes . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.1

Waqfs types [Web-Islamweb] . . . . . . . . . . . . . . . . . . . . . . . .

44

3.2

Some numerical miracles of Qur’an[Nawfal1975] . . . . . . . . . . . . .

50

3.3

Index based on word parts . . . . . . . . . . . . . . . . . . . . . . . . .

55

3.4

Index based on sentences– surah: al-fātiha . . . . . . . . . . . . . . . .

56

3.5

Overview of main index– Midād lbayān . . . . . . . . . . . . . . . . . .

58

3.6

Overview of words index – M.Taha Zerrouki . . . . . . . . . . . . . . .

59

3.7

Overview of the topics index– Taha Zerrouki . . . . . . . . . . . . . . .

60

3.8

Overview of the index of synonyms – Taha Zerrouki . . . . . . . . . . .

60

3.9

Overview of morphology index – Quranic Arabic corpus . . . . . . . . .

62

3.10 Example of simple proper Quranic text – Tanzil.info . . . . . . . . . .

63

3.11 Example of Surah index – Tanzil.info . . . . . . . . . . . . . . . . . . .

63

3.12 Sajdah index – Tanzil.info . . . . . . . . . . . . . . . . . . . . . . . . .

64

3.13 Example of rub’ index – Tanzil.info . . . . . . . . . . . . . . . . . . . .

64

3.14 Sample of Boundary Annotated Qur’an Corpus . . . . . . . . . . . . .

65

5.1

Partial vocalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.2

Some numbers as they mentioned in Quran

. . . . . . . . . . . . . . . 115

xviii

6.1

Search request flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.2

Pylint Analysis stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.3

Implementation State of search features

. . . . . . . . . . . . . . . . . 148

xix

List of Abbreviations AGPL

Affero General Public License.

API

Application Programming Interface.

GPL

GNU Public License.

GUI

Graphical User Interface.

IDF

Inverse Document Frequency.

OWL

Web Ontology Language.

PC

Personal Computer.

POS

Part Of Speech.

POS

Part Of Speech.

RSV

Retrieval Status Value.

TF

Term Frequency.

TF*TDF

term frequency - inverse document frequency.

UI

User Interface .

xx

Glossary

Abrogated ayahs Abrogating ayahs Accusative Active Participle Active voice Allegorical ayah Assimilated verb Attaching pronoun Ayah Basmalah Book Science Broken plural Conjugation Declension Declinable Defect Demonstrative pronoun Deverbal noun Diacritical marks Diphthong Diptote Dual form Expansion

‫الآيات_المنسوخة‬ ‫الآيات_الناسخة‬ ‫حالة_النصب‬ ‫اسم_الفاعل‬ ‫صيغة_مبني_للمعلوم‬ ‫الآيات_المتشابهات‬ ِ ‫فعل_م َثال‬ ‫ضمير_م َّتصل‬ ‫آية‬ ‫بسملة‬ ‫علم_الكتاب‬ ‫جمع_تكسير‬ ‫تصريف_الأفعال‬ ‫الإ عراب‬ ‫معرب‬ ‫علّة‬ ‫اسم_إشارة‬ ‫مصدر‬ ‫علامات_التشكيل‬ ‫الإ دغام‬ ‫ممنوع_من_الصرف‬ ‫المثنّى‬ ‫ال ّزيادة‬ xxi

External feminine plural External masculine plural External plural Fiqh First ayahs of surah First person Fusional language Geminated verb General and Particular Genitive Hamzah Hamzated verb Healthy verb Hizb Hollow verb Imperative Imperfective Instrument noun Internal plural Inversion Jussive Juz’ Lam-ALef Last ayahs of surah Laws Lemma Lexicology Makkan Medinan

‫جمع_مؤنث_السالم‬ ‫جمع_مذكر_السالم‬ ‫جمع_سالم‬ ‫الفقه‬ ‫فواتح_السورة‬ ‫المتكلم‬ ‫لغة_إشتقاقية‬ ‫فعل_ ُم َضا َعف‬ ّ‫الخاص_والعام‬ ّ ‫حالة_الجر‬ ‫همزة‬ ‫فعل_ َم ْهموز‬ ‫فعل_صحيح‬ ‫ِح ْزب‬ ‫فعل_أَ ْج َوف‬ ‫الأمر‬ ‫المضارع‬ ‫اسم_الآلة‬ ‫جمع_تكسير‬ ‫القلب‬ ‫حالة_الجزم‬ ‫ُج ْزء‬ ‫لام_ألف‬ ‫خواتيم_السورة‬ ‫الأحكام‬ ‫جذع_الكلمة‬ ‫علم_المعاجم‬ ‫مكّ ّية‬ ‫مدن ّية‬ xxii

Migration of Prophet Morphology Mus’haf Narration of Hadiths Nisf Nomen Speciei Nomen Vicis Nominative Noun of place Noun of time Object Orthography Othmani script Passive Participle Passive voice People of the book Perfective Personal pronoun Plural form Primitive noun Prophet’s Sunnah Prosody Prostration of recitation Qiblah Qur’anic comma Qur’anic Parable Recitation Relative pronoun Revelation Science Rewayate Rewayate of Kaloun

‫الهجرة_النبوية‬ ‫الصرف‬ ‫ال ُمصحف‬ ‫رواية_الأحاديث‬ ‫نِ ْصف‬ ‫اسم_الهيئة‬ ‫اسم_الم َّرة‬ ‫حالة_الرفع‬ ‫اسم_مكان‬ ‫اسم_زمان‬ ‫مفعول‬ ‫علم_مرسوم_الخ ّط‬ ‫الخ ّط_العثماني‬ ‫اسم_المفعول‬ ‫صيغة_مبني_للمجهول‬ ‫أهل_الكتاب‬ ‫الماضي‬ ‫ضمير_منفصل‬ ‫الجمع‬ ‫اسم_جامد‬ ‫الس َّنة_النبوية‬ ُ ‫تقطيع_العروض‬ ‫سجود_التّلاوة‬ ‫ال ِقبلة‬ ‫فاصلة_قرآنية‬ ‫َم َثل_قرآني‬ ‫التلاوة‬ ‫اسم_موصول‬ ‫علم_التنزيل‬ ‫رواية‬ ‫رِواية_قالون‬ xxiii

Rhetoric Root Rubu’ Sajdah Second person Singular form Standard script Subject Superlative noun Surah Surah keys Tafssir The Five Nouns Third person Thumn Translation of Qur’an Triptote Verb with a simple root Verb with augmented root Virtues of surah Waqf Weakened verb

‫البلاغة‬ ‫جذر_الكلمة‬ ‫ُر ُبع‬ ‫سجدة‬ ‫المخاطب‬ ‫المفرد‬ ‫الخط_الإ ملائي‬ ‫فاعل‬ ‫اسم_تفضيل‬ ‫سورة‬ ‫مفاتيح_السورة‬ ّ ‫علم_تفسير_القرآن‬ ‫الأسماء_الخمسة‬ ‫الغائب‬ ‫ثُ ُمن‬ ‫ترجمة_معاني_القرآن‬ ‫منصرف‬ ‫فعل_مج ّرد‬ ‫فعل_مزيد‬ ‫فضائل_السورة‬ ‫َوقف‬ ‫فعل_ن َِاقص‬

xxiv

General Introduction

Work Context Qur’an, in Arabic, means the read or the recitation. Muslim scholars define it as: the words of Allah revealed to His Prophet Muhammad, written in Mus’haf and transmitted by successive generations (‫[)التواتر‬Mahssin1973]. The Qur’an is also known by other names such as: Al-Furkān , Al-kitāb , Al-dhikr , Al-wahy and Al-rōuh . It is the sacred book of all Muslims and the first reference to Islamic law. It’s more then 14 centuries passed since its revelation, and the Muslims are still studying it, teaching it, writing books about it and recently developing applications for it. Qur’an is an important source of information that contains various information about all aspects of life: Scientific, Social, Historic, Politic...etc.

Problematic Due to the large amount of information held in the Qur’an, it has become extremely difficult for regular search engines to successfully extract key information. For example, When searching for a book related to English grammar, you’ll simply Google it, select a PDF file and download it. That’s all! Search engines (like Google) are used generally on Latin letters and for searching general information of document like content, title, author…etc. However, searching through Qu’ranic text is a much more complicated; It’s procedure that’s requiring a much more in depth solution as there is a lot of information that needs to be extracted to fulfill Qur’an scholar’s needs. Before the creation of computer, Qur’an scholars were using printed lexicons made manually. The printed lexicons can’t help much since many search process waste the time and the force of the searcher. Each lexicon is written to reply to a specific query which is generally simple. Nowadays, there are many applications that are specific for search needs; most of applications that were developed for Qur’an had the search feature but

1

General Introduction in a simply way: sequential search with regular expressions. The simple search using exact query does not offer better options and still inefficient to move toward Thematic search by example. Full text search is the new approach of search that replaced the sequential search and which is used in search engines. Unfortunately, this approach is not applied yet on Qur’an. The question is why we need this approach? Why search engines? Do applications of search in Qur’an really need to be implemented as search engines?

Objectives Our proposal is about design a retrieval system that fit the Qur’an search needs. But to realize this objective, we must first list and classify all the search features that are possible and helpful. Then we need to study how to implement each feature and what is its requirements.

Report organization We organized the report as follows:

First Part : Art State This part contains 3 chapters: Chapter 1 : Search Engines To design a powerful search engine, it is essential to understand how search engines work, in this chapter we discuss the different parts of a search engine, namely: the crawling, indexing and querying . And the definition of basic concepts in the field of information retrieval systems. This chapter contains an introduction to the semantic approach. Chapter 2 : Arabic Language The objective of this chapter is to present the properties of the Arabic language, its spells, its morphology and to introduce some ambiguity issues that raise due to the Arabic nature ... etc. Chapter 3 : The Qur’an This chapter presents an overview of the Qur’an and its sciences, it has a historical background on the evolution of the Qur’an, the structure of the Mus’haf, and the main problems of computerization of the Qur’an, including the script Uthmani and authentication Qu’ranic texts.

2

General Introduction

Second Part : Analysis & Conception This part contains two chapters : Chapter 4 : Qur’anic search features The objective of this chapter is to present the possible search features in Qur’an. It has a big importance in our work since it defines our objectives and our path on the work. We’ve make a survey about Usefulness, Need, and Clarity of each feature in order to validate our points of view in choosing those features. Chapter 5 : Conception In this chapter, we start by a preview on our previous work then we’ll propose many improvements to carry out all the feasible search features mentioned in the previous chapter.

Third Part : Implementation This part contains the different steps of implementation of our retrieval system. It includes one chapter: Chapter 6 : Implementation This chapter describes the choice of technologies and development tools and also presents the prototype with a description of various features. Finally, we finish the report with a conclusion that summarizes our work. We include an appendix that describes the papers published about this work. Actually there are two papers: • An Arabic paper in NITS 2011 KSA entitled ”An Application Programming Interface for indexing and search in Noble Quran”1 [Chelli2011]. • An English paper in a pre-conference workshop in LREC 2012 Turkey which is about ”LRE-Rel: Language Resource and Evaluation for Religious Texts”. The paper was entitled ”Advanced Search in Quran: Classification and Proposition of All Possible Features”[Chelli2012].

1

Arabic title:

‫مكتبة برمجية للفهرسة والبحث في القرآن الكريم‬ 3

Part I State Art

4

Chapter 1 Search engines

How could the world beat a path to your door when the path was uncharted, uncatalogued, and could be discovered only serendipitously? Paul Gilster, Digital Literacy

1.1

Introduction

Our work falls within the field of Information Retrieval, as it aims to design a search engine, in this chapter we will discuss how search engines work by explaining its main components. Exploration is the part that feeds the search engine by documents that it collects, but with the amount of information that becomes larger and larger, it is necessary to develop methods of search, only indexing able to accelerate search in very large systems such as the Web, because it anticipates the search by extracting and arranging them keywords. So that search results be satisfactory, we must properly calculate the relevance of results against the query, this is done during the interrogation. The question must also be able to express simple questions as well as complex questions. The quality of research is directly related to the quality of the crawling, indexing and search, these three operations can be considered as the core of search engine, the objective of this chapter is to define the main concepts of this area, starting with defining the crawling, then study indexing, its methods and steps, and then we shall explain the process of search and the notion of relevance. 5

Chapter 1

1.2

Definitions

1.2.1

Keyword

Word or set of words chosen to represent the contents of a document, and find it in document search. It can be coming from the document (title, text, abstract, ...) or a controlled vocabulary..[Hensens1998]

1.2.2

Descriptor

Keyword selected from a set of equivalent terms to represent clearly a concept. It is usually part of an organized and hierarchical vocabulary of type ”thesaurus”.[Hensens1998]

1.2.3

Document

A document can be text, a piece of text, web page, image, video, etc. We call Document any unit that can be an answer to a user query. For textual documents, there are many forms regarding their specification. A document can be a text without any structure (it is also called full-text) and may also be a text with a structured part (document partially structured or semi structured) or fully structured. [Amrouche2008]

1.2.4

Query

A query expresses the need for information of a user. Various types of query languages ��have been proposed to formulate a query. A query can be expressed: • In natural (or almost) language (eg: ”find all the manufacturing facilities of cars and their addresses”)[Salton1971] • In a structured format, also called Boolean query language (eg: ”cars and factories and brand”)[Bourne1979] • As graphical language from a GUI [Lelu1992]

1.2.5

Relevance

Relevance is a word that simply means returning the information considered the most useful at the top of a result list. While the definition is simple, getting a program to compute relevance is not a trivial task, mainly because the notion of usefulness is hard for a machine to understand. [Bernard2009]

6

Chapter 1

1.3

Search engines

A search engine is software that allows to regain resources (web pages, images, video, files, information ... etc.) related to any words. Some websites offer a search engine as the main feature, called then search engine the website itself (Google, Yahoo, Bing ... are search engines). [Nejjari2007] A search engine is also a crawling tool on the web made ��up of ”robots” that explore the websites periodically and automatically (without human intervention, that is what distinguishes search engines from directories). They follow the links (pages that link to each other) encountered on each page reached. Each identified page will be indexed in a database, then will be accessible by Internet users using keywords.[Sanan2008] Search engines do not apply only to Web: some engines are softwares installed on personal computers. These are known as desktop search engines , they aims the search in the files stored on the PC - include such Exalead Desktop, Google Desktop and Copernic Desktop Search ... etc. In December 2004, Tim Berners Lee (the inventor of the World Wide Web) talked about a new project: ”Semantic Web” which is based on processing of the web information automatically according to their significances. The reason for this was that 80% of Web contains texts intended to be read and understood by humans. While computer programs, Web browsers and search engines are unable to understand this content, so they are unable to speed up the search. In less than two years from the article of Lee, the First foundations of Semantic Web were formed. They seemed to lead the world toward a new revolution in Internet and search engines .[Abulhajjaj2009] At first glance, nothing distinguishes a classic search engine from a semantic one. The same sparse interface, with a text box in the center of the page where the user can enter his search query. In fact, the difference lies in the search mode. A classic search engine , as Google, works as follows: its robots index browse the pages and index the words. Then store these words in a gigantic database. Users can do search by send their queries and a search algorithm retrieve the results and sort them in a certain order based on their relevance.[Mentre2008]

1.4

Full-text search

Full-text search is a technology focused on finding documents matching a set of words . While sounding like a mouthful, full-text search is more common than you might think. You probably have been using full-text search today. Most of the web search

7

Chapter 1

engines such as Google and Yahoo! use full-text search engines at the heart of their service. The differences between each of them are recipe secrets (and sometimes not so secret), such as the Google PageRankTM algorithm. PageRankTM will modify the importance of a given web page (result) depending on how many web pages are pointing to it and how important each page is . Be careful, though; these so-called web search engines are way more than the core of full-text search: They have a web UI , they crawl the web to find new pages or existing ones, and so on. They provide business-specific wrapping around the core of a full- text search engine. Given a set of words (the query), the main goal of full-text search is to provide access to all the documents matching those words. Because sequentially scanning all the documents to find the matching words is very inefficient, a full-text search engine (its core) is split into two main operations: indexing the information into an efficient format and searching the relevant information from this precomputed index. From the definition, you can clearly see that the notion of word is at the heart of full-text search; this is the atomic piece of information that the engine will manipulate. [Bernard2009]

1.5

Crawling

Crawling is the process by which we gather pages from the Web, in order to index them and support a search engine. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them. the web crawler is sometimes referred to as a spider . [Manning2009] This process is in the phase preceding the indexing phase, see Figure:

8

Chapter 1

Figure 1.1: The various components of a web search engine

1.5.1

Crawler Features

We list the desiderata for web crawlers in two categories: features that web crawlers must provide, followed by features they should provide. [Manning2009] 1.5.1.1 Features a crawler must provide Robustness :

The Web contains servers that create spider traps, which are gener-

ators of web pages that mislead crawlers into getting stuck fetching an infinite number of pages in a particular domain. Crawlers must be designed to be resilient to such traps. Not all such traps are malicious; some are the inadvertent side-effect of faulty website development . Politeness :

Web servers have both implicit and explicit policies regulating the

rate at which a crawler can visit them. These politeness policies must be respected . 1.5.1.2 Features a crawler should provide Distributed :

The crawler should have the ability to execute in a distributed

fashion across multiple machines . Scalable :

The crawler architecture should permit scaling up the crawl rate by

adding extra machines and bandwidth .

9

Chapter 1

Performance and efficiency :

The crawl system should make efficient use of

various system resources including processor, storage and network band- width. Quality :

Given that a significant fraction of all web pages are of poor utility for

serving user query needs, the crawler should be biased towards fetching “useful” pages first. Freshness :

In many applications, the crawler should operate in continuous mode:

it should obtain fresh copies of previously fetched pages. A search engine crawler, for instance, can thus ensure that the search engine’s index contains a fairly current representation of each indexed web page. For such continuous crawling, a crawler should be able to crawl a page with a frequency that approximates the rate of change of that page. Extensible :

Crawlers should be designed to be extensible in many ways – to

cope with new data formats, new fetch protocols, and so on. This demands that the crawler architecture be modular.

1.6

Indexing

To make the research cost acceptable, it should pass  by an essential phase in the document database. This phase consists in analyzing each document in the collection to create a set of keywords: we call it the indexing phase. These keywords will be more easily used by the system during the subsequent process of search. Indexing create a representation of documents in the system. Its objective is to find the most important concepts of the document (or query), which form the descriptor of document.[Sauvagnat2005]

1.6.1

Definition

Indexing is the act of describing or classifying a document by index terms or other symbols in order to indicate what the document is about, to summarize its content or to increase its find-ability. In other words, it is about identifying and describing the subject of documents. Indexes are constructed, separately, on three distinct levels: terms in a document such as a book; objects in a collection such as a library; and documents (such as books and articles) within a field of knowledge. The process of indexing begins with any analysis of the subject of the document. The indexer must then identify terms which appropriately identify the subject either 10

Chapter 1

by extracting words directly from the document or assigning words from a controlled vocabulary. The terms in the index are then presented in a systematic order. Indexers must decide how many terms to include and how specific the terms should be. Together this gives a depth of indexing.[Lancaster2003] Indexing is most often used to information retrieval. But it can also be used in other areas such as automatic classification of documents,   keyword suggestion, co-occurring terms calculating, automatic summarization, etc.[Abar2009]

Figure 1.2: Indexing Benefits

1.6.2

Indexing modes

1.6.2.1 Manual indexing Manual indexing is achieved by a human expert (librarian or specialist in the field ) that analyzes the content of the text to identify the terms representing the document. Manual indexing ensures greater relevance in the answers, because it identifies a more specific keywords describing a document. However, it has several drawbacks, there is the problem of used vocabulary and the dependence on indexer’s knowledge on the topic, ie the same document can be indexed in several ways (according to vision of the person who makes the indexing), and an indexer at two different times can have two distinct terms to represent the same concept. The major drawback of this method is the cost in time, this method is not therefore appropriate when the number of documents to be indexed is substantial. [Sauvagnat2005, Abar2009, Amrouche2008] 11

Chapter 1

Manual indexing is based on four key points [Chartron1989] : • reading the entire document for preparation ; • consideration of descriptors, objectives (applications) and user needs; • permanent complementarity between the terms of manual indexing and abstract; • in the absence of appropriate descriptor, and when the emergence of a new concept is not explicit enough to propose a candidate descriptor, the ability to use a close or generic descriptor. So we thought fast enough to use the computer[Mustafa]. 1.6.2.2 Automatic indexing Automatic indexing is a set of automated processing phases applied on documents. We distinguish: Tokenization (automatic extraction of word), Elimination of stop words, Stemming (Lemmatization or radicalization), Scoring of words and finally the creation of the index[Sauvagnat2005]. The first approach to the automatic indexation KWIC (Key Word In Context-) was introduced by Luhn (1957)[Luhn1957]. There was discussion about to weight the index. In the early days of information retrieval, statistical methods were based on the frequency of words in the document. Later, this measure was extended to take into account the specificity of a term for the document. To this end, other methods have been exploited, such as 2-Poisson (Nie, 2003)[Gaussier2003][Mustafa]. The automatic indexing systems use several methods of analysis: 1.6.2.2.1

Linguistic analysis:

Technology issued from ”text mining”, the latter

is to implement a simplified model of linguistic theories in computer systems of learning. This is part of the artificial intelligence field . [Allab2008] The linguistic method consists of several modules of linguistic analysis: morphological, lexical, syntactic and pragmatic. The fact that some systems use indexing techniques of natural language processing, demonstrates the relevance of a linguistic approach. [Elhachani1997] 1.6.2.2.2

Statistical analysis:

The initiator of the methods of the automatic

indexation is H.P. Luhn with his influential article “The automatic creation of literature abstracts” published in 1958 in the “Journal of Research and Development” of IBM. He states : « (...) instead of sampling at ran- dom, as a reader normally does when scanning, the new mechanical method selects those among all the sentences of 12

Chapter 1

an article that are the most representative of pertinent information », H. P. Luhn opened the door to work on automatic indexing by proximity also called statistical method[Luhn1958]. Automatic indexing involves the following steps : • Extracting words (tokenization): the extraction rules are language-dependent. • Eliminating stop words (stop words): these are words too frequent but unnecessary. Example: the, a, of, or ... etc. • Stemming : for example the stem of the word ”stemmers” is ”stem”. • Transformation rules: removal of plural endings. • Truncation: choose an optimal value of truncation of words. It is better to truncate suffixes. There is no absolute rule for this. 1.6.2.3 Semi-automatic indexing The two previous techniques can be combined, a first automatic process to extract the terms of the document. However the final choice remains the expert in the field or librarian to establish semantic relations between keywords and choose the significant terms using a thesaurus or a terminology database which is an organized list of descriptors (keywords) obeying specific terminology rules[Abar2009, Sauvagnat2005, Hadjhenni2008].

1.6.3

Index types

The index is the output of the indexing process, there are several types of indexes according to the used technique and the desired function: 1.6.3.1 Document Index The document index keeps information about each document. It is an index ISAM (Index sequential access mode) with a fixed width, ordered by the ID of the document. The information stored in each entry includes data, a checksum of documents and various statistics. If the document was crawled, it also contains a pointer to a variable width file called the document information that contains the URL and title. This design decision was driven by the desire to have a relatively compact data structure, and the ability to find a record in one disk traversal, when queried.[Brin1998] The following table is a simplified illustration of a document index:

13

Chapter 1

Document ID Document 1 Document 2 Document 3

Text The cow says moo The cat and the hat The dish ran away with the spoon

Link /ex/doc1.txt /ex/doc2.txt /ex/doc3.txt

Table 1.1: Document Index

1.6.3.2 Forward Index The forward index stores a list of words for each document. The following is a simplified form of forward index:

Document ID Document 1 Document 2 Document 3

Words the, cow, says, moo the, cat, and, the, hat the, dish, ran, away, with, the, spoon Table 1.2: Forward Index

The rationale behind developing a forward index is that as documents are parsing, it is better to immediately store the words per document. The delineation enables Asynchronous system processing, which partially circumvents the inverted index update bottleneck. The forward index is sorted to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index. [Brin1998] 1.6.3.3 Inverted index Many search engines include an inverted index when evaluating a search query to quickly retrieve documents that contain words in the query and then sort them by relevance. Since the inverted index stores the list of documents containing each word, the search engine can use direct access to find documents associated with each word in a query to retrieve documents that respond quickly. The following table is a simplified illustration of an inverted index:

14

Chapter 1

Word the cow says moo

Documents Document 1, Document 3, Document 4, Document 5 Document 2, Document 3, Document 4 Document 5 Document 7 Table 1.3: Inverted Index

This index can identify only if a word exists in a particular document because it does not store any information regarding the frequency or the position of the word. It is considered an index of boolean. This index determines which documents that match a query, but does not classify them. In some models, the index includes additional information such as frequency of each word in each document or positions of a word in each document. The position information allow the search algorithm to identify the adjacent words to support the search by phrases. The frequency can be used to assist calculating the relevance of documents to the query. [Grossman2002, Tang2004] 1.6.3.4 N-gram index An n-gram is a sequence of n consecutive characters. For any document, all n-grams (usually n takes the values ��2 or 3) we can generate, is the result obtained by shifting a window of n squares on the body text. This shift occurs in steps, one step corresponds to a character. Then we calculate the frequencies of n-grams found. for example

1

[Jalam2002]the french sentence ”La nourrice nourrit le nourrisson” is represented by :

n-grams Frequencies

1 la_ 1

2 a_n 1

3 _no 3

4 nou 3

5 our 3

6 urr 3

7 rri 3

8 ric 1

9 ice 1

10 _ce 1

11 e_n 2

12 rit 1

… … …

Table 1.4: N-gram index

One benefit of n-grams is automatic tracking of the most common stems [Grefenstette1995]: dans in the previous example, using techniques based on n-grams we find the common root of : Nourrir, nourri, nourrit, nourrissez, nourriture, etc. Tolerance to spelling mistakes and distortions is also an important property. [Sanan2008]

1.6.4

Index storage

Storage of index structures is mainly characterized per the index size and organization of its elements. Index structures vary widely in their use of size that is closely related 1

the character ”_” is used instead of spaces, in order to facilitate the reading. 15

Chapter 1

to the organization of data in the index. This organization has a significant impact on latency of search. More items are closely related to each other in the storage space is less latency research, this is called the concept of locality. It is also very important that the index can hold in main memory, it avoids disk access to the system and reduces the latency of search. The ideal index is one that occupies less space and minimize search latency. [Dahak2006]

1.6.5

Index update

Updating the index refers to the behavior of applying changes on the index . Changes can be insertions, modifications or deletions. An index can be more or less able to adapt to these changes. This adaptation can occur in two forms: 1.6.5.1 Incremental update In the case of an incremental update, the structure of the index is updated by adding back the indexes of new documents without modifying existing ones. The number of changes in this case is, however, often limited.[Dahak2006] 1.6.5.2 Global update The third case, and worst is when the structure of the entire index must be rebuilt from scratch.[Dahak2006]

1.6.6

Indexing phases

The indexing process consists of the following phases: 1.6.6.1 Tokenization Tokenization is a phase that may seem trivial at first, and yet provide the basis for the rest of the indexing phases. Therefore this phase must be done with the highest quality.[Meylan2001] Some retrieval systems use a list of predefined keywords. This list is designed manually and, in most cases built for a specific topic. This method allow to control the index size. The use of automatic extraction of keywords or the use of a list of predefined keywords, determines the type of indexing. Document-oriented in the first case and query-oriented in the second.[Berrut1997, Dahak2006]

16

Chapter 1

1.6.6.2 Normalization This processing is to find for a word its normalized form (usually the masculine for nouns, infinitive for verbs, the masculine singular for adjectives, etc.). Thus, in the index are stored only in their normalized forms, which offers a significant size saving, but more importantly, even if the processing is done on the request, it can be much quicker and more flexible in research: for example, if a user searches with a verb, documents that contains this verb in all its conjugated forms will be considered, not just documents containing the word in the form provided by the user. This step is also called ”morphological processing of keywords”[Denoyer2004] This phase can also be enriched with syntactic and semantic processing of keywords. The first is to identify and group a set of words whose meaning depends on their union. For example, the words ”White House” does not usually mean you’re dealing with a house that is white, but instead the seat of the presidency of the United States. It is also to remove ambiguities such as the problems of homography. Semantic processing is intended to make distinctions between different possible meanings of a word (polysemy). For example, this phase helps differentiate the word ”room” that can match a coin, or a room in a house. This is an arduous task that is not currently well controlled and its effect on system performance is not always proven.[Dahak2006] 1.6.6.3 Elimination of stop-words This phase is of some importance since it constitutes a factor of great influence in the accuracy of the search. The failure to remove stop words inevitably cause noise. The elimination of stop words which are words of everyday language and do not contain much semantic information must be both in indexing as querying (removing stop-words from the query). [Dahak2006] 1.6.6.4 Weighting This step is entirely dependent on the model of information retrieval used. It defines how important a term in a given document. [Dahak2006] In general, most term weighting formulas are built by combination of two factors. A local weighting factor measuring the local representativity of a term in the document, and an overall weighting factor measuring the global representativity of a term with respect to the collection of documents[Amrouche2008]. That leads to two types :

17

Chapter 1

1.6.6.4.1

Local weighting Local weighting takes into account the local informa-

tion of the term that depend only on the document. It is typically a function of frequency of occurrence of the word in the document, denoted tf (Term Frequency). A term that frequently appears in a document is considered relevant to describe its contents. [Dahak2006] 1.6.6.4.2

Overall weighting The overall weight measures the importance of a

term within all documents. It aims to represent its discriminatory nature, or in other words its ability to distinguish between document. In fact, a term appearing in few documents is considered more discriminatory and should be favored over a term found in many documents. The calculation of the overall weighting is based on the number of documents in which a term appears. One of the most used is idf (Inverse Document Frequency), represented by the following formula: Idf = log( nNi ) Such as ni is the number of documents containing the word i and N is the total number of documents. The value tf *idf gives a good approximation of the importance of a term in the document, particularly in the corpus of documents of similar size. [Dahak2006]

1.7

Querying

Querying is the phase of interaction between the system and the user. This expresses the need for information via a query language that the system will take care of interpreting. This interpretation is done according to the query template and is designed to understand user needs and express them in a formalism similar to the one used when indexing documents. This process provides an inner query. Following this phase of query interpreting, a matching pattern calculates the match between the inner  query and each document in the index. This calculation established by the mapping function, has traditionally resulted in an ordered list of documents. It should, at this level, a semantic comparison (not equal) between concepts in of document and those of the query. The comparison between query and document rarely leads to strict equivalences, but rather to partial equivalences: the document is only part of the query. The first document in the list returned by the system is one that is considered by the system as the most relevant, that is to say the one that best suits the query, again according to the system. The final document is one that is considered by the system as the 18

Chapter 1

least relevant. This notion of relevance is based on the proximity between the needs expressed by the user and the results provided by the system.[Dahak2006]

1.7.1

Relevance concept

Relevance is a central concept of the query because all evaluations are based around this concept. But it is also the most poorly understood concept, despite numerous studies on this concept as the one in[Denos1997]. Let us see some definitions of the relevance. Relevance is: • The correspondence between a document and a query, a measure of informativeness of the document to the query; • A degree of relationship (overlap, relativity, ...) between the document and the query; • A degree of surprise that comes with a document that is relevant to the needs of the user; • A measure of usefulness of the document to the user. Even in these definitions, the used concepts (informativeness, relativity, surprise ...) remains very vague because users have very different needs. They have very different criteria for judging whether a document is relevant. So the notion of relevance is used to cover a very wide range of criteria and relations[Dahak2006].

1.7.2

Similarity Function

Comparing between the document and query is equivalent to calculating a score, assumed to represent the relevance of the document in respect to the query. This value is calculated from a function or a probability of similarity denoted rsv(q,d) (retrieval status value), such as q is a query and d est a document and whose formula depends entirely on the used model of information retrieval. This measure takes into account the weight of terms in documents determined by statistical analysis and probability. The matching function is very closely related to the operations of indexing and weighting of query terms and documents in the corpus. In general, the matching document - query and indexing model used to characterize and identify a model of information retrieval. The similarity function is then used to order the documents returned to the user. The quality of this ordering is paramount. In fact, users is generally satisfied to examine the first documents (the top 10 or 20). If the documents sought are not present in this slice, the user will consider sorting as bad in respect to his query[Sauvagnat2005, Dahak2006]. 19

Chapter 1

1.7.3

Search process

Search takes a user query and returns the effective list of matching results sorted by relevance. Such as indexing, searching is a multiphase process, as shown in Figure.

Figure 1.3: Search process

The first operation is about building the query. Depending on the full text search, the way to express query is either: : 1. String based—A text-based query language. Depending on the focus, such a language can be as simple as handling words and as complex as having Boolean operators, approximation operators, field restriction, and much more! 2. Programmatic API based—For advanced and tightly controlled queries a programmatic API is very neat. It gives the developer a flexible way to express complex queries and decide how to expose the query flexibility to users. Some tools will focus on the string-based query, some on the programmatic API, and some on both. The second operation, let’s call it analyzing, is responsible for taking sentences or lists of words and applying the similar operation performed at indexing time (chunk 20

Chapter 1

into words, stems, or phonetic description). This is critical because the result of this operation is the common language that indexing and searching use to talk to each other and happens to be the one stored in the index. If the same set of operations is not applied, the search won’t find the indexed words—not so useful! Based on the common language between indexing and searching, the third operation (finding documents) will read the index and retrieve the index information associated with each matching word. Remember, for each word, the index could store the list of matching documents, the frequency, the word positions in a document, and so on. The implicit deal here is that the document itself is not loaded, and that’s one of the reasons why full-text search is efficient: The document does not have to be loaded to know whether it matches or not. The next operation (filtering and ordering) will process the information retrieved from the index and build the list of documents (or more precisely, handlers to docu- ments). From the information available (matching documents per word, word fre- quency, and word position), the search engine is able to exclude documents from the matching list. More important, it is able to compute a score for each document. The higher its score, the higher a document will be in the result list. let’s have a look at some factors influencing its value : • In a query involving multiple words, the closer they are in a document, the higher the rank. • In a query involving multiple words, the more are found in a single document, the higher the rank. • The higher the frequency of a matching word in a document, the higher the rank.

• The less approximate a word, the higher the rank. Depending on how the query is expressed and how the product computes score, these rules may or may not apply. This list is here to give you a feeling of what may affect the score, therefore the relevance of a document. Once the ordered list of documents is ready, the full-text search engine exposes the results to the user. It can be through a programmatic API or through a web page. the following figure shows a result page from the Google search engine.[Bernard2009]

21

Chapter 1

Figure 1.4: Search results returned as a web page: one of the possible ways to expose results

1.8

Semantic Approach

Semantic search seeks to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed system, to generate more relevant results. Semantic search systems consider various points including context of search, location, intent, variation of words, synonyms, generalized and specialized queries, concept matching and natural language queries to provide relevant search results[Web-Techulator]. Major web search engines like Google and Bing incorporate some elements of semantic search. Rather than using ranking algorithms such as Google’s PageRank to predict relevancy, semantic search uses semantics, or the science of meaning in language, to produce highly relevant search results. In most cases, the goal is to deliver the information queried by a user rather than have a user sort through a list of loosely related keyword results. However, Google itself has subsequently also announced its own Semantic Search project[Web-WSJ]. Other authors primarily regard semantic search as a set of techniques for retrieving knowledge from richly structured data sources like ontologies. Such technologies enable the formal articulation of domain knowledge at a high level of expressiveness and could enable the user to specify his intent in more detail at query time[Web-ESWC2012]. Semantic search does not just mean contextual search or search based on the intend of the question. It include several other factors as well. A smart search engine would consider several factors to provide the most relevant and useful search queries,

22

Chapter 1

including[Web-Techulator]: • Current trend: If the president election was just finished in the country and someone is searching for ’Who is the new president’, the semantic search system should be able to understand the query and give relevant results based on the current trend and news. • Location of search: If a person is searching for ’what is the temperature’, the semantic search engine should be able to provide results based on the current location of the search. If the person is searching from California, search results should include the current temperature in California. • Intend of the search: Semantic search engines should be able to give appropriate search results based on the intent of the search and not based on the specific words used in the search query. • Variations of words in Semantic Search: Semantic search should consider tenses, plural, singular etc and provide relevant search results for all semantic variations of the words. For example, words like dog, dogs, dog’s etc. • Synonyms and Semantic Search: A semantic search engine should be able to understand the synonyms and give more or less the same search results on any synonyms of the word users search for. For example, try searching for ”biggest mountain” and ”highest mountain”. You would get pretty much the same results since both of them means the same in this particular query, even though the ”biggest” and ”highest” could mean different things in different cases. • Generalized and Specialized queries: Semantic Searching engine should be able to set relation between generalized and specialized queries and provide appropriate and relevant results. For example, consider an article on general health topics and another article specifically on Diabetes. If someone search for health information, both articles could match even though the article on Diabetes does not talk specifically about ”health”. • Concept matching: This is a sub-set of context matching in semantic search. Semantic search should understand the broad concept of the query and return relevant results. For example, a query on ”Traffic problems in New Jersey” could return relevant results including the topics ”narrow roads”, ”non functioning traffic lights”, ”lack of roadside assistance” etc because in a broad conceptual point of view, all of these lead to traffic problems.

23

Chapter 1

• Natural language queries: Not everyone are tech savvy and not many people know what to search to get the relevant search results. Most users simply type in queries in natural language. For example, if some one want to find what is the current time in Arizona, USA, they would search for ’What time is it in Arizona’. Most search engines would simply show results from the websites and articles that talk about Time and Arizona. However, smart search engines that use Semantic Search would actually show you the current time in Arizona, USA. Try it yourself at Google search. • Change of meaning based on the group of words: By combining different words, the true meaning of search term could change. Consider the following search terms: ◦ New egg health products ◦ New egg health benefits If you search for both the above terms in Google, you would get completely different meaning. Instead of just picking the results based on the words, Google Search looks at it as a term and then combines with common user search pattern. The first term returns search results primarily on the popular online shopping website NewEgg.com and shows results of health products from that site and similar sites. The second term shows search results for the health benefits of Egg. Semantic Search is a big challenge for search engines and none of them are perfect. Most search engines have improved significantly in last few years. Search engines like Bing and Google provide significantly relevant search results incorporating some degree of semantic search. There are many other specialized search engines (like Hakia) which offer purely semantic search results, but they lack many other qualities of normal search engines.

1.9

Conclusion

In this chapter, the study focused on the working mechanism of search engines and information retrieval systems, based on indexing due to its importance. Indeed, it is the most important step in the search process as it allows the extraction and processing of keywords. The search phase does not offer only the interaction between users and the system, but also calculates the match percentage between the query and the documents to provide the most relevant results. 24

Chapter 2 Arabic Language

ِ ‫حر في أَحشائِ ِه ال ُد ُّر‬ ‫اص َعن َصدَفاتـــي‬ ُ ‫أَنا َالب‬ َ ‫كامـــ ٌن *** َف َهل َسأَلوا ال َغ ّو‬ ‫ قصيدة عن اللغة العربية‬،‫حافظ ابراهيم‬

2.1

Introduction

Arabic (

‫ ) العربية‬is a name applied to the descendants of the classical Arabic language

of the sixth century AD, the most widely used in the Qur’an, the Islamic holy book. Arabic is a Central Semitic language, closely related to modern Hebrew and ��Aramai languages1 . In this chapter, we will talk about orthography and morphology of the Arabic language that are unique. we will also talk about some ambiguities that may appear in Arabic due to the absence of vocalization.

2.2

Orthography

The Arabic script is one of the most used scripts all over the world. It dominates in the Arab countries, of course, but holds a special place for all Muslims because it is the script used to write the Qur’an.[Jabri] It is written from right to left like other Semitic languages��. Its alphabet has twentynine2 consonant letters, three of them .

‫يوا‬

are considered as vowels. Optionally,

1

Modern Aramaic, languages are varieties of Aramaic that are spoken vernaculars in the medieval to modern era, evolving out of Middle Aramaic dialects around AD 1200. 2 The scientists of Arabic language considered the Hamzah as a letter 25

Chapter 2

one of the three Diacritical marks .

‫ـِ ُـ َـ‬

can be placed after certain characters to

resolve the ambiguity in pronunciation and/or direction when it arises. In a fully vocalized Arabic text, the lack of diacritics can be regarded as a sokōune .

‫ْـ‬

(silence).

In some cases, a letter doubled, may be replaced by a single letter with tashdeed . (reinforcement) placed above. [AlKharashi1999]

ّ‫ـ‬

In addition, it is important to note that the notion of uppercase and lowercase letter does not exist, the Arabic writing is called unicameral. In addition, Arabic is a language semi cursive, most letters are attached to each other, their spellings differ depending on whether they are preceded and/or followed by other letters or they are isolated. Only six of them does not attach to the following letter : . one letter does not attach at all : .

‫ء‬

‫وزرذدا‬

and

[Mesfar2008]

Letter Hamzah Wâw Ayn

Spellings

‫ء‬ ‫و ـو‬ ‫ـع ـعـ عـ ع‬

Table 2.1: 3 types of Arabic letters: 1 form, 2 forms or 4 forms

2.3

Lexicography

The traditional Arabic grammar has only three subsets: Nouns, Verbs and Particles.

2.3.1

Verbs

A verb is an entity expressing a time-dependent sense. Most Arabic verbs are formed

‫َك َت َب‬ and eventually four consonants that is the case of the verb . ‫ج‬ َ ‫َد ْح َر‬

on three radical consonants that is the case of the verb .

(kataba – write) (dah�rağa – roll

along). These roots may form several patterns as a result of one or more morphological transformations (eg: repetition of a consonant, lengthening a vowel, the  expanding of a morpheme, etc.), it comes in this case to roots with augmented pattern. Several linguistic studies have been conducted on the verbal system in Arabic, see[Larcher2003]. In this section, it is necessary to introduce a classification of verbs according to their radicals: 2.3.1.1 Verbs with a simple root (‫المج ّرد‬

‫)الفعل‬:

A verb with a simple root has a base of three consonants called radical consonants. These verbs are associated with verbal pattern .

‫َف َع َل‬

(fa’ala). When none of the root

26

Chapter 2

consonants of the verb is a long vowel, it is called healthy. These radicals may involve

processing or causes of defects (‫) ِعلَّة‬, we mention : • The presence of .

ٔ‫ا‬

(á – hamzä), .

‫ي‬

(y - yâ’) or .

‫و‬

(w – wâw) among the

radical consonants. Depending on the position of that, we distinguish different types of verbs : ◦ If one of the root consonants is . => Hamzated verb (‫َم ْهموز‬

ٔ‫ا‬

(á – hamzä), independently of its position

‫; )فعل‬

‫و‬

◦ The first radical consonant is a . (‫ِم َثال‬

‫;)فعل‬

◦ The second radical consonant is a .

‫;)أَ ْج َوف‬

◦ The third radical consonant is a .

‫;)ن َِاقص‬

‫و‬

(w) or .

‫ي‬

(y) => Assimilated verb

(w) or .

‫ي‬

(y) => Hollow verb (‫فعل‬

‫و‬

(w) or .

‫ي‬

(y) => Weakened verb (‫فعل‬

• The presence of two identical consonants in the second and third position of the root => Geminated verb (‫ضا َعف‬ َ ‫ُم‬

‫)فعل‬.

2.3.1.2 Verbs with augmented root (‫المزيد‬

‫)الفعل‬:

The patterns of verbs with augmented root are formed from simple roots by a set of morphological operations to provide a specific meaning to the outcome verbs , we mention: • . • . • . • . • . • . • . • .

‫َف َّعل‬ ‫َفا َع َل‬ ‫أَ ْف َع َل‬

(fa’ala) (faā’ala) (áfa’ala)

‫( َت َف َّع َل‬tafa”ala) ‫( َتفَا َع َل‬tafaā’ala) ‫( اِ ْف َت َع َل‬ìfta’ala)

‫( اِنْ َف َع َل‬ìnfa’ala) ‫( اِ ْس َت ْف َع َل‬ìstaf’ala)

27

Chapter 2

2.3.2

Nouns

The morphological system of Arabic nouns contains three subcategories: 2.3.2.1 Primitive nouns (‫الجامدة‬

‫)الأسماء‬:

The primitive nouns are nouns that can not be attached to a verbal root. They well form the fundamental glossary of the concrete language. eg: . .

‫كُ ْر ِس ّي‬

(kursiyy – chair), .

‫َك ْبش‬

‫أخ‬

(raás – head),

(kabš – Sheep), etc. In this category we also include

nouns composed of two letters such as: . (áab – father), .

‫َرأْس‬

‫دم‬

(dam - blood), .

‫فم‬

(fam - mouth), .

‫أب‬

(áh – brother), etc.

2.3.2.2 Nouns derived from verbals (‫المشت ّقة‬

‫ )الأسماء‬:

These are the nouns that can be derived from a verbal root. The number and nature of these forms vary depending on the status of the verb to which they relate. As nouns, they can receive marks of case, gender and indeterminacy. 2.3.2.3 Numbers :

‫صفر‬ ‫عشرون‬

This category of nouns is made up of simple numerals representing units: from . (s�ifr- zero, 0) to .

‫تسعة‬

(tis�at – nine, 9); the tens: .

(�išruwn – twenty, 20) … and .

‫تسعون‬

‫تسعة_ َعشَ ر‬

(�ašarat – ten, 10), .

(tis�uwn – ninety, 90) ; the hundreds, etc. well

as numerals compounds such as cardinals of . to .

‫عشرة‬

(tis�at �ašara - nineteen, 19).

‫أَ َحد_ َعشَ ر‬

(áah�ada �ašara - eleven, 11)

In their decomposition, the Arab grammarians have classified adjectives to nouns as they almost take all the morphological forms and may, for example, be definite or indefinite and flex according to case, number and type. 2.3.2.4 Demonstrative pronouns (‫الإ شارة‬

‫)أسماء‬:

Demonstrative pronouns represent a subcategory of noun expressing an idea of ��demonstration. They can indicate that the object represented is found, either in the text, either in space or time, defined by the situation of utterance. They are two subsets: near-deictic (eg: . .

َ‫ذلِك‬

‫َهذا‬

(dalika – that), .

(hadaā – this), .

َ‫أُولائِك‬

‫َه ُؤلاء‬

(haẃulaā’ – these) and far-deictic (eg:

(ùuwlaāýika – those), etc.). Demonstratives are deriv-

able only to dual.

28

Chapter 2

2.3.2.5 Relative pronouns (‫موصولة‬

‫) أسماء‬:

Relative pronouns relate to the noun or personal pronoun that precedes them and that we denote by antecedent. The relatives shall afford with their antecedents but are derivable only to dual (as demonstratives). Among the relative pronouns, we mention: .

‫الَّذي‬

dual), .

(al-ladiy - that, masculine, singular), .

‫اللذين‬ َ

(those, masculine, plural), etc.

2.3.2.6 Personal pronouns (

ِ‫الل َت ْين‬

(al-latayni – those, feminine,

‫) الضمائر المنفصلة‬:

Personal pronouns are intended to identify three types of grammatical persons: • First person, ie, the speaker (‫ )المتكلم‬, that who is talking: . .

‫ن َْح ُن‬

‫أَنَا‬

(ánaā - I) or

(nah�nu – we) ;

‫( أَنْ َت‬áanta – ِ ْ‫( أَن‬áanti – you, feminine, singular), . ‫( أَنْ ُت َما‬áanyou, masculine, singular), . ‫ت‬ tumaā – you, dual), . ‫( أَنْ ُت ْم‬áantum – you, masculine, plural), . ‫ن‬ َّ ‫( أَنْ ُت‬áantunna

• Second person, ie, the listener (‫)المخاطب‬, that who talking to: .

– you, feminine, plural);

• Third person, ie, the absent (‫ )الغائب‬, that who talking about: . . .

2.3.3

‫ِه َي‬ ‫ُهن‬

(hiya – she), .

‫ُه َما‬

(humaā – they, dual), .

(hunna – they, feminine).

‫ُه ْم‬

‫ُه َو‬

(huwa – he),

(hum – they, masculine),

Function words

The function words are used to locate entities, facts or objects in relation to time or place. They also play a key role in the coherence and sequencing of a text. For example, we have particles that designate a time: • .

‫بعد‬

(ba�da – after)

• .

‫قبل‬

(qabla – before)

• .

‫منذ‬

(mundu – since)

or a place like • .

‫حيث‬

(h�aytu – where),

According to their semantic meaning and their function in the sentence, they can play an important role in the interpretation of a sentence expressing an introduction, explanation, consequence, etc.[Kadri1992]. Function words include various categories, we mention: 29

Chapter 2

• Prepositions : .

‫ِفي‬

(fiy – in) or .

• Conjunctions: .

‫ثُ َّم‬

(tumma – then) ;

• Adverbs : .

‫أَ َبدًا‬

‫َع َلى‬

(abadã – never) or .

in the normal way) ; • Quantifiers: .

‫ُك ّل‬

(kulla – all ) or .

(�alaą – on);

‫بِش ْكل_ َعا ِدي‬

‫َب ْعض‬

(bišaklĩ �aādiyyĩ – normally,

(ba�d�a – some) ;

• Etc. The function words are divided into subgroups: those variables (quantifiers) and those that are invariable (adverbs, prepositions, etc.).

2.4

Morphology

There are several categories of Fusional languages��, and Arabic is precisely in the category of languages with Intro-flexion: this category of languages��, the consonants indicate the meaning and vowels mark the flexion of  word. This system is found especially in the Semitic languages ��(eg: Arabic, Hebrew) [Choueiter2006] Morphologically, the Arabic language is very rich and based on the structure of patterns and roots. Most Arabic words are generated from a finite set of roots (about 7000 roots) transformed using one or more patterns (about 400-500). Theoretically, a single Arabic root can generate hundreds of words (noun, verb, ...). An Arabic word can exist in about a hundred of forms in a normal text by adding certain suffixes and prefixes (mainly considered as stop-words in English).[AlKharashi1999]

2.4.1

Flexional Morphology

Arabic uses for the declension of verbs and nouns, some indications of aspect, mood, time, person, gender, number and case, which are generally suffixes and prefixes[Gaudefroy1975]. Generally, these flexional marks can distinguish [?] : • Mode of verbs: eg, for the verb .

‫ذ َه َب‬

(dahaba – to go), forms in the Perfective

‫( ذ َه ْب ُت‬dahabatu – I went) or their prefixes such in the Imperfective (‫ )المضارع‬as . ‫ب‬ ُ ‫( أَذ َه‬áadhabu – I go) ; ِ ‫( َر ُج‬rağulaáni – Two men) in Function of nouns: using of suffixes such as . ‫لان‬ Nominative (‫ )حالة الرفع‬or . ِ‫ج َل ْين‬ ُ ‫( َر‬rağulayni – Two men) in Accusative (‫حالة‬ ‫ )النصب‬or Genitive (‫)حالة الجر‬. [?] (‫ )الماضي‬can be identified using their suffixes as .



30

Chapter 2

2.4.1.1 Flexion of verbs Called also Conjugation, it describes the variation in their forms according to circumstances. Generally, conjugation includes a number of values ��which are: • Aspect: The aspect is a grammar feature associated, in most cases, to verbs in order to indicate which state it expresses; considered from the perspective of its development (beginning, progress, completion, overall evolution, etc.), regardless of when it comes ; • Mood : Mood indicates how the action expressed by the verb is designed and presented. The action can be doubted, affirmed as actual or eventual. They combine the semantics of verbs and thereby create aspects ; • Tense : Tense is a grammatical feature to locate a fact (which may be a state or action) in the enunciation time axis relative to the three markers: past, present and future. The temporal indications are often accompanied by aspectual indications that are more or less related. These three key values ��are closely related ; they can describe two basic forms of the verb in Arabic : • Perfective (‫)الماضي‬: it indicates that the progress of the action expressed by the verb is finished, which means the past. It is characterized by adding suffixes of person, gender, number and mood to the verb’s stem. For example, for the feminine plural of the verb . get the form .

‫َك َت ْب َن‬

‫َك َت َب‬

(kataba – to write), we add the suffix .

‫َن‬

to

(katabna - *they* wrote, feminine) and for the masculine

plural, we add the suffix .

‫وا‬

masculine) ;

to get the form .

‫َك َت ُبوا‬

(katabuwá - *they* wrote,

• Imperfective (‫)المضارع‬: it indicates an unfinished progress, which may imply the present. It is characterized by adding a prefix and one or more infixes as a letter duplication or a vowel substitution. For example, for the verb . – to give), we can get .

‫أَ ُم ُّد‬

(áamuddu – I give) or .

‫َي ْم ُد ْدن‬

‫َم َّد‬

(madda

(yamdudna – they

gives, feminine). It includes two types of modal inflections:

◦ The indicative of actual mode where the speaker states the actual character (reread, to be achieved, in progress, etc.) of action or state expressed by the verb; ◦ The subjunctive of potential mode in which the speaker merely states the possible or virtual nature of action or state expressed by the verb. 31

Chapter 2

• Imperative (‫)الأمر‬: it expresses the order, command, or exhortation ... etc. It exists only the with the 2nd person in singular, dual and plural; 2.4.1.2 Flexion of nouns In Arabic, the declension (‫ )إعراب‬of nouns involves three cases: Nominative (‫) َم ْرفُوع‬, Accusative (‫صوب‬ ْ ). Except for some special cases, the nouns are ُ ‫ )م ْن‬and Genitive (‫مجرور‬ declinables (‫ )معربة‬and appear in one of these three cases according to their functions in the sentence. In terms of the spelling, the case represents only an assistant graphic at the end of nominal forms. The nominal system of Arabic admits different systems depending on the nature of variation of the form (triptote, diptote, etc.) and the number thereof (singular, dual or plural). We can distinguish : 2.4.1.2.1 Declension of singular nouns: Basic declension of triptotes (‫)منصرف‬: This is the most frequent case, it takes the vowel .

‫َض َّمة‬

(d�ammat – u) as a sign of the nominative , the vowel .

a) in the accusative and the vowel .

‫ك َْس َرة‬

.

‫ة‬

ٍ‫ـ‬

(ta) or by .

(fath�at –

(kasrat – i) in the genitive. When the noun is

undefined, the tanwîn is marked respectively by the three diacritics: . (ã - an) et .

‫َف ْت َحة‬

ٌ‫ـ‬

(ũ – un), .

‫ًـ‬

(ĩ – in). In the indefinite accusative , except the case of nouns ending by

‫ا‬

(ā’), an álif .

‫ا‬

(ā) strengthens the tanwîn .

the accusative indefinite the noun .

‫ِك َتاب‬

(an) : for example, in

(kitaāb – book) becomes .

book, accusative, indefinite) and the book . (ğaziyratã – island, accusative, indefinite). Declension of diptotes(‫الصرف‬

‫ًـ‬

‫َجز َِيرة‬

‫ِك َتا ًبا‬

(kitaābãā –

(ğaziyrat – island) becomes .

‫)الممنوع من‬:

‫َجز َِير ًة‬

The nouns that are diptotes, gram-

matically undefined, do not accept tanwîn and take the same mark in the accusative and the genitive which is the .

‫َف ْت َحة‬

(fath�at – a). By contrary, when they are defined,

they follow the declension of triptotes. This is the case of feminine nouns that end

‫( اء‬ā’) such as . ‫صح َراء‬ ْ (sah�raā’ – desert), masculine adjectives of colors with the pattern . ‫( أَ ْفعل‬af�al) such as . ‫ح َمر‬ ْ ٔ‫( ا‬áh�mar – red) and those which are feminine with the pattern . ‫( َف ْعلَاء‬fa�laā’) such as . ‫ضاء‬ َ ‫( َب ْي‬bayd�aā’ – white , feminine) with

Declension of The Five Nouns (‫الخمسة‬ • Three nouns: .

‫أبو‬

(áabuw - father), .

‫أخو‬

‫) الأسماء‬:

The five nouns are :

(áhuw - brother) and .

‫حمو‬

(h�amuw

- stepfather) ; • A variant of .

‫َفم‬

(fam – mouth) : .

‫َفا‬

,.

‫فو‬

and .

‫ِفي‬

;

32

Chapter 2

• The noun .

‫ذو‬

(duw – possessor of ).

These are bi-literal nouns who extend their final vowel when they are defined by a complement. Declension of deverbals with defective roots :

Some active participles and

deverbal nouns of verbs with defective root such as the active participle . – past) et the deverbal noun .

‫َت َخ ٍّل‬

‫ض‬ ٍ ‫َما‬

(maādĩ

(tahallĩ - abandon) only take the mark of case

‫ي‬

in the accusative: the last letter of root .

(y) is replaced by the tanwîn (in) to

the indefinite nominative and genitive. As for the passive participles that end in . or .

‫ا‬

such as .

‫ُم ْع ًطى‬

‫ى‬

(mu�t�ãą – given), they lose their case inflection. a Tanwîn

distinguishes indefinite noun of definite. At this point, it is important to note that the use of this rule of declension is remised. Indeed, in standard texts the form of the verbal noun .

‫ض‬ ٍ ‫َقا‬

(qaādĩ – jurist) is, generally, altered to .

by adding the glide .

‫ي‬

‫ال ُمث َّنى‬

(qaādiy – jurist)

(y - yâ’) at the end of the initial form.

2.4.1.2.2 Declension of dual nouns : .

ِ ‫َق‬ ‫اضي‬

There is in Arabic a dual form called

(al-mutannaą – dual) to describe two things or two persons. It takes place

between singular (referring to one thing or one person) and plural (greater than three things or three persons). This is a declension with two alternatives where the mark

‫ ا‬, and that of accusative and genitive is Ya . ‫ ي‬. To form ِ (aāni) to the dual of a noun, indefinite or definite by article, we add the suffix : . ‫ان‬ nominative and . ِ‫( ْين‬ayni) to accusative and genitive. For example, the dual form ِ ‫( َس َّيا َر‬sayyaārataāni – two of the noun . ‫( َس َّيا َرة‬sayyaārat – a car) takes the form . ‫تان‬ cars, nominative) or . ِ‫( َس َّيا َر َت ْين‬sayyaāratayni – two cars, accusative and genitive) In some cases, especially for words whose root is defective or that end with a . ‫ى‬ (ą), a . ‫( و‬ą) or a . ‫)‘( ء‬, the ending of noun changes before the suffix of the dual. ِ ‫َم ْق َه َي‬ In this case, the form . ‫( َم ْق َهى‬maqhaą – coffee shop) has the dual form . ‫ان‬ of nominative is Alef .

(maqhayaāni – two coffee shops)

2.4.1.2.3 Declension of plural nouns:

There are two major categories of plural

form in Arabic: External or regular plurals (‫السالمة‬

‫)الجموع‬

:

External plurals are formed

by adding a suffix to the singular without changing the structure of the word. we distinguish: • External masculine plural (‫السالم‬ letters .

‫ين‬

(iyna) or .

‫ون‬

‫)جمع مذكر‬: For this plural we add the two

(uwna) depending on the position of the word in the 33

Chapter 2

‫مسلم‬

sentence ( subject or object), eg : .

(muslim - Muslim) becomes .

(muslimuwna – Muslims, nominative) or .

‫مسلمين‬

‫مسلمون‬

(muslimiyna – Muslims, ac-

cusative or genitive) ; • External feminine plural(‫السالم‬ , the two letters .

‫ات‬

(āt), eg: .

(sayyaāraāt – cars).

‫)جمع مؤنث‬: Similarly, we add to this plural ‫( َس َّيا َرة‬sayyaārat – a car) becomes . ‫َس َّيا َرات‬

Internal or broken plurals (‫ )جموع التكسير‬:

The internal plurals are designated

also by broken plurals because of changes and infixes they require over the singular form. Unlike what happens with regular plurals (masculine and feminine). Broken plural forms are numerous and generally unpredictable, they follow a variety of complex rules

‫( كَاتِب‬kaātib – a writer) turns into ‫( كُ َّتاب‬kutteb – writers) or . ‫( َك َت َبة‬katabat – writers). Note also

and depend on the noun. For example: the noun . the two plural forms .

that the Arab grammarians have made distinctions between plurals based on number.

‫( شَ ْهر‬šahr - month) has two plural forms: . ‫( َٔا ْش َهر‬ášhur less than 12 months) and . ‫( شُ ُهور‬šuhuwr - more than 12 months ) . Only the external For example : the noun .

plurals follow a clear declension. The internal plurals are related to declensions of the singular (declensions of triptotes and diptotes ). 2.4.1.3 Flexion of function words When it comes to flexing the particles. We distinguish two categories: • Invariables: their forms are constant and do not accept any variation, eg: . (�alaą – on), .

‫ُم ْنذ‬

(mundu – since), etc.

‫َع َلى‬

• Variables: they follow the system of declension in three cases according to their function in the sentence. For example. the quantifier .

‫ك ّل‬

(kull – all) may

accept the three final casual vowels to describe the nominative, accusative or genitive according to its function in the sentence.

2.4.2

Derivational morphology

Each verb has its set of associated deverbal forms which it maintains morphological, syntactic and semantic relations. The number and nature of these forms vary depending on the status of the verb. We cite some deverbal forms[?]:

34

Chapter 2

2.4.2.1 Deverbal noun (‫)المصدر‬: The deverbal noun is an abstract noun formed on the same root as a verb and expresses the same semantic content as it, but it involves no notion of tense, aspect, mood, person, or even voice. Semantically, it expresses an action, a state or process within the meaning of a verb. Every verb, whatever its type, has a verbal noun and sometimes more than one. Augmented verbs have, generally, only one. By cons, this is not the case for simple verbs that can have up to 5 verbal nouns. For example, the verb .

‫َو َّد‬

(wadda – to like) recognizes five different verbal nouns that express an « affection » with some semantic nuance : . (widaādat) and .

‫َم َو َّدة‬

‫َو ٌّد‬

(waddu), .

(mawaddat);

2.4.2.2 Active participle (‫فاعل‬

‫ُو ٌّد‬

(wuddu), .

‫ِو َدا ٌد‬

(widaādu), .

‫ِو َدا َد ٌة‬

‫)اسم‬:

Active participle is a noun associated with any action verb (transitive or intransitive), which means the agent who does the action, and therefore called the noun of agent given to this verb. For example, verbs in simple root such as . follows the pattern .

‫َفاعل‬

‫َض َر َب‬

(faā�il) to generate the active participle .

hitter) ; 2.4.2.3 Passive participle (‫مفعول‬

(d�araba – to hit)

‫َضارِب‬

(d�aārib –

‫)اسم‬:

Passive participle is a noun associated with any action verb transitive. It means the patient undergoing the action or the result of that action, and therefore called of patient noun given to this verb. For example, if we take the verb with simple root . (d�araba – to hit), it follows the pattern . participle .

‫َمضْ ُروب‬

(mad�ruwb – struck)

‫َمفعول‬

2.4.2.4 Nouns of time and place (‫والمكان‬

‫َض َر َب‬

(maf�uwl) to generate the passive

‫)أسماء الزمان‬:

The noun of place is a deverbal supposed to designate the place where the action occurs

‫َم ْد َر َسة‬ from the verb . ‫س‬ َ ‫َد َر‬ as the noun .

(madrasat – school, place where to study) that may be generated (darasa – to study). When the semantics of the verb rather lends

to a temporal interpretation, that noun is called noun of time, such as for the noun

.

‫َم ْغرِب‬

(maġrib – Sunset) attached to the verb .

2.4.2.5 Noun of instrument (‫الآلة‬

‫َغ َرب‬

(ġaraba – to set) .

‫)اسم‬:

The instrument noun is a noun that refers to the instrument that is used to perform the action expressed by the verb. For example, the noun . can be attached to the verb .

‫َو َّزع‬

‫ُم َو ِّزع‬

(muwazzi� – dispenser)

(wazza�a – to dispense) ;

35

Chapter 2

2.4.2.6 The Nomen Vicis (‫المرة‬

‫)اسم‬:

The Nomen Vicis is a noun that designates a single occurrence of the action expressed by the verb. It is usually produced from a singular form of the verbal noun. For example, the noun . – to hit) .

‫َض ْر َبة‬

(d�arabat – a hit) may attach to the verb .

2.4.2.7 The Nomen Speciei (‫الهيئة‬

‫َض َر َب‬

(d�araba

‫)اسم‬:

The Nomen Speciei is a noun that is supposed to be used to indicate how the action is done. For augmented verbs , it is not distinct from the nomen vicis. We can set an example of the sentence: . a princess).

2.5

ِ ‫َج َل َس‬ ‫ت_ج ْل َسة_الا َٔ ِم َيرة‬

(ğalasat ğilsata al-amirat - she sat like

Ambiguity issues

2.5.1

The absence of vocalization

The majority of Arabic texts are written without diacritical marks, this causes some confusion in the meaning, only the context of text can lift the ambiguity. Therefore, the softwares based only on word spelling can not resolve it.

‫ ملك‬, how can determine the mean‫( ال ُملك‬kingdom) or . ‫( ال َم ِلك‬King)?

For example, if the text contains the word . ing of this word : is it . [Albawwab2009]

‫ال َم َلك‬

(Angel) or .

The following table shows some words that its meaning changes due to the change of diacritical marks: Unvocalized word

‫القدر‬ ‫يحرم‬ ‫البر‬ ‫الحجر‬ ‫العرض‬ ‫أم‬ ‫الملك‬ ‫من‬

1st meaning

2nd meaning

3rd meaning

‫( ال ِقدْر‬pot)

‫( ال َقدْر‬dignity) ‫( ال َقدَر‬fate) ‫( ُي َح ِّرم‬forbids) ‫( ُي ْحرِم‬enters ihram) ‫( َي ْح ُرم‬deprives of) ‫( ال ِبر‬charity) ‫( َالب ّر‬land) ‫( ُالبر‬wheat) ِ ‫(الح ْجر‬lap) ‫الح َجر‬ ‫الح ْجر‬ َ (stone) َ (interdiction) ‫(ال َع ْرض‬width) ‫( ال ِع ْرض‬honor) ‫(ال َع َرض‬symptom) ‫(أَم‬or) ‫(أُم‬mother) ‫(أَ َّم‬to lead in prayer) ِ ‫(ال َملك‬king) ‫(ال َم َلك‬angel) ‫(ال ُم ْلك‬reign) ‫( ِمن‬from) ‫( َمن‬who) ‫( َم ّن‬remind someone of a favor)

… … … … … … … … …

Table 2.2: The change of meaning by changing the diacritical marks

In addition to confusion about the meaning of  word due to the absence of diacritical marks, there is more confusion about the function of  word, also the context of text only can eliminate it. For example, the case of the word .

‫أخد‬

, Softwares based only 36

Chapter 2

on word spelling can not distinguish between active form . form .

‫أُ ِخ َذ‬

‫أَ َخ َذ‬

(took) and passive

(was taken)[Albawwab2009]

The following table shows some words change their functions due to the change of diacritical marks: Unvocalized word

‫ملك‬ ‫من‬ ‫عدوان‬ ‫فرض‬ ‫عن‬ ‫أدر‬ ‫أحسن‬ ‫علم‬ ‫غلبت‬

1st function

2nd function

‫( َمن‬Relative pronoun) ‫( ُع ْد َوان‬Singular form) ‫ض‬ َ ‫( َف َر‬Perfective verb) ‫( َعن‬Perfective verb) ‫( أَ ْدر‬Jussive Imperfective) ‫( أَ ْح َس َن‬Perfective verb) ‫( َع ِل َم‬Simple verb) ‫( ُغ ِل َب ْت‬Passive voice)

َ‫( َم َلك‬Verb) ِ ‫( م ْن‬Preposition) ‫( َع ُد َّوان‬Dual form) ‫ض‬ ٌ ‫( َف ْر‬Deverbal noun) ‫( َعن‬Preposition) ‫( أَ ِد ْر‬Imperative) ‫( أَ ْح َسن‬Superlative noun) ‫( َعلَّ َم‬Augmented verb) ‫( َغ َل َب ْت‬Active voice)

ٌ‫( َم ِلك‬Noun)

... ... ... ... ... ... ... ... ... ...

Table 2.3: The change of function by changing diacritical marks

2.5.2

Prefixes

This is another type of confusion, the word can be read several ways. For example:

‫ وعد‬, the letter wâw . ‫ واو‬may belong to word as in: . ‫َو َع َد‬ verb . ‫( َو َعدَ_ َي ِع ُد‬promise), or else does not belong to word as in: . ‫َو َع َّد‬ combination . ‫ َع َّد‬+‫( َو‬and + count ), the verb . ‫( َعد_ َي ُع ُّد‬count). the word .

from the from the

Thus, softwares based only on the spelling of  word, can not distinguish between

these two cases. This type of confusion does not exist in the English language, for instance: The translation of the word .

‫والكتاب‬

into English is a phrase of 3 independent

words ”and the book”. [Albawwab2009] The following table provides additional examples of words that can be read in two ways:

37

Chapter 2

2nd way 1st way (‫ اسم‬+ ‫ َطا َقة )حرف جر‬+ ِ‫بِ َطا َقة = بـ‬ (‫بِ َطا َقة )اسم‬ (‫ ضمير متصل‬+‫ ُه ْم )حرف عطف‬+ ‫َو ُه ْم = َو‬ (‫َو ْه ٌم )مصدر‬ (‫ َر َم ْت )من َر َمى َي ْر ِمي‬+ ‫َف َر َم ْت = ف َـ‬ (‫َف َر َم ْت )من َف َر َم َي ْف ِر ُم‬ (‫ َس َع ْت )من َس َعى َي ْس َعى‬+ ‫لَ َس َع ْت = ل َـ‬ (‫لَ َس َع ْت )من لَ َس َع َي ْل َس ُع‬ (ٍ‫ فعل ماض‬+ ‫ َب َع َث )همزة استفهام‬+ َٔ‫أَ ْب َع ُث )فعل مضارع( أَ َب َع َث = ا‬ (‫ اسم‬+ ‫ َر ّب )حرف تشبيه‬+ ‫ك ََر ّب = َك‬ (ٍ‫َكر َِب )فعل ماض‬ (‫ اسم‬+ ‫ َح ُّق )ال التعريف‬+ ‫الْ َح ّق = ا ْل‬ (‫اِلْ َح ْق )فعل ٔامر‬

Word ‫بطاقة‬ ‫وهم‬ ‫فرمت‬ ‫لسعت‬ ‫أبعث‬ ‫كرب‬ ‫الحق‬

Table 2.4: Ambiguities of prefixes

2.5.3

Suffixes

The suffixes are the attaching pronouns (‫المتصلة‬

‫ )الضمائر‬attached to words.

The confusion here is similar to the confusion of prefixes, this confusion is that we can read the word in two ways, depending on whether the letters belong to the word

‫ وله‬, the letter ha’ . ‫ هاء‬may: belong to word as in: . ‫ َولِه‬from the verb . ‫( َول ِ َه_ َي ْو َل ُه‬admire) ; does not belong to word as in: . ‫ َولِّ ِه‬from the combination . ‫ـ ِه‬+‫ل‬ ِّ ‫( َو‬crown + him) from the verb . ‫( َولَّى_ ُي َولِّي‬crown) or . ‫ ُه‬+‫ل‬ َ +‫ ( َو‬and + he + has ) .

or not. For example, in the word . • •

Just as the previous confusion, this type of confusion does not exist in the English language, for instance: The translation of . words ”we answered him”.[Albawwab2009]

‫أَ َج ْب َنا ُه‬

into English is a phrase of 3 separate

The following table provides additional examples:

2nd way 1st way Word (‫ضمير متصل‬+ٍ‫ َك )فعل ماض‬+ ‫َب َّر َك = َب َّر‬ (ٍ‫َب َر َك )فعل ماض‬ ‫برك‬ (‫ضمير متصل‬+‫ كُ ْم )حرف جر‬+ ِ‫بِ ُك ْم = بـ‬ (‫ُب ْك ٌم )جمع أبكم‬ ‫بكم‬ (‫نون النسوة‬+‫ َن )فعل ماض‬+ ‫ف‬ (‫َت ْلف ََن )فعل رباعي‬ ‫تلفن‬ ْ ‫َت ِل ْف َن = َت ِل‬ (‫ تاء التأنيث الساكنة‬+ ‫ض من نبا ينبو‬ ‫نبت‬ ٍ ‫ ْت )فعل ما‬+ ‫ن ََب َت )فعل ماضٍ( ن ََب ْت = ن ََب‬ (‫ضمير متصل‬+‫ ِه َّن )اسم‬+ ‫بِ ِّر ِه َّن = بِ ِّر‬ (‫َب ْر َه َن )فعل رباعي‬ ‫برهن‬ (‫ضمير متصل‬+‫ ي )حرف جر‬+ ِ‫لِي = لـ‬ (‫لي لَ ٌّي )مصدر لَ َوى َي ْلوِي‬ (‫ضمير متصل‬+‫ ُه ْم )اسم‬+ ‫َه ُّم ُه ْم = َه ُّم‬ (‫َه ْم َه َم )فعل رباعي‬ ‫همهم‬ Table 2.5: Ambiguities due to suffixes

38

Chapter 2

2.6

The computerization of  Arabic language

Computer softwares has been designed at first to serve the Latin languages��, including English. Then these techniques were adapted to accommodate the other world languages��, including Arabic. Writing specifications for certain languages ��were influenced by that, because of the pressure of technological change. However, the Arabic script has persisted against the effects of this wave, despite growing calls to change the Arabic script, under the pretext of simplification, modernization, or convenience. The output of Arabic letters on the screens and printers has gone through several stages, starting with the same format for each letter and separate them by spaces between the letters in the word. Then, accommodate the change of the shape of the letter according to its position in the word: the first, the middle, the last or separated, through the contextual analysis, then improving the appearance of letters by the development of its images and changing the display width according to the letter. Then the possibility of marking diacritics, and the ligature composed of two or more characters:

Figure 2.1: Ligature of lâm and álif.

All this was achieved through the development of professional programs for designing fonts. Because of the sophisticated fonts, such as TrueType and OpenType fonts, preserve the specifications of characters regardless of size which gives a unique and natural shape. The letters of the Arabic language are represented with one symbol at the level of the computer, and this symbol represents a single location on the keyboard. The posting of the letter in output might look different and that depending on the location of the letter in the word. Therefore, it seems that it is not necessary to match the shape of the character introduced with what is stored within the computer and what is to be printed. Lam-Alef has no place in the table of the Arabic alphabet, but it appears as a single letter while printing. This letter can be introduced directly, or by clicking Lam then Alef from the keyboard.

39

Chapter 2

2.7

Conclusion

We explained in this chapter orthographic and morphological properties of the Arabic language in general. Also, here are some ambiguities that may be apparent in the automatic handling. In the next chapter, we present how do the search engines work and a listing of known General, Arabic and Qur’anic Search engines

40

Chapter 3 The Qur’an

* ‫اب أُ ْح ِك َم ْت آ َيا ُت ُه ثُ َّم فُ ِّص َل ْت ِم ْن لَ ُد ْن َح ِكي ٍم َخ ِبي ٍر‬ ٌ ‫* الر ِك َت‬ 1 ‫ الآية‬، ‫سورة هود‬

3.1

Introduction

The Qur’an is the sacred book of Muslims, it is the basis of several sciences, such as Tafssir, Translation of Qur’an, Recitation, etc. The information contained in the Qur’anic text are the sources of these sciences, hence the need for a powerful search system that offers lots of features specific to the Qur’an. Our subject is to achieve such a system, so in this chapter, we will first clarify the definition of the Qur’an, then we will see in brief the main steps of the evolution of the Qur’an through time and how it was written, edited and computerized, then we will study the mus’haf and sciences of the Qur’an. Then we will see the benefits and challenges of computerization of the Qur’an, in particular, the script Othmani, verification and authentication of Mus’haf.

3.2

Definitions

Qur’an the word « Qur’an » or « Koran », in Arabic Language, means recitation or reading, Muslim scholars define it as: the words of Allah revealed to His Prophet Muhammad, written in mus’haf and transmitted by successive generations (‫)التّواتر‬. The Qur’an is also known by other names such as : Al-Furkān , Al-kitāb , Al-dhikr , Al-

41

Chapter 3

wahy et Al-rōuh . It is the sacred book of all Muslims and the first reference to Islamic law. [Mahssin1973] Mus’haf

The mus’haf is the book that contains the entire Qur’an, the first

mus’haf was gathered by the Caliph Abu-Bakr. The mus’haf is written according to the Othmani script (‫العثماني‬

script(‫الإ ملائي‬

‫)الخط‬.

‫ )الخ ّط‬which is somewhat different from the standard

The translation of the Qur’an constitute in no way a mus’haf,

because the Qur’an was revealed in Arabic as unique in its form and its meaning. [Hadjhenni2008] Qur’anic Documents

It is a document that can match the entire contents of

the Qur’an (Mus’haf), or else extract a coherent (ayahs successive) of the Qur’an in Arabic. These extracts can be used as quotations or examples, and often written in standard Arabic script.[Hadjhenni2008] Qur’anic Lexicon (Mu’jam)

The Mu’jam is a reference that contains informa-

tion (subjects or words), sorted in some order as well as their translations or explanations, the Mu’jam may be general or specialized. a Qur’anic Mu’jam is any Mu’jam that its contents is based on Qur’an. Qur’anic Index

The index is as Mu’jam because it contains words sorted, but

the difference is that the index does not explain the words, but it indicates the position of all the instances so that we can access them easily. The index gives the position of the Qur’an by surah and ayah merely a specific term.[Khmissi2004]

3.3

Qur’an Structure

The Qur’an consists of 114 surahs, the surahs are divided into ayahs. In mus’haf, the number of each ayah is attached to its text. This partitioning is the same since the first mus’haf, because the order of Surahs and ayahs is specified by the prophet. However, there are other fragmentations of mus’haf adopted by scholars of the Qur’an in order to facilitate learning and recitation. The marks of these fragmentations are not part of the Qur’anic text, and they are written on the margins.[Zerrouki2005] The structure of the Qur’an consists of the following analytical levels[BenJammaa]: • Primary structure: surah, ayah, word and letter; • Special locations: First ayahs of Surah (‫السورة‬

‫ )فواتح‬, Last ayahs of Surah (‫خواتيم‬ ‫)السورة‬, Qur’anic comma (‫)فاصلة قرآنية‬, sajdah (‫)سجدة‬, waqf (‫;)وقف‬ 42

Chapter 3

• Other structures: page, Juz’(‫)جزء‬,Hizb(‫)حزب‬,Nisf(‫)نصف‬, Rubu’(‫)ربع‬, Thumn(‫)ثمن‬ ; • Scripture: Sawamit, Harakat, Hamza, diacritics, signs of distinction between similar letters, phonetic signs; • Incorporeal structure: word, keyword, phrase, purpose unit; Revelation: order, place, timing, cause, context ... etc. Here are some examples of different levels of fragmentation:

3.3.1

Fragmentation into surahs

It is the original fragmentation of the Qur’an, each Surah has a name,a basmalah (except Al-Tawbah, which does not have a basmalah ) and a list of ayahs separated by a circle that contains the number of ayah in its surah, the surahs are characterized by: a name, the number of ayahs, the type of revelation (Makkan

‫ مكّ ّية‬or Medinan ‫) مدن ّية‬,

order in mus’haf and order according to revelation . [Zerrouki2005]

Figure 3.1: Explanatory diagram of fragmentation into Surahs

3.3.2

Fragmentation into Hizbs

The Qur’an is fragmented into 30 parts - approximately equal, each part is called Juz’, this spreads the reading of the Quran on a month, each juz’ is fragmented into two parts, each one called Hizb. Thus, the half of Hizb called Nisf, and the half of Nisf is called Rubu’, and finally the half of Rubu’ called Thumn.[Zerrouki2005]

43

Chapter 3

Figure 3.2: Explanatory diagram of fragmentation into Hizbs

3.3.3

Fragmentation into Stops (Waqfs)

Marks of Waqf are used to knowing when to take a break during the recitation of the Qur’an, these marks are different to distinguish the type of waqf that can be: allowed, preferred, prohibited, etc. Note that the marking of Waqfs differs between rewayates (‫)الروايات‬. The figure shows a reference of waqfs based on rewayate of Kaloun .[Bourne1979, Zerrouki2005] Symbol

‫م‬ ‫قلى‬

∴∴

‫ج‬ ‫لا‬ ‫صلى‬

Type [Arabic] Waqf Lazim Waqf Awla Ta‘anuq Waqf Waqf Ja”iz La Waqf Wasl Awla

Type [English] Obligatory Stop Preferable Stop Embrace Stop Permissible Stop No Stop Preferable Continuation

Table 3.1: Waqfs types [Web-Islamweb]

_

3.4

Qur’anic Sciences

These sciences hanging the Qur’an as a subject for study to illustrate and explore its secrets, some books are found under different names such as Revelation Science (‫علم‬

‫ )التنزيل‬and Book Science (‫)علم الكتاب‬, we can classify them in two categories[bSbNaserEttayar2008]: 1. Sciences originally coming from the Qur’an: we can not use them to study other things except the Qur’an such as knowledge of Makkan and Medinan ayahs, the causes of revelation, etc. 44

Chapter 3

2. Common sciences between the Qur’an and other sciences: they are the sciences that arise from the fact that the Qur’an can be seen in two different ways : (a) If seen as a text written in Arabic, Arab linguists can study it as well in terms of Declension (‫)الإ عراب‬, Morphology (‫)الصرف‬, Rhetoric (‫)البلاغة‬, Lexicology (‫المعاجم‬

‫)علم‬, etc.

Should be noted here that the Qur’an was the primary

reason for the emergence of the sciences of Arabic language, since the Arabs did not take care of their own language until the arrival of Islam. Note also that this kind of science is wider in the books of Arabic language than in books of Sciences of the Qur’an, because sciences the Qur’an took only what they need to. (b) If seen as a legislative text, then this characteristic is shared with the Sunnah, it has produced a number of sciences such as the Fiqh (‫ )الفقه‬, Abrogating and Abrogated ayahs (‫)النّاسخ والمنسوخ‬, General and Particular (ّ‫الخاص والعام‬ ) ّ , etc.

Qur’anic science are so vast than we can browse all its branches, so we’ll rely on what is supposed to help us in our work:

3.4.1

Knowledge of Ayahs revelation places

There are many definitions to distinguish between what is Makkan and what is Medinan, but the best known definition is as follows : • The Makkan is all that was revealed before the migration of Prophet (‫)الهجرة النبوية‬ , even if it is revealed in Medina, and Medinan is all that was revealed after the migration, even if it is revealed in Makka. • The Makkan surahs are among 82 surahs, and Medinan are 20 surahs, while the remaining 12 surahs are raised for the divergence, the Makkan and Medinan ayas have characteristics that distinguish them from each other : • The characteristics of Makkans, Any Surah contains : ◦ A prostration of recitation(‫التّلاوة‬ ◦ The word ”Nay!” (kallā

‫;)سجود‬

‫)ك ّلا‬, there are 33 occurrences of this word in the

Qur’an; ◦ The term ”O people  ‫اس‬ ُ ّ‫الن‬

‫ َيا أَ ُّي َها‬ ” ;

◦ Stories of the Prophets and the past communities , except in the case of Surah the Cow (‫;)البقرة‬

45

Chapter 3

◦ The stories of Adam and Satan, except once again, Surah the Cow (‫)البقرة‬  ; ◦ The detached alphabetical letters such as Alif-Lam-Mim (‫ )ا ل م‬and Ha-Mim (‫م‬

‫)ح‬, except Surah the Cow (‫ )البقرة‬and al-’Imrān (‫)آل عمران‬.

• The characteristics of Medinans: Any Surah contains : ◦ An obligation ; ◦ A legal punishment ; ◦ Ayahs talking about hypocrisy ; ◦ Dialogue with Jews and Christians (People of the book ◦ « O believers … » [‫آ َم ُنوا‬

‫] َيا أَ ُّي َها الَّ ِذ َين‬ ;

‫; ) أهل الكتاب‬

• The importance of knowing the Makkans and the Medinans :   ◦ Knowledge of Abrogating and Abrogated ayahs (‫المنسوخ‬

‫)النّاسخ و‬.  

◦ Mastering the history of Islamic legislation and juridical purpose by sorting the branches with their foundations, using principles of reflection and psychology to deliver legal consultations reliably. This helps the acceptance of preaching and Islamic laws by peoples newly converted to Islam. ◦ The explanation of the Qur’an and understanding of its teachings. ◦ The discovery of various Qur’anic styles and their use in Islamic preaching .   ◦ Knowledge of the prophet’s biography through the Qur’anic verses.

3.4.2

Knowledge of Ayahs revelation causes:

The importance of this science is evident in the ayas concerning specific moments of the prophet’s life, eg : [‫اللَّ ِه‬

‫] َولِلَّ ِه الْ َمشْ ِر ُق َوالْ َم ْغر ُِب َفأَ ْي َن َما تُ َولُّوا َف َث َّم َو ْج ُه‬.

If we try to

understand literally, people will say it is not required to move towards the Qiblah, which is obviously wrong. So we can not grasp the meaning of the ayah only if we know the cause of revelation[Zarkashī1957].

3.4.3

Knowledge of Morphology:

This science considers the structure of words, it is divided into two parts: 1. Put the word in various modes such as deverbal (‫)مصدر‬, active participle (‫فاعل‬ ‫)اسم‬ , passive participle (‫مفعول‬

‫)اسم‬, etc.

46

Chapter 3

2. Changing a word to express a particular meaning such as the expansion (‫)ال ّزيادة‬ , theinversion (‫)القلب‬, the diphthong (‫)الإ دغام‬, etc.

Knowing the morphology is essential to understand the meaning of the ayahs because sometimes we do understand ambiguous words only with discussing the morphology, for

ِ ‫الْق‬ ‫ ] َوأَ َّما‬and [‫الْ ُم ْق ِس ِطين‬ َ ‫ح‬  example: [ً‫طبا‬ َ ‫لِ َج َه َّن َم‬ ‫ َف َكانُوا‬ َ‫َاس ُطون‬

‫]� َوأَ ْق ِس ُطوا إِ َّن اللَّ َه ُي ِح ُّب‬, one can really see how the meaning has changed by the morphology from Equitables (‫)المقسطون‬ to Inequitables (‫)القاسطون‬.[Zarkashī1957] 3.4.4

ّ ‫الخ‬ Knowledge of Orthography (‫ط‬

‫ )علم مرسوم‬:

When the companions wrote the Qur’an at the time of Uthman, they have diverged in writing of the word ‫( التابوت‬coffin) in Mus’haf. Zaid ibn thabit (‫ثابت‬

‫ )زيد بن‬said ‫التابوه‬,

but Quraysh1 (‫ )قريش‬said ‫التابوت‬. So they asked the opinion of Othman who said write

it

‫ التابوت‬because the Qur’an was revealed with the dialect (language) of Quraysh. In fact, Qur’anic writing is fixed, Abdullah ibn Droustouīh (‫ )عبد الله ابن درستويه‬said

that in the Arabic language there are two calligraphies unable to be judged: the script of Mus’haf and the script of Prosody (‫العروض‬

‫)تقطيع‬, So it is not permitted to write

the mus’haf with another script but Othmani because the differences of this script are done intentionally, we give below some examples[Zarkashī1957] : • Addition of Alif (‫)أَلِف‬ : such as Alif in [‫الْلُ ْؤلُؤِا‬ • Addition of Wāw (‫ )واو‬: such as [‫آ َياتِي‬ • Addition of Yā’ (‫)ياء‬ : like in [‫بِأَ ْيي ٍد‬ • Removal of Alif : such as [

‫] َسأ ُورِي ُك ْم‬ ;

‫الس َم َاء َب َن ْي َنا َها‬ َّ ‫; ] َو‬

‫]بِ ْس ِم اللَّ ِه‬, but not in [ َ‫; ]بِ ْاس ِم َر ِّبك‬

ِ ‫الْ َب‬ • Removal of Wāw : such as [‫اطل‬ • Removal of Yā’ : such as [‫ِع ْل ٌم‬

ً‫;]شَ ْي ٍء َح َّتى أُ ْح ِد َث لَكَ ِم ْن ُه ِذ ْكرا‬

‫]� َو َي ْم ُح اللَّ ُه‬, but not in [‫;]� َي ْم ُحوا اللَّ ُه َما َيشَ اء‬

‫س لَكَ بِ ِه‬ َ ‫] َفلا َت ْسأَلْنِ َما لَ ْي‬, but not in [‫َفلا َت ْسأَلْ ِني َع ْن‬

ُ َ‫ل‬ • Diphthong(‫)الإ دغام‬: such as [‫ك ْم‬

‫ ] َف ِٕالَّ ْم َي ْس َت ِج ُيبوا‬and [ َ‫;] َف ِٕا ْن لَ ْم َي ْس َت ِج ُيبوا لَك‬

َ ‫َبص‬ • Close letters: such as Sīn and Sâd in: [‫ط ًة‬ 1

‫] َكأَ ْم َث ِال‬, but not in [‫; ] َكأَنَّ ُه ْم لُ ْؤلُ ٌؤ‬

ِ ‫ َب ْس َط ًة‬ ‫] َو َزا َد ُه‬. ِ‫ ] َو َزا َدكُ ْم ِفي الْ َخ ْلق‬and [‫والْ ِج ْس ِم‬ َ ‫الْ ِع ْل ِم‬ ‫في‬ 

Quraysh tribe, the dominant tribe of Mecca upon the appearance of the religion of Islam 47

Chapter 3

3.4.5

Grammatical analysis of the Qur’an (‫القرآن‬

‫ )إعراب ألفاظ‬:

The prophet has said :« Analyze the Qur’an and Get its oddities »2 . Analyzing here means the knowledge of the meaning of words. there are even who have analyzed grammatically the entire Qur’an word by word.

3.4.6

[Zarkash�1957]

Science of allegorical ayahs (‫المتشابه‬

‫)علم‬ :

Allegorical ayahs are those ayahs mentioned with the same meaning but in different forms and words repeated several times, as in the following examples[Zarkashī1957]: • Reversing : like [‫ين‬ َّ ‫َو‬ َ ‫الصابِ ِئ‬

‫ ] َوال َّن َصا َرى‬and [‫ين َوال َّن َصا َرى‬ َّ ‫] َو‬ ; َ ‫الصابِ ِئ‬ ِ ‫بِ ُسو َر ٍة‬ ‫; ] َفأْتُوا‬ ‫ ] َفأْتُوا بِ ُسو َر ٍة ِّم ْن‬and [‫مثْله‬  like [‫حق‬ َ ْ‫ين بِ َغ ْي ِر ال‬ َ ‫ ] َو َي ْق ُتلُو َن ال َّن ِب ِّي‬and [‫ين بِ َغ ْي ِر‬ َ ‫َو َي ْق ُتلُو َن ال َّن ِب ِّي‬

• Increases and decreases : like [‫ِم ْث ِل ِه‬ • Definition and in-definition:

‫] َح ِّق‬ ;

ٍ ‫ ]لَ ْن َت َم َّس َنا ال َّنا ُر إِلا َّ أَ َّياماً َم ْعدُو َد‬and [‫ َم ْعدُو َد ًة‬ ً‫أَ َّياما‬ َّ‫إِلا‬ ‫ال َّنا ُر‬ ‫ َت َم َّس َنا‬ ‫]لَ ْن‬ ; • Singular and plural: like [‫ات‬ • Substitution of prepositions : like [‫إِلَ ْي َنا‬

‫ ] َو َما أُنْ ِز َل‬and [‫ َع َل ْي َنا‬ ‫أُنْ ِز َل‬ ‫;] َو َما‬

‫ ] َما أَلْ َف ْي َنا َع َل ْي ِه‬and [‫;] َما َو َج ْدنَا َع َل ْي ِه آ َب َاءنَا‬ َّ ِٕ‫ ]ا‬which repeated 20 times in Qur’an. Literal Repetition : such as [‫ن ِفي َذلِكَ لا ٓ َي ًة‬

• Substitutions of words : like [‫اءنَا‬ َ ‫آ َب‬ •

3.4.7

The beginnings of surahs (‫سور‬ ّ ‫ال‬

‫)فواتح‬:

The Qur’an contains 114 chapters, each one begins with one of the following ten different types[Zarkashī1957] : • Appreciation of Allah  : [‫لِلَّ ِه‬

‫;]الْ َح ْم ُد‬

• Alphabetical letters : [‫;]كهيعص‬

‫;] َيا أَ ُّي َها الَّ ِذ َين‬ ِ ‫;] َوال َّنا ِز َع‬ Oath: [‫ات‬

• Call: [‫ٓا َم ُنوا‬ •

• Narrative sentence : [‫اللَّ ِه‬ • 2

‫]أَ َتى أَ ْم ُر‬ ; Condition : [‫َص ُر اللَّ ِه‬ ْ ‫;]إِ َذا َج َاء ن‬

«‫غرائبه‬

‫أعربوا القرآن والتمسوا‬  » – ‫– رواه البيهقي والحاكم عن أبي هريرة‬ 48

Chapter 3

• Command : [ َ‫َر ِّبك‬

‫;]ا ْق َرأْ بِ ْاس ِم‬

• Interrogation : [‫اءلُون‬ َ ‫َي َت َس‬ • Prayer: [‫ين‬ َ ‫لِ ْل ُم َط ِّف ِف‬

‫; ] َع َّم‬

‫;] َو ْي ٌل‬

• Justification (only one time) : [‫قُ َر ْيش‬

3.4.8

ِ ‫]� ِٕلا‬. ‫يلاف‬

Knowledge of Qur’anic Parables (‫القرآنية‬

‫)الأمثال‬:

This science is important because there are several ayahs which are illuminated through

ِ ‫ ] َك َمثَلِ الْ َع ْن َك ُب‬and [ً‫] َم َثلُ ُه ْم َك َمثَلِ الَّ ِذي ْاس َت ْو َق َد نَارا‬, this ‫وت ات ََّخ َذ ْت‬ science are stated in the Qur’an in : [‫ن‬ َ ‫س َو َما َي ْع ِقلُ َها إِلا َّ الْ َعالِ ُمو‬ ِ ‫ ] َوتِ ْلكَ الا َٔ ْم َثا ُل نَضْ ِر ُب َها لِل َّنا‬parables such as : [ً‫َب ْيتا‬

Spider,48. [Zarkashī1957]

3.4.9

Tafssīr (‫)التفسير‬:

Tafssīr is explaining Qur’an meanings and extracting its laws (‫)الأحكام‬. This requires the knowledge of causes of revelation, the grammatical analysis of the Qur’an and other Qur’anic sciences.

3.5

Computerization of the Qur’an

The Arabic language was used in computer programs since the sixties. The software ’Salsabil’ was the first treating the Qur’an, and it was without diacritics because of softwares at that time was developed in a low level. But later, the high-resolution screens of Mac OS allowed introducing of Arabic diacritics and fonts to computers. Today, Qur’anic software development becomes extremely important among all areas of development by several factors, including: • The great interest of Muslims in Qur’an; • Support of current operating systems to the Arabic language; • The inclusion of the symbols of the Qur’an into the Unicode system; • The appearance of fonts that are used to write in Othmani script . • The passage of some programmers to develop under open source licences. • Cumulative effect of projects working on Quran. 49

Chapter 3

3.5.1

Advantages of Computerization

The computerization of the Qur’an lead to a set of advantages: Unite the different Qur’anic sciences:

The improvement of accessibility and

time of data access is a key objective of computerization in general. Particularly, computerization greatly facilitates learning Quran, recitation, search, Tafssir, translation, and teaching of the Qur’an. Current Qur’anic applications offer the ability to get through all these sciences with simple clicks, saving time and effort than searching in books. These programs do not replace books and traditional ways of learning, but they complement it. Verify the numerical miracles of the Qur’an:

There is a type of Qur’anic

miracles that covers miracles based on statistics performed on the Qur’an, for example: Word I

‫الدّنيا‬ ‫الملائكة‬ ‫الصالحات‬ ّ ‫الشهر‬

Frequency 115 68 167 12 / 365

Word II

‫الآخرة‬ ‫الشّ ياطين‬ ‫ال ّسيئات‬ ‫اليوم‬

Table 3.2: Some numerical miracles of Qur’an[Nawfal1975]

The computerization of the Qur’an can help verify whether these miracles are true or false miracles and discover other statistics that are clear, We note here that there are many modern attempts to prove the miracles in the Qur’an numerical who met with strong opposition the part of Muslim scholars, especially, the impostor Rachād Khalīfa that claimed to be a prophet based on his observations of numerical phenomena such as these.[Khedar]

3.6

Qur’an Indexes

Over centuries, Muslims were keenly interested in the Qur’an, by Tafssir , study, preservation, recitation, and calligraphy. Qur’anic science was the subject of great attention. Therefore, indexes were created for the vocabulary of the Qur’an, the search for similar words and differences between them, find their roots and dissimilarities. In the modern era, ”the indexed Mu’jam of the Holy Qur’an words” by Mohammad Fouad Abd El baki had filled a critical gap in the speed of access to ayahs of the Qur’an using any word, the Mu’jam index the nouns and verbs in the Qur’an according to the roots. 50

Chapter 3

Today, there have been numerous voices calling for the computerization of such index for use in the broad fields of computing, including the search and several projects were launched for this purpose. In this section, we mention some of those projects.

3.6.1

History

Many Muslim scholars are interested in indexing of the Qur’an by developing methods and renewing the manner and the purpose. They put up indexes of Qur’anic terms and their positions in the Qur’an, then they built lexicons to explain the meanings of these words after recalling the ayahs in which they appear. Some scholars have built specialized lexicons, each containing the vocabulary of a single topic. The evolution of the indexes of the Qur’an was done in four main steps[Khmissi2004]: 3.6.1.1 Indexing words of the Qur’an Development of indexes of Qur’anic words is a necessity, because they enable researchers to easily access the words of the Qur’an. Muslims and Orientalists have focused their research in the first third of the XXth century on this axis. Among the major works in this field we mention: • « nujoum al-furkān fī atrāf al-coran » (‫ )نجوم الفرقان في أطراف القرآن‬by the German orientalist Leberecht Flugel ; • « mu’jam of Qur’an ayahs » ( ‫القرآن‬

‫ )معجم آيات‬by Houssine Nassār ;

• « the indexed Mu’jam of the Holy Qur’an words » (‫القرآن‬

‫ )الكريم‬by Mohammed Fouad Abd El-bāki .

‫المعجم المفهرس لألفاظ‬

The index Mohammed Fouad is the most important, he has benefited from the experience of Flugel, translated his book nujoum al-furkān fī atrāf al-coran and perfected it by removing word affixes and arranging them in alphabetical order. His mu’jam « mu’jam of Qur’an ayahs » was published in Egypt 1945, and ayah numbers mentioned in this book corresponded to the official mus’haf. Then it was published in several Arab and Muslim countries including Istanbul in 1982. This book has been highly successful, however it does not contain pronouns and particles[Khmissi2004]. 3.6.1.2 Ma’ājim of Qur’anic words The index of the Qur’an have the potential to find the position of words, then the researcher must find the Tafssir within the appropriate books for himself, hence the 51

Chapter 3

need to produce ma’ājim giving the words with explanation to make the search more flexible. Many collaborative works were launched work in this direction. In 1941 Dr. Mohamed Hussein Heikal, member of the Academy of Arabic Language in Cairo has proposed to implement a Mu’jam of the Qur’anic terms . In 1949, a committee that began developing this Mu’jam modeled on that of Mohammed Fouad Abd Elbāki.[Khmissi2004] 3.6.1.3 Specialized Ma’ājim of Qur’anic words The ma’ājim mentioned above are general, and it will be difficult to find the ayahs containing words that belong to a specific field. This led to the creation of specialized ma’ājim such as ”Mu’jam of vocabulary of plants in the Qur’an” by the leader in this field Fawzi Al-Mokhtar na’āl, who then completed other ma’ājim specialized for humans, animals, ethics, trade, etc.[Khmissi2004] 3.6.1.4 Qur’anic Indexes and Computer When the computer entered the daily lives of people, it made a revolution in the information world. The Muslim scholar took advantage of these services, and they have made one excelling index for the Qur’an. In fact the speed and accuracy of the computer allow it to quickly give the location and number of occurrences of any word in the Qur’an, it is sufficient that only feeds it with the indexes and ma’ājim.

3.6.2

Classification

Existing indexes differ considerably from each other, we can classify them using some criteria, including: the subject or the purpose of each index and the unit on which the index is based ... etc. Classified according to the criteria mentioned above, we will explain the indexes, existing or suspected, the most significant: 3.6.2.1 By unit: The indexes of the Quran is based on several units (see figure): 3.6.2.1.1 Word These indexes can assume two aspects, the first is grammatical, it consists in segmenting the words according to their fundamental structure, indicating the affixes and origins such as root and morpheme. The second aspect is intended to facilitate the reading of mus’haf by an index of words including different spellings 52

Chapter 3

between the standard script and the Othmani script and an index of words that are spelled the same way but different in pronunciation . Among these indexes there exists: 1. Indexes of all words and by linking them with all their parts: prefixes + stem + affixes, as well as the root, the morpheme, the verb in the infinitive form or the singular form in the case of plural words, the more various inflectional information such as the type of word and its inflectional function . These indexes serve a purpose in natural language processing . 2. An index of words whose writing differs between Othmani script and standard script, by linking each index entry with ayahs and the words to be achieved into the pages of mus’haf. the aim of this index is to facilitate reading in the Qur’an for beginners and those who do not know the rules for writing words in Othmani script . 3. An index of words whose similarly spelled but the pronunciation varies depending on the agreed format, with distinction between the pronunciation of these words and by linking them with the pages of mus’haf. [Mayaman] Example of index that takes the word as a unit, this index will be detailed afterwards:

Figure 3.3: Index using the word as a unit [Arabic Quranic Corpus]

3.6.2.1.2 Ayah This index takes the ayahs as an indexing unit in order to classify them according to topic and examine the similarity and coincidence and develop a Mu’jam based on a morphological analyzer of Qur’anic roots, as well as making an indexation of the ayahs beginnings and their ends (

‫) رؤوس الآيات‬, etc. Here are some

indexes for example : 1. An index of Qur’anic ayahs and their locations in the mus’haf. 53

Chapter 3

2. A tree of topics complete and comprehensive consisting of several levels containing the topics covered by the Qur’an. This index connects each topic primary or secondary with ayahs that can be placed under this topic. This index is considered an excellent aid for preachers, scholars and students of Islamic knowledge to access all ayahs of a topic or subtopic related the Qur’anic text. Note that ayah can be linked to several subtopics. 3. An index of identical or similar ayahs in the Qur’an to serve memorizers and researchers of the Qur’an, where identical and similar ayahs are identified and linked into textual groups. each groups represent a list of identical and similar ayahs to each surah (if any), and accessible by choosing any ayah on that list. 4. An index of Qur’anic ayahs arranged according to the letters of first words, and ends of ayahs and the last two words, to facilitate both access to any ayah, and memorizing the Qur’an. 5. An index of abrogating and the abrogated ayahs among the Qur’an: includes the listing of abrogating or abrogated ayahs within surahs of the Qur’an, by linking the ayahs with their locations in the pages of Mus’haf.[Mayaman]

Figure 3.4: Index that takes the ayah as a unit [Arabeyes Quran Model]

3.6.2.1.3 Surah

The basic purpose of these indexes is that they give a general idea

of the surahs, with sufficient information to learn what needs to be learned: basic data and virtues of surahs (‫السور‬

‫)فضائل‬. the most important issues of these indexes are :

1. A list of surahs and their locations in the mus’haf.

54

Chapter 3

2. An index of basic data of surahs, including its names, type (Makkan or Medinan), its order in the mus’haf, the order according to the revelation, its previous and following surah (in order of mus’haf or revelation). 3. An index of virtues of the surahs in the Prophet’s Sunnah (‫الس َّنةالنبوية‬ ُ ), based on the narration of Hadiths (‫الأحاديث‬

4. An index of Surah keys (‫سورة‬ ّ ‫ال‬

‫)رواية‬.

‫ )مفاتيح‬, which aims to define the Surah in terms

of its names, the reason for that naming, the cause of revelation (if any), etc. with an overview of Surah, and major topics it contains.

5. An index of topics for each surah such each topic includes several ayahs[Mayaman] 3.6.2.1.4 Other units

In addition to the aforementioned indexes based on word,

surah and ayah, there are indexes based on other units, including: 1. The index based on the grammatical parts of words, it provides information on the type of each part and its grammatical characteristics: whether invariant or flexed, its inflexion, its inflectional mark and so on. Order_Aff 1 2 3 4 5

Id_word 2597 2597 2597 2597 2597

Diacritic

‫فتحة‬ ‫فتحة‬ ‫سكون‬ ‫فتحة‬ ‫سكون أو ضم عند التقاء الساكنين‬

Affix

‫فـ‬ ‫ـسـ‬ ‫يكفي‬ ‫ك‬ ‫هم‬

Table 3.3: Index based on word parts

2. The index of syllables or phonetic units, it provides information on pronunciation, which are very useful to the science of recitation. No such index exists yet. 3. Index of Qur’anic sentences (see Table), which shows how to split the surahs into sentences, because it is common knowledge that the ayah does not necessarily represent a sentence.

55

Chapter 3

Sentence

Num

@‫بِ ْس ِم اللَّ ِه ال َّر ْح َمنِ ال َّر ِحي ِم‬ @ ِ‫ين@ال َّر ْح َمنِ ال َّر ِحي ِم@ َمالِ ِك َي ْو ِم ال ِّدين‬ َ ‫الْ َح ْم ُد لِلَّ ِه َر ِّب الْ َعالَ ِم‬ ‫إِ َّيا َك نَ ْع ُب ُد‬ @‫َوإِ َّيا َك ن َْس َت ِعي ُن‬ ِ ‫اط الْ ُم ْس َت ِقي َم‬ َ ‫@ص َر‬ َ ‫الص َر‬ ِ ‫اط الَّ ِذ َين أَنْ َع ْم َت َع َل ْي ِه ْم َغ ْي ِر الْ َمغ ُْض‬ @‫ين‬ َّ َ ‫وب َع َل ْي ِه ْم َولا‬ ِّ ‫ا ْه ِدنَا‬ َ ِّ‫الضال‬

1 2 3 4 5

Table 3.4: Index based on sentences– surah: al-fātiha

3.6.2.2 By purpose The science of indexing approached the Qur’an in several respects. From indexing words, the indexing of topics to other indexes that we cite as follows:

Figure 3.5: Classification indexes by purpose

3.6.2.2.1 Syntactic indexes

These are the indexes that handle words in terms of

functionalities of the Arabic language to show the affix of the word, its root, the verb if it is derived from a verb, its singular form if it is plural, etc. The goal is to search by words and replace stemmers by performing the majority of their functions. These indexes are built with or without word repetition. Among the best books that have indexed the Qur’an, the Mu’jam of Pr. Muhammad Fuad Abd El-baqi well as the book « nujoum al-furkān fī atrāf al-coran » and the book « murchid al-haïrān ilā ayāte al-coran ». The index of words by Taha Zerrouki, the project ‘Midād lbayān’ and the Arabic Qur’anic Corpus by Kais Dukes can be considered in this category, we are going to explain them later in this chapter. 3.6.2.2.2 Semantic indexes

These are the indexes following a semantic approach

in order to link words of the Qur’an with their meaning, the first database of significa56

Chapter 3

tions can be considered as those that are based on the roots, since in terms of meaning,

‫» جيء‬, « ‫»أتى‬, « ‫» قدم‬, «‫» حضر‬, « ‫» وصل‬, « ‫ »ولي‬and «‫ » دبر‬as well as : «‫» دخل‬, « ‫» ولج‬, « ‫»قبل‬, « ‫ » ورد‬and « ‫ »وصل‬. There are many books in the Arab heritage, that mention of the words like

the roots are closely interconnected. For example, the roots: «

that. There are also modern dictionaries that provide meanings and the corresponding foreign words. [Mālik1991, Nasser1989] 3.6.2.2.3 Thematic indexes

These are the indexes that divide the Qur’an into

units on the same topic. Division takes the form of a pyramid of topics. Among the books that have indexed Qur’anic topics the book« zad al-moâllifīn min kitābi rabbi al‘ālamin » by Abd Allah Mohammed Al-daruīche. Among the indexes of topics that have been computerized, the index of topics made by Taha Zerrouki et the project Qurany by Noorhan Abbas as we’ll cover those in this chapter. 3.6.2.2.4 Structural indexes

These are the indexes that focus on relations be-

tween the different divisions of mus’haf as the division into ayahs, surahs ..., or thumns, rubu’..., or even the special divisions as sentences and so on ... (see figure). These indexes are based on one of the units of the division.

Figure 3.6: The various structures of Qur’an

3.6.2.2.5 Statistical indexes

It is those who are primarily intended to collect

statistics on the various units of the Qur’an based on the letters, syllables, ... to the Qur’an as a whole. Among the possible statistics: the frequency, the average frequency and the dominant component. These indexes are based on the element that represents the basis of statistics. An example of statistics is the number of characters in an ayah or a surah.

57

Chapter 3

3.6.3

Projects of building indexes

There are many Qur’anic indexes that vary in purpose : Syntactic, Thematic, etc; But the problem is that most of them are printed, intended only for human reading, only a small part was released on computerized data storing format such as XML, Excel or as a database. There are many plans to computerize the printed indexes but still under development. Among these projects we cite: 3.6.3.1 Midād lbayān This is a project started by Mohamed Zaki Khadher which aims to build a Qur’anic database as far as possible and describes all the details of the Qur’an from the smaller terms that are the syllables of the word then the word, phrases, sentences, ayahs and surahs. This database contains writing, pronunciation, morphology, grammar, and meaning[Khedar]. Midād lbayān is considered as a great achievement, it contains within its indexes lot of information about the Qur’anic words and their characteristics and also the prospects of adding further indexes to study the sentences of the Qur’an, indexing topics, etc. The authors of project Midād lbayān confirm that the base is intended for free use, but they do not give details on the license used. They provide only manual exposure program of the database, without references.[Web-Midād lba] We are going to address one of the earlier versions of the database, due to lack of access to the current version that it contains in its main index: 1. Sequential code that serves as a key 2. Lemma(‫الكلمة‬

‫ )جذع‬and Root (‫)جذر الكلمة‬

3. The word in standard script (vocalized) and Othmani script 4. List of prefixes ( 4 maximum ) 5. List of postfixes ( 3 maximum ) 6. Number of Ayah and Surah Seq 1 2 3 4 5 6

Word # @

‫بِ ْسم‬ ‫اللَّه‬ ‫ال َّر ْح َمن‬ ‫ال َّر ِحي ِم‬

Stem # @

‫ْاسم‬ ‫اللَّه‬ ‫َر ْح َمن‬ ‫َر ِحيم‬

Root # @

‫سمى‬ ‫ءله‬ ‫رحم‬ ‫رحم‬

Othmani # @

‫بِ ْسم‬ ‫ِ ا?للَّ ِه‬ ‫ِ ا?ل َّر ْح َم!ن‬ ‫ا?ل َّر ِحي ِم‬

Prefix1

Prefix2

Prefix3

Prefix4

Postfix1

Postfix2

‫ِ ِب‬ ‫ال‬ ‫ِ ال‬

Postfix3

Surah 1 1 1

Ayah 0 1 1

1 1 1

1 1 1

Table 3.5: Overview of main index– Midād lbayān 58

Chapter 3

3.6.3.2 Indexes by Taha Zerrouki This is an individual effort by Taha Zerrouki under a project about the Quran and its sciences. What distinguishes his work is its diversity: an index for words, one for topics and another synonyms. They are not based on a clear unit (unlike Midād lbayān and Qur’anic Arabic corpus that are based on words of Quran by considering all occurrences). that will increase the complexity and probability of fail in joining with other indexes. There are three indexes, the index of words contains : 1. Original word as a key 2. the word in Othmani script 3. Root 4. Deverbal noun 5. Type 6. Dual and Plural forms 7. Feminine form 8. Proper noun (true/false) Othmani

‫البسط‬

Proper

‫فيبسطه‬ ‫يبسطوا‬ ‫باسقت‬

Dual form

‫مبسوط‬ ‫باسق‬

Feminine

Plural form

‫مبسوطتان‬

Type

‫مثنى‬ ‫باسقات‬

‫جمع‬

‫اسم‬ ‫اسم‬ ‫فعل‬ ‫فعل‬ ‫اسم‬

Deverbal noun

‫بسط‬ ‫مبسوطة‬ ‫بسط‬ ‫بسط‬ ‫باسقة‬

Root

‫بسط‬ ‫بسط‬ ‫بسط‬ ‫بسط‬ ‫بسق‬

Original

‫البسط‬ ‫مبسوطتان‬ ‫فيبسطه‬ ‫يبسطوا‬ ‫باسقات‬

Table 3.6: Overview of words index – M.Taha Zerrouki

The second index is the index of topics, it contains: 1. Topic 2. Subtopic 3. Section 4. Surah number 5. Ayahs interval (the first ayah to the last ayah)

59

Chapter 3

Topic ‫أسماء الله تعالى وصفاته‬ ‫أسماء الله تعالى وصفاته‬ ‫أسماء الله تعالى وصفاته‬ ‫أسماء الله تعالى وصفاته‬

Subtopic ‫صفات الله تعالى‬ ‫صفات الله تعالى‬ ‫صفات الله تعالى‬ ‫صفات الله تعالى‬

Section Surah ayah First ayah Last ‫ مشيئته تعالى‬16 93 ‫ مشيئته تعالى‬17 54 ‫عدله تعالى‬ 2 272 ‫عدله تعالى‬ 2 281

Table 3.7: Overview of the topics index– Taha Zerrouki

The third index is the index of synonyms, it contains: 1. Type 2. Synonyms list 3. Note

Note Type ‫إِبِل اسم جمع‬ ‫أَبِق فعل ألف‬ ‫أَب اسم‬ ‫أَ َتى فعل‬ ‫آثَر فعل‬

Synonyms ‫ِعير‬ ‫َفر‬ ‫َاص‬ َ ‫َه َرب ن‬ ‫َواِلْد‬ ‫أَ ْق َبل َح َضر َجاء‬ ‫اصطفى اخْ َتار‬ ْ ‫َف َّضل‬

Table 3.8: Overview of the index of synonyms – Taha Zerrouki

3.6.3.3 Qur’anic Arabic Corpus It is a linguistic resource that shows annotated Arabic grammar, syntax and morphology of each word in the Qur’an, made ��by Kais Dukes in 2009 at the School of Computer Science, University of Leeds. This index is available on the Internet with other simplified indexes licensed under the GPL (GNU Public License), but it imposes additional restrictions when downloading including the prohibition of modification. The Qur’anic corpus provides valuable grammatical information on the words of the Qur’an on which the index is based. It formulates them in a particular way which makes it difficult and complex extraction.[Web-Corpus] We will address the version available for download, it is in XML format and contains the following information : 1. Numbers of Surah, Ayah and Word as a key 2. The word in Othmani script (Unicode)

60

Chapter 3

3. A coded expression that describes the word morphologically and grammatically [Dukes2010]: (a) Prefix features: Al+ (determiner al), bi+ (preposition bi), ka+ (preposition ka), ta+ (preposition ta), sa+ (future particle sa), ya+ (vocative particle ya), ha+ (vocative particle ha) (b) letter alif as a prefixed particle: A:INTG+ (interrogative alif), A:EQ+ (equalization alif) (c) letter waw as a prefixed particle: wa+ (conjunction waw), w:P+ (preposition waw - used as a particle of oath) (d) letter fa as a prefixed particle: f:CONJ+ (conjunction fa), f:REM+ (resumption fa), f:CAUS+ (cause fa) (e) letter lam as a prefixed particle: l:P+ (preposition lam), l:EMPH+ (enphasis lam), l:PRP+ (purpose lam), l:IMPV+ (imperative lam) (f) root: ROOT: (uses Buckwalter transliteration) (g) lemma: LEM: (uses Buckwalter transliteration) (h) special: SP: (used if the word belongs to a special group such as (‫كان‬

‫)وأخواتها‬.

Certain words in the corpus are tagged this way where this is

relevant for syntactic function, and not easily determined by lemma or partof-speech; for example, the particle ma (‫ )ما‬in a negative sense can behave like the verb laysa (‫ )ليس‬and place a predicate into the accusative case) (i) person: 1 (first person), 2 (second person), 3 (third person) (j) gender: M (masculine), F (feminine) (k) number: S (singular), D (dual), P (plural) (l) aspect: PERF (perfect), IMPF (imperfect), IMPV (imperative) (m) mood: IND (indicative), SUBJ (subjunctive), JUS (jussive), ENG (energetic) (n) voice: ACT (active), PASS (passive), (o) verb form: I to XII (p) derivation: ACT PCPL (active participle), PASS PCPL (passive participle), VN (verbal noun) (q) state: DEF (definite), INDEF (indefinite) (r) case: NOM (nominative), ACC (accusative), GEN (gentive)

61

Chapter 3

(s) suffix features: PRON: (attached pronoun, compound feature with person, gender and number), +VOC (vocative suffix for Allahumma) chapter 1

verse 1

word 1

1 1

1 1

2 3

1

1

4

token

‫بِ ْس ِم‬ ‫ٱللَّ ِه‬ ِ‫ٱل َّر ْح َٰمن‬ ‫ٱل َّر ِحي ِم‬

morphology bi+ POS:N LEM:{som ROOT:smw M GEN POS:PN LEM:{ll~ah GEN Al+ POS:ADJ LEM:r~aHoma‘n ROOT:rHm MS GEN Al+ POS:ADJ LEM:r~aHiym ROOT:rHm MS GEN

Table 3.9: Overview of morphology index – Quranic Arabic corpus

3.6.3.4 Tanzil Project Tanzil is a project designed to provide a highly accurate and well checked electronic copy of Qur’anic text. It was developed by Hamid Zarrabi-Zadeh et reviewed by a verification team. Tanzil provides several texts of the Qur’an into two major categories as outlined below : 1. Othmani script (a) Othmani : Quranic text in Othmani script similar to « Mushaf Al-Madina ». (b) Othmani minimal : Quranic text script in Othmani script with the minimum of diacritical marks and symbols. 2. Standard script (a) Simple : the Qur’anic text on Standard script, recommended for search and display. (b) Simple Enhanced : simple text with a better demonstration of ’ikhfā and idghām3 , simplified for reading. (c) Simple minimal : simple text with a minimum number of diacritical marks and symbols. Suitable for integration in other texts. (d) Simple clean : simple text without diacritical signs or symbols, Suitable for easy search. Tanzil also provides some indexes as : • Surah index ; 3

Tajweed rules applied when the ’noon as-sakina’ or ’tanwīn’ is followed by certain letters.. 62

Chapter 3

• Sajdah index ; • structural indexes based on : ◦ Rub’ (we can define hizbs , nisfs based on this index); ◦ Juz’ ; ◦ Manzil ; ◦ Ruku’ ; ◦ Page. Tanzil is a project free to be used and downloaded from its public website. It benefits from considerable credibility since a group of researchers ensures the checking[Web-Tanzil]. surah index 1 1 1 1 1 1

surah name

1 2 ...

‫الفاتحة‬ ‫الفاتحة‬ ‫الفاتحة‬ ‫الفاتحة‬ ‫الفاتحة‬ ‫الفاتحة‬ ‫الفاتحة‬ ‫البقرة‬ ...

ayah index 1 2 3 4 5 6 7 1 ...

ayah text

‫بسم الله الرحمن الرحيم‬ ‫الحمد لله رب العالمين‬ ‫الرحمن الرحيم‬ ‫مالك يوم الدين‬ ‫إياك نعبد وإياك نستعين‬ ‫اهدنا الصراط المستقيم‬ ‫صراط الذين أنعمت عليهم غير المغضوب عليهم ولا الضالين‬ ‫الم‬ ...

bismillah

‫بسم الله الرحمن الرحيم‬ ...

Table 3.10: Example of simple proper Quranic text – Tanzil.info

index 1 2 3 4 …

ayas 7 286 200 176 …

start 0 7 293 493 …

name

‫الفاتحة‬ ‫البقرة‬ ‫آل عمران‬ ‫النساء‬ …

tname Al-Faatiha Al-Baqara Aal-i-Imraan An-Nisaa …

ename The Opening The Cow The Family of Imraan The Women …

type Makkan Medinan Medinan Medinan …

order 5 87 89 92 …

rukus 1 40 20 24 …

Table 3.11: Example of Surah index – Tanzil.info

63

Chapter 3

index 1 2 3 4 5 6 7 8 9 10 11 12 13 14

surah 7 13 16 17 19 22 22 25 27 32 38 41 53 84

ayah 206 15 50 109 58 18 77 60 26 15 24 38 62 21

type recommended recommended recommended recommended recommended recommended recommended recommended recommended obligatory recommended obligatory obligatory recommended

Table 3.12: Sajdah index – Tanzil.info

index 1 2 3 4 5 ...

surah 1 2 2 2 2 ...

ayah 1 26 44 60 75 ...

Table 3.13: Example of rub’ index – Tanzil.info

3.6.3.5 Boundary-Annotated Qur’an Corpus This Corpus is made in Leeds university , It’s version 1.0 is reported in [Brierley2012]. It’s based on a coarse-grained boundary annotation scheme for Arabic [Brierley2011] derived from Tajwīd stops and starts mark-up in a reputable edition of the Qur’an, and in a widely-used recitation style: ḥafṣ bin ‘āṣim [Sharaf2004]. The current BoundaryAnnotated Qur’an dataset contains 77430 words and 8230 sentences, where each word is tagged with prosodic and syntactic information at two coarse-grained levels. This corpus contains the following information: 1. Chapter number 2. Verse number 3. Sentence number 4. Word number 64

Chapter 3

5. Word in Othmani Script 6. Word in Standard Script 7. POS (Part of speech ) 8. Symbol of break 9. Tri. Set 10. Binary Set 11. whether the break is Terminal or not? 12. Word_by_Word Translation to English Chapter 1

Verse 1

Sentence 1

Word 1

1 1

1 1

1 1

2 3

1

1

1

4

1

2

1

1

1 1

2 2

1 1

2 3

1 ...

2 ...

1 ...

4 ...

Othmani

‫بِ ْس ِم‬ ‫اللَّ ِه‬ ِ‫ال َّر ْح َمن‬ ‫ال َّر ِحي ِم‬ ‫الْ َح ْم ُد‬ ‫لِلَّ ِه‬ ‫َر ِّب‬ ‫ين‬ َ ‫الْ َعالَ ِم‬ ...

Standard

‫بِ ْس ِم‬ ‫ٱللَّ ِه‬ ِ‫ٱل َّر ْح َٰمن‬ ‫ٱل َّر ِحي ِم‬ ‫ٱلْ َح ْم ُد‬ ‫لِلَّ ِه‬ ‫َر ِّب‬ ‫ين‬ َ ‫ٱلْ َٰع َل ِم‬

POS N

POS NOUN

Symbol -

Tri. Set -

Bin. Set non-break

Terminal -

Translation in-(the)-name

N N

NOUN NOMINAL

-

-

non-break non-break

-

(of)-allah the-most-gracious

N

NOMINAL

۞

||

break

terminal

the-most-merciful

N

NOUN

-

-

non-break

-

all-praises-and-thanks

N N

NOUN NOUN

-

-

non-break non-break

-

(be)-to-allah the-lord

N ...

NOUN ...

...

۞

|| ...

break ...

terminal ...

of-the-universe ...

...

Table 3.14: Sample of Boundary Annotated Qur’an Corpus

3.6.3.6 Qurany Concepts Tool Qurany Explorer is a comprehensive tool that covers all the themes and concepts mentioned in the Quran. Its made by Mrs. Noorhan Abbas of Leeds university. It is imported from ’Mushaf Al Tajweed’ (‫التجويد‬

‫ )مصحف‬.

The ’Mushaf Al Tajweed’

contains a comprehensive hierarchical index or ontology of nearly 1200 concepts in the Quran. It can be used to identify a precise concept and find the verses which allude to this concept, with higher precision. This index is made as a Python dictionary structure but its not available in its raw format.its available only for browsing on-line hosted in googleapps with the ID : quranytopics4 [Web-Qurany]. Qurany topics are named in Arabic and English, distributed on 4 levels hierarchy of topics . The first level contains 15 topics which are: 1. Pillars of Islam (‫الإ سلام‬

‫)أركان‬

2. The Call for Allah (‫الله‬

‫) الدعوة إلى‬

4

http://quranytopics.appspot.com/ 65

Chapter 3

3. The Holy Quran (‫الكريم‬

‫)القرآن‬

4. Jihad (‫)الجهاد‬ 5. Action(Work) (‫)العمل‬ 6. Man and The Moral Relations (‫الأخلاقية‬

‫)الإ نسان والعلاقات‬

7. Man and The Social Relations (‫الاجتماعية‬

‫)الإ نسان والعلاقات‬

8. Organizing Financial Relationships (‫المالية‬

‫)تنظيم العلاقات‬

9. Trade, Agriculture, Industry and Hunting (‫الصيد‬ 10. Judicial Relationships (‫القضائية‬

‫)العلاقات‬

11. General and Political Relationships (‫والعامة‬ 12. Science and Art (‫والفنون‬

‫) التجارة و الزراعة و الصناعة و‬

‫)العلاقات السياسية‬

‫)العلوم‬

13. Religions (‫)الديانات‬ 14. The Stories and the History (‫والتاريخ‬

‫)القصص‬

Figure 3.7: Preview of Qurany

3.7

Qur’an Ontologies

An ontology formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain, and may be used to describe the domain. In theory, an ontology is a ”formal, explicit specification of a shared conceptualization”[Gruber1993]. An ontology renders shared vocabulary and taxonomy, which models a domain with the definition of objects and/or concepts, and their properties and relations[Arvidsso2008]. 66

Chapter 3

Recently, many researches appeared about Ontologies for Qur’anic concepts and different information, we cite two of the most important Qur’anic ontologies that exist :

3.7.1

Qur’anic Concepts Ontology

The Qur’anic Ontology is also a project of Leeds university done by Kais Dukes. It uses knowledge representation to define the key concepts in the Qur’an, and shows the relationships between these concepts using predicate logic. The fundamental concepts in the ontology are based on the knowledge contained in traditional sources of Qur’anic analysis, including the hadīth of the prophet Muhammad , and the Tafssir (Qur’anic exegesis) of ibn kathīr. Named entities in verses, such as the names of historic people and places mentioned in the Qur’an, are linked to concepts in the ontology as part of named entity tagging. As well as listing the major concepts in the Qur’an, the ontology also defines a set of semantic relations between these concepts. The most important relation is the set membership relation ”instance” in which one concept is defined to be an instance or individual member of another group. For example the relation ”Satan is a jinn” in the ontology would represent the knowledge contained in the Qur’an that the individual known as Satan belongs to the set of sentient creations named the jinn. Other concepts in the ontology and grouped into logical categories, according to the properties that they share. For example, the Sun, Earth and Moon are classified under ”Astronomical Body”:

67

Chapter 3

Figure 3.8: A closer look at Qur’an Concepts Ontology

3.7.2

The Ontology made by Hadj Henni:

There are several possibilities of ontology for contents of the Qur’an. Hajj Henni undertook the construction of a domain ontology of documents Qur’anic by the method of XML schema mapping of the standard electronic model of Mus’haf to . This choice is justified by the advantages of this very convenient language. At first, he realized this mapping automatically, then manually. He found that it is satisfactory and better than the manual mapping. The ontology obtained by the manual mapping is enriched to form the first domain ontology of Qur’anic documents. This ontology meets the needs of standard, modular, and reusable semantics. It shall contain the domain concepts of Qur’anic documents and the semantic constraints as a basis for building other ontologies related to the contents of the Qur’an [Hadjhenni2008].

68

Chapter 3

Figure 3.9: Diagram of domain ontology of Qur’anic documents made by Hadj Henni[Hadjhenni2008]

3.8

Qur’anic Search Tools

Here we list some softwares/websites that offer the search in Quran as one of their main options. Starting with the name of each one with a little description including type and licensing, author, year of creation, features , cons , developing status.

3.8.1

Alawfa (‫)الأوفى‬

Alawfa is an Arabic website that offers search in Quran and was offering search in Hadith and Islamic contents before dropped out lately. Made the first time by masssoft company and Published on 20035 . The development of Alawfa was stuck many times during its career. The last time was revived is nearly January 2012 where its developers add new features and focus only on Quran. Currently, Alawfa offers only exact word search but partially improved by the feature of auto-completion for user keywords and search within a surah. It is based on non-vocalized Standard Qur’anic text for search but provides for view the vocalized standard text and the Uthmani text as images. Each result includes recitation, Tafssir, Translation to many languages, topics classification and a link to locate the similar ayahs. Alawfa calculate some statistics like number of words and letters within Ayah and Surah. However, those 5

checked by WayBackMachine 69

Chapter 3

statistics seems badly calculated since the space is considered as letter and this is totally wrong. It offers the highlight but limited to exact matches. Alawfa is closed source, actively developed recently.

Figure 3.10: Preview of Alawfa website: www.alawfa.com

3.8.2

Al-Monaqeb-Alqurany (‫القرآني‬

‫)المنقب‬

Al-Monaqeb-Alqurany is a search-in-quran web-page included in the Arabic websitewww.holyquran.ne It is published the first time on 20046 and it’s not developed actively the last years. It offers a board for advanced search options that contains many useful options : • The choice to search either with exact words, part of words or roots. • An English transliteration. However, this is a self-made non-standard transliteration. For example, the word

‫ سبحان‬is written as seen-ba-hha-alif-noon.

• Ignore/Consider Alef spelling errors • Limit the search in an interval of ayahs and surahs The highlight used is limited to exact match. Each result includes an English translation and a Tafssir. There is only English translation and All available Tafassir are from a Shi’ite source. Al-Monaqeb is closed source with inactive development currently.

6

checked by WayBackMachine 70

Chapter 3

Figure 3.11: Preview of Al Monaqeb Alqurany

3.8.3

Quran complex search service

This is a search service provided by the official website of King Fahd Complex for printings of Holy Quran. The website is multi-language . It was added to the website about 2004. Quran Complex provides only exact words search either fetch all words or some words. Unlike of the previous listed search tools, that one has many Qur’anic resources to search in and that’s good. Those resources are: • Makkan and Medinan • Quran Grammar Analysis • Tafassir • Translations • Abrogating and abrogated Ayahs • Gharib al-Quran • Revelation causes The problem is that the search in those resources is running completely independent whereas it’s more helpful if combined. For example, showing more information in the results of Quran search because the current view is minimal: Surah name, Ayah number, Ayah text. That project is closed source and it’s under a slow development.

71

Chapter 3

Figure 3.12: Preview of Quran Complex Search page

3.8.4

Quranic Researcher (‫القرآني‬

‫)الباحث‬

Quranic Researcher is a new website that mainly search in Quran and Sunnah. Created on 2011 by Al-GLWAN company. It is very similar to Al-Monaqeb-Alqurany, it has all the options provided and in an alike manner. There is a strong possibility that they’re based on the same code base. Unfortunately, we could do nothing to prove that since they are both closed source. By the way, Quranic Researcher has more extra features such as printing, saving and sharing options for each result. That project is closed source, free to use but not ad-free. It’s under an active development but looks like its no more focusing on Quran as a main search need.

72

Chapter 3

Figure 3.13: Preview of Quranic Researcher (www.quranicresearcher.com)

3.8.5

Quranologie (‫القرآن‬

‫)علم‬

Quranologie is a website offers the search in Quran as the main function. It was created on 2011 by Abadou Fares as a personal effort. Compared to the previous list search tools, Quranologie has a faster fetching process. It searches only with part of words and that raises some problems when the user want to search a short word that appears broadly as a part of words such as ‫من‬. Quranologie has the auto completion feature and offers a possibility to search within a specific surah. It provides the frequency of query matches as well as the query keywords frequencies one by one. Each result includes revelation place, recitation, new queries links to get to translations, a new query link to get to Tafssir and statistics (words/letters number). A problem with Quranologie is that his query system is very ambiguous because it accepted natural questions but in a very restricted way. That makes the user be confused how to form his questions. Just to notice, the questions in Quranologie has no relation to semantic approach; they are just an human-easy-read search query. The project is closed source and it is still under an active development.

73

Chapter 3

Figure 3.14: Preview of Quranologie (quranologie.com)

3.8.6

Quranic Corpus Word-by-Word Search

We’ve already talked about Quranic Arabic Corpus as a Qur’an index (See the Section: Qur’an Indexes, Subsection: Projects of building indexes). Quranic Arabic Corpus have a web interface to demonstrate his features. This interface has a search service and that’s what we’ll discuss here. Created on 20097 by Kais Dukes as a part of his PhD in Leeds university. It focuses on the morphological features and targets Qur’anic words as a unit of search unlike the previous listed search tools. It provides the search with Arabic words, English words and Buckwalter transliteration. The user can choose either he want to search for a word , a substring or an exact phrase. Moreover, the user may search using roots, stems and lemmas. Also, filtering words pursuant to their Part-Of-Speech (POS) And/Or their pattern. The view of results is minimal: surah id, ayah id, word id, word translation, ayah text. The highlight is perfect , it works well with vocalization and English words search. Quranic Corpus W-b-W search has a semantic feature which is retrieving semantically-related words based on a concept ontology. The interface is closed source but Quranic corpus data is open with GPL license. 7

checked by WayBackMachine 74

Chapter 3

The project is still active.

Figure 3.15: Preview of Qur’anic Arabic Corpus Word By Word search

3.8.7

Tanzil (‫)تنزيل‬

Tanzil is a Qur’anic project aimed at providing a highly-verified precise Quran text. It has a web application for browsing the Quran and provides search as a secondary functionality. Founded on 2007 by Hamid Zarrabi - Zadeh. We’ll talk only about the features related to search. Tanzil offer browsing the Quran by Surah, Juz, Page. The same to the most of the search tools, the user can search using exact word or a part of word. Also the user can partially vocalize his keywords, use logical relations (And, Or, Not) and Wildcards ( ? for one letter, * for many letters). Tanzil automatically matches similar letters. For example, the search word

‫نعمت‬

matches both

‫نعمت‬

and

‫نعمة‬.

Results is shown in minimal view: Ayah text, Ayah id, Surah name. The highlighting process is ideal. Clicking each result leads to a view that has the same form of Mus-haf where the position of ayah in Mus-haf is obvious. Tanzil is closed source and still under active development.

75

Chapter 3

Figure 3.16: Preview of Tanzil

3.8.8

Zekr (‫)ذكر‬

Zekr is an open source Quranic desktop application. It is an open platform Quran study tool for browsing and researching on the Quran. Zekr is a Quran-based project, planned to be a universal, open source, and cross-platform application to perform most of the usual refers to the Quran, according to the project website[Web-Zekr]. The first release was in January 2005 and was developed by Mohsen Saboorian. It has currently releases for Windows, Linux, Mac and there are lot of contributors working in the different levels of development. Zekr is considered as the original source of the Quran’s translations that are used broadly in Qur’anic applications. Zekr has two approaches of search: • Basic Search: uses Tanzil’s search method that is to create a regular expression from the user’s input query (see Tanzil 3.8.7). • Advanced Search: uses Lucene full-text search library. It supports all sorts of queries Lucene has: Wildcards, Range, Boosting, Fuzzy queries, Boolean operators, Grouping, etc. Users can limit the search only to a scope of ayahs and can search in translations. Zekr provides many sorting options: Relevance, Natural order (Mus-haf), Revelation order, Aya length. Each result contains translation and recitation added to the minimal information: Ayah text, Ayah id and Surah name. The highlight looks ideal. 76

Chapter 3

Zekr is open source licensed under GPLv2. It is still under active development. Main programming language used by it is Java.

Figure 3.17: Preview of Zekr Application

3.9

Conclusion

We concluded in this chapter that the richness of information contained in the Qur’an, lead us to wonder: how to extract the maximum information, the answer might be a search engine soundness, adequate and extensible. Limitation of Qur’anic computerized resources can make implementation of our work more difficult, but we tried to do our work with maximum independence to Qur’anic resources specially ones that are in an ambiguous state. In the next chapter, we present the characteristics of the language of Qur’an: The classical Arabic.

77

Part II Analysis & Conception

78

Chapter 4 Classification & Proposition of Qur’anic Search Features

Computers are useless. They can only give you answers Pablo Picasso

4.1

Introduction

Due to the limitations of search in Quran, the need to find a more practical method to query is evident. Our proposal is to design an information retrieval system that fits to the specific needs of the Quran. But in order to realize this objective, we should first list and classify all the search features that are possible and helpful, this chapter has been written to explain this point. It contains a listing for all search features that we have collected and a classification depending on the nature of feature. the chapter contains also the results of an online survey that we’ve done to get usability and usefulness of each feature.

4.2

Difficulties of Search in Quran

To clarify the vision about the problematic of this paper, we are describing the challenges that face the search in Quran: • First, as a general search need; • Second, as an Arabic search challenge; 79

Chapter 4

• Third, Quran as a special source of information. We start explaining the first point, the search in Quran is by theory has the same challenges of search in any other documents. The search in documents has passed by different phases in its evolution. At the beginning, the search was sequential based an exact keyword before the regular expressions were introduced. Full text search was invented to avoid the sequential search limitations on huge documents. The full text search introduces some new mechanisms for text analysis that include tokenization, normalization, and stemming...etc. Gathering Statistics make now a part of search process, it helps to improve the order of results and the suggestions. After the raising of the web semantic, the search is heading to a semantic approach where the to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear in the search-able data-space to generate more relevant results. To get more user experience, the search engines try to improve the behavior of showing the results by sorting it based on their relevance in the documents , more sorting criteria, Highlighting the keywords, Pagination, Filtering and Expanding. Moreover, improve the query input by introducing different input methods (ex: Vocal input) and suggesting related keywords. Till now, most of these features are not implemented to use on Quran. And many of them need to be customized to fit the Arabic properties that what explained in the next point. Secondly, Quran’s language is considered as the classical Arabic[?]. Arabic is a specific language because its morphology and orthography, and this must be taken into consideration in text analyzing phases. For instance, letters shaping (specially the Hamza

-‫ء‬-), the vocalization, the different levels of stemming and types of deriva-

tions...etc. That must be taken into consideration in search features by example: the regular expressions are badly misrepresenting the Arabic letters since the vocalization diacritics are not distinct from letters. The absence of vocalization issues some ambiguities [Albawwab2009] in understanding the words: • The word .

‫ال َم ِلك‬

(the



‫ َع َّد‬+‫َو‬

(and

‫ الملك‬could be understood as . ‫( ال َم َلك‬the angel), . king) or . ‫( ال ُم ْلك‬the kingdom) The word . ‫ وعد‬could be understood as . ‫*( َو َع َد‬he* promise) or .

+ *he* count) • The word . or .

‫ ُه‬+‫ َل‬+‫َو‬

‫وله‬

could be understood as .

( and + he + has )

‫َولَ َه‬

(admire) or .

‫ه‬+‫َو ِّل‬

( crown him )

Trying to resolve these problems as a generic Arabic problem is really hard since it hasn’t linguistic resources to make strict lexical analyzers. By the contrary, Quran 80

Chapter 4

has a limited count of words and that means that it’s possible to write manually morphological indexes and use it to replace lexical analyzers. Finally, we explain in this point what the specific challenges faced in search in order of the particular characteristic of Quran. El-Mus-haf , the book of Quran, is written on the Uthmani script . This last is full of recitation marks and spells some words in a different way than the standard way. By example, the word .

‫بسطة‬

is spelled .

‫بصطة‬

in Uthmani. The Uthmani

script requires considering its specifications in Text analyzing phases: Normalization, Stemming. Quran is structured in many analytic levels[BenJammaa]: • Main structure: Sura, Aya, word, letter. • Special emplacements: Sura’s start, Sura’s end. Aya’s end. Sajdah, Waqf , Facilah; • Mushaf structure: page, thumn, rubu’, nisf, hizb, juz’; • Quran writing: sawāmit, harakāt, hamza, diacritics, signes of distinction between similar letters, phonetic signs; • Incorporeal structure: word, keyword, expression, objective unit • Revelation: order, place, calender, cause, context...etc. The users may need to search, filter results or group them based on one of those structures. There are many sciences related to Quran, named Quranic Sciences: Tafsir, Translation, Recitation, Similitude and Abrogation…etc. Next, we’ll propose an initial classification for the search features that we have inspired from the problematic points.

4.3

Classification

To make the listing of search features easier, we classified them in many classes based on their objectives. 1. Advanced Query: This class contains the modifications on simple Query in order to give the user the ability of formulating his query in a précised way. By example: Phrase search, Logical relations, Jokers. 2. Output Improvement: Those are to improve the results before showing it to users. The results must pass by many phases: Scoring, Sorting, Pagination, Highlighting...etc.

81

Chapter 4

3. Suggestion Systems: This class contains all options that aims to offer a suggestion that help users to correct, extend the results by improving the queries. For example, suggest correction of misspelled keywords or suggest relative-words. 4. Linguistic Aspects: This is about all features that are related to linguistic Aspects like stemming, Selection & filtering stop words, normalization. 5. Quranic Options: It’s related to the properties of the book and the information included inside. As we mentioned in the problematic, the book of Quran (almushaf) is written in uthmani script full of diacritical symbols and structured in many ways. 6. Semantic Queries: Semantic approach is about to allow the users to pose their queries in natural language to get more relevant results implicitly. 7. Statistical System: This class covers all the statistical needs of users. Statistics are helpful in classifications of words and ayahs and can be used also to verify validity of the facts designated as Numerical miracles. An example of statistics, searching the most frequented word. This is an initial classification; we have to improve it for a well exploit of all possible search features.

4.4

Proposals

In this point, we enlist all possible search features based on the classification we mentioned before. These entire features express a search need: general, related to Arabic or related to Quran. We have collected the basic ideas from: • Classic & Semantic search engines: Google, • Arabic search engines: Taya it, • Quranic search tools: Zekr application, al-monaqeb alqurany, • Indexing/Search programming libraries: Whoosh, Lucene • Quranic Paper lexicons: The indexed mu’jam of words of the Holy Quran(‫المعجم‬

‫ )المفهرس لألفاظ القرآن الكريم‬by Mohammed Fouad Abd El-bāki

We have manipulated those ideas to fit the context of Arabic and Quran. There are many features that are totally new , we propose them to fulfill a search need or resolve a specific problem. In addition to simple search, these are our proposals: 82

Chapter 4

4.4.1

Advanced Query

1. Fielded search: uses field-name in the query to search in a specific field. Helpful to search more various information like surah names, topics, page number...etc. e.g. surah_name =

‫الفاتحة‬, to limit the search within Fatiha.

2. Logical relations: to force the existence or absence of a keyword. The most known relations are: AND for conjunction, OR for disjunction and NOT for exception. The relations can be grouped using parenthesis. e.g. (

‫ الصلاة‬AND_NOT

‫ ) الزكاة‬AND sura_name=‫البقرة‬, to find Ayahs in Al-baqara that contains the word ‫( الصلاة‬Prayer) excluding the one that contains the word ‫( الزكاة‬Alms). 3. Phrase search: is a type of search that allows users to search for documents containing an exact sentence or phrase. e.g. “

‫” الحمد لله‬, to match exactly the

words if and only if adjacent and in the same form, the same order (eliminate “ ‫الحمد‬

‫)”لله‬.

4. Interval search: used to search an interval of values in the numerical field. Helpful in fields like: Ayah ID, Page, Hizb, statistic fields. e.g. Ayah_number:[1 to 5], to retrieve the first five ayahs of each surah. 5. Full Regex: regular expressions provide a concise and flexible mean to ”match” (specify and recognize) strings of text, such as particular characters, words, or patterns of characters. One issue about regular expressions is that they are very complicated for a normal user to manipulate. e.g.

[‫ م]ن ا‬, to retrieve either ‫من‬

or ‫ ما‬because the square brackets mean to select one and only one letter of letters inside it. 6. Wildcards (Jokers): used to search a set of words that share some letters. It’s a simpler way to avoid regular expressions complexity. This feature can be used to search a part of a word. In Latin, there is two Jokers used largely : ? replaces one letter, * replaces an undefined number of letters. These jokers are inefficient in Arabic because the existence of vocalization symbols which they are not letters, and Hamza(‫ )ء‬letter that has different forms (different Unicode emplacements). e.g.

‫نب ّينه‬,‫النبيين‬, ‫ الأنبياء‬, etc.

‫ب؟طة‬, to generate ‫بسطة‬, ‫ *نبي* ; بصطة‬, to generate ‫نبي‬,

7. Boosting keywords weight: used to boost up or boost down the relevance factor of any keywords to make effect on the results sorting. e.g.

0.5^‫سميع عليم‬

2^‫بصير‬, boosting up ‫( بصير‬all seeing) and boosting down ‫( عليم‬all knowing) to 83

Chapter 4

prefer in sorting ”‫بصير‬

‫( ”سميع‬all hearing, all seeing) on ”‫( ”سميع عليم‬all hearing,

all knowing).

4.4.2

Output Improvements

1. Pagination: dividing the results on pages. e.g. 10, 20, 50… results per page. That will simplify the showing of results and optimize the time of transferring.

Figure 4.1: Pages view in Google.com

2. Scoring: assign to each result a calculated score that represent the relevance of that result to the search query. This score will be used for sorting and/or filtering. 3. Sorting: sorting the results help the user to browse them in any order he want. Results can be sorted by various criteria such as: Relevance to query, Natural order (Mus-haf), Revelation order, Numerical order of numeric fields, Alphabetical (‫ )ألفبائي‬or Abjad (‫ )أبجدي‬order of alphabetic fields.

4. Keywords Highlight: used to emphasize searched keywords in Ayah by adding styling tags in order to make them more remarkable then other words. e.g.

‫ (reading left to right).

Figure 4.2: Highlight the keyword

‫الحمد‬

‫( وحيدا‬alone) in Ayah 11 of al-modather – alfanous.org

5. Real time output: used to avoid the time of user waiting, and show the results directly when retrieve them. So useful for the case of long standing queries. 6. Results grouping: this will be used to offer a better exploration for results because the use of ayah as a unit in showing is not always the best choice for users, sometimes grouping those ayahs on surahs, topics, or any other criterion, is more useful. The criteria that can be used for grouping are: by surahs, by topics, by Tafssir dependency, by revelation events, by allegorical ayahs or by Qur’anic parables. 84

Chapter 4

7. Uthmani script with full diacritical marks: Quran text is written on mushaf in a Uthmani script with full diacritical marks which are helpful for better reading and recitation of the Quran. Ignoring those marks raises some ambiguities in recitation such as boundary (stop) marks that clarify where can’t stop ٌ َ‫ﻮﺳ َﻒ َوإ ْﺧ َﻮﺗﻪۦٓ َءا﵀‬ ُ ‫ﺎن ﻓِﻲ ُﻳ‬ َ ‫] ۞ ﱠﻟ َﻘ ْﺪ َﻛ‬ recitation, where can stop and where must stop. e.g. ‫ﺖ‬ ِِ ِ َ ‫ﻠﺴﺂ ِﺋﻠ‬ ‫ﱢﻟ ﱠ‬ [‫ِﻴﻦ‬

Figure 4.3: Ayah in full diacritical marks - quran.com

4.4.3

Suggestion Systems

1. Vocalized spell correction: offers alternatives for keywords when misspelled or appeared with a different form in Quran. e.g. suggest for

‫( إ ْب َرا ِهيم‬Ibrahim) as mentioned in Qur’an.

‫( أ َبرا َهام‬Abraham):

Figure 4.4: Query spell correction in Google.com

2. Semantically related keywords (Ontology-Based) : used as hints for users to help them find more keywords that may interest them, generally based on domain ontologies. e.g. suggest for ‫( يعقوب‬Yaqub - Jacob): name of Yaqub),

‫( إسرائيل‬Israel, another

‫( يوسف‬Yusuf, a son of Yaqub), ‫( الأسباط‬Asbat, sons of Yaqub),

‫( نبي‬Prophet, what Yaqub is).

85

Chapter 4

Figure 4.5: Focusing on ”Yaqub” in Ontology of concepts of corpus.quran.com

3. Different vocalization: suggest all the possible vocalizations sorted by the frequency of each one. Helpful to refine the ambiguous queries and make them more accurate. e.g. suggest for ‫( الملك‬Almlk) : (The kingdom, Almulk)�

‫( ال َم ِلك‬The king, Almalik ), ‫ال ُم ْلك‬

‫( ال َم َلك‬the angel, Almalak).

4. Collocated words: provides a completion based on collocation statistics. Also helpful to refine queries and make it more accurate. e.g. suggest for hearing) :

‫( سميع‬all

‫( سميع عليم‬all hearing, all knowing), ‫( سميع بصير‬all hearing, all seeing).

86

Chapter 4

Figure 4.6: Related searches suggestion in Google.com

5. Keyboard mapping: this feature is about to replace each Latin letter with the Arabic letter that share with it the same emplacement in the keyboard. User’s keyboard layouts must be detected to do a perfect mapping. This feature may help when the user forget to switch the keyboard language; It is used in Google search engine but in the inverse way: Arabic To English. e.g. suggest for fsl :

‫ ( بسم‬f →‫ب‬, s →‫س‬, l→‫)م‬.

Figure 4.7: Keyboard mapping Arabic to English in Google.com

6. Different significations: used to limit a keyword to only one of their meanings. Helpful to make the query more precise. e.g. suggest for

‫)إله‬, 2nd meaning (master - ‫)سيد‬.

4.4.4

‫رب‬: 1st meaning (god -

Linguistic Aspects

1. Romanization: or called also English transliteration, It’s needed mainly by non-Arabic speakers specially whom are not familiar with Arabic letters and keyboards.

There are many transliterations for Arabic to English, the most

known: the international standard ISO233 used in scientific papers, Buckwalter

87

Chapter 4

Transliteration used in NLP, Arabtex used in latex and Arabizi used informally by users through forums, chat discussions...etc. e.g.

‫( َخليفَة‬Caliph1) could be

written as �alīfaẗ (ISO233), xalyfap (Buckwalter), _halyfaT (Arabtex), khalifa or khalifah or khaleefa or 5alifa or 7’alifa (Arabizi).

Figure 4.8: Different romanizations for the word

‫َخليفَة‬

Figure 4.9: Different transliterations used in ElixirFM Resolve Online

2. Syntactic Coloration: Coloration of words parts based on Part-Of-Speech will be helpful browsing the ayahs for Linguists.

Figure 4.10: Syntactic Coloration of Basmalah – bayt-al-hikma.com 1

Also: Successor, Ruler 88

Chapter 4

3. Vocal Search: this feature is targeting blind people to help them search easily . It’s helpful also in Smart phones where the keyboards are difficult to use. Though, vocal recognition of Arabic is not easy at all.

Figure 4.11: Google Voice Search on Android

4. Partial vocalization search: gives user the opportunity to specify just some diacritics. Helpful for example to avoid the last diacritic because it varies based on the position of the word in the phrase. e.g. _‫ م َـلـك‬to find ignore

‫ مـُل ْـك‬.

‫م َـلـِك‬, ‫ … م َـل َـك‬and

5. Multi-level derivation: This feature is helpful for location of all or some derivation forms of a word. the Arabic words derivations can be divided to the following four levels : exact word ( ‫ فأسقيناكموه‬- ”so We give it to you to drink of”), Lemma

(‫ أسقينا‬- We give drink ) , Stem (‫ أسقى‬- give drink), Root (‫ ) سقي‬. e.g.1 : (Word:

‫أسقينا‬, level:

lemma) to find

‫ َفأَ ْس َق ْي َناكُ ُمو ُه‬,‫ َلا َٔ ْس َق ْي َنا ُه ْم‬, ‫ َوأَ ْس َق ْي َناكُ ْم‬.

6. Specific-derivations: this is specification of the previous feature. Since Arabic is fully flexional , it has lot of derivation operations. e.g. conjugation in Past

‫( قال‬to say) to find ‫*( قالت‬she* said), ‫*( قال‬he* said), ‫*( قالوا‬they* said, masculine), ‫*( قلن‬they* said, feminine ) , etc. tense of

7. Word properties embedded query: offers a smart way to handle words families by filtering using a set of properties like: root , lemma, type, part-of-speech, verb mood, verb form, gender, person, number, voice…etc. e.g. {root:

‫ملك‬, type: 89

Chapter 4

noun, number: singular} to retrieve all the root

‫ ملك‬family that is a noun

and in singular form. 8. Fuzzy string search: that will be helpful especially for the letters that usually misspelled like Hamza(‫ ;)ء‬The Hamza letter is hard to write its shape since its

‫مءصدة‬ ‫ ضحي‬may replace ‫ ضحى‬, ‫ جنه‬may

writing is based on the vocalization of both of it and its antecedent. e.g. may replace replace

‫جنة‬.

‫( مؤصدة‬Ignore Hamza form),

9. Word linguistic annotations: Attaching linguistuc annotations to each word of results (ayahs) will help linguistics students to get information about the word easily and word_by_word. This annotation contains information like root, lemma, mood, aspect, gender, number , voice , case and so on.

‫ الإ نشقاق‬, Ayah: 1 [ ‫الس َم ُاء انْشَ ق َّْت‬ َّ ‫]إِ َذا‬

• Surah:

Translation: When the sky is split asunder, ◦ ◦ ◦

ِ ‫( إِ َذا‬idhā, when): time adverb ; ‫الس َم ُاء‬ َّ (l-samāu, the sky): nominative feminine noun; ‫( انشقت‬inshaqqat, is split asunder): 3rd person feminine singular (form VII) perfect verb.

Figure 4.12: Annotations shown by Quranic Arabic Corpus website

10. Linguistic examples search: that will help Arabic language scholars to find 90

Chapter 4

easily linguistic examples appeared in Quran text. For example: Rhetorical deletion (‫البلاغي‬

‫)الحذف‬, Grammatical Shift (‫)الالتفات‬.

11. Uthmani writing way: offers the user the ability of writing words either as mentioned in Uthmani Mus’haf or as we write it in the standard way. e.g. replaces

4.4.5

‫ بسطة‬, ‫ نعمت‬replaces ‫ نعمة‬.

‫بصطة‬

Qur’anic Options

1. Structural options: since Quran is divided to parts and the part to Hizbs and Hizb to Halves ... till we get Ayahs.There is also other structures of Qur’an such as to pages ...etc. This feature is useful for limitation of the search interval. e.g.1 page:1, to limit the search in the first page. e.g.2 Hizb:60 , to limit the search in the last hizb. e.g.3 Juz:َّ‫ عم‬, to limit the search in the last Juz (part). 2. Recitation marks retrieving: helpful for Tajweed scholars who seek examples. e.g.1 Saktah: YES, to search for ayahs that contain a saktah mark. e.g.2 Sajdah_type: Obligatory , to search for all ayahs where there is a sajdah which is obligatory. 3. Divine Names Highlight: a similar feature to this is used in many mus’hafs nowadays , it aims to show Allah’s name clearly. this includes different writings of word ”Allah” and may include any pronoun or word that refers surely to Allah.

Figure 4.13: Divine names Highlight in Quran Reader iPhone application

91

Chapter 4

4. Translation embedded query: helps users to search using words translations to other languages instead of the native Arabic text. e.g. { text: mercy, lang: english , author: shekir } to search the word mercy in Quran English translation made by shekir (Author). 5. Repetitions and Allegorical ayahs (‫والمتشابهات‬

‫)التكرار‬:

enable the Qur’an

scholars to retrieve similar ayahs whether they are identical or roughly similar, whether similar in writing or in meaning. • Repetitions of {‫الرحمن‬,13}

[‫] َف ِبأَ ِّي آ َلا ِء َر ِّب ُك َما تُ َك ِّذ َب ِان‬

There are 31 exact similitudes in surah

‫ الرحمن‬, ayahs number : 13, 16, 18,

21, 23, 25, 28, 30, 32, 34, 36, 38, 40, 42, 45, 47 , 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77. 6. Abrogators and Abrogated ayahs search (‫والمنسوخ‬

‫)الناسخ‬: Knowing abro-

gations helps in knowledge of Islamic Laws and it will be helpful to search for abrogation cases, by indicating proponents and opponents of each case. • Abrogators of {‫النساء‬, 43}:

[...َ‫الصلَا َة َوأَنْ ُت ْم ُس َكا َرى َح َّتى َت ْع َل ُموا َما َتقُولُون‬ َّ ‫] َيا أَ ُّي َها الَّ ِذ َين آ َم ُنوا َلا َت ْق َر ُبوا‬

Translation: O you who believe! do not go near prayer when you are Intoxicated until you know (well) what you say, ... ◦ Abrogator #1 : {‫المائدة‬, 90}

‫س ِم ْن َع َملِ الشَّ ْي َط ِان‬ ُ ‫] َيا أَ ُّي َها الَّ ِذ َين آ َم ُنوا إِنَّ َما الْ َخ ْم ُر َوالْ َم ْي ِس ُر َوا ْلأَنْ َص‬ ٌ ‫اب َوا ْلا َٔ ْز َلا ُم ر ِْج‬ [ ‫اج َت ِن ُبو ُه لَ َعلَّ ُك ْم تُ ْف ِل ُحو َن‬ ْ ‫َف‬ Translation: O you who believe! intoxicants and games of chance and

(sacrificing to) stones set up and (dividing by) arrows are only an uncleanness, the Shaitan’s work; shun it therefore that you may be successful.

‫ مصطفى زيد‬،‫ ابن الجوزي‬،‫ مكي بن أبي طالب‬،‫النحاس‬ Proponents: ‫ الزرقاني‬،‫ الدهلوي‬،‫السيوطي‬

Opponents:

7. Qur’anic Parables (‫القرآنية‬

‫)الأمثال‬:

this will help Quran scholars to look for

Parables in Quran. • Parable-in-Sura:

‫البقرة‬

‫ البقرة‬, Ayah: 17 ‫] َم َثلُ ُه ْم َك َمثَلِ الَّ ِذي ْاس َت ْو َق َد نَا ًرا َف َل َّما أَ َض َاء ْت َما َح ْولَ ُه َذ َه َب اللَّـ ُه بِ ُنو ِر ِه ْم َو َت َر َك ُه ْم ِفي‬ Surah:

92

Chapter 4

ٍ ‫ُظلُ َم‬ [َ‫ات َلا ُي ْب ِص ُرون‬

Translation: Their parable is like the parable of one who kindled a fire, but when it had illumined all around him, Allah took away their light, and left them in utter darkness– they do not see.

4.4.6

Semantic Queries

1. Semantically related words: this feature is about extending a keyword to a set of keywords that are semantically related. There are several different kinds of semantic relations: Synonymy, Antonymy, Hypernymy, Hyponymy, Meronymy, Holonymy, Troponymy. That will help to achieve the other semantic features and to answer some user questions. • Synonym(‫ ) جنة‬to search for • •

‫ فردوس‬,‫ نعيم‬,‫… جنة‬ Antonym( ‫ )جنة‬to search for ‫ سقر‬,‫ جهنم‬, ‫ سعير‬,‫… جحيم‬ Hyponym( ‫ )جنة‬to search for ‫ فردوس‬,‫… عدن‬

2. Faceted Thematic Search: that offers the feature of browsing the verses in term of themas structured on many levels hierarchically.

Figure 4.14: Faceted Thematic Browsing - Qurany Project

3. Questions Answering (QA) : this means to offer the option of forming the search query as a form of an Arabic natural question. The most used Arabic ques-

‫( هل؟‬is it?),‫( من؟‬who?), ‫( ما؟‬what?), ‫( ٔاين؟‬where?), ‫( متى؟‬when?), ‫( لم؟‬why?), ‫( أ ّي؟‬which?), ‫( لمن؟‬whose?), ‫( كيف؟‬how?), ‫( كم؟‬how much/often?).

tion words are:

• Who are the prophets? (‫الأنبياء؟‬ Surah:

‫ النساء‬, Ayah: 163

‫)من هم‬ 93

Chapter 4

‫ين ِم ْن َب ْع ِد ِه َوأَ ْو َح ْي َنا إِلَى إِ ْب َرا ِهي َم َوإ ِْس َم ِاعي َل‬ ٍ ‫] إِنَّا أَ ْو َح ْي َنا إِلَ ْيكَ َك َما أَ ْو َح ْي َنا إِلَى ن‬ َ ‫ُوح َوال َّن ِب ِّي‬ ِ ‫ُوب َوا ْلا َٔ ْس َب‬ [‫ُس َو َها ُرو َن َو ُس َل ْي َما َن َوآ َت ْي َنا َد ُاوو َد َز ُبو ًرا‬ َ ‫اط َو ِع‬ َ ‫يسى َوأَ ُّي‬ َ ‫َوإ ِْس َحاقَ َو َي ْعق‬ َ ‫وب َو ُيون‬

Translation: Surely We have revealed to you as We revealed to Nuh, and the prophets after him, and We revealed to Ibrahim and Ismail and Ishaq and Yaqoub and the tribes, and Isa and Ayub and Yunus and Haroun and Sulaiman and We gave to Dawood

• What is Al-hottamat? (‫الحطمة؟‬

‫ الهمزة‬, Ayah: 6 [ُ‫] نَا ُر اللَّـ ِه الْ ُمو َق َدة‬

‫)ما هي‬

Surah:

Translation: It is the fire kindled by Allah

• Where was Rome defeated?(‫الروم؟‬

‫هزمت‬/‫)أين غلبت‬

‫ الروم‬, Ayah: 3 ِ [َ‫ض َو ُه ْم ِم ْن َب ْع ِد َغ َل ِب ِه ْم َس َي ْغ ِل ُبون‬ ِ ‫]في أَ ْدنَى ا ْلا َٔ ْر‬ Surah:

Translation: In a near land, and they, after being vanquished, shall overcome, • How much of time did People of Cave stay? (‫الكهف؟‬

‫الكهف‬, Ayah: 25 َ ‫] َولَ ِب ُثوا ِفي ك َْه ِف ِه ْم َثل‬ [‫ين َوا ْز َدا ُدوا تِ ْس ًعا‬ َ ‫َاث ِمائَ ٍة ِس ِن‬

‫)كم مكث أصحاب‬

Surah:

Translation: And they remained in their cave three hundred years and (some) add (another) nine. • When is the Day of Resurrection? (‫القيامة؟‬

‫)متى يوم‬

‫ الأحزاب‬, Ayah: 63 [‫السا َع َة َت ُكو ُن َقر ًِيبا‬ َّ ‫السا َع ِة قُ ْل إِنَّ َما ِع ْل ُم َها ِع ْن َد اللَّـ ِه َو َما ُي ْدرِيكَ لَ َع َّل‬ َّ ِ‫اس َعن‬ ُ ‫] َي ْسأَلُكَ ال َّن‬

Surah:

Translation: Men ask you about the hour; say: The knowledge of it is only with Allah, and what will make you comprehend that the hour may be nigh.

• How has the embryo be formed? (‫الجنين؟‬

‫)كيف يتشكّل‬

‫ المؤمنون‬, Ayah: 14 ‫]ثُ َّم َخ َل ْق َنا ال ُّن ْط َف َة َع َل َق ًة َف َخ َل ْق َنا الْ َع َل َق َة ُمضْ َغ ًة َف َخ َل ْق َنا الْ ُمضْ َغ َة ِع َظا ًما َف َك َس ْونَا الْ ِع َظا َم لَ ْح ًما‬ [‫ين‬ َ ‫ثُ َّم أَنْشَ أْنَا ُه َخ ْلقًا ا َٓخ َر َف َت َبا َر َك اللَّـ ُه أَ ْح َس ُن الْ َخالِ ِق‬ Surah:

94

Chapter 4

Translation: Then We made the seed a clot, then We made the clot a lump of flesh, then We made (in) the lump of flesh bones, then We clothed the bones with flesh, then We caused it to grow into another creation, so blessed be Allah, the best of the creators.

4. Automatic vocalization: the absence of diacritics lead to the ambiguities that we’ve mentioned them in Problematic. This feature helps to pass over these ambiguities and try to resolve them before executing the search process. e.g. the phrase ”‫الله‬

‫ ''رسول من‬becomes after auto-vocalization ”‫” َر ُسول ِم َن الل ِه‬.

5. Entity Extraction: this helps to retrieve Person; Date; Time; Distance; Number; Places , even if appeared as in a different format. This is important to attend the previous feature: Question Answering. • Extract

‫( ثلاث مائة سنين وازدادوا تسعا‬three hundred years and exceeded by

nine) as (Time/number, 309)

‫( ببكة‬at Bakkah) as (place, Mekka / ‫)مكة‬ Extract ‫( كلمح البصر‬as the twinkling of an eye) as (time unit, ??) Extract ‫( مثقال ذرة‬atom’s weight) as (Size unit, ??) Extract ‫( يا أيها النبي‬O Prophet!) as (person, Mohammad)

• Extract • • •

6. Proper nouns search (co-reference resolution): lot of proper nouns are mentioned clearly in Qur’an but some of them are just referred to implicitly . This feature is to search for proper nouns however mentioned : explicitly or implicitly.

‫? ) بنيامين‬ Surah: ‫ المؤمنون‬, Ayah: 41

• Benyamin (

ُ َ َ َ َ ََ ‫ْ َ ُ َ ُ ُ ُ َ َ ُ َ َ ﱡ َ َ َ ﱠ َ َ ْ ُ ُ ْ َ ﱠ‬ [‫ﻴﻦ‬ ٍ ‫] ِإذ ﻗﺎﻟﻮا ﻟﻴﻮﺳﻒ وأﺧﻮهُ أﺣﺐ ِإﻟﻲ أ ِﺑﻴﻨﺎ ِﻣﻨﺎ وﻧﺤﻦ ﻋﺼﺒﺔٌ ِإن أﺑﺎﻧﺎ ﻟ ِﻔﻲ ﺿ ٍل ﻣ ِﺒ‬

Translation: When they said: Certainly Yusuf and his brother are dearer to our father than we, though we are a (stronger) company; most surely our father is in manifest error • Abu-Bakr ( ‫ اﻟﺼﺪﻳﻖ‬/‫?) أﺑﻮ ﺑﻜﺮ‬ Surah: ‫ اﻟﺘﻮﺑﺔ‬, Ayah: 40 ْ َ ‫ﱠ‬ ُ ُ ْ ْ َ َ ْ [... ‫ ﺛﺎ ِﻧ َﻲ اﺛ َﻨ ْﻴ ِﻦ ِإذ ُﻫ َﻤﺎ ﻓِﻲ اﻟ َﻐ ِﺎر ِإذ َﻳﻘﻮل ﻟ َِﺼ ِﺎﺣ ِﺒ ِﻪ ﺗ ْﺤ َﺰ ْن ِإ ﱠن اﻟﻠ َـﻪ َﻣ َﻌ َﻨﺎ‬...] Translation: ... he being the second of the two, when they were both in the cave, when he said to his companion: Grieve not, surely Allah is with us ...

95

Chapter 4

4.4.7

Statistical System

1. Unvocalized word frequency: this feature is about gathering the frequency of each word per ayahs, per surahs, per hizbs. • How many words of “‫ ”الله‬in Surah

"‫?"المجادلة‬

◦ What the first ten words which are the most frequently cited words in the whole Qur’an? 2. Vocalized word frequency: the same as the previous feature but with consideration of diacritics.that will make the difference for instance between particle),

‫( َم ْن‬who, noun) and ‫( َم َّن‬reproach, verb).

‫( ِم ْن‬from,

3. Root/Stem/Lemma frequency: the same as the previous feature but use Root, Stem or Lemma as unit of statistics instead of the vocalized word. • How many the word of “Sea/‫ ”بحر‬and its derivations are mentioned in the whole Qur’an? the word “Seas/

‫ ”بحار‬will be considered also.

4. Another Qur’anic units frequency: the statistics could be gathered based on many units otherwise the words like letters, ayahs, Recitation marks...etc. • How many letters in the Surah “‫?”طه‬ • What’s the longest Ayah? • How many Marks of Sajdah in the whole Qur’an?

4.5

Discussion

Those proposed search features could be useful or not, needed or not. That’s why , we need to find a way to validate the usefulness of those feature and the need to them. The best validation way in our case is to gather the opinions of normal users, Quran scholars, Arabic morphology experts, Natural Language Processing / Information Retrieval researchers, and may be philosophers that are working on religious scriptures comparing. We have to mix the audience aimed by the survey to get high quality feedbacks about usefulness, clarity and importance of each search feature. But, the real experience will give better feedbacks so we will gather the feedbacks of users about each implemented feature. For this, we’ll make a second-step validation which is the users feedbacks in Alpha and Beta test phases that will be available after implementation of our work as a desktop application or a “Google-like” website . 96

Chapter 4

4.5.1

Survey

We’ve made the survey in English and Arabic for a better cover of the audience. We distributed the questions in many phases: 1. Basic Information: here we ask about the basic information of the survey taker such country, age, gender, religion, native language. 2. Experience: here we ask about the background of the survey taker, How deep his knowledge in the four axes: Quran, Arabic, Linguistics, Computing. 3. Search features review: here we explain each search feature to the survey taker so he can review it and evaluate clarity, usefulness and importance of each one. 4. Feedback about survey: here we ask about the survey itself such as the clarity, the quality and also the difficulties that may the survey taker have it through the survey. We used Google Docs forms to build our survey because it’s the best one of free survey makers in the Internet, it offers simple but efficient surveying features. 4.5.1.1 Survey Participants Details We launch the survey for 45 days and we gathered about 37 takers. The following pie charts describe the variations between the background of the participants including Age, Gender, Country, Language, Religion.

97

Chapter 4

Figure 4.15: Audience Background

We’ve gathered the information about the experience of the participants in four axes: Quran, Arabic, Linguistics, Computing. This following chart describes this:

98

Chapter 4

Figure 4.16: Audience experience

4.5.1.2 Results of survey We got very helpful results from launching the survey. The following figure describes the percentage of clarity, usefulness, and need of each search feature listed in 4.4.

99

Chapter 4

Figure 4.17: Clarity, Usefulness, and Need percentage of each feature

100

Chapter 4

4.6

Conclusion

In this chapter, we have enlisted the search features in Quran that are helpful. To facilitate the enlisting, we have classified those features depending on the nature of the problem. That list will help us to make a detailed retrieval system that fits perfectly the needs of the Quran. We conducted a survey about usefulness, usability and clarity of each feature and we have got very helpful feedbacks. Each feature has a different level of complexity: could be implemented easily or may lead to vast problem that need a deeper study. That what we’ll explain in the next chapter.

101

Chapter 5 Conception

Special cases aren’t special enough to break the rules. Although practicality beats purity. Tim Peters, The Zen of Python

5.1

Introduction

After listing all possible search features as explained in the previous chapter, we will at first - go through our previous work explaining all what we have already done. After that, we’ll discuss many improvements that leads to a better search experience. We’ve gathered those improvements under those points: • Maintain a full vocalization search engine; • Text processing based on Uthmani script; • Quranic Word Search; • An accurate statistics gathering system; • More adequate suggestion system; • The road-map toward semantic search. Finally, we’ll mention also in this chapter all Qur’anic resources that we think it’s necessary and the features depending on each resource.

102

Chapter 5

5.2

Previous Work

We’ve started the work on the idea of search in Quran in the Engineer degree graduation project entitled ”Development of a search and indexing engine for Qur’anic documents”1 [Dahmani2010]. We’ll browse what already done so we consider it as the base to continue. To attend our objective, we had based on the general behavior of retrieval systems as described on the figure:

Figure 5.1: Basic Prototype

We did ignore the phase of crawling because our need was about search in a limited static resource which is the Quran text on the contrary of other search engines needs. We have considered the Ayah as the key unit of the index, each Ayah is defined by its Surah name and its number in the Surah. We had adopted a Qur’anic text written on standard script for the basic search. The basic schema that we’ve used in document index contains: Document ID, Ayah ID, Ayah text, Surah name. The Searcher (in the previous figure) is the element served to perform the search operation. It gets user queries , retrieving them in the inverted index of Ayahs to get document IDs of matched Ayahs and then use those document IDs to retrieve all information of matched Ayahs such as Ayah text and its Surah name. The whole information will be sent to the interfaces as results. The behavior of Searcher is described in this figure:

1

Original title in French: Développement d’un moteur de recherche et d’indexation pour les documents coraniques 103

Chapter 5

Figure 5.2: The behavior of Searcher

We had proposed a partial vocalization comparison between indexed and queried keywords. The new comparison compare the diacritics if exists in the two compared words but it pass the comparison if diacritics does not exist at least in one of the compared words. We had enriched our prototype in many key steps , we cite them summarized here:

5.2.1

Text Processing

Text Processing is generally based on two main phases. The first is the tokenization (extract tokens). The second is the processing of those tokens, it includes normalization of characters, filtering the stop words, and the last: the stemming. We’ve base on the same phases but we had customized them.

Figure 5.3: Text processing

104

Chapter 5

For tokenization, we’ve applied the space tokenization to separate text on each space or tab encountered. We cited also that we can separate the pronouns from nouns but that need a further study for different cases. One of the special cases is .

‫( لهن‬for

them, feminine) where the pronoun forms the base of the word. For normalization, we include it it the fixing of shaping problem of letters and of lam-alef. we avoid to remove the diacritics in this phase because we needed the diacritics in further features. The same for the case of Hamza forms. For stemming, there are many different definitions, we had considered that stemming is the operation of detaching affixes without bringing the word back to its root origin as we may call it ”light stemming”. The reasons for that choice and more details are mentioned in [Dahmani2010]. For stop-words filtering, we had discussed how to consider a Qur’anic word as a stop-word and we had concluded that we had to choose from the list of most frequent words in Qur’an the particles and the pronouns taking in consideration their diacritical marks to avoid ambiguities.

5.2.2

Query Processing

Advanced Query offers a set of search options, it’s used by many search engines such as Google. The options that were already existing are: fields, partial word, wild-cards search (Jokers), logical relations (And, Or, Not), phrase, interval, and boosting. We had benefited of the advanced query in the implementation of some new linguistic features such as: • Synonyms & Antonyms: it’s a simple replacement operation of word by the set of its synonyms or antonyms. • Search by Derivations: it’s a linguistic option aims to extract derivations of a given word based on a defined derivation level, that works on 3 steps: 1. Detect level of derivation of the given word, is it Root, Stem, Lemma or Complete Word? 2. Get the origin of the word in the required derivation level. 3. Fetch all the derivations of that origin. We’ve implemented the previous actions using specific indexes that contains all different levels of derivations of all Qur’anic words. A con of using indexes is that it can’t fetch origins of words not mentioned in Quran. This con can be fixed using morphological analyzers for the first step. 105

Chapter 5

• Considering/Ignoring Spelling faults: with this feature, the search engine would ignore most of spelling faults that are frequently in Arabic. We choose the following operations to be performed on the query words: ◦ Replace Tâ’ marbûta with Hâ’. ◦ Replace Alef maqsûra with Yâ’. ◦ Replace all forms of Hamzä to independent Hamzä (‫)ء‬. Then, the search engine processes all indexed words and process them the same manner to compare them with the query words. The list of all words with successful comparison will compound the new query. • Search by the triple root-type-pattern: The general behavior of this propriety is that the user put a triple of word properties so the search engine fetch in the index all the words that correspond to those properties.

5.2.3

Suggestions

A suggestion module, is broadly: takes the query keywords and offers another words as suggestions. We had mentioned two features of suggestion: • Different Spellings: To offer a spell correction, we used n-gram dictionaries for Qur’anic words. • Related Keywords: For this, we suggested to use an ontology that include all words with of Qur’an their semantic relations where suggestion would be chosen on the term of dependency.

5.2.4

Results Processing

We passed the results by many phases of processing to improve their quality and prepare them to send to interfaces. Those phases were:

Figure 5.4: Results processing phases

• Scoring: the search in Qur’an is not different in how the scoring must be done so we had decided to choose simple TF*TDF or one of its derived algorithms 106

Chapter 5

• Sorting: we suggested those criteria to sort the results: score, natural order of mus-haf, revelation order of ayahs, alphabetic order of text fields, and numerical order of numeric fields. • Filtering: the objective of filtering is to eliminate some unneeded results based a such criterion like : By surah. • Extending: here, extra information would be attached to each ayah in results such as: Tafssir, Translation to specified languages, recitation audio file, etc. • Paginating: in this phase, the results would be divided to a stream of pages to be sent to interfaces page by page, because the user generally seek the first results. • Highlighting: for highlighting, we didn’t lean on the simple method of exact word matching. We had adopted a method that considers text processing for text-for-view as in text-for-search.

5.2.5

Indexes Importing

To accomplish the features specific to Quran (such as: Search in Uthmani Script, Structural options, Statistical options, Thematic search and so on), should be invoke other information from extra indexes that contains: • Ayah texts: include the different writings of ayah’s text such as Uthmani or Standard scripts. • Structural Information: include all structures made for mus-haf such as Ahzab, Manazil, Surahs and so on. • Surah Information: include all what is attributed to the surah as an index key unit such as surah’s names, revelation place and time, and so on. • Statistical Information: include all statistics of surahs and ayahs such as word frequency. • Classification on Topics: include a categorization of ayahs on term of topic in a hierarchic manner.

5.3

Full vocalized search engine

An obvious fact is that the majority of Arabic texts are written unvocalized (without diacritical marks), this causes some confusion in the meaning, only the context of text 107

Chapter 5

can lift the ambiguity. Therefore, We had to consider the vocalization in all indexing and search phases in order to achieve a Quranic retrieving system that pass over these ambiguities. One of the main barriers that prevent the consideration of vocalization in Arabic is the lack of vocalized Arabic texts among the total Arabic content. Though, since the Quran is already fully vocalized so one barrier is lifted on. Another barrier is comparing vocalized, partially vocalized, and unvocalized texts because the ordinary comparing does not distinguish between letters and diacritics and fails to discover the similarity between words. For example, with ordinary comparing those words will be considered different:

‫ الحمد‬, ‫الحمد‬ َ

, and

‫الح ْم ُد‬ َ .

That’s why we had attempted to

replace it with a partial vocalization comparing which is working as following: 1st word

‫الم ْلك‬ ‫ال َملك‬

2nd word

ُ‫ال ٌملك‬ ٌ‫ال ُملك‬

comparing result

reason

True False

no conflicting diacritics conflict of diacritics

Table 5.1: Partial vocalization

We’ve intended to use this method in the comparing between search query keywords and indexed keywords but this is not sufficient for two reasons: 1. This will take effect only in searching and not all operations needed in indexing, statistics calculation, detecting suggestions, detecting synonyms, detecting different vocalizations and any operation may appear later. 2. The performance will be affected since this method is not optimized compared to the ordinary one. To fix those two issues, we have to improve the comparing method in the lowest layers possibles: • Regular expression engine: General regular expressions has no classification for neither Arabic letters nor Arabic diacritics while it gives special classes for Latin letters, digits, and symbols. Introducing Arabic customized classes into the regex engine improves the partial vocalization comparing performance and makes it globally available. • Basic string: in most programming languages, strings are objects and have their own comparing method. Replacing that object method with our partial comparing operation takes effect on all declared instances and all derived objects. One problem of changing Regex engine or basic string directly in-place is the possibility of disconnecting from the main line of the programming language development. That should be token in consideration because if it happens the forked branch will be stuck. 108

Chapter 5

An alternative way is the prototyping instead of replacing in-place. With this way, lot of conflicts -between main development branch and forked branch- can be avoided. To improve better the vocalized search, we have to distinguish between original vowels and declension case markers because words like

ُ‫( الْ ُم ْلك‬Kingdom, nominative),

‫( الْ ُم ْل ِك‬Kingdom, genitive), َ‫( الْ ُم ْلك‬Kingdom, accusative) are considered as one word

with one meaning which is Kingdom. This will give best matches in search and will not

oblige the user to repeat the search operation with each one of the different possible declensions.

Figure 5.5: Different possible declensions of the word

‫الملك‬

Ignoring vocalization has a direct influence on filtering stop-words because filtering out a word like the particle

‫ ِم ْن‬leads to filtering the relative noun ‫ َم ْن‬and the verb ‫َم َّن‬

. Using vocalized stop-words is very recommended to avoid ambiguities and to confine the filtered occurrences. We’ll mention how we gather the stop-words in 5.4.4.

109

Chapter 5

Figure 5.6: Different types of the possible vocalizations of the word

‫من‬

Another problem emerges by ignoring the vocalization is the inaccuracy of statistics.

‫ الملك‬will be the total of all those ِ ‫( الْ ُم ْل‬Kingdom, genitive), َ‫( الْ ُم ْلك‬Kingdom, acwords ُ‫( الْ ُم ْلك‬Kingdom, nominative), ‫ك‬ ِ ‫( ا ْل َم ِل‬King, genitive). The appropriate way cusative), ُ‫( الْ َم ِلك‬King, nominative), and ‫ك‬

For example the frequency of the unvocalized word

to gather statistics is to use the vocalized words by ignoring the haraka of declension case (‫الإ عراب‬

‫ )حالة‬for nouns so for the previous example we’ll have two units: ‫( ال ُم ْلك‬Kingdom, undefined case), ‫( ال َم ِلك‬King, undefined case). We’ll talk more about

gathering statistics in ??.

In order to build a full vocalized search engine, the linguistic and quranic resources that we use should be fully vocalized too. Those resources could be: • a text: Quranic ayahs. • a list of names: sura names. • an ontology: Quranic concepts. • a thesaurus: synonyms, antonyms . • a word mapping: Uthmani to Standard writing. • annotations: Quranic words annotations. If any necessary resource was unvocalized or partially, we could exploit it using partial vocalization comparing method and proceed the operation of vocalization it. The old method of generating suggestions that we have proposed in the previous work didn’t consider vocalization because it may rise some glitches such as: • Larger number of n-grams. 110

Chapter 5

• It’s insignificant to have n-grams that start with a vocalization mark. The right behavior is to distinguish between letters and diacritics and change the gram to contain the letter and the diacritic after then change the similarity function to use partial vocalization comparing. This keeps a small number of n-grams and avoids insignificant ones. As a conclusion we mention benefits gained by basing on a full vocalization environment: • Lift the ambiguities cause by ignoring vocalizations • Make searching results, suggestions, and statistics more accurate. • Refine the meanings detection ( a first step in the semantic approach )

5.4

Othmani script and text processing

Quran is written principally on Othmani script which has many differences compared to the standard script used usually in different media. In the previous work, we adopted the quranic text written on standard script but due to the difficulties caused by the its differences with Othmani script, we have to consider both scripts for indexing and search. Among those difficulties, we mention: 1. Searching with an othmani writing form of a word such as it written also in the form

‫ بسطة‬in another Ayah2.

‫ بصطة‬knowing that

The retrieving system can’t

distinguish between the two occurrences basing only on the standard script.

Figure 5.7: Different writing forms of the word 2

occurs as

‫ بسطة‬using Othmani script

‫ بصطة‬in (‫ الأعراف‬69) and as ‫ بسطة‬in (‫ البقرة‬247) 111

Chapter 5

2. Calculating statistics knowing that letters number is different between the two scripts in a big number of ayahs. This difference is due to addition and removal of some letters in the Othmani script (for more details see 3.4.4). 3. Matching the same Word-By-Word structure of some Quranic linguistic resources. Using the Quranic word as the basic unit for linguistic resources lacks for precision in some cases such in the case of two occurrences of word that have different interpretation. To resolve that, many Quranic linguistic resources (such as Quranic Arabic Corpus, see 3.6.3.3) have adopted the word occurrence as a basic unit a.k.a Word-By-Word Quran browsing. As a result for this, the resources such those have to use the words as they occurred in Quran (consequently in Othmani script) because the separation of some words is different between Othmani script and standard script as in the word words

‫ َيأَ َسفَى‬in Yusuf 84 which is separated to two

‫ َيا‬+ ‫ أَ َسفَى‬in the standard script.

Figure 5.8: Merging Words in Uthmani Script

As we’ve said, to resolve those difficulties we consider Othmani text also for text processing along with standard text. In addition, we propose many improvements on the text processing phases to achieve many of search features that we have proposed in the previous chapter (see 4.4).

112

Chapter 5

Figure 5.9: General Schema of Uthmani and Standard text processing.

We list in the following our improvements per phase:

5.4.1

Substitution

This is a new phase that we propose to be added before tokenization. Its objective is to identify a list of pre-defined patterns and replace them as a preparation for tokenization. Generally this phase is for processing the search query because it’s not needed in indexing Quranic text. We don’t include this phase into normalization phase ( after tokenization ) before those substitutions are essential for a perfect tokenization. Especially that the tokenization phase is not a simple space-based separation and configured to use a word-by-word corpus. We propose these cases of substitution: 5.4.1.1 Romanizations Among the search features that we’ve described in the previous chapter, there is the option of search using a romanized Arabic text (see 1). Romanized texts are in Latin

113

Chapter 5

letters and our system is defined for Arabic letters. That’s why we have to replace each letter with its appropriate correlative. There are many romanization systems. To provide the possibility to use more then one, we should specify a guessing policy to detect what system is used in a given string. There are many criteria that could be used in guessing, we list some of them in the following: 1. Nature of used characters: each romanization system uses a set of letters, numbers, or symbols to represent the Arabic vowels and letters (including hamza forms). The differences between those sets are remarkable. For example, ISO233 norm uses usually some dotted letters such as ẗ for ‫ ; ة‬In the contrary, Buckwalter uses instead symbols such as & for

‫ؤ‬.

Those differences could be used to guess

the romanization system. For example If we consider the three romanization systems: ISO233, Buckwalter, and Arabtex, guessing the romanized word �alīfaẗ leads surely to ISO233 system because of the specfic characters �, ī, ẗ. 2. Arabic valid words: by interpreting the romanized word using the different romanization systems, we get many correlative Arabic words. We can check if each word is valid in Arabic, and eliminate non valid ones. For example, guessing the romanized word xalyfap leads surely to the system buckwalter because it generates the only valid Arabic word:

‫ َخليفَة‬.

3. Word existence in Quran: by checking the existence in the Quran of interpreted words arising from a romanized word, we can detect potential ones. 4. Predefined priorities: the last criterion is the predefined priorities. If the guessing system failed to limit the choices to only one romanization system, it should pick the first one based on predfined priorities. 5.4.1.2 Numbers into words: In Quran, the numbers are written in words not as we write them in the current numeral systems (Arabic numerals3 , or Hindi numerals4 aka Eastern Arabic numerals ). We need to translate those numerals into the literal form to make them search-able in the Quran. In the following table , we list some numbers as they mentioned in Quran:

3 4

Arabic numerals: (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) Hindi numerals: (٠ - ١ - ٢ - ٣ - ٤ - ٥ - ٦ - ٧ -

٨ - ٩) 114

Chapter 5



Ayah ID

1 2 3

(‫ الإ خلاص‬1) (‫ المائدة‬106) (‫ مريم‬10)

4 5 6

(‫ الأنفال‬65) (‫ البقرة‬261) (‫ الأنفال‬65) (‫ البقرة‬96)

7 8 9 11 12

(‫ المعارج‬4)

(‫ يوسف‬4) (‫ الكهف‬25)

(‫ العنكبوت‬14)

Number as mentioned

Appropriate numeral

‫أَ َح ٌد‬ ‫ا ْث َن ِان‬ َ ‫َثل‬ ‫َاث‬ ‫ِعشْ ُرو َن‬ ‫ِمائَ ُة‬ ِ‫ِمائَ َت ْين‬ ‫ف‬ َ ْ‫أَل‬ ‫ف‬ َ ْ‫ين_أَل‬ َ ‫َخ ْم ِس‬ ‫أَ َحدَ_ َعشَ َر‬ ِ ‫َاث‬ ِ ‫_مائَ ٍة‬ َ ‫َثل‬ ‫_وا ْز َدا ُدوا_تِ ْس ًعا‬ َ ‫ين‬ َ ‫_س ِن‬ ‫ين_ َعا ًما‬ َ ‫_س َن ٍة_إِ َّل‬ َ ‫ا_خ ْم ِس‬ َ ْ‫أَل‬ َ ‫ف‬

1 2 3 20 100 200 1000 50000 11 309 950

Table 5.2: Some numbers as they mentioned in Quran

To be close to the Quranic way of writing numbers, we should respect many properties. We write each digit in its class: ones, tens, hundreds, thousands, tens of thousands, hundreds of thousands, and better to not write the conjuction waw (‫العطف‬

‫)واو‬.

‫( صفر رجل‬zero men) but we say ‫( لا رجل‬no men). One never mentioned as ‫ واحد‬but as ‫أحد‬. Some numbers accept gender such as ‫( إثنان‬2, masculine) and ‫( إثنتان‬2, feminine). Other numbers change their forms in the opposite gender of the count noun (‫ )المعدود‬such as ‫سبع‬ ‫( سماوات‬7 heavens, feminine) and ‫( سبعة أبحر‬7 seas, masculine). A hundred ‫ مئة‬had a special writing in Quran which is ‫ ِمائَ ُة‬with an additional Alef. Zero has to be ignored because it will never be written, we don’t say

In the example N°11of the table, the number 309 was separated with the two words but still search-able. In the contrary , the example N°12 show an indirect mention of the number 950. That make it hard to be retrieved without understanding the Ayah semantically.

Figure 5.10: Example of Substitution

115

Chapter 5

5.4.2

Tokenization:

The function of tokenization is to split a running text into tokens, so that they can be fed into the next phases for further processing. We keep the space-based tokenization from our previous work as a first step since it’s a perfect choice and the simplest one at the same time. After performing the space-based tokenization, We could also separate the words (tokens) into their parts (sub-tokens) so we’ll have separated pronouns and clitics in case we needed a special processing for them. The problem with this is that it emerges lot of flaws.

‫ َفأَ ْس َق ْي َناكُ ُمو ُه‬5 (meaning6: so We give it to you to drink of) so the separated sub-tokens would be ‫ف‬ َ + ‫ أَ ْس َق ْي َنا‬+ ‫ كُ ُم‬+ ‫ ( ُه‬ignoring ‫ ) و‬and Having as an example the word

the flaws will be:

Figure 5.11: Tokenization of the word

‫َفأَ ْس َق ْي َنٰ ُك ُمو ُه‬

• Tokenization is an important phase since it’s the basis for the following phases. Thus, it should have a very high precision while actually there is a lack for accurate Arabic analyzer. • Some affixes change the form of word so we have to bring it to its original form

‫ و‬is an additional letter. such as ‫ف‬ َ should be considered

after separation. In our example, • A lot of clitics and pronouns

as stop words

because they will have a high frequency and have a less need of retrieving. A related solution, proposed by Mohammed A. Attia [Attia]. He started his idea by noting that the Arabic word can comprise up to four independent tokens so he consider that the morphological knowledge needs to be incorporated into the tokenizer. He described a rule-based tokenizer that handles tokenization as a full-rounded process with two stages: a preprocessing stage (white space normalizer), and a post-processing stage (token filter). He described the implementation of 3 tokenization models: 5 6

mentioned in (‫ الحجر‬22) English Translation of Mohammad Habib Shakir 116

Chapter 5

1. Tokenization Combined with Morphological Analysis: This is the most linguistically motivated and the most common form of implementation for Arabic tokenization [Habash05arabictokeni]. However, it violates the design concept of modularity which requires systems to have separate modules for undertaking separate tasks. For example: the word (

‫و‬+conj@‫ل‬+comp@‫شكر‬+verb+pres+sg.@

‫ ولیشكر‬,

and to thank ) will be

2. Tokenization Guesser (tokenization is separated from morphological analysis): The tokenizer here only detects and demarcates clitic boundaries with two additional components: a clitics guesser (integrated with the tokenizer) and a clitics transducer (integrated with the morphological transducer) to recover the information on what may constitute a clitic. the advantages are robustness as it is able to deal with any words whether they are known to the morphological transducer or not, and abiding by the concept of modularity as it separates the process of tokenization from morphological analysis. There is the disadvantage that the morphological analyzer and the syntactic parser have to deal with increased tokenization ambiguities. For example: the word (‫ وللرجل‬, and

to the man) will be either @‫رجل@ال@ل@و‬, @‫الرجل@ل@و‬, @‫ للرجل@و‬, or @‫وللرجل‬ whereas the tagging of the clitics is ‫و‬+conj, ‫ل‬+prep,

‫ال‬+art+def.

3. Tokenization Dependent on the Morphological Analyzer: here the word stem is not guessed, but taken as a list of actual words. A possible word in the tokenizer in this model is any word found in the morphological transducer. The morphological transducer here is the same in the first model but with one difference, that is the output does not include any morphological features, but only token boundaries between clitics and stems. One advantage of this implementation is that the tool becomes more deterministic and more manageable in debugging. Its lack of robustness makes it mostly inapplicable as no single morphological transducer can claim to comprise all the words in a language. For example the word )

‫ وللرجل‬, and to the man) will be @‫رجل@ال@ل@و‬.

Those models proposed by Attia are specified for Arabic, and because we aim the Quran we value the precision. The tokenization guesser leads to an increased number of ambiguities but if we use a hand-made word-by-word corpus we can limit those ambiguities especially for the words existing in Quran. For the words not existing in the corpus, we may use any general Arabic tokenization guesser.

117

Chapter 5

Figure 5.12: Sub-tokens separation schema

Tokenization cause a loss in token information as function and POS . This loss affect the semantic understanding of keywords. For example the two ayahs: 1. [

‫س َو ُض َحا َها‬ ِ ‫ الشمس( – ] َوالشَّ ْم‬1)

2. [

‫س َوالْ َق َم َر كُ ٌّل ِفي َف َل ٍك َي ْس َب ُحو َن‬ َ ‫ الأنبياء( – ] َو ُه َو الَّ ِذي َخ َل َق اللَّ ْي َل َوال َّن َها َر َوالشَّ ْم‬33)

When we tokenize the previous two Ayahs, we lose the function of the first waw (‫)و‬ as in the first ayah is for oath and in the second ayah is for coordination. To recover such kind of loss, we introduce some tags that has the values of important information. We add tags in this phase exactly (tokenization) because it’s the only phase that read the word in it’s phrase semantic and syntactic context. By consequence, it knows the part-of-speech information. Another reason is that we need them in the next phases: Normalization and Stemming. Among the tags that we think they are important, there are: 1. Word declension: whether the word is declinable (‫ )معربة‬or not. We need this 118

Chapter 5

in normalization phase because we intend to strip the declension vowels (‫حركات‬

‫ )الإ عراب‬to unite different occurrences of a word.

2. Flexion information: whether the word is derivative or not. We need this in Stemming phase. English language has a similar problem in the tokenization of compound nouns such as the word ”homework” or the word ”unLadyLike” but not the same complication as in Arabic language. There is a possibility to delay affix processing into the stemming phase but there will be a difference: • with stemming we lose affixes so we can’t specify the exact word search ; • with tokenization, we preserve tokens and we can reproduce the whole word.

Figure 5.13: Example of tokenization

5.4.3

Normalization:

Since we are accepting the Othmani script beside the standard script,we have to specify the normalization for the both. We had to strip all recitation marks such as the Waqf symbol

∴∴ (See different Waqf symbols in 3.3.3).

After stripping all extra symbols, we should either normalize othmani text into standard text. This operation is necessary to unify the processing for the next phases. There is no benefits to continue in a separated lines. The converging into standard script makes the stemming clearer and pass over some special uthmani-written words. For example, the word -example-. To achieve that, we need a complete mapping of Quranic words between the both scripts. To minimize the size of mapping, we can exclude the words that are written in the same manner. The mapping should use the uthmani word as a unit not the standard word. The reason is that some words are mapped differently from an occurrence to

119

Chapter 5

another. An example for that, the word and as

‫ بسطة‬in (‫ البقرة‬247).

‫ بسطة‬which occurs as ‫ بصطة‬in (‫ الأعراف‬69)

Mainly it is recommended to strip vowels in the normalization of Arabic texts. This means ignoring vowels during indexing and search. Consequently it affects negatively the accuracy of search since the vowels are an important key to detect a word meaning

‫ الملك‬that ُِ‫( الْ َملك‬King) and ُ‫الْ َم َلك‬

and distinguish it from other meanings. An example for that, the word mentioned with different meanings in Quran:

ُ‫( الْ ُم ْلك‬Kingdom),

(Angel). This word, however, are mentioned in different declension cases: Nominative, Accusative and Genitive. The declension case has no effect on the meaning. Our idea is to keep the vowels except the declension case ending vowel. This will unify all declension cases of a word. This can be done by passing tags with tokens for declension info such as: • Declinable (‫عرب‬ َ ‫ ) ُم‬or not? • Declension case (‫الإ عراب‬

‫)حالة‬: Nominative, Accusative, and Genitive.

• Declension case markers (‫الإ عراب‬

‫)علامات‬: Fatha, Damma, Kessra, Double-Fatha

(with the additional Alef) , Double-Kessra, Double-Damma.

Figure 5.14: Arabic case markers

Figure 5.15: Example of normalization

5.4.4

Filtering stop-words:

In the previous work, we discussed the feasibility of filtering stop-words for search in Quran (see 5.2.1). We have proposed to select the stop-words from the list of the most frequent words in Qur’an, the particles and the pronouns taking in consideration their diacritical marks. 120

Chapter 5

We made an improvement in the tokenization that affect our previous criteria for choosing the list of stop-words. This improvement is the separation of pronouns and

clitics from the lemma of the word. Clitics such as ‫( َفـ‬fa) generally has a high frequency and hasn’t a self independent meaning. Therefore, clitics are potential stop-words, should be ignored for a better search experience. The ignoring should be optional to let the possibility for linguists to look for these keywords.

Figure 5.16: Example of stop-word filtering

5.4.5

Lemmatization:

The Arabic word has many different level of derivations. Those levels are the word, the word without affixes (Lemma), the stem, or the root. We proposed stripping the affixes in tokenization (see 5.4.2). In Stemming we either bring it back into stem or root origin. The roots in Arabic are the basic unit. There are about 10000 roots7 in Arabic (about 1500 roots in Quran). When pairing with patterns, the root can generate more then 1000 variant words. Generated words could have similar, independent or opposite meanings. In the contrary, the stem usually generate a small set of words that have a similar meaning. Thus, we recommend use the stem as the landmark for this phase. We can use the morphology tags from the tokenization phase as feeds for the stemmer. Some tags are helpful. We include the stem value of each word -if available- as a tag. We use it to replace the word.

7

9273 roots in the lexicon of Lisan Al-arab. 121

Chapter 5

Figure 5.17: Examples of lemmatization

5.5

Qur’anic Word Search

Previously, we have considered the ayah as the search unit. That means it was the unit of our document index or we can call it the document. The ayah being the document is yet the perfect choice. However, to attend many linguistic features (mentioned in 4.4) such as: • Suggestion systems :: Different vocalizations; • Linguistic Aspects :: Multi-level derivation; • Linguistic Aspects :: Specific-derivations • Linguistic Aspects :: Word properties embedded query • Linguistic Aspects :: Word linguistic annotations We need to consider a different search unit: the Quranic word (‫القرآني‬

‫)اللفظ‬.

The

purpose of this is to obtain a quick efficient stable method to retrieve specific Quranic words. To achieve that we need a corpus of Quranic words enriched with linguistic properties of each word. This corpus could be based on either the word form or the word occurrence. The second choice is the most accurate because the word can change their properties moving from an occurrence to another. There are about 17 thousands of words with a total about 76 thousands of occurrences8 . We propose those information to be included in the schema of the word document index: 8

We calculated those statistics based on the Quran copy downloaded from Tanzil.org 122

Chapter 5

• Identifiers: a global identifier, a secondary identifier based on the order in the ayah added to ayah identifier and surah identifier; • Different forms: Uthmani vocalized word (the main form), Standard vocalized word, Standard unvocalized word; • Transliterations: ISO233, Buckwalter, Arabtex; • Translations: English, other languages; • Different levels of stemming: Lemma, Stem, Root; • Other properties: Part Of Speech, type, state, case, mood, voice, number, gender, person. Then we make the inverted indexes based on the document indexes as explained in the following figure: [+figure of indexing the word document index into inverted indexes] We should pass the word fields by the same text analyzing phases, we described in 5.4. We should also preserve a duplicate field for the original word without text analyzing in order to satisfy exact search needs. We use the word search to improve the ayah search by introducing a 2-steps search strategy. First step, retrieving the best keywords set based on the user query by searching in the word-as-a-unit index. Second step, retrieving the corresponding ayahs using the keywords set resulted from the first step. we explaining the 2-steps search strategy on this figure:

Figure 5.18: Two-Steps search behavior

The 2-steps search must also perform a second operation which is to inquire a word ontology to retrieve semantically related words. Semantically related words includes

123

Chapter 5

Figure 5.19: Semantically related words : Idols in Quran

As a conclusion, the 2-steps search strategy helps in retrieving Quran words classified on their origins, properties, translations, transliterations, semantic relations. The classification goes flexibly, and fast by using inverted indexes and ontologies. In the following, we explain some potential useful classifications for the root 1. Synonyms:

‫( قول‬qwl):

‫نطق‬, ‫كلام‬, ‫شهادة‬.

2. Imperative tense: 3. Passive form:

‫قل‬, ‫قولوا‬, ‫قولي‬.

‫يقال‬, ‫قيل‬.

‫قال‬, ‫قالا‬, ‫قلنا‬, ‫قالت‬...etc. Noun, Plural: ‫الأقاويل‬.

4. Past tense: 5.

Building on the 2-steps strategy we propose those new operations to be implemented in the query parser:

5.5.1

Word properties search

This operation replaces the search by the triple root-type-pattern in the previous work (see 5.2.2). Its objective is to allow the users to locate ayahs based on a set of keywords chosen by linguistic properties such as Part Of Speech, type, state, case, mood, voice, number, gender, person. All what we need for the implementation of this operation is the word document index with the ability of fielded search. A fielded search is an advanced query feature that enables users to select and associate the different document 124

Chapter 5

fields to which he wishes to limit the query, to then use the required keywords within these fields. We propose the query syntax {{ PROPERTY_FIELDi:PROPERTY_VALUEi, ... }} to search for the set of words that have the value PROPERTY_VALUEi for the property PROPERTY_FIELDi.

Figure 5.20: Word properties search example : First person, Plural, Masculine

5.5.2

Semantically Related Words

This operation replaces the synonyms and antonyms operation in the previous work (see 5.2.2). Its objective is to offer the related words of a keyword entered by the user. The user can specify which semantic relation to inquire: Synonymy, Antonymy, Hypernymy, Hyponymy, Meronymy, Holonymy, Troponymy. This operation requires an ontology that clarify the relations between different Quranic words. It is performed firstly by inquiring the ontology for related words and secondly by using those keywords to retrieve the corresponding ayahs.

Figure 5.21: Searching through an ontology

We propose the query syntax RELATION~WORD to search for the words related to the word WORD by the semantic relation RELATION, or ~WORD to search using an undefined relation.

125

Chapter 5

Figure 5.22: Semantically Related Words, Hyponymy of the word

5.5.3

‫( نبي‬prophet)

Multi-level Derivations

This operation replaces the derivation search operation in the previous work (see 5.2.2). The objective of this operation is to get a set of words that share the same origin such as stem and root. The user has to specify the word and a level of derivation. The operation will recover the origin of the word in the specified derivation level and retrieve all the set of words that share this origin. Since we can adopt the lemma or the stem as the form to save the words in indexing phase (after text analyzing). So the adopted form would be the reference level to our operation. If we adopt the stem so the root level will be the only one. If we adopt the lemma there would be two levels: the stem and the root. This operation require the origins of each Qur’anic word to be available. We propose the query syntax ORIGIN_LEVEL>WORD to search for the set of words with the same origin of the word WORD in the level ORIGIN_LEVEL. The syntax could be simplified into >WORD, >>WORD or >>>WORD depending on the origin level.

126

Chapter 5

Figure 5.23: Multi-level Derivation Search example

5.5.4

Specific Derivations

This operation is quite similar to the previous one in the objective. Although, this is about find the words resultants of applying a specific derivation operation on the user given word. Among the possible derivations, we mention: • Conjugation of verbs in different tenses: Perfect, Imperfect, Imperative. • Conjugation of verbs with different pronouns: ◦ Person: First person, Second person, Third preson; ◦ Number: Singular, Dual, Plural; ◦ Gender: Masculine, Feminine. • Conjugation of verbs in different voices: Active, Passive. • Genders and Plurals of a noun: Masculine singular , Feminine singular , Masculine Dual (‫)مثنى‬, Feminine Dual (‫)مثنى‬, Masculine Plural, Feminine Plural, Broken Plural, Plural of Plural. • Deverbals of a root: Active participle (‫فاعل‬

‫)اسم‬, Passive participle (‫)اسم مفعول‬, Nouns of time and place (‫)أسماء الزمان والمكان‬, Noun of instrument (‫)اسم الآلة‬, The Nomen Vicis (‫)اسم المرة‬, The Nomen Speciei (‫)اسم الهيئة‬. 127

Chapter 5

• Other derivations: the forms of exaggeration, The Comparative and Superlative Noun. The user should enter the keyword and specify which derivation he seeks. We generate the set of derived words either by fetching in the word index or using linguistic tools such as verb conjugators. If it’s the second case, the generated set could be filtered as a second step by intersection with the set of Quranic words. The resulted set will be used to locate the corresponding ayahs. We propose the query syntax OPERATION(WORD) to search for the set words generated from the word WORD by applying the operation OPERATION.

Figure 5.24: Special derivations example, Imperative of

5.5.5

‫( قال‬to say)

Fuzzy Search

This operation replaces the Considering/Ignoring spell errors operation in the previous work (see 5.2.2). Its objective is to fetch using the set of words that are nearly similar to the input word in writing or pronunciation. Usually useful to guess the right spellings of a misspelled word. There are many methods to implement fuzzy search, some are designed for search against previously unknown text such as Liechtenstein distance method and other not such as ngrams and spell-checker methods. Since we are applying on a previously known text, so that means both of the two types is feasible. Nevertheless, methods such Liechtenstein distance and ngrams lack for good handling for Arabic vowels (‫)حركات‬. They consider the vowels the same as letters and this leads to an important decrease of efficiency. To pass on this weakness, those methods should consider any letter followed by a vowel as one unit. In Arabic, the vocalized letter is considered as one letter whilst in computer this still considered as two characters. In the other hand, Lam-alef (‫ )لا‬which is actually two letters still consider as one character in many computer writing systems. Any fuzzy search algorithm should consider some specific similarities in Arabic:

128

Chapter 5

• The similarity between the different forms of each letter: Hamza, Ta’, and Alef. For example,

‫( مءصدة‬Hamza on the line) and ‫( مؤصدة‬Hamza on waw).

• An unvocalized letter is so similar to a vocalized one whatever the vowel is. For example,

‫( الحمد‬unvocalized) and ‫الح ْم ُد‬ َ (vocalized).

• Tanwin in a vowel is roughly similar to a vowel without Tanwin. For example,

‫( عش ِر‬kessra) and ‫( عش ٍر‬tanwin kessra). • Shedda on a letter is so similar to the letter doubled. For example, on Lam) and

‫( يضلله‬doubled Lam).

‫( يض َله‬Shedda

The fuzzy search algorithms could be strengthened by coupling it with a Phonetic algorithm. A phonetic algorithm matches two different words with similar pronunciation to the same code, which allows phonetic similarity based word set comparison and indexing. Usually used for proper names where there is no unified spelling for the word . One of the first algorithms was Soundtex invented in the 1910s by Robert Russell. Its working principle is based on the partition of consonants in groups with ordinal numbers, which are then compiled to the resulting value. Generally the fuzzy search should be implemented to work automatically either always or when there is a lack of results. Another case is to show the possibilities as suggestions. Those are some examples of fuzzy search: 1. Mis-order of letters:

‫ زنبجيل‬for ‫زنجبيل‬.

2. Phonetic similarity:

‫ هرم‬for ‫إرم‬.

3. Spelling similarity:

5.6

‫ الضحي‬for ‫الضحى‬.

Conclusion

We based on our previous work, we proposed fixes for what already done , and new improvements. The improvements are about getting to a full vocalized search engine customized to offer most of the features mentioned in the previous chapter. We proposed a customized text processing to fits the properties of Quranic text. We introduce searching using the word as a unit in order to achieve a set of search features that requires the manipulation of word sets. In the next chapter, we’ll talk about implementation details. 129

Part III Implementation

130

Chapter 6 Implementation

Talk is cheap, show me the code! Linus Torvalds

6.1

Introduction

Our aim in implementation is to offer an open application programming interface. That API must be well extensible to achieve the most of the search features we’ve discussed in the previous chapter. We’ll talk about the technical details of our implementation. Otherwise, we explain why we decided to open the source, what are the advantages driving us to adopt a such approach.

6.2

Why Open Source?

One of the fundamental differences between open source software and proprietary software is that the source code of open source software must be made freely available with the software. Anyone should be able to download the source code, view it, and alter it as they see fit. There are a number of advantages lead us to open source, the following points examine the most important of these[Web-Oss-watch]. • Collaborative bug-fixing: As a large number of users can access and change the code, bugs tend to be more visible and more rapidly corrected. One of the slogans of the open source movement is that ‘given enough eyeballs, all bugs are shallow’ . 131

Chapter 6

• Fast security vulnerabilities detection: Access to source code makes it easier and faster to detect security flaws in software, whether you are looking to fix them or exploit them. This would seem to suggest that open source code is less secure than closed source. However, this view is not universally supported. Given the number of programmers who can access and edit open source code, compared with the few that are entitled to access closed source code, it should not come as a surprise that flaws in open source software tend to be fixed more rapidly, before serious damage can be done. • Customization: Open source applications may be customized by anyone with the requisite skill. Thus, open source software can be readily adapted to meet specific needs. For businesses or educational institutions, the ability to customize source code may enable improvements to the ‘best practice’ provided by default installations, therefore improving efficiency and possibly providing a competitive advantage. • Translation & Localization : With access to the source code it is easy to translate the language of the software interface. • Development discontinuation: With open source software, this danger of discontinuation is greatly reduced. As the source code is not ’owned’ in the same way that proprietary source code is, it may be picked up and developed by anyone with an interest in a product’s survival. • Learning from examples: Open source code provides an excellent resource from which to learn, and open source projects provide a practical environment in which to test skills. Just watching the development process can provide an education in itself. • Being part of a community: By adopting open source software you become part of a community of users and developers who have an interest in working together to support each other and improve the software. The extent to which you engage with this community is up to you, but you may obtain the intangible benefits of goodwill if you do. • Cost: Many open source programs can be obtained at no cost or at a very low cost. This is often an important issue for individuals and in many cases this has been the main reason for an individual adopting a particular open source solution over a closed source alternative.

132

Chapter 6

6.2.1

License : AGPL

For licensing our API, we decide to choose one of the GNU public licenses which is the GNU Affero General Public License (AGPL). The GNU Affero General Public License is a modified version of the ordinary GNU GPL version 3. It has one added requirement: if you run the program on a server and let other users communicate with it there, your server must also allow them to download the source code corresponding to the program that it’s running. If what’s running there is your modified version of the program, the server’s users must get the source code as you modified it. The purpose of the GNU Affero GPL is to prevent a problem that affects developers of free programs that are often used on servers and since our API is considered among this type of software we have to choose this license.

6.2.2

Python

Python is a remarkably powerful dynamic programming language that is used in a wide variety of application domains. Python is often compared to Tcl, Perl, Ruby, Scheme or Java. Some of its key distinguishing features include: • Python is powerful and fast: Fans of Python use the phrase ”batteries included” to describe the standard library, which covers everything from asynchronous processing to zip files. The language itself is a flexible powerhouse that can handle practically any problem domain. It’s possible in Python to build a web server in three lines of code. Python lets the user to write the code he needs, quickly. • Python plays well with others: Python can integrate with COM, .NET, and CORBA objects. ◦ For Java libraries, developers can use Jython, an implementation of Python for the Java Virtual Machine. ◦ For .NET, developers can use IronPython , Microsoft’s new implementation of Python for .NET, or Python for .NET. • Python runs everywhere: Python is available for all major operating systems: Windows, Linux/Unix, OS/2, Mac, Amiga, among others. There are even versions that run on .NET, the Java virtual machine, and Nokia Series 60 cell phones. the same source code will run unchanged across all implementations. • Python is friendly and easy to learn: Python comes with complete documentation, both integrated into the language and as separate web pages. Online 133

Chapter 6

tutorials target both the seasoned programmer and the newcomer. All are designed to make developers productive quickly. The availability of first-rate books completes the learning package. • Python is Open: The Python implementation is under an open source license that makes it freely usable and distributable, even for commercial use. The Python license is administered by the Python Software Foundation[Web-Python].

6.2.3

Whoosh Search API

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly. Some of Whoosh’s features include: • Pythonic API. • Pure-Python. No compilation or binary packages needed, no mysterious crashes. • Fielded indexing and search. • Fast indexing and retrieval – faster than any other pure-Python search solution I know of. See Benchmarks. • Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc. • Powerful query language. • Pure Python spell-checker. Whoosh might be useful in the following circumstances: • Anywhere a pure-Python solution is desirable to avoid having to build/compile native libraries (or force users to build/compile them). • As a research platform (at least for programmers that find Python easier to read and work with than Java ;) • When an easy-to-use Pythonic interface is more important to you than raw speed.

134

Chapter 6

• If your application can make good use of one deeply integrated search/lookup solution you can rely on just being there rather than having two different search solutions (a simple/slow/homegrown one integrated, an indexed/fast/external binary dependency one as an option). Whoosh was created and is maintained by Matt Chaput. It was originally created for use in the online help system of Side Effects Software’s 3D animation software Houdini.

6.3

Previous Code Base

We have based on the code base we’ve realized on [Dahmani2010]. The code base was licensed under the GPL open license as a result of using the library whoosh licensed under the same license. It contained an Application Programming Interface that performs the basic search operation. It offers the results in a raw HTML format, could be used only in Python. To be used in another programming language, it requires to write a complicated wrapper. There was a basic resource manager that index the Quran information stored in an intermediate database. The importing of new resources into this intermediate database was a missing piece. Two interfaces was developed for the API: 1. A Gui desktop interface, demonstrate perfectly the API features . It is urgently coded in Python and Qt.

135

Chapter 6

Figure 6.1: Screenshot of the Qt desktop interface

2. A basic CGI-Html interface, demonstrates in a simpler way the API features. Also urgently coded and it did not respect the Model View Control. The API contained many implemented search features including: 1. Search by exact word: that is the simple search, eg:

‫فأسقيناكموه‬.

2. Search by phrase: for searching a whole phrase rather then independent words, eg:

"‫"رسول الله‬.

3. Logical relations: (a) Conjuction: for searching only the ayahs that contains two terms or more, eg:

‫ الزكاة‬+ ‫الصلاة‬.

(b) Disjunction (default): for searching all the ayahs that contains one of two terms or more, eg:

‫الصلاة | الزكاة‬.

(c) Exception: for eliminating a term from search results, eg:‫الزكاة‬ You can understand it as ”Ayahs that contains

‫”الزكاة‬.

- ‫الصلاة‬.

‫ الصلاة‬but doesn’t contain

4. Wildcards (Jokers): for search all words that share many letters, we have:

136

Chapter 6

*‫*نبي‬. replace one undefined letter, eg: ‫نعم؟‬.

(a) Asterisk: replace zero or many undefined letters, eg: (b) Interogation mark:

5. Fielded search: to search in more information of Quran , not only ayahs’ text, we cite here the most significant fields for users: (a) aya_id or

‫( رقم_الآية‬Aya local ID): that’s the number of ayah inside its

sura, use it for example to search all first ayahs (aya_id:1). (b) sura_id or

‫رقم_السورة‬

(Sura ID): use it with aya_id to specify an ex-

act ayah,for example the first ayah of surate an-nass will be: aya_id:1 + sura_id:114.

‫( موضوع‬Topics): thats field contains all topics information, it will be helpful to search for a topic, eg: subject:‫الشيطان‬.

(c) subject or

6. Intervals: this will be helpful in statistics or positions, for example search the divine name only in the first surahs: aya_id:[1 to 5] 7. Partial vocalization: to consider given diacritics and ignore the others, eg: aya_:'‫' َمن‬. 8. Word Properties: to search using root and type of words, type could be (noun),

‫( فعل‬verb) or ‫( أداة‬particle), eg: {‫}اسم� قول‬.

‫اسم‬

9. Derivations (a) light (using lemma): to search all the words having the same lemma of the given word,eg: >‫ملك‬. (b) heavy (using root): to search all the words having the same root of the given word,eg: >>‫ملك‬.

6.4

Our improvements

As a first step, we change the license into the GNU Affero General Public License which aims to protect libraries that usually used in servers (for more details, see 6.2.1). The code base has had 981 commits made representing 15,243 lines of code . It is mostly written in Python with a well-commented source code. It is a young, but established codebase took an estimated 4 years of effort (COCOMO model)1 . Those are the main milestones that we went through: 1

Statistic analysis from Ohloh website. 137

Chapter 6

6.4.1

A New Centralized JSON Output System:

First at all, we upgraded the output system of the API to offers the results in Json data format instead of raw Html. Then we centralized the output system to gain many advantages which are: 1. Easy-implementation of more data structures; 2. The use of the same output system in the different interfaces: Console Interface, CGI Interface, Desktop Interfaces, Web Interfaces; 3. Making changes and updates only in one place instead of many. Secondly, we proposed an extended structure for results. This structure is extensible: any new information can be included easily without affecting the old structure and without resulting a back-incompatibility.

Figure 6.2: Json results of

‫الكوثر‬

Thirdly, we refine the list of request flags to allow the requester to customize his request and control which information retrieved in order to gain performance and minimize the size of results and the retrieving runtime. A concept of predefined configuration for requests is introduced. So the requester can choose either a minimal, normal , full, linguistic, or statistical view without specifying each flag value.

138

Chapter 6

1 2 3 4 5a 5b 6 7 8 9 10 11 12a 12b 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Flag action unit platform domain query query highlight script vocalized recitation translation romanization view view prev_aya next_aya sura_info sura_stat_info word_info aya_position_info aya_theme_info aya_stat_info aya_sajda_info annotation_word annotation_aya sortedby offset range page perpage fuzzy aya

related action search, suggest search, suggest show search search search search search search search search search search search search search search search search search search search search search search search search search search

related unit aya, word aya, word aya aya, trans aya, word aya, word translation aya aya aya aya aya, word aya aya aya aya aya, word aya aya aya word, trans

description action to perform search unit platform used by requester web domain of requester if applicable query attached to action query attached to action highlight method script of aya text enable vocalization of aya text recitation id translation id type of romanization pre-defined configuration for view pre-defined configuration for view enable previous aya retrieving enable next aya retrieving enable sura information retrieving enable sura stats retrieving enable word information retrieving enable aya position information retrieving enable aya theme information retrieving enable aya stat information retrieving enable aya sajda information retrieving enable query terms annotations retrieving enable aya words annotations retrieving sorting order of results starting offset of results range of results page number [override offset] results per page [override range] fuzzy search [exprimental] enable retrieving of aya text

default value ”search” ”aya” ”undefined” ”undefined” ”” ”” ”css” ”standard” ”True” ”1” ”none” ”none” ”custom” ”custom” ”False” ”False” ”True” ”False” ”True” ”True” ”True” ”True” ”True” ”False” ”False” ”score” ”1” ”10” ”1” ”10” ”False” ”True”

accepted values search | suggest | show aya | word | translation undefined | wp7 | s60 | android | ios | linux | windows * * all | translations | chapters | flags | fields ... etc. css | html | genshi | bold | bbcode standard | uthmani True | False 1 to 30 * none | buckwalter | iso | arabtex minimal | normal | full | statistic | linguistic | custom minimal | normal | full | custom True | False True | False True | False True | False True | False True | False True | False True | False True | False True | False True | False total | score | mushaf | tanzil | subject 1 to 6236 1 to 25 1 to 6236 1 to 25 True | False True | False

Table 6.1: Search request flags

We also centralized the loading of necessary data files. Data files now are loading automatically from a default path else if this path is redefined. We included the data files with API package to maintain the portability. Also we export them into a Json data format as a standardization step. We mean with data files, the files such as: 1. Configuration files: Available recitations list, Available translations list, Usage hints, Usage statistics, meta information. 2. Linguistic resources: derivation list, vocalization list, synonym list, standardinto-uthmani mapping, stop word list, word properties. 3. Quranic indexes: Ayah document index, Translation document index, Word document index. A usage statistics calculating feature is implemented. The statistics saved in a JSON format including usage of each flag passed through the search request. We offered a set of error codes in case of failed search query. It helps in clarifying the nature of failure. This set is extensible, new error codes can be added easily. Currently we define those errors: 1. Error -1: ”fail, reason unknown”; 2. Error 0: ”success”; 3. Error 1: ”no action is chosen or action undefined”; 139

Chapter 6

4. Error 2: ”super jokers are not permitted”. We made the API meta data available for request in the output system. Meta data includes: 1. API meta information including authorship, license, description, version, and release; 2. Error messages; 3. Possible flags, in addition to their default values, possible values intervals, and help messages, see table 6.1 ; 4. Available search fields (Arabic and English names), see table ; 5. Available translations and recitations; 6. Surah and topic values lists; 7. Global search usage hints.

6.4.2

Many new features

We implemented many new features which are available within the new JSON output system. One of the features is the fuzzy search feature. We’ve done a basic implementation that automates the fuzzy search to include the potential synonyms of the word and also the different derivations.

Figure 6.3: Fuzzy search example

We added the possibility of retrieving the previous and next ayahs of each ayah of the results. Showing neighbor ayahs helps the user to know exactly the ayah position in its own surah.

140

Chapter 6

Figure 6.4: Showing adjacent ayahs

For the ayah text, we offered the ability to choose between Vocalized and Unvocalized text, and between Standard and Uthmani text.

Figure 6.5: Showing ayahs in different scripts

We implemented four suggestion operations: Close spellings, Different vocalizations, Different derivations, Synonyms.

Figure 6.6: Suggestion example of Vocalizations , Derivations ,and Synonyms of

‫قول‬

We also include the linguistic annotations of each query keyword in the API output results. We got those Annotations from the Arabic Quranic Corpus.

Figure 6.7: Annotations of the keyword

‫قيمة‬

We implemented the possibility of search using Buckwalter transliteration and show the resulted keywords written in the same transliteration.

141

Chapter 6

Figure 6.8: Buckwalter translation example

6.4.3

Resource Importing Manager

An important part was missing in the old implementation of the resource importing manager. The missing part was the importing of the data from their original source. We have fixed this missing part and improve the global behavior of the resource manager. The improved version can do: 1. Downloading original resources: the reason for downloading the resources rather then including them with the project is to pass by the restrictions on the redistribution of those resources. Another reason is to keep using always the last updated versions of the resources. Yet another reason is avoid the optional resources with big sizes so let the developer or the user choose what to download. This is the list of the resources to be downloaded: (a) Tanzil Mus-haf xml file. (b) Tanzil Quranic translations packaged as the Zekr model. (c) EveryAyah recitations list. 2. Parsing & Importing the data to our intermediate database: Each resource has a specific unique structure. We need to parse each resources then import the data included to our intermediate database. The role of intermediate database is to re-organize the data in a way that make the indexing easier. We made two libraries: (a) PyCorpus: a library to parse and read the data included in the project Arabic Quranic Corpus. (b) PyZekrModels: a library to read the Quranic translations packaged as the Zekr model. 3. Indexing the database: First, we need to transform the database into document indexes . Then, we need to generate the inverted indexes out of the corresponding document indexes. There is a table called fields in the database take the control over the information to be indexed and how to indexed. This is a sample example of the fields table:

142

Chapter 6

Figure 6.9: Fields table

4. Updating auto-generated data files: some data files need to be generated immediately after the indexing. An example of that is the list of indexed translations.

6.4.4

Automating the API building

The building of resources was manual and undocumented. We created a Makefile that automate now the building. We fill the makefile with the appropriate number of comments in order to make it clear and understandable. The role of make file is to perform the following tasks: 1. Auto-downloading of new versions of the used Quranic and linguistic resources; 2. Generating and Regenerating of the intermediate generated data files; 3. Auto-building of the document indexes in addition the inverted indexes; 4. Auto-install and packaging of the API and different related extensions and interfaces.

6.4.5

A new console interface

We introduced a new console interface to be used by a non-python desktop application, could be run as background service a.k.a Daemon. For details see 6.5.1.2.

6.4.6

Enhancing the web interface

We improved the HTML web interface in two steps. The first one into an Html-jQuery web interface with client-side search results showing. The second step into a Django 143

Chapter 6

web interface with a server-side search results showing with many extra features. For more details see ?? .

6.4.7

Packaging system:

We implemented a packaging system to package the API for different operating systems. Since the API is pure python so that makes it portable through the different platforms. Though, many data resources should be generated and imported in the package. We distribute the API as : 1. Source Tarball: An archive of source files created with the Unix tar utility. Source-code distributions have been packaged as tarballs since the mid 1980s. 2. Binary Tarball: An archive contains the built and compiled source code. 3. Python egg package: a logical structure embodying the release of a specific version of a Python project, comprising its code, resources, and metadata. 4. Debian deb package: Debian packages are standard Unix archives that include two tar archives optionally compressed: one archive holds the control information and another contains the program data. 5. Red-hat rpm package: the acronym RPM stands for Redhat package manager which is a package management system. It refers also to the file format used by the packages of this system.

6.4.8

Multiple search units

The Application Programming Interface initially was made up to search only in the ayahs. Actually we introduced more search units. To introduce any unit, we should gather the appropriate data resources, define the search fields, define the search request flags, and finally define the results structure. We introduced two units which are: 1. Translations: enable the search for Quranic translations in different languages. The information included are: identifier, text, language, author.

144

Chapter 6

Figure 6.10: Translation-as-unit search , Query: seven

2. Words: enable the search for Quranic words and different linguistic annotations including: word identifiers, origins (lemma, stem, root), POS, form, gender, person, number, voice, aspect, state.

Figure 6.11: Word-as-unit json outout, Query:

6.4.9

‫قيمة‬

Coding Standardization

We passed our code base by Pylint code analysis. Pylint is a source code bug and quality checker for the Python programming language. It is highly configurable and can be customized as needed. It includes features such as: • Checking a line-code’s Length ; • Checking if variable names are well-formed according to your coding standard ; • Checking if declared interfaces are truly implemented, and so on. We have fixed a huge amount of shown up messages by Pylint including: Conventions, Refactors, Warnings, Errors. This leads by consequence into a mass re-organization of the source code.

145

Chapter 6

type convention refactor warning error

actual number 1059 81 1549 79

previous number 9217 310 12331 412

difference -8242 -229 -10782 -333

Table 6.2: Pylint Analysis stats

6.4.10 Documentation covering We made the Readme files for the different parts of the project. The contents of readme files typically include one or more of the following: 1. Configuration instructions 2. Installation instructions 3. Operating instructions 4. A file manifest (list of files included) 5. Copyright and licensing information 6. Contact information for the distributor or programmer 7. Known bugs 8. Troubleshooting 9. Credits and acknowledgments 10. A changelog (usually for programmers) 11. A news section (usually for users) The expression ”readme file” is also sometimes used descriptively and generically, whereby the files are not named ”readme”, but are considered types of readme files. The source code distributions of many free software packages, especially those following the Gnits Standards, usually include a standard set of readme files:

146

Chapter 6

README

General information

AUTHORS

Credits

THANKS

Acknowledgments

ChangeLog

A detailed changelog, intended for programmers

NEWS

A basic changelog, intended for users

INSTALL

Installation instructions

COPYING / LICENSE

Copyright and licensing information

BUGS Known bugs and instructions on reporting new ones Other files commonly distributed with software include a FAQ and a TODO file listing possible future changes.

6.4.11 Open Issues We went through many improvements. Yet, however, there still lot of things to be done. We’ll browse the main milestones that should go through: 1. Enriching the linguistic resources: the actual used resources are poor comparing to what we really need. In order to get this done , we should: • Textify the binary database to enable the possibility of logging of changes and take the benefits of revision control systems such as GIT. • Integrate Qurany project to enrich the actual faceted thematic search. • Integrate the boundary annotations to enable the retrieving of boundaries in Quran. • Propose a standard format for new linguistic and Quranic resources. 2. Complete the features implementation: we listed lot of search features in a previous chapter (see 4.4). Though, we didn’t implement the whole list. We list the implementation state in the next table:

147

Chapter 6

Class

Advanced Query

Output Improvements

Suggestion Systems

Linguistic Aspects

Qur’anic Options

Semantic Queries

Statistical System

Feature Fielded search Logical relations Phrase search Interval search Full Regex Wildcards (Jokers) Boosting keywords weight Pagination Scoring Sorting Keywords Highlight Real time output Results grouping Uthmani script with full diacritical marks Vocalized spell correction Semantically related keywords Different vocalization Collocated words Keyboard mapping Different significations Romanization Syntactic Coloration Vocal Search Partial vocalization search Multi-level derivation Specific-derivations Word properties embedded query Fuzzy string search Word linguistic annotations Linguistic examples search Uthmani writing way Structural options Recitation marks retrieving Divine Names Highlight Translation search Repetitions and Allegorical ayahs Abrogators and Abrogated ayahs search �Qur’anic Parables Semantically related words Faceted Thematic Search Questions Answering (QA) Automatic vocalization Entity Extraction Co-reference resolution Unvocalized word frequency Vocalized word frequency Root/Stem/Lemma frequency Another Qur’anic units frequency

Table 6.3: Implementation State of search features

Implementation State Yes Yes Yes Yes No Partially Yes Yes Yes Yes Yes No No Yes Partially Partially Yes No No No Partially No No Partially Yes No Partially Partially Partially No No Yes No (Sajda only) No Yes No No No Partially Partially No No No No Yes Yes No No 148

Chapter 6

1. Implement the modularity for the Query Parser: This is important to enable the extensibility feature and fix the problem of mixing (the combination) the different operations made during parsing. 2. Restrict the anonymous requests to the API: restricting requests protect the API from flooding either intended or not. This can be done by: • Limit the maximum of simultaneous requests globally and by IP. • Implement an identification system that works with remote clients. 3. Move to the last version of Whoosh library: Whoosh is almost in the version 3.X in its stable release while we still using an older version which is 0.3. The moving to the last version is very recommended to benefit of the improvements made. Though, it will not be an easy operation since our API is intertwined with the older version. Especially for the Query Parser. 4. Move to Python 3.X: Python 2 is disappearing and sooner or later it’ll be fully replaced. There are many tools offer some automatic scripts to convert a code from 2 into 3. Though, the big part often should be done manually. 5. Cover with documentation: the documentation is so important, it’s expensive but it encourages the community to involve in the project. This can be done by: • Enrich the readme files; • Enrich the code with appropriate comments; • Create a usage How-To and straighten it with many demos; • The man page for the console interface. 6. Optimize code and performance: proceed the fixing of pylint code analysis warnings and use Profile to check the performance of each search feature in order to improve it.

6.5

Interfaces

The project is working as a library offering many interfaces. The main one is the Application Programming Interface or API. It works as the intermediary between the library and the other interfaces. There are two low-level interfaces that works with the API:

149

Chapter 6

1. Console interface, destined for test purposes and to be used by third party nonpythonic desktop interfaces. 2. JSON web service, destined to be used by web interfaces, smart phone apps, and social network apps.

Figure 6.12: Interfaces dependency hierarchy

6.5.1

Application Programming Interface

An application programming interface (API) is a protocol intended to be used as an interface by software components to communicate with each other. An API is a library that may include specification for routines, data structures, object classes, and variables. The powerful points of our API are: 1. Free Open Libre: any one can use it, any one can contribute in. That means it takes the advantage of community involvement. 2. A Python API: that allows anyone to create independently a web interface, desktop interface , Android/Iphone/Windows phone interfaces , facebook/twitter/G+ applications ...and so on.

150

Chapter 6

3. A founded base: The search process is too fast and too stable other websites/applications do. 4. Lot of features: The actual API has an important number of features and prepared to accept more. The next figure represent a python sample code to use the API:

151

Chapter 6

Figure 6.13: API usage sample code

6.5.1.1 JSON web service To enable the use of our API over the web, we made a web service that wrap the input/output of the API. The request arguments should be passed in URL and the output will be generated and shown in JSON format. This could be used by web 152

Chapter 6

interfaces, smart phone apps, social network apps, and browsers addons.

Figure 6.14: Preview of the JSON web service

6.5.1.2 Console interface As a test interface, we made a console interface that works on command line. This interface could be used also as a wrapper to make desktop interfaces that are developed under a programming language different then Python. The request should be passed as in-line arguments in the command line and the output will be generated & shown in JSON format. High-activity desktop interfaces, working on a linux-like platform, can run this interface as a Daemon service on the background.

Figure 6.15: Preview of the Console interface

153

Chapter 6

6.6

Conclusion

The chapter contains an explanation for why we have chosen the open source approach. We talked about the code base that we have started from. Also, we talked the improvements that we have done. We made a lot but unfortunately, there are more improvements to be done and many issues to be resolved. we clarified this under the subsection: Open issues. We leave them as perspectives.

154

Conclusion

155

General Conclusion Our proposal is about design a retrieval system that fits the Qur’an search needs. But to realize this objective, we had to first list and classify all the search features that are possible and helpful. Then we needed to make a global conception for a retrieval system as a first stone to achieve each search feature. We started by three chapters for the state of art. The first is about search engines and how they works. The second is about Arabic language and its properties. The third is about Quran and its properties including its indexes and many search tools. To clarify our proposal, we have enlisted the search features in Quran that are helpful. To facilitate the enlisting, we have classified those features depending on the nature of the problem. That list will help us to make a detailed retrieval system that fits perfectly the needs of the Quran. We conducted a survey about usefulness, usability and clarity of each feature and we have got very helpful feedbacks. Each feature has a different level of complexity: could be implemented easily or may lead to vast problem that need a deeper study. A primary version of this list of search features is published as a paper in an LREC 2012 pre-conference workshop which is about ”LRE-Rel: Language Resource and Evaluation for Religious Texts”. The paper was entitled ”Advanced Search in Quran: Classification and Proposition of All Possible Features”[Chelli2012]. For conception, we’ve based our work on a previous one, we went through the explanation of what’s already done. Then, we proposed many fixes, and many improvements on it. First improvement is about achieving a full vocalized search engine customized to offer most of the features mentioned in the chapter. The full vocalization is a very important improvement to get over the ambiguities and lack of accuracy arisen from ignoring the vowels (‫)حركات‬. Second improvement is a proposal of a customized text processing that fits the properties of the Quranic text. We’ve introduced the use of Othmani script side by side with the Standard script. We added a pre-tokenization phase that substitute some 156

Conclusion Générale patterns of the query by their appropriate correspondents. This is useful to implement search features such as romanizations, and numbers-as-words search. We suggested a change in the tokenization phase, which is adding a second operation right after the space-based tokenization. The role of this new second operation is to separate the resulted tokens (the words) into sub-tokens (words’ parts) in order to have a specific processing for each part of the word. For normalization, we discussed the processing of the Othmani text. We discussed also some changes on the way how to filter stop words and how to do the lemmatization (stemming). The third improvement, we considered the word as a new unit for search in addition to the ayah as the main unit. The objective of that is to achieve a set of search features that requires the manipulation of word sets. The word-as-a-unit search helps directly or indirectly in many feature: • Word properties search • Semantically-related words • Multi-level derivations • Specific derivations • Fuzzy Search We went through the implementation of many search features that we previously enlisted. Unfortunately, there are more improvements to be done and many issues to be resolved. We left them as perspectives: • Achieving an accurate statistics gathering system; • Implementation of a more adequate suggestion system; • Clear the way toward a semantic search engine. • Proceeding the full conception of all the search features cited in 4.4. • Complete implementation of all the open issues cited in 6.4.11.

157

158

Bibliography [Jabri]

Youssef Jabri. The arabi system - tex writes in arabic and farsi. Thesis, Ecole Nationale des Sciences Appliquées,Oujda, Morocco.

[Attia]

Mohammed A. Attia. Arabic tokenization system.

[Mayaman]

-‫الحاسوبية‬-‫التقنيات‬-‫توظيف‬ ‫وعلومه‬-‫الكريم‬-‫القرآن‬-‫لخدمة‬-‫ومبتكرة‬-‫هـ�§مة‬-‫فهارس‬-‫لإ عداد‬ . ‫الشريف‬-‫المصحف‬-‫لطباعة‬-‫فهد‬-‫الملك‬-‫ مجمع‬.

[Khedar]

Muhammed Zaki Muhammed Khedar.

Sulaiman Ben Abdellah Mayaman.

‫الكريم‬-��ٓ‫القرا‬-‫خدمة‬-‫في‬-‫البيان‬-‫ مداد‬. [BenJammaa]

[Mustafa]

-‫لمشروع‬-‫أولي‬-‫عرض‬

-‫موسوعة‬-‫لإ نجاز‬-‫تعاونية‬-‫منهجية‬ ‫وعلومه‬-‫الكريم‬-‫للقرآن‬-‫ش�§ملة‬-‫ إلكترونية‬. -‫لطباعة‬-‫فهد‬-‫الملك‬-‫مجمع‬ ‫الشريف‬-‫ المصحف‬. Muhammed BenJammaa.

El Hadi Widad Mustafa. Indexation humaine et indexation automatisee : la place du terme et de son environnement.

[Zarkashī1957]

al-Qur�ān. [Luhn1957]

‫القرآن‬-‫علوم‬-‫في‬-‫ البرهان‬. ‫العربية‬-‫الكتب‬-‫احياء‬-‫ دار‬, 1957.

Zarkashī, M.B.

al-Burhān fī �ulūm

Hans Peter Luhn. A Statistical Approach to Mechanized Encoding and Searching of Literary Information. IBM, International Business Machines Copr., 1957.

[Luhn1958]

Hans Peter Luhn. The Automatic Creation of Literature Abstracts (auto-abstracts). IBM Research Center, International Business Machines Corporation, 1958.

159

[Salton1971]

Salton, G. The SMART retrieval system: experiments in automatic document processing. Prentice-Hall series in automatic computation. Prentice-Hall, 1971.

[Mahssin1973]

Muhammed Salem Mahssin.

‫بجدة‬-‫للطباعة‬-‫ الأصفهاني‬, 1973.

‫الكريم‬-‫القرآن‬-‫تاريخ‬

.

-‫دار‬

‫الكريم‬-‫للقرآن‬-‫العددي‬-‫ الإ عجاز‬. 1975.

[Nawfal1975]

Abderrezzaq Nawfal.

[Gaudefroy1975]

R. Gaudefroy and M. Demombynes Blachère. Grammaire de l’arabe classique (morphologie et syntaxe). 1975.

[Bourne1979]

C. Bourne and B. Anderson. In Lockheed Information Systems. 1979.

[Nasser1989]

Kaddem Adel Nasser.

‫عربي‬-‫انجليزي‬-‫المتضادة‬-‫المفردات‬-‫قاموس‬

. 1989. [Chartron1989]

Ghislaine Chartron and Sylvie Dalbin and Marie-Gaelle Monteil and Monique Verillon. Indexation manuelle et indexation automatique. depasser les oppositions. 1989.

[Mālik1991]

Mālik, M.A.A.I. and �Awwād, M.Ḥ.

‫المؤتلفة‬-‫ المعاني‬. 1991. [Lelu1992]

-‫في‬-‫المختلفة‬-‫الألفاظ‬

A. Lelu and C. Francois. Information retrieval based on neural unsupervised extraction of thematic fuzzy clusters. 1992.

[Kadri1992]

Y Kadri and A Benyamina.

Système d’analyse syntaxico-

semantique du langage arabe. Thesis, université d’Oran, Essénia, 1992. [Gruber1993]

Gruber, Thomas R. A translation approach to portable ontology specifications. Knowl. Acquis., 5(2):199–220, June 1993.

[Grefenstette1995]

G. Grefenstette.

Comparing two language identification

schemes. 1995. [Berrut1997]

C. Berrut. Indexation des donnees multimedia, utilisation dans le cadre d-un systeme de recherche d-informations. Doctorate thesis, Université Joseph Fourier, 1997.

160

[Elhachani1997]

Mabrouka Elhachani. L’indexation automatique. Thesis, Ecole Nationale Superieure des Sciences de l’Information et des Bibliotheques (ENSSIB), 1997.

[Denos1997]

N. Denos. Modelisation de la pertinence en recherche d’information : modèle conceptuel, formalisation et application. Thesis, Université Joseph Fourier, 1997.

[Brin1998]

Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1–7):107 – 117, 1998.

Proceedings

of the Seventh International World Wide Web Conference. [Hensens1998]

Hanka Hensens and Orstom Montpellier. Traitement des fonds de Laboratoires : Indexation Principes. 1998.

[AlKharashi1999]

Ibrahim AlKharashi. A web search engine for indexing, searching and publishing arabic bibliographic databases. 1999.

[Meylan2001]

Ed Meylan. Introduction theorique a la gestion de donnees textuelles. Haute Ecole Specialisee de Suisse Occidentale, 2001.

[Grossman2002]

Grossman and Frieder and Goharian. IR Basics of Inverted Index. 2002.

[Jalam2002]

Radwan Jalam and Chauchat Jean Hugues. Pourquoi les ngrammes permettent de classer des textes? recherche de motsclefs pertinents a l’aide des n-grammes caracteristiques. 2002.

[Gaussier2003]

Gaussier, E. Assistance intelligente à la recherche d’informations. Traité des sciences et techniques de l’information. Hermes Science Publications, 2003.

[Lancaster2003]

Lancaster, F.W. Indexing and abstracting in theory and practice. Facet, 2003.

[Larcher2003]

P. Larcher. Le système verbal de l’arabe classique. Publications de l’Université de Provence, Collection Didactilangue, 2003.

[Denoyer2004]

L. Denoyer. Apprentissage et inference statistique dans les bases de documents structures : Application aux corpus de

161

documents textuels. Doctorate thesis, Université de Paris 6, France, 2004. [Khmissi2004]

‫ القرآن‬. [Sharaf2004]

-‫مفردات‬-‫في‬-‫المعجمي‬-‫التأليف‬-‫حركة‬ ‫دمشق‬-‫العربي‬-‫التراث‬-‫ مجلة‬, 2004.

Ahmed Hassan Khmissi.

-‫في‬-‫الصحابة‬-‫مصحف‬ ‫والدرة‬-‫الشاطبية‬-‫طريق‬-‫من‬-‫المتواترة‬-‫العشر‬-‫ القراءات‬. Dar As-Sahaba Jamal Ad-Deen Muhammad Sharaf.

lil-Turath, 2004. [Tang2004]

Hunqiang Tang and Sandhya Dwarkadas. Hybrid global local indexing for efficient peer to peer information retrieval. 2004.

[Sauvagnat2005]

Karen Sauvagnat. Modele flexible pour la Recherche d’Information dans des corpus de documents semi-structures. Thesis de doctorat, Universite Paul Sabatier, Toulouse, France, 2005.

[Zerrouki2005]

Taha Zerrouki. Un modèle de mushaf électronique. Magister thesis, ESI, 2005.

[Dahak2006]

Fouad Dahak. Indexation des documents semi-structures : Proposition d-une approche basee sur le fichier inverse et le trie. Thesis, Institut National de Formation en Informatique (I.N.I),Algiers, 2006.

[Choueiter2006]

Choueiter, G. and Povey, D. and Chen, S.F. and Zweig, G. Morpheme-based language modeling for arabic lvcsr. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, volume 1, page I, may 2006.

[Nejjari2007]

Youssef Nejjari. Recherche documentaire sur le web. 2007.

[Mesfar2008]

Slim Mesfar. Analyse morpho-syntaxique automatique et reconnaissance des entites nommees en arabe standard. Thesis, Université de Franche-Comté, 2008.

[Hadjhenni2008]

Mohamed Hadjhenni. Approche ontologique pour la modélisation sémantique, l’indexation et l’interrogation des documents coraniques. Magister thesis, ESI, 2008.

162

-‫علوم‬-‫في‬-‫المحرر‬ -‫الإ ��ام‬-‫بمعهد‬-‫القرآنية‬-‫والعلوم‬-‫الدراسات‬-‫مركز‬

[bSbNaserEttayar2008] Musaid bin Sulaiman bin Naser Et-tayar.

‫الكريم‬-‫ القرآن‬. ‫ الشاطبي‬, 2008. [Allab2008]

Kamel Allab and Abdelkrim Doufene. Classification automatique de documents xml. 2008.

[Sanan2008]

Majed Sanan. Etude des methodes de la recherche d’information et de l’indexation sur les documents electroniques : cas de la langue arabe. 2008.

[Mentre2008]

Marc Mentre. Les moteurs de recherche semantique : un pas dans le web 3.0. 2008.

[Arvidsso2008]

Fredrik Arvidsso and Annika Flycht-Eriksso. Ontologies. 2008.

[Amrouche2008]

Karima Amrouche. Passage à l’echelle en recherche d’information : Methode d-elagage pour la reduction de l’espace de recherche. Thesis, Institut National de Formation en Informatique (I.N.I),Algiers, 2008.

[Albawwab2009]

‫الأنترنت‬-‫ وصفحات‬. [Bernard2009]

-‫العربية‬-‫النصوص‬-‫في‬-‫البحث‬-‫محركات‬ ‫دمشق‬-‫العربية‬-‫اللغة‬-‫ مجمع‬, 2009.

Merouane Albawwab.

Bernard, E. and Griffin, J. Hibernate Search in Action. In Action Series. Manning Publications, 2009.

[Manning2009]

Christopher D. Manning and Prabhakar Raghavan and Hinrich Schutze. An Introduction to Information Retrieval. Cambridge University Press, 2009.

[Abar2009]

Ali Abar and Mounir Boughanem. La capitalisation du profil des experts du ceneap sur la base de la classification des etudes. Thesis, Institut National de Formation en Informatique (I.N.I),Algiers, 2009.

[Abulhajjaj2009]

M.B. Abulhajjaj. Semantic Web - the next revolution. 2009.

[Dahmani2010]

Merouane Dahmani and Assem Chelli and Amar Balla and Taha Zerrouki. Développement d’un moteur de recherche et d’indexation pour les documents coraniques. Engineer thesis, Ecole Nationale Superieure des l’Informatique (ESI) - Algiers, 2010. 163

[Dukes2010]

Kais Dukes and Nizar Habash. , Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, may 2010. European Language Resources Association (ELRA).

[Brierley2011]

Claire Brierley. Prosody Resources and Symbolic Prosodic Features for Automated Phrase Break Prediction.

These

de doctorat, The University of Leeds,School of Computing, september 2011. [Chelli2011]

Assem Chelli and Merouane Dahmani and Amar Balla and Taha Zerrouki. ����� ������ ������� ������ �� ������ ������. 2011.

[Chelli2012]

Assem Chelli and Amar Balla and Taha Zerrouki. Advanced search in quran: Classification and proposition of all possible features. 2012.

[Brierley2012]

Claire Brierley and Majdi Sawalha and Eric Atwell. , Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, may 2012. European Language Resources Association (ELRA).

[Habash05arabictokeni] [Web-Corpus]

Quranic arabic corpus. http://corpus.quran.com. 2009.

[Web-ESWC2012]

Domain specific data retrieval on the semantic web. http://www.springerlink.com/content/k6022745421q1l5h/. 2012.

[Web-Islamweb]

Punctuation symbols for waqf. http://www.islamweb.net/. 2011.

[Web-Midād lba]

o

‫قرآنية‬-‫بيانات‬-‫قاعدة‬-‫مشروع‬-‫البيان‬-‫مداد‬

. http://mbayan.net/.

2011. [Web-Oss-watch]

Oss watch, benefits of open source code. http://www.osswatch.ac.uk/resources/whoneedssource.xml. 2011.

[Web-Python]

Python

official

website

-

about

page.

http://www.python.org/about/,. 2012. [Web-Qurany]

Qurany concepts tool.

http://quranytopics.appspot.com/.

2011. 164

Appendix

Abstracts Paper

[Web-Tanzil]

Quranic arabic corpus. http://www.tanzil.info. 2011.

[Web-Techulator]

What is semantic search? http://www.techulator.com/resources/5933what-semantic-search.aspx. 2012.

[Web-WSJ]

Google

gives

search

a

refresh.

http://online.wsj.com/article/sb10001424052702304459804577281842851136 2012. [Web-Zekr]

Zekr project official website. http://zekr.org. 2012.

Appendices

A1

Annex: Paper Abstracts We have published, about this work, two papers in two conferences: • An Arabic paper in NITS 2011 KSA [Chelli2011]. ◦ Title: An Application Programming Interface for indexing and search in Noble Quran 2 . ◦ Abstract: Quranic search engine API is a project offer simple and advanced search services in the whole information that Holy Qur’an contains. it is based on the modern approach of information retrieval to get a good stability and a high - speed search. We want implement some additional features like Highlight, Suggestions, Scoring ...etc. by the way, Quranic search engine is an Arabic search engine so the API has to offer Arabic language processing like the stemming and eliminating of ambiguities. The API is aimed to be the base that the developers can use it to build different types of interfaces in different systems, ex: Desktop GUI, Web - based UI...etc. ◦ Keywords: Indexing, Search, Quran, Search-engine. ◦ Authors: Assem Chelli, Merouane Dahmani, Amar Balla, Taha Zerrouki. • An English paper in a pre-conference workshop in LREC 2012 Turkey which is about ”LRE-Rel: Language Resource and Evaluation for Religious Texts”[Chelli2012]. ◦ Title: Advanced Search in Quran: Classification and Proposition of All Possible Features. ◦ Abstract: This paper contains a listing for all search features in Quran that we have collected and a classification depending on the nature of each feature. it’s the first step to design an information retrieval system that fits to the specific needs of the Quran. 2

Arabic title:

‫مكتبة برمجية للفهرسة والبحث في القرآن الكريم‬ A2

Appendix

Abstracts Paper

◦ Keywords: Information Retrieval, Quran, Search Features. ◦ Authors: Assem Chelli, Amar Balla, Taha Zerrouki.

A3