Towards Structural Logistic Regression ... - Semantic Scholar

1 downloads 0 Views 279KB Size Report
Towards Structural Logistic Regression: Combining Relational and Statistical Learning. Alexandrin Popescul. University of Pennsylvania. Lyle H. Ungar.
University of Pennsylvania

ScholarlyCommons Departmental Papers (CIS)

Department of Computer & Information Science

July 2002

Towards Structural Logistic Regression: Combining Relational and Statistical Learning Alexandrin Popescul University of Pennsylvania

Lyle H. Ungar University of Pennsylvania, [email protected]

Steve Lawrence NEC Research Institute

David M. Pennock NEC Research Institute

Follow this and additional works at: http://repository.upenn.edu/cis_papers Recommended Citation Alexandrin Popescul, Lyle H. Ungar, Steve Lawrence, and David M. Pennock, "Towards Structural Logistic Regression: Combining Relational and Statistical Learning", . July 2002.

Presented at the 1st Workshop on Multi-Relational Data Mining (MRDM 2002). This paper is posted at ScholarlyCommons. http://repository.upenn.edu/cis_papers/134 For more information, please contact [email protected].

Towards Structural Logistic Regression: Combining Relational and Statistical Learning Abstract

Inductive logic programming (ILP) techniques are useful for analyzing data in multi-table relational databases. Learned rules can potentially discover relationships that are not obvious in "flattened" data. Statistical learners, on the other hand, are generally not constructed to search relational data; they expect to be presented with a single table containing a set of feature candidates. However, statistical learners often yield more accurate models than the logical forms of ILP, and can better handle certain types of data, such as counts. We propose a new approach which integrates structure navigation from ILP with regression modeling. Our approach propositionalizes the first-order rules at each step of ILP's relational structure search, generating features for potential inclusion in a regression model. Ideally, feature generation by ILP and feature selection by stepwise regression should be integrated into a single loop. Preliminary results for scientific literature classification are presented using a relational form of the data extracted by ResearchIndex (formerly CiteSeer). We use FOIL and logistic regression as our ILP and statistical components (decoupled at this stage). Word counts and citation-based features learned with FOIL are modeled together by logistic regression. The combination often significantly improves performance when high precision classification is desired. Comments

Presented at the 1st Workshop on Multi-Relational Data Mining (MRDM 2002).

This conference paper is available at ScholarlyCommons: http://repository.upenn.edu/cis_papers/134

In Multi-Relational Data Mining Workshop at KDD-2002.

    ! "#%$'&( & ) ! *,+ - /.10 #* #*2$'&#3! *4* 5 3# 6 !74&(* 8* 9(:A@CBEDGFH@IJAKL;8MGN!O?:QPSRCT"UV:H;WXCY(@?Z[>\D!PSR ]V^ ;#_`;T">ba7DG;#@CN#;8cdRC>A@CBfeg>b_VFhBji,XkI;#@?@?JENmlEc nporq[sutbvGwyxzs|{~}8€"‚vmq`q[xƒ}A„utS…#q[s†… ‡ vmˆV…!wz{y‰vmqd{ 8€)Š8‰ˆ[‹A{yvGwŒ…8q[/ŽqA€h#wy‰…#{ysuSqpA‘msuvmq`‘mv ’m“A”#“\•\–S—!¥§˜A™L¦š~˜S¨ ›dŠªœ\8©ŒžAŸ!vmxz 8v«˜b…#›Awy#‘­œA¬p¡LŽ~¢zq[—bxƒ¡S{y–Ls|{y¢£‹`˜b{y“Av •8›S›"¢ƒ•S¤S˜ ’!¤8“\•8›b›A”b—«®"šz™b!¯džd•8›`—#•\Ÿ# Sžb•\–8•d8ž\—!°)¢£›A±‚¢Q›A•\—L¢z—b”m² ³…#qV´¶…8µm„|}A·mÂm¸!su¹Vq[º8¼·S»`…#Ž~{­qV…Í`su‹`qđG{y‰3sutb‹`v „|{y„usuSÅQ{­¼8…#s†‘Æ[„uˆ`vwywySvm¼#„†w­…#…8{y‰suSq[‰…8su„Eq[`¼…#½¾{­Ž…#¿LÆV)…8À6xzvG{yxmvmÇ\‘­¿k¬[v«q`…!sHÁdwyq[‹`v«vm3x…!wywy‹`v„uvm‹`x)xz‘mvG…8€h‹`q„VˆE€h88w Å {yvmqd{ys†…8„u„|} `suxz‘m!tbvGwwyvm„†…#{ysu8q[xz¬[suˆ`xŒ{y¬[…#{Œ…#wyv7q`8{8ÆAt\suS‹`xs†qÉÈ~ÊV…#{z{yvmq`v«`ˍ`…#{­…\Ç d{­…#{ysuxƒ{ysu‘«…8„‚„uv«…!wyq[vGwyxmÌ?8q {y¬[v(#{y¬[vGwr¬V…#qVCÌ[…#wyv(¼8vmq[vGw­…#„u„u}pq[8{r‘m8q[xƒ{zwy‹[‘Í{yv«/{y xzv«…!wy‘ͬwyvG„H…!{ysuSqV…#„ […!{­…AÎk{y¬[vG}§vGÏAˆEvm‘G{({ypÆEvˆAwyvmxzvmqd{yv«jÐs|{y¬ …pxzsuq`¼S„uv{­…8Æ[„uv ‘m8qd{­…8suq[suq[¼Ñ…pxzvÍ{g#€ €hv«…!{y‹`wyv‘«…#qV`s†`…#{yvmxmÇLÒ«ÐvmtSvGw«ÌLxƒ{­…!{ysuxƒ{ys†‘m…8„„uv«…!wyq[vGwyx8€HÅ {yvmq§}As†vG„H§‰8wyv3…8‘m‘m‹Aw­…#{yv‰\Avm„ux{y¬V…#qj{y¬[vg„uS¼Ssu‘«…#„€h#wy‰x78€Žƒ¿L‚Ìk…8q[j‘«…#q ÆEvG{z{yvÍw¶¬V…#qV`„uv)‘mvGwz{­…8suq{}AˆEvmxL#€E`…#{­…AÌ#xz‹`‘ͬg…8x‚‘GS‹[qd{yxmÇ!Óv"ˆ`wy8ˆESxzv …Œq[vGÐ …8ˆ`Å ˆAwyb…8‘­¬fЬ`s†‘­¬suqd{yvm¼8w­…!{yvmxxƒ{zwy‹[‘G{y‹AwyvqV…«t\s†¼S…#{ysuSq€HwyS‰ÔŽƒ¿kÐs|{y¬fwyvm¼#wyvmxzxzsuSq ‰\Avm„us†q`¼`ÇÕ‹`wą8ˆ`ˆ`wyb…#‘ͬɈ`wy8ˆESxzs|{ysuSqV…#„us†ÂGvmx{y¬`v/Ö`wyxƒ{zÅ~#w­`vGwgwy‹[„uvmx…!{v«…8‘­¬ xƒ{yvmˆ§8€Žƒ¿k× xÃwyvm„†…#{ysu8qV…8„"xƒ{zwy‹[‘Í{y‹`wyv3xzvm…#wy‘­¬kÌC¼Svmq`vGw­…#{ysuq`¼/€hvm…#{y‹`wyvGxÀh8w(ˆE#{yvmq`Å {ys†…8„s†q`‘m„u‹[xzsuSq suq … wyvG¼8wyvmxzxzsuSq؉\Avm„¾Ç‚ŽƒAv«…8„u„|}dÌC€hv«…!{y‹`wyv¼Svmq`vGw­…#{ysuSq Æd}jŽƒ¿k …#qV €hvm…#{y‹`wyvgxzvm„uvm‘G{ysuSq§Æd} xƒ{yvGˆ\Ðsuxzvwyvm¼8wyvmxzxzsu8qjxz¬`S‹[„†pÆEvsuq\{yvG¼8w­…#{yvmÑsuqd{y… xzsuq[¼8„uv3„udSˆCÇL¶wyvm„usu‰suqV…#wz} wyvmxz‹[„|{yx€\K?K?DGJAKCDÍFh> ^ ;AX!ì?J`D6;!=?>Aê KC:H;`R#a ç ;8@arJ`DGB3N!J`O?@ ^ M">ADÍ;OCMÍ;8B3èQJ`D"BEJEN!OCê ;8@ ^ N!:h>AMGM­FHíkN8> ^ F\:b_[F ^ ;DÍ;8: ^ FA:?M ^ DÍOkN ^ OCDÍ;7>\@kB/Z`;#@?;8DG> ^ ;ÃKLJ ^ ;8@ ^ Fh>\:ABEFb_d>\F ^­^ ;#@?;SB?ú,J`@?;!ò ^ >Aé?:H; DG;#KCDÍ;SM­;8@ ^ > ^ FHJ`@6X

û ü „uvGÏ`…8qV\wysuqp‚SˆEvGxz‘m‹[„kvG{r…8„¾Ç (Y @EèQJ`D ^ O?@C> ^ ;#: ^ FA:):h>\@?Z`OC>\Z`;/ê>AlA;8M ^ ç ;#ê ^ JVJ FH@C>`N#N#O?DG> ^ ;ÄF\@VU§B?JAê>\FA@þ>\KCK?DÍJ[>AN ç a ç ;#DG;ôíCDmM ^ òƒJADmBE;#DfDÍOC:H;SMjéL;#F ^ :AM ^ J`@CN!;bö6J`D)>`M)N#JAO?@ ^ MÃõ ^ ç ;@VO?êéL;#DJ\èCFbUEMÃ> ^ Dm>\F\êpK?:\@ MG> ^ FhM­èQU ^ ç ;ÄíkDGM ^ òƒJADmBE;#DDGO?:H;bö«X óz@ ^ ç FhMKC>AKk;8D8R)ar;ÑM ^ >AD ^^ Jô>AB?B?DÍ;SMÍM ^ ç ;§FhMÍMÍO?;SMJ\è7N!J`êé?FA@CBJAè(:HJ`ZAFhM ^ F ^ ;8:HU`R)>\@kB,FH@4N!J`êé?F ^ FHJ`@6R"éVUª: ^j^ ç ; ^ >AMÍl4J\èN!:h>AMGMÍF†èQUVF,èQDGJA ê 7;!ò M­;S>\DmN ç óz@CBE;#=‚R (>A@JA@?:R !37!?A2 0 % 4>!&8743: ;?%'>.\@VUØaJADmB?M#X¶9:HêpJ`M ^ >\:b_A;pM­J`êp; DG;#:A@ ^ aJADmB?M#R¶MÍJ DÍ;SN#>\:b_A;>A@[UjN!F ^ > ^ FHJ`@CM(OkM­;#èQO?:6èQJ`DN!:h>AMGM­FHíkN8> ^ Fpùyñk> ^ úÃ> ^Í^ DGFHé?O ^ ;!ò _b>A:HOC;7DÍ;8K?DÍ;SM­;8@ ^ > ^ FHJ`@JAèkaJADmB/N#JAO?@ ^ M8X ý ; ç >b_A;Ã;!=EKL;#DGFHêp;8@ ^ ;8B a7F ^ ç ékJ ^ ç ì góyT>A@CB :HJ`ZAFhM ^ F ^ FHJ`@6X ³Œì góyTjOCMÍ;8M)éC>ANmlVZ`DÍJ`O?@CBÄDG;#:h> ^ F ^ FHJ`@ªM ^ DGOCN ^ O?DG;AX"YgM­F ^ FAMÍ;8BªèQ;S> ^ O?DÍ;SMZ`;#@?;8DG> ^ ;SBªé[U*ì góyTøF ^ O?DG; a>`MÃ;#@?J`O?Z ç @VO?êp;#DGF\: ^ FHJ`@jFAFH@6X Ÿ ž˜Ÿ š »$«{¥Ÿ ¡3«9$¥¯‘¥$«9§&«9Ÿ š6¡ š C¨¸Ÿ «9§"«Ÿ š6¡ §?¦9>œ ‘ž˜$¦ ] JAêp;J\è ^ ç ;ÑN!F ^ >dò ^ F`M­;SBDGO?:b_A;Ø>ª@C> ^ O?Dm>\:é?FHéC:HFAMÍF\@CN#;rJAèLBEJEN#O?êp;#@ ^ 9éL;#:(BEJEN!O?êp;8@ ^ ðôN!F ^ ;SBÄéVU ^ ç ;#ê ékJ ^ ç X æÃç ;3èQJA: ^ FHJ`@ET 7AÛ-)/Ü35ÞÝÀßà0"7!&2?: )Àá-,ÀÜ35 ,”0"7!&2?: â."%?0 è$ã>ç>æ>é>è)}áF5 å

eJEN!OCê ;8@ ^ 9 F\@óyT"IëB?JVN#O?êp;#@ ^ F†èF ^ F ç FHZ ç :HUfN!F ^ ;SBfB?JVN#O?êp;#@ ^ F§"« ¥>¦ ;3êpJVB?;#:‚K?DGJAKLJ`MÍF ^ FHJ`@C>\: ^ F`M­;SBÑìgóyTØèQ;8> ^ O?DG;8M>\@CB§arJ`DGBEòƒN#JAO?@

^ Ma7F ^ ç :HJ`ZAFhM ^ FAN#N#O?Dm>AN!U aÃ>AMFHêpK?DGJd_A;SBéVUÑ>\@f>b_A;#Dm>\Z`;(J\è ¼?X º ¼$½§X æÃç ;ÄFHêpK?DGJd_A;8êp;#@ ^ F\Z`;AX æÃç ;g>b_`;#Dm>\ZA;ÃN!:h>AMGMÍF†íkN8> ^ FHJ`@;#DGDÍJ`D DÍ;SBEOCN ^ FHJ`@ÑFhM ÖEX ½§X[ì?JADGê>\:LM ^ > ^ FhM ^ FhN#>\:LM­F\@CN#; ^ ;8M ^ F ^/^ ç ;§F\ZA;pK?DG;8N#FA@CN!;ÄZ`DÍ;S> ^ : ^ F`M­;SBèQ;S> ^ OCDÍ;SM(>\DG;gDG;#:h> ^ FH_`; ^ J3arJ`DGB?M8X æÃç ; ^ a;#: ^ >AMÍ; ^ M"DG;#KCDÍ;SM­;8@ ^ N!J`ê ê/O?@?F ^ FA@?FS> ^ FHJ`@6X ] JAêp;>\DG;3_A;#DGUÑa;#: ^ ç ;#D>ÄN#JA:\l`;#D N!J`@?@?;8N ^ ;SB/N#JAêpêO?@CF ^ FH;SM#R\;`X ZkX6ù­lV@?Jda7: ^ O?DG;8MÄ>\DG;êpJ`DÍ; OCMÍ;!èQO?: F`M­; ^ M8R ;8MÍKk;SN!Fh>\:\:\FgZ`JVJVB  JAé"X ý ;;!=EKk;SN ^ J`O?DèQO?: ^ ;8B >\K?K?DGJ`>`N ç ^ J éL;>Aé?:H; ^ Jêp>AlA;ÄéL; ^­^ ;#DN ç JAFhN!;8MDG;#Z[>\DmBEFH@CZ ^ ç ;FH@kN!: ^ FH_`; ê ;8DÍF ^ MÃJAèaJADmB?MÃ>\@kBfN#F ^ > ^ FHJ`@CMÃFH@ BEF ±‚;#DG;#@ ^Ã^ >AMÍlEM#X 

Bayesian Networks

60

Precision

50

60

Precision

70

80

80

90

100

Association Rules

40

Words Citations Words+Citations

30

40

Words Citations Words+Citations

20

40

60

80

100

Recall

20

40

60

100

Decision Trees

20

Words Citations Words+Citations

40

40

Words Citations Words+Citations

60

60

Precision

80

80

100

100

Data Mining

Precision

80

Recall

.

40

60

80

100

Recall

0

20

40

60

80

100

80

100

Recall

.

Graphical Models

60

Precision

60 50

Words Citations Words+Citations

40

40

Words Citations Words+Citations

30

Precision

70

80

80

90

100

Digital Libraries

20

40

k»hV» 6wyvG‘msuxzs†8q`Å£©Œvm‘«…8„u„½Q`…#{­…#xzvG{yxÒ~GŘXbÀ

60

80

Recall

2

:DN

100

0

20

40

60 Recall

Ûk!Ð…#w­AxÐd{zwy‹[‘G{y‹Aw­…8„C¿k8¼Ssuxƒ{ysu‘©Œvm¼#wyvmxzxzsuSq

Inductive Logic Programming

Precision

60 50

60

60

80

80

100

100

Information Extraction

Words Citations Words+Citations

40

Words Citations Words+Citations

40

Precision

|

70

80

90

100

Recall

0

20

40

60

80

100

Recall

.

Knowledge Discovery

Precision

60

Words Citations Words+Citations

40

60

Words Citations Words+Citations

20

40

Precision

80

80

100

Machine Translation

20

40

60

80

100

Recall

20

40

60

80

100

80

100

Recall

.

Natural Language Processing

60

Precision

60 50

Words Citations Words+Citations

40

40

Words Citations Words+Citations

30

Precision

70

80

80

90

100

Maximum Entropy

20

40

60 Recall

2

L» ?» 6wyvm‘Gs†xzsu8q`Å£©rvG‘«…8„u„½Q`…#{­…8xzvÍ{yxv‚«Å}~ û À 80

:DN 1

100

20

40

60 Recall

ü „uvGÏ`…8qV\wysuqp‚SˆEvGxz‘m‹[„kvG{r…8„¾Ç Ž ‡ ŠÀK)"ß ‡"â 1á&à M 7;#K?DG;8MÍ;#@ ^ FH@?ZªF ^ O?DG;8M ^ J éL; OCMÍ;8B4F\DG@?FH@CZ,FAM,ùÍK?DÍJ`KkJ[M­F ^ F\: ^ F\êp;#D; ^ >\:X  ¼  K?DGJd_VFDÍ;#ò _[F\@CB§F ^ MÃ>\K?K?: ^ FHJ`@CM8X æÃç ;gDGFHZ ç ^ ò ç >\@CB§M­FhBE;8MrJ\è)DÍ;8: ^ FA: DÍO?:\@ ék;OkM­;SB >`MÄé?F\DGUèQ;8> ^ O?DG;8MÄF\@VUØêpJEBE;#:AK?K?DGJAK?DGF ^ ;/èQJAD ^ ç F ^ FHJ`@6XkIŒDÍJ`KkJ[M­F ^ F\: ^ F\@kB ÄF ^ ;8N ç @?F ^ O?DG;8M8X`IOCDÍ;7K?DGJAKLJ`MÍF ^ FA:HF 8> ^ FbUEM >\K?KC:HFhN#>Aé?:H;7a ç ;#@ ^ ç ;gB?> ^ >ÄFhM@?J ^ ar;8:H:¶MÍO?F ^ >\é?:AMÍ;8B:ADÍ@CFH@?ZkR[>`MŒF`M­;gJ\è"aJADmBÑN!J`O?@ ^ MrFH@BEJEN!OCê ;8@ ^ N!:h>AMGM­FHíkN#> ^ Fék;#èQJADG;gN!J`@CM ^ DGOCN ^ FH@?Z >êpJEBE;#:¶>A:\:X Õ  >`Nml[@CJda7:H;SBEZA; ^ ç F\@kB K?DGJAKLJ`MÍ;JA@?; M­J`:HO ^ F\DG@?FH@CZ ðÃ>bUA;SM­Fh>\@j@?; ^ aJADGlEM#X óyT"I >A:HZ`JADGF ^ ç êM ç >b_A;§ék;8;#@>AK?K?: ^ Fb_A;8@Ñ; ^ >A:~X Ó R {Ó  K?DÍJ`KkJ[M­;Ä> ^ ;SN ç @?Fh÷`OC; N#>\:A@CBDG;#:h> ^ F\:A:ADÍ@?ò FH@?ZªFAFH@6XŒóz@ ùGM ^ > ^ FhM ^ FhN#>\:7KCDÍ;SBEF ^ ;fFAMÍ;8B N!:h>AMGM­FHíkN#> ^ F\FbU`;8MÑ>ADÍ; FAM@C;#a K?DÍ;SBEFhN#> ^ ;8M8"X gOCDÃ>AK?K?DGJ`>`N ç BE F ±‚;#DmMFH@ ^ ç ;ùGBEFHDG;8N ^ FAM/>ØM ^ > ^ FA: ^ ;SN ç @?Fh÷[O?;§>AM> êpJVB?;#:\DG;M­OCK?K?:A@ óyT"IòéC>`M­;SBDÍ;8: ^ FA:"M ^ DGOCN ^ O?DÍ; M­;S>\DmN ç Xù ][^ > ^ FhM ^ F\:K?DÍ;SBEFhN#> ^ ; FH@V_A;8@ ^ F\@CB6RAK?DG;8MÍ;#DG_A;SM ^ ç ;(óyT)I,N!J`ê KLJA@C;#@ ^ >AM ^ ç ;N#;#@ ^ Dm>\:Cê JEBVò ;#:\@CB N#>A:H:hMgM ^ > ^ FhM ^ F\:êpJEBE;#:AMN#JAO?:hBÉDÍ;8: ^ F