University of Pennsylvania
ScholarlyCommons Departmental Papers (CIS)
Department of Computer & Information Science
July 2002
Towards Structural Logistic Regression: Combining Relational and Statistical Learning Alexandrin Popescul University of Pennsylvania
Lyle H. Ungar University of Pennsylvania,
[email protected]
Steve Lawrence NEC Research Institute
David M. Pennock NEC Research Institute
Follow this and additional works at: http://repository.upenn.edu/cis_papers Recommended Citation Alexandrin Popescul, Lyle H. Ungar, Steve Lawrence, and David M. Pennock, "Towards Structural Logistic Regression: Combining Relational and Statistical Learning", . July 2002.
Presented at the 1st Workshop on Multi-Relational Data Mining (MRDM 2002). This paper is posted at ScholarlyCommons. http://repository.upenn.edu/cis_papers/134 For more information, please contact
[email protected].
Towards Structural Logistic Regression: Combining Relational and Statistical Learning Abstract
Inductive logic programming (ILP) techniques are useful for analyzing data in multi-table relational databases. Learned rules can potentially discover relationships that are not obvious in "flattened" data. Statistical learners, on the other hand, are generally not constructed to search relational data; they expect to be presented with a single table containing a set of feature candidates. However, statistical learners often yield more accurate models than the logical forms of ILP, and can better handle certain types of data, such as counts. We propose a new approach which integrates structure navigation from ILP with regression modeling. Our approach propositionalizes the first-order rules at each step of ILP's relational structure search, generating features for potential inclusion in a regression model. Ideally, feature generation by ILP and feature selection by stepwise regression should be integrated into a single loop. Preliminary results for scientific literature classification are presented using a relational form of the data extracted by ResearchIndex (formerly CiteSeer). We use FOIL and logistic regression as our ILP and statistical components (decoupled at this stage). Word counts and citation-based features learned with FOIL are modeled together by logistic regression. The combination often significantly improves performance when high precision classification is desired. Comments
Presented at the 1st Workshop on Multi-Relational Data Mining (MRDM 2002).
This conference paper is available at ScholarlyCommons: http://repository.upenn.edu/cis_papers/134
In Multi-Relational Data Mining Workshop at KDD-2002.
!"#%$'&( &)! *,+ - /.10 #* #*2$'! *4* 5 3#6 !74&(* 8* 9(:A@CBEDGFH@IJAKL;8MGN!O?:QPSRCT"UV:H;WXCY(@?Z[>\D!PSR ]V^ ;#_`;T">ba7DG;#@CN#;8cdRC>A@CBfeg>b_VFhBji,XkI;#@?@?JENmlEc nporq[sutbvGwyxzs|{~}8"vmq`q[x}AutS
#q[s
vmV
!wz{yvmqd{ 8)8[A{yvGw
8q[/qAh#wy
#{ysuSqpAmsuvmq`mv mA#\\S!¥§AL¦~S¨ dª\8©A!vmxz 8v«b
#Awy#A¬p¡L~¢zq[bx¡S{yLs|{y¢£`b{yAv 8S"¢S¤S !¤8\8bAb«®"zb!¯dd8`#\# Sb\8d8\!°)¢£A±¢QA\L¢zbm² ³
#qV´¶
8µm|}A·mÂm¸!su¹Vq[º8¼·S»`
#~{qV
Ã`su`qÄG{y3sutb`v |{yusuSÅQ{¼8
#sÆ[u`vwywySvm¼#w
#
8{ysuSq[
8suEq[`¼
#½¾{
#¿LÆV)
8À6xzvG{yxmvmÇ\¿k¬[v«q`
!sHÁdwyq[`v«vm3x
!wywy`vuvm`x)xzmvG
8h`qVEh88w Å {yvmqd{ys
8u|} `suxzm!tbvGwwyvm
#{ysu8q[xz¬[su`x{y¬[
#{
#wyv7q`8{8ÆAt\suS`xsqÉÈ~ÊV
#{z{yvmq`v«`Ë`
#{
\Ç d{
#{ysux{ysu«
8uv«
!wyq[vGwyxmÌ?8q {y¬[v(#{y¬[vGwr¬V
#qVCÌ[
#wyv(¼8vmq[vGw
#uu}pq[8{rm8q[x{zwy[Í{yv«/{y xzv«
!wyͬwyvGH
!{ysuSqV
# [
!{
AÎk{y¬[vG}§vGÏAEvmG{({ypÆEvAwyvmxzvmqd{yv«jÐs|{y¬
pxzsuq`¼Suv{
8Æ[uv m8qd{
8suq[suq[¼Ñ
pxzvÍ{g# hv«
!{y`wyv«
#qV`s`
#{yvmxmÇLÒ«ÐvmtSvGw«ÌLx{
!{ysux{ysm
8uv«
!wyq[vGwyx8HÅ {yvmq§}AsvGH§8wyv3
8mmAw
#{yv\Avmux{y¬V
#qj{y¬[vguS¼Ssu«
#h#wyx78¿LÌk
8q[j«
#q ÆEvG{z{yvÍw¶¬V
#qV`uv)mvGwz{
8suq{}AEvmxL#E`
#{
AÌ#xz`ͬg
8xGS[qd{yxmÇ!Óv"`wy8ESxzv
q[vGÐ
8`Å Awyb
8¬fЬ`s¬suqd{yvm¼8w
!{yvmxx{zwy[G{yAwyvqV
«t\s¼S
#{ysuSqHwySÔ¿kÐs|{y¬fwyvm¼#wyvmxzxzsuSq \Avmusq`¼`ÇÕ`wÄ
8``wyb
#ͬÉ`wy8ESxzs|{ysuSqV
#usÂGvmx{y¬`v/Ö`wyx{zÅ~#w`vGwgwy[uvmx
!{v«
8¬ x{yvm§8¿k× xÃwyvm
#{ysu8qV
8"x{zwy[Í{y`wyv3xzvm
#wy¬kÌC¼Svmq`vGw
#{ysuq`¼/hvm
#{y`wyvGxÃh8w(E#{yvmq`Å {ys
8sq`mu[xzsuSq suq
wyvG¼8wyvmxzxzsuSqØ\Avm¾ÇAv«
8u|}dÌChv«
!{y`wyv¼Svmq`vGw
#{ysuSq Æd}j¿k
#qV hvm
#{y`wyvgxzvmuvmG{ysuSq§Æd} x{yvG\Ðsuxzvwyvm¼8wyvmxzxzsu8qjxz¬`S[pÆEvsuq\{yvG¼8w
#{yvmÑsuqd{y
xzsuq[¼8uv3udSCÇL¶wyvmususuqV
#wz} wyvmxz[|{yx\K?K?DGJAKCDÍFh> ^ ;AX!ì?J`D6;!=?>Aê KC:H;`R#a ç ;8@arJ`DGB3N!J`O?@ ^ M">ADÍ;OCMÍ;8B3èQJ`D"BEJEN!OCê ;8@ ^ N!:h>AMGMFHíkN8> ^ F\:b_[F ^ ;DÍ;8: ^ FA:?M ^ DÍOkN ^ OCDÍ;7>\@kB/Z`;#@?;8DG> ^ ;ÃKLJ ^ ;8@ ^ Fh>\:ABEFb_d>\F ^^ ;#@?;SB?ú,J`@?;!ò ^ >Aé?:H; DG;#KCDÍ;SM;8@ ^ > ^ FHJ`@6X
û ü uvGÏ`
8qV\wysuqpSEvGxzm[kvG{r
8¾Ç (Y @EèQJ`D ^ O?@C> ^ ;#: ^ FA:):h>\@?Z`OC>\Z`;/ê>AlA;8M ^ ç ;#ê ^ JVJ FH@C>`N#N#O?DG> ^ ;ÄF\@VU§B?JAê>\FA@þ>\KCK?DÍJ[>AN ç a ç ;#DG;ôíCDmM ^ òJADmBE;#DfDÍOC:H;SMjéL;#F ^ :AM ^ J`@CN!;bö6J`D)>`M)N#JAO?@ ^ MÃõ ^ ç ;@VO?êéL;#DJ\èCFbUEMÃ> ^ Dm>\F\êpK?:\@ MG> ^ FhMèQU ^ ç ;ÄíkDGM ^ òJADmBE;#DDGO?:H;bö«X óz@ ^ ç FhMKC>AKk;8D8R)ar;ÑM ^ >AD ^^ Jô>AB?B?DÍ;SMÍM ^ ç ;§FhMÍMÍO?;SMJ\è7N!J`êé?FA@CBJAè(:HJ`ZAFhM ^ F ^ ;8:HU`R)>\@kB,FH@4N!J`êé?F ^ FHJ`@6R"éVUª: ^j^ ç ; ^ >AMÍl4J\èN!:h>AMGMÍFèQUVF,èQDGJA ê 7;!ò M;S>\DmN ç óz@CBE;#=R (>A@JA@?:R !37!?A2 0 % 4>!&8743: ;?%'>.\@VUØaJADmB?M#X¶9:HêpJ`M ^ >\:b_A;pMJ`êp; DG;#:A@ ^ aJADmB?M#R¶MÍJ DÍ;SN#>\:b_A;>A@[UjN!F ^ > ^ FHJ`@CM(OkM;#èQO?:6èQJ`DN!:h>AMGMFHíkN8> ^ Fpùyñk> ^ úÃ> ^Í^ DGFHé?O ^ ;!ò _b>A:HOC;7DÍ;8K?DÍ;SM;8@ ^ > ^ FHJ`@JAèkaJADmB/N#JAO?@ ^ M8X ý ; ç >b_A;Ã;!=EKL;#DGFHêp;8@ ^ ;8B a7F ^ ç ékJ ^ ç ì góyT>A@CB :HJ`ZAFhM ^ F ^ FHJ`@6X ³ì góyTjOCMÍ;8M)éC>ANmlVZ`DÍJ`O?@CBÄDG;#:h> ^ F ^ FHJ`@ªM ^ DGOCN ^ O?DG;AX"YgMF ^ FAMÍ;8BªèQ;S> ^ O?DÍ;SMZ`;#@?;8DG> ^ ;SBªé[U*ì góyTøF ^ O?DG; a>`MÃ;#@?J`O?Z ç @VO?êp;#DGF\: ^ FHJ`@jFAFH@6X »$«{¥ ¡3«9$¥¯¥$«9§&«9 6¡ C¨¸ «9§"« 6¡ §?¦9> $¦ ] JAêp;J\è ^ ç ;ÑN!F ^ >dò ^ F`M;SBDGO?:b_A;Ø>ª@C> ^ O?Dm>\:é?FHéC:HFAMÍF\@CN#;rJAèLBEJEN#O?êp;#@ ^ 9éL;#:(BEJEN!O?êp;8@ ^ ðôN!F ^ ;SBÄéVU ^ ç ;#ê ékJ ^ ç X æÃç ;3èQJA: ^ FHJ`@ET 7AÛ-)/Ü35ÞÝÀßà0"7!&2?: )Àá-,ÀÜ35 ,0"7!&2?: â."%?0 è$ã>ç>æ>é>è)}áF5 å
eJEN!OCê ;8@ ^ 9 F\@óyT"IëB?JVN#O?êp;#@ ^ FèF ^ F ç FHZ ç :HUfN!F ^ ;SBfB?JVN#O?êp;#@ ^ F§"« ¥>¦ ;3êpJVB?;#:K?DGJAKLJ`MÍF ^ FHJ`@C>\: ^ F`M;SBÑìgóyTØèQ;8> ^ O?DG;8M>\@CB§arJ`DGBEòN#JAO?@
^ Ma7F ^ ç :HJ`ZAFhM ^ FAN#N#O?Dm>AN!U aÃ>AMFHêpK?DGJd_A;SBéVUÑ>\@f>b_A;#Dm>\Z`;(J\è ¼?X º ¼$½§X æÃç ;ÄFHêpK?DGJd_A;8êp;#@ ^ F\Z`;AX æÃç ;g>b_`;#Dm>\ZA;ÃN!:h>AMGMÍFíkN8> ^ FHJ`@;#DGDÍJ`D DÍ;SBEOCN ^ FHJ`@ÑFhM ÖEX ½§X[ì?JADGê>\:LM ^ > ^ FhM ^ FhN#>\:LMF\@CN#; ^ ;8M ^ F ^/^ ç ;§F\ZA;pK?DG;8N#FA@CN!;ÄZ`DÍ;S> ^ : ^ F`M;SBèQ;S> ^ OCDÍ;SM(>\DG;gDG;#:h> ^ FH_`; ^ J3arJ`DGB?M8X æÃç ; ^ a;#: ^ >AMÍ; ^ M"DG;#KCDÍ;SM;8@ ^ N!J`ê ê/O?@?F ^ FA@?FS> ^ FHJ`@6X ] JAêp;>\DG;3_A;#DGUÑa;#: ^ ç ;#D>ÄN#JA:\l`;#D N!J`@?@?;8N ^ ;SB/N#JAêpêO?@CF ^ FH;SM#R\;`X ZkX6ùlV@?Jda7: ^ O?DG;8MÄ>\DG;êpJ`DÍ; OCMÍ;!èQO?: F`M; ^ M8R ;8MÍKk;SN!Fh>\:\:\FgZ`JVJVB JAé"X ý ;;!=EKk;SN ^ J`O?DèQO?: ^ ;8B >\K?K?DGJ`>`N ç ^ J éL;>Aé?:H; ^ Jêp>AlA;ÄéL; ^^ ;#DN ç JAFhN!;8MDG;#Z[>\DmBEFH@CZ ^ ç ;FH@kN!: ^ FH_`; ê ;8DÍF ^ MÃJAèaJADmB?MÃ>\@kBfN#F ^ > ^ FHJ`@CMÃFH@ BEF ±;#DG;#@ ^Ã^ >AMÍlEM#X
Bayesian Networks
60
Precision
50
60
Precision
70
80
80
90
100
Association Rules
40
Words Citations Words+Citations
30
40
Words Citations Words+Citations
20
40
60
80
100
Recall
20
40
60
100
Decision Trees
20
Words Citations Words+Citations
40
40
Words Citations Words+Citations
60
60
Precision
80
80
100
100
Data Mining
Precision
80
Recall
.
40
60
80
100
Recall
0
20
40
60
80
100
80
100
Recall
.
Graphical Models
60
Precision
60 50
Words Citations Words+Citations
40
40
Words Citations Words+Citations
30
Precision
70
80
80
90
100
Digital Libraries
20
40
k»hV» 6wyvGmsuxzs8q`Å£©vm«
8u½Q`
#{
#xzvG{yxÒ~GÅXbÀ
60
80
Recall
2
:DN
100
0
20
40
60 Recall
Ûk!Ð
#wAxÃd{zwy[G{yAw
8C¿k8¼Ssux{ysu©vm¼#wyvmxzxzsuSq
Inductive Logic Programming
Precision
60 50
60
60
80
80
100
100
Information Extraction
Words Citations Words+Citations
40
Words Citations Words+Citations
40
Precision
|
70
80
90
100
Recall
0
20
40
60
80
100
Recall
.
Knowledge Discovery
Precision
60
Words Citations Words+Citations
40
60
Words Citations Words+Citations
20
40
Precision
80
80
100
Machine Translation
20
40
60
80
100
Recall
20
40
60
80
100
80
100
Recall
.
Natural Language Processing
60
Precision
60 50
Words Citations Words+Citations
40
40
Words Citations Words+Citations
30
Precision
70
80
80
90
100
Maximum Entropy
20
40
60 Recall
2
L» ?» 6wyvmGsxzsu8q`Å£©rvG«
8u½Q`
#{
8xzvÍ{yxv«Å}~ û À 80
:DN 1
100
20
40
60 Recall
ü uvGÏ`
8qV\wysuqpSEvGxzm[kvG{r
8¾Ç ÀK)"ß "â 1á&à M 7;#K?DG;8MÍ;#@ ^ FH@?ZªF ^ O?DG;8M ^ J éL; OCMÍ;8B4F\DG@?FH@CZ,FAM,ùÍK?DÍJ`KkJ[MF ^ F\: ^ F\êp;#D; ^ >\:X ¼ K?DGJd_VFDÍ;#ò _[F\@CB§F ^ MÃ>\K?K?: ^ FHJ`@CM8X æÃç ;gDGFHZ ç ^ ò ç >\@CB§MFhBE;8MrJ\è)DÍ;8: ^ FA: DÍO?:\@ ék;OkM;SB >`MÄé?F\DGUèQ;8> ^ O?DG;8MÄF\@VUØêpJEBE;#:AK?K?DGJAK?DGF ^ ;/èQJAD ^ ç F ^ FHJ`@6XkIDÍJ`KkJ[MF ^ F\: ^ F\@kB ÄF ^ ;8N ç @?F ^ O?DG;8M8X`IOCDÍ;7K?DGJAKLJ`MÍF ^ FA:HF 8> ^ FbUEM >\K?KC:HFhN#>Aé?:H;7a ç ;#@ ^ ç ;gB?> ^ >ÄFhM@?J ^ ar;8:H:¶MÍO?F ^ >\é?:AMÍ;8B:ADÍ@CFH@?ZkR[>`MF`M;gJ\è"aJADmBÑN!J`O?@ ^ MrFH@BEJEN!OCê ;8@ ^ N!:h>AMGMFHíkN#> ^ Fék;#èQJADG;gN!J`@CM ^ DGOCN ^ FH@?Z >êpJEBE;#:¶>A:\:X Õ >`Nml[@CJda7:H;SBEZA; ^ ç F\@kB K?DGJAKLJ`MÍ;JA@?; MJ`:HO ^ F\DG@?FH@CZ ðÃ>bUA;SMFh>\@j@?; ^ aJADGlEM#X óyT"I >A:HZ`JADGF ^ ç êM ç >b_A;§ék;8;#@>AK?K?: ^ Fb_A;8@Ñ; ^ >A:~X Ó R {Ó K?DÍJ`KkJ[M;Ä> ^ ;SN ç @?Fh÷`OC; N#>\:A@CBDG;#:h> ^ F\:A:ADÍ@?ò FH@?ZªFAFH@6Xóz@ ùGM ^ > ^ FhM ^ FhN#>\:7KCDÍ;SBEF ^ ;fFAMÍ;8B N!:h>AMGMFHíkN#> ^ F\FbU`;8MÑ>ADÍ; FAM@C;#a K?DÍ;SBEFhN#> ^ ;8M8"X gOCDÃ>AK?K?DGJ`>`N ç BE F ±;#DmMFH@ ^ ç ;ùGBEFHDG;8N ^ FAM/>ØM ^ > ^ FA: ^ ;SN ç @?Fh÷[O?;§>AM> êpJVB?;#:\DG;MOCK?K?:A@ óyT"IòéC>`M;SBDÍ;8: ^ FA:"M ^ DGOCN ^ O?DÍ; M;S>\DmN ç Xù ][^ > ^ FhM ^ F\:K?DÍ;SBEFhN#> ^ ; FH@V_A;8@ ^ F\@CB6RAK?DG;8MÍ;#DG_A;SM ^ ç ;(óyT)I,N!J`ê KLJA@C;#@ ^ >AM ^ ç ;N#;#@ ^ Dm>\:Cê JEBVò ;#:\@CB N#>A:H:hMgM ^ > ^ FhM ^ F\:êpJEBE;#:AMN#JAO?:hBÉDÍ;8: ^ F