Detecting stochastic dominance for poset-valued

Georg Schollmeyer, Christoph Jansen, Thomas Augustin

Detecting stochastic dominance for poset-valued random variables as an example of linear programming on closure systems

Technical Report Number 209, 2017 Department of Statistics University of Munich http://www.stat.uni-muenchen.de

Detecting stochastic dominance for poset-valued random variables as an example of linear programming on closure systems

Georg Schollmeyer

Christoph Jansen

Thomas Augustin

Abstract In this paper we develop a linear programming method for detecting stochastic dominance for random variables with values in a partially ordered set (poset) based on the upset-characterization of stochastic dominance. The proposed detectionprocedure is based on a descriptively interpretable statistic, namely the maximal probability-dierence of an upset. We show how our method is related to the general task of maximizing a linear function on a closure system. Since closure systems are describable via its valid formal implications, we can use here ingredients of formal concept analysis. We also address the question of inference via resampling and via conservative bounds given by the application of Vapnik-Chervonenkis theory, which also allows for an adequate pruning of the envisaged closure system that allows for the regularization of the test statistic (by paying a price of less conceptual rigor). We illustrate the developed methods by applying them to a variety of data examples, concretely to multivariate inequality analysis, item impact and dierential item functioning in item response theory and to the analysis of distributional dierences in spatial statistics. The power of regularization is illustrated with a data example in the context of cognitive diagnosis models. Keywords: stochastic dominance, multivariate stochastic order, linear programming, closure system, formal concept analysis, formal implication, VapnikChervonenkis theory, regularization.

1

Introduction

Stochastic (rst order) dominance plays an important role in a huge variety of disciplines like for example welfare economics (cf., e.g., [Arndt et al., 2012, 2015]), decision theory (cf., e.g., [Levy, 2015]), portfolio analysis (cf., e.g., [Kuosmanen, 2004]), nonparametric item response theory (IRT, cf., e.g., [Scheiblechner, 2007]), medicine (cf., e.g., [Leshno and Levy, 2004]), toxicology (cf., e.g., [Davidov and Peddada, 2013]) or psychology (cf., e.g., [Levy and Levy, 2002]) to cite just a few. Most treatments of stochastic dominance are devoted to the univariate case with emphasis also on higher order stochastic dominance

1

d or to the classical multivariate case where one has R -valued random variables with the d d natural ordering ≤= {(x, y) ∈ R × R | ∀i ∈ {1, . . . , d} : xi ≤ yi }. In this paper we treat the general case of random variables that have values in a partially ordered set

1

(poset)

V = (V, ≤).

Detecting stochastic dominance in this general case is especially interesting in the

context of multivariate inequality or poverty analysis (cf., [Alkire et al., 2015]) in the situation where one has more dimensions of inequality that are additionally possibly only of a partial ordinal scale of measurement. One thinkable dimension with an only partially ordered scale of measurement is the dimension education,

because dierent highest

educational achievements may be incomparable due to dierent specics of dierent courses of education. In this paper, the example of multivariate inequality analysis will serve as a prototypic example of multivariate stochastic dominance analysis. In contrast to the simple univariate case, for random variables with values in a partially ordered set the notion of stochastic dominance cannot be simply described with the

2

X : Ω −→ V and Y : Ω −→ V (V, ≤), one says that X is (weakly) stochastically 0 0 by X ≤SD Y , if there exist random variables X and Y on a d d (Ω0 , F 0 , P 0 ) with X = X 0 , Y = Y 0 and P 0 (X 0 ≤ Y 0 ) = 1. The

distribution function, anymore . For two random variables with values in a partially ordered set smaller than

Y,

denoted

further probability space

property of stochastic dominance can be characterized by three essentially equivalent, more constructive statements: The random variable variables i) ii)

Y

X

is stochastically smaller than the random

3

if one of the three following conditions is satised :

P (X ∈ A) ≤ P (Y ∈ A) E(u ◦ X) ≤ E(u ◦ Y ) u : V −→ R

for every (measurable) upset

A⊆V

5

fX from the density fY 0 values v ≤ v .

iii) It is possible to obtain the density mass from values

v

4

for every bounded non-decreasing Borel-measurable

to smaller

function

by transporting probability

In this paper we will deal with the problem of detecting stochastic dominance between two random variables

X

and

Y

for which one has observed an i.i.d. sample

1 This includes especially the multivariate case of

{1, . . . , d} : xi ≤ yi

Rd

where the natural order

is used. Note also that every nite poset

(V, ≤)

(x1 , . . . , xnx )

x ≤ y

⇐⇒

∀i ∈

can mathematically be represented

as a multivariate case where the dimension equals the order dimension of 1941, Trotter, 2001].

(V, ≤),

cf. [Dushnik and Miller,

2 If one would rely on the distribution function in the multivariate case, then one would get another

order, the so-called lower orthant or upper orthant order, cf., e.g., [Müller and Stoyan, 2002].

3 The equivalence between (ii) and (i) was shown by Lehmann [1955] and independently proved by Le-

vhari et al. [1975]. The equivalence between (iii) and (i) is a consequence of Strassen's Theorem ([Strassen, 1965]), see Kamae et al. [1977].

4 Here, we have to assume that

partially ordered polish space.

(V, ≤)

can be equipped with an appropriate topology that makes it a

5 This statement is of course only equivalent if the densities

2

fX

and

fY

actually exist.

of the unknown random variable random variable

Y.

not exactly know the true law of stochastic dominance between

X

and

Y

X

and an i.i.d.

sample

(y1 , . . . , yny ) of the unknown X ≤SD Y , but one does

Actually, one would be interested in detecting

X

X

Y . So, here we will deal with detecting empirical Y , denoted by X ≤SD ˆ Y , where the true laws of

and

and

are replaced by the corresponding empirical laws.

The problem of statistical

inference that is concerned with the question of how stochastic dominance w.r.t. empirical laws can be translated to stochastic dominance w.r.t. be discussed in this paper.

the

the true laws will also

The typical situation in this paper will be the analysis of

dierences between two subpopulations of some population. The typical subpopulations analyzed in this paper will be subpopulations of male and female persons. Here, we think of the random variable the male, and

Y

X

as the outcome of a random sample from the subpopulation of

as a random sample from the subpopulation of the female persons. Note

that in the formal denition of stochastic dominance one compares random variables on the same probability space ensure that

X

and

Y

(Ω, F, P ).

In our case of comparing subpopulations we can

are random variables on the same underlying probability space by

thinking of jointly sampling from the male and the female subpopulation. Note that the notion of stochastic dominance does not rely on the possible dependencies between and

Y,

because all terms involved in the characterizing properties

dominance only rely on the marginal distribution of

X

and

Y.

i) − iii)

X

of stochastic

Note further that the

denition of stochastic dominance could thus be simply extended to random variables living on dierent probability spaces.

Thus, also for the replacement of the true laws

by empirical laws, dierent sample sizes for the male and the female samples would not introduce any problem, here. For detecting stochastic dominance in the above sense, we will make substantial use of the upset-characterization

i).

The characterization via a mass transfer can also be

used to check for stochastic dominance, see, e.g., Mosler and Scarsini [1991] (for empirical applications see, e.g., [Arndt et al., 2012, 2013]), while an alternative approach would be to make use of a network ow formulation of the problem, as outlined in Preston [1974] or Hansel and Troallic [1978] and then check for dominance via computation of the maximum ow. The main reason for putting emphasis on the upset approach is that with this approach we could not only check for stochastic dominance, but we will also get additionally some well-interpretable statistic for free, upon which we can also base an attempt to do inference. Beyond this, the family of all upsets of a given poset is a well

6

understood closure system

and a natural question is then, how the linear programming

approach outlined here, can be generalized to the case of arbitrary closure systems. The paper is structured as follows: In Section 2 we briey introduce basic mathematical concepts of partially ordered sets, complete lattices and formal concept analysis needed in the paper. Section 3 develops and analyses a linear program for detecting rst order sto-

6 A closure system intersections.

S

is a family of subsets of a space

3

Ω

that contains

Ω

and is closed under arbitrary

chastic dominance for random variables with values in a poset. In Section 4 we generalize the linear programming approach to optimization on closure systems. Statistical inference for the developed methods, especially the application of Vapnik-Chervonenkis theory, possible regularization and characterizations of the Vapnik-Chervonenkis dimension of selected closure systems (as well as concretely computing the Vapnik-Chervonenkis dimension) are treated in Section 5. Examples of application, ranging from inequality analysis based on stochastic dominance to a geometrical generalization of the Kolmogorov-Smirnov test for spatial statistics are given in Section 6, while Section 7 concludes.

2

Mathematical preliminaries

In this section, we very briey introduce elementary basics of partially ordered sets and of formal concept analysis. A far more detailed introduction to partially ordered sets can be found in Davey and Priestley [2002], which also gives a short introduction to formal concept analysis.

An introduction into formal concept analysis can be found in Ganter

and Wille [2012]. The concepts of formal concept analysis are actually only needed for the optimization problems on general closure systems indicated in Section 4, the reader only interested in the problem of detecting rst order stochastic dominance can skip Section 2.2.

2.1 Ordered sets and lattices Denition 1

. A partially ordered set (poset) V = (V, ≤) is a

(posets and lattices)

pair of a set V and a binary relation ≤ on V that is reexive transitive and antisymmetric. A poset (V, ≤) is called linearly ordered, if every two elements x, y of V are comparable (meaning that x ≤ y or y ≤ x). For two dierent elements x, y of a poset V we say that y is an upper neighbor of x (or that x is a lower neighbor of y), and denote this by x l y, if x ≤ y and if there is no further element z ∈ V (dierent from x and y ) with x ≤ z ≤ y . A lattice L = (L, ≤) is a poset such that every set {x, y} of two elements x, y ∈ L has a least upper bound and a greatest lower bound. A lattice is called complete, if every arbitrary set M has a least upper bound and a greatest lower bound. TheWleast upper bound of a set M is called supremum or join of M and it is denoted with M . TheVgreatest lower bound of a set M is called inmum or meet of M and it is denoted with M . An element x of a complete lattice (L, ≤) is called join-irreducible if for arbitrary subsets W B ⊆ L from x = B it follows x = b for some b ∈ B . The set of all join-irreducible elements of a poset V is denoted with J (V).

Denition 2

(upset and downset, principal ideal and principal lter). Let (V, ≤) be a poset. A set V ⊆ M is called an upset (or lter) if we have ∀x, y ∈ V : x ≤ y & x ∈ M =⇒ y ∈ M . A subset M ⊆ V is called downset (or ideal) if ∀x, y ∈ V : x ≤ y & y ∈ M =⇒ x ∈ M . The set of all upsets of a poset (V, ≤) is denoted with U((V, ≤)) and the set of all downsets is denoted with D((V, ≤)). An upset of the form ↑ x := {y ∈ V | y ≥ x}

4

with x ∈ V is called a principal lter. A downset of the form ↓ x := {y ∈ V | y ≤ x} with x ∈ V is called a principal ideal.

Remark 1. The complement of an upset is a downset and the complement of a downset is an upset.

Denition 3 (chain, antichain and width). Let (V, ≤) be a poset. A set M ⊆ V is called

a chain if every two arbitrary elements x and y of M are comparable (meaning that x ≤ y or y ≤ x). A subset M of a poset (V, ≤) is called an antichain if every two arbitrary dierent elements x and y of M are incomparable (meaning that neither x ≤ y nor y ≤ x). The width of a nite poset (V, ≤) is the maximal cardinality of an antichain of (V, ≤).

Remark 2. For every upset M the set min M of all minimal elements of M is an antichain. Furthermore, every nite upset M can be characterized by its minimal elements as M =↑ min M := {x ∈ V | ∃y ∈ min M : y ≤ x}. Analogous statements are valid for downsets.

Denition 4 (order dimension). The order dimension of a poset (V, ≤) is the smallest

number k such that there exist k linearly ordered sets (V, L1 ), . . . , (V, Lk ) that represent (V, ≤) via ≤=

k T

i=1

Li .

2.2 Formal concept analysis Formal concept analysis (FCA) is an applied mathematical theory rooted in an attempt

concept.

to mathematically formalize the notion of a

In its origins initially motivated by

some philosophical attempt to restructure lattice theory (cf., [Wille, 1982]) it nowadays also has very broad applications in computer science, for example in data mining, text mining, machine learning or knowledge management, to name just a few. Concretely, in formal concept analysis one starts with a so-called

K = (G, M, I)

formal context

I ⊆ G × M is a binary relation between the objects and the attributes with the interpretation (g, m) ∈ I i object g has attribute m. If (g, m) ∈ I we also use inx notation and write gIm. In where

G

is a set of objects,

M

is a set of attributes and

the context of statistical data analysis, the objects are often the data points, for example the persons that participated in a survey.The attributes are the observed values of the interesting variables, for example the answer means that person

g

answered the question

yes

m

or

with

no to the posed questions and gIm yes. (If the answers to the questions

in a survey are not binary, then one can transform them into binary attributes with the

formal concept of the context K is a pair called extent, and a set B ⊆ M of attributes, called

method of conceptual scaling, see below.) A

(A, B)

of a set

A⊆G

of objects,

intent, with the following properties: 1. Every object

g∈A

has every attribute

g ∈ G\A (∀m ∈ B : gIm) =⇒ g ∈ A).

2. There is no further object

m∈B

(i.e.:

∀g ∈ A∀m ∈ B : gIm).

that has also all attributes of

5

B

(i.e.:

∀g ∈ G :

m ∈ M \A ∀m ∈ M : (∀g ∈ A : gIm) =⇒ m ∈ B ).

3. There is no further attribute

that is also shared by all objects

g∈A

(i.e.

Conceptually, the concept extent describes, which objects belong to the formal concept and the intent describes, which attributes characterize the concept. The property of being a formal concept can be characterized with the following operators

Φ : 2M −→ 2G : B 7→ {g ∈ G | ∀m ∈ B : gIm}

Ψ : 2G −→ 2M : A 7→ {m ∈ M | ∀g ∈ A : gIm} as

(A, B)

is a formal concept

This can be verbalized as: The pair

⇐⇒ Ψ(A) = B & Φ(B) = A.

(A, B) is a A and A

formal concept i

B

is exactly the set of

all common attributes of the objects of attributes of

B .

is exactly the set of all objects having all 0 In the sequel, we will abbreviate both Ψ and Φ with . (Which of the

two operators is meant will be always clear from the context.) Furthermore, for singleton 0 0 0 0 sets {g} ⊆ G or {m} ⊆ M we abbreviate {g} by g and {m} by m . On the set of all formal concepts we can dene a subconcept relation as

(A, B) ≤ (C, D) ⇐⇒ A ⊆ C & B ⊇ D. (Actually, for formal concepts the equivalence

(A, B)

is a subconcept of

(C, D)

A ⊆ C ⇐⇒ B ⊇ D

then it is a more specic concept containing less objects

that have more attributes in common.

The set of all formal concepts of a context

together with the subconcept relation is called the with

B(K).

holds.) If the concept

K

concept lattice of K and it is denoted

The concept lattice is in fact a complete lattice.

The set of the concept

B1 (K) and the set of all concept intents is denoted with B2 (K). The family of sets B1 (K) is a closure system on G and the family B2 (K) is a closures system on M : A (set-theoretic) closure system S on a Ω space Ω is a family S ⊆ 2 of subsets of Ω that contains the space Ω and is closed under arbitrary intersections. If a family F of subsets of a space Ω is not a closure system, one T can compute its closure cl(F) := {S | S ⊇ F & S is a closure system on Ω} that is the smallest closure system containing all sets of F . extents of all formal concepts of

is denoted with

Ω can be described by all valid formal implications of S : A formal implication is a pair (Y, Z) of subsets of Ω, also denoted by Y −→ Z . We say that an implication Y −→ Z is valid in a family S of subsets of Ω (which needs not to be a closure system) if every set of S that contains all elements of Y also contains all elements of Z . In this case we also say that the family S respects the implication Y −→ Z . Similarly, we say that a subset of Ω respects an implication Y −→ Z if it either is not a superset of Y or if it is a superset of Z . The rst component of a formal implication Every closure system

S

B(K)

on

6

premise or the antecedent and the second component is also called conclusion or the consequent of the formal implication. A formal implication is called simple if its premise is a singleton.

is also called the

A closure system

S

the set

I(S)

closure system

S

can be characterized by formal implications as follows: Dene for

S

can be rediscovered from

all formal implications of

S

S . Then, given the set I(S), the I(S) as the set of all subsets of Ω that respect I(S) of all valid implications of a closure system

of all formal implications that are valid in

is usually very large.

I(S).

so-called implication base of that a further set

∀M ⊆ Ω :

J

The set

To eciently describe a closure system, it suces to look at a

I(S):

Given an arbitrary set

of implications is a

I

of formal implications, we say

base of I, if we have

M respects all implications of I ⇐⇒ M respects all implications of J

and if furthermore

J

is minimal w.r.t. this property, i.e. for every other subset

(1)

J0 ( J

the equivalence (1) is not valid anymore. In the sequel, we will mainly deal with formal implications of the closure system of the concept intents of a given formal context

K.

Such implications are sometimes also called attribute-implications to indicate that one is speaking about implications between attributes and not between objects of a context. Here, we will always use the short term implications and will also say that an implication is valid in a context

K

instead of saying that an implication is valid in the closure system

of all concept intents of

K.

In the context of statistical data analysis one often has data that are not binary but for example categorical with more than two possible values.

To analyze such data with

methods of formal concept analysis one can use the technique of (cf.

conceptual scaling

[Ganter and Wille, 2012, p.36-45]) to t the categorical data into a binary setting:

{1, . . . , K}

K attributes = 1 , . . . , = K and say that an object g has attribute = i if the value of K equals i. In a similar way, for an ordinal variable with possible values {1 < 2 < . . . < K} we can introduce the attributes ≤ 1, ≤ 2,. . ., ≤ K and say that object g has attribute ≤ i if the value of object g is lower than or equal to i. One can also additionally introduce the attributes ≥ 1, . . ., ≥ K . This concrete way of conceptually scaling For a categorical variable with the possible values in

an ordinal variable is called

interordinal scaling

one can introduce the

and will be used in one example of

application given in Section 6.2.

3

Detecting stochastic dominance

We now turn to the development of a technique for detecting stochastic dominance for poset-valued random variables based on linear programming and the upset-characterization of stochastic dominance.

7

3.1 Characterizing stochastic dominance via linear programming (V, ≤) = ({v1 , . . . , vk }, ≤) be a nite poset7 , let x = (x1 , . . . , xnx ) be an i.i.d. sample x of a random variable X and let y = (y1 , . . . , yny ) be an i.i.d. sample of Y . Let w = x x x (w1 , . . . , wk ) where wi denotes the number of observed samples of X with value vi , divided y by nx . Analogously, let w = (w1y , . . . , wky ) where wiy denotes the number of samples of Y with value vi , divided by ny . With U((V, ≤)) we denote the set of all upsets of (V, ≤). We identify an upset M ∈ (V, ≤) with its characteristic vector m ∈ {0, 1}k via mi = 1 ⇐⇒ vi ∈ M . Additionally, we also identify the relation ≤ with the relation {(i, j) | i, j ∈ {1, . . . , k}, vi ≤ vj }, the same for the covering relation l. To the samples x ˆ of the true law P via Pˆ (X = vi ) = wx and and y we associate the empirical analogue P i y ˆ P (Y = vi ) = wi . To check if X ≤SD ˆ Y we have to check Let

∀M ∈ U((V, ≤)) :Pˆ (X ∈ M ) ≤ Pˆ (Y ∈ M ).

Obviously,

Pˆ (X ∈ M ) = hwx , mi

characterizable as

and

Pˆ (X ∈ M ) = hwx , mi,

(2) so (2) is equivalently

∀M ∈ U((V, ≤)) : Pˆ (X ∈ M ) ≤ Pˆ (Y ∈ M ) ⇐⇒ ∀M ∈ U((V, ≤)) : hwx , mi ≤ hwy , mi ⇐⇒ ∀M ∈ U((V, ≤)) : hwx , mi − hwy , mi ≤ 0 ⇐⇒ ∀M ∈ U((V, ≤)) : hwx − wy , mi ≤ 0 ⇐⇒ sup hwx − wy , mi ≤ 0. M ∈U ((V,≤))

This means that the problem is characterizable as a linear program over the family

S := U((V, ≤)) of subsets of V . To solve this program we can look at the concrete structure of the family S . The family S consists of all upsets of (V, ≤), i.e., of all sets M satisfying ∀i, j ∈ {1, . . . , k} : vi ∈ M & vi ≤ vj =⇒ vj ∈ M which is equivalent to

∀i, j ∈ {1, . . . , k} : vi ≤ vj =⇒ mj ≥ mi and this set of inequalities can be easily implemented in a linear program: We have

X ≤SD ˆ Y

if and only if the linear binary program

hwx − wy , mi −→ max w.r.t. m ∈ {0, 1}k ∀(i, j) ∈≤: mj ≥ mi

(3)

7 This is actually no restriction because we are in the rst place interested in detecting stochastic dominance for samples that are always nite.

8

has a maximal value of zero. because for

M =∅

we have

(Note that the maximal value of (3) is always at least

hwx − wy , (0, . . . , 0)i = 0.)

0,

If one analyzes this above binary

program further (see the last paragraph of Section 4.1 at page 21), one sees that it is not necessary to take the

mi

as binary variables, one can relax the integrality conditions and

solve instead the far more simple classical linear program

hwx − wy , mi −→ max w.r.t.

(4)

m ∈ [0, 1]k ∀(i, j) ∈≤: mj ≥ mi which could be further simplied to

hwx − wy , mi −→ max w.r.t.

(5)

m ∈ [0, 1]k ∀(i, j) ∈ l : mj ≥ mi . In the sequel we will denote the maximal value of (5) with

D+

and the optimal value − one would get if one would replace maximization by minimization in (5) with D .

3.2 Some analysis of the linear programming approach for detecting stochastic dominance The obtained linear program for checking dominance involves k decision variables and |l|+k inequalities, where |l| can be shown to be bounded by b k2 c·d k2 e, which indicates that the linear program is practically manageable for real data sets. One interesting question in this context is how the feasible set of the linear program looks like in special situations and what for example the simplex-algorithm would actually do. In applied situations, the poset (V, ≤) is often of the form V = Rd or {0, . . . , K}d and x ≤ y ⇐⇒ ∀i ∈ {1, . . . , d} : xi ≤ yi .

For checking stochastic dominance only the actually observed helps in reducing the eective size of the poset of

V

V

x∈V

are of interest.This

but at the same times makes the structure

only implicitly given. Thus, a general analysis seems to be dicult and we therefore

restrict the analysis in Section 3.2.1 to some simple examples.

3.2.1 Some examples In this section we exemplarily discuss some examples for posets (V, ≤) of the form V = {0, . . . , K}d and x ≤ y ⇐⇒ ∀i ∈ {1, . . . , d} : xi ≤ yi . We start with the simplest example

9

where

d=1

which corresponds to a linearly ordered set

the linear program (5) would translate to

V = {0 < 1 < . . . < K}.

Then

hwx − wy , mi −→ max w.r.t. m ∈ [0, 1]k    1 −1 0 0 ... 0 m1 0  0 1 −1 0 . . . 0   m2   0          ..  ≤  ..  . .   .   .  . 0 0 0 . . . 1 −1 mk 0 | {z } 



=:A

l

m

In this case the extreme points of the feasible set are simply the vectors of the form = (0, . . . , 0 , 1, 1, . . .), where l ∈ {0, . . . , k}. For l, l0 ∈ {1, . . . , k − 1} it is easy to

|{z} l-th entry

l l0 see that every two dierent extreme points m and m are adjacent because A has full l rank and for m the inequality constraint associated to the l − th row of A is strict where l l0 the other inequalities are actually equalities and to switch from m to m one simply has 0 to switch the l − th variable from basis to non-basis and the l − th variable form non-basic 0 to basic. A similar argumentation shows that also for arbitrary l, l ∈ {0, . . . , k} every two dierent extreme points are adjacent which means that applying the simplex algorithm would in this case exactly mean that one scans every extreme point, i.e. every upset, so the simplex algorithm is not better than a naive inspection of every upset. the case of a linearly ordered set the number of upsets is computational point of view. Now we come to the more dicult cases of

d > 1.

|V |

However in

and thus no problem from a

In these situations the feasible set

of the linear program appears to be not so easily describable, there seems to be no simple

8

rule that says which extreme points are adjacent. Table 1 gives lower and upper bounds d for the size u of the closure system of all upsets of {0, . . . , K} for dierent values of K and

d.

One can see that for high enough

K

or

d

the closure system is very big and explicitly

checking all upsets is clearly not applicable. Compared to this, in Table 2 one can see the

8 The upper bounds were computed with the help of the Sauer-Shelah lemma ([Sauer, 1972, Shelah,

1972]).

The

Sauer-Shelah

lemma

is

also

closely

which we use in Section 5.2, see also Bottou [2013] or

vapnik-symposium-2011.pdf

upset

↑ B.

multisets of rank

V

Vapnik-Chervonenkis

theory

l ∈ {1, . . . , K}

the set

Al := {x ∈ {0, 1, . . . , K}d |

is an antichain and thus for every non-empty set

K P

B⊆

d P

xi = l} of all i=1 Al we get a dierent

(2|Al | − 1) + 2, where the last +2 comes from noting that also the empty set and l=1 are upsets, and the cardinality |Al | can be computed recursively.

Thus,

the whole set

l

to

for the curious history of the Sauer-Shelah lemma. The lower bounds were

obtained be noting that for every

K -bounded

related

http://leon.bottou.org/_media/papers/

u≥

10

number of iterations a dual simplex algorithm did take to get a solution. (For the objective function we simply took a standard normally distributed sample.) This indicates that with the linear programming approach the problem is still solvable for larger values of

lower bound K=1 upper bound lower bound K=2 upper bound lower bound K=3 upper bound lower bound K=4 upper bound lower bound K=5 upper bound lower bound K=6 upper bound lower bound K=7 upper bound lower bound K=8 upper bound Table 1:

1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9

2 5 10 15 129 37 2516 83 68405 177 2391495 367 102022809 749 5130659560 1515 296881218693

3 16 9.2e+01 2.7e+02 1.3+06 1.0e+04 4.2e+12 1.1e+06 1.6e+22 3.4e+08 2.1e+34 2.9e+11 7.0e+49 7.1e+14 1.0e+68 4.9e+18 6.9e+89

Upper and lower bounds for the size

{1, . . . , K}d

for dierent values of

K

and

d

4 95 1.5e+04 6.6e+05 2.1e+18 2.0e+13 8.6e+49 4.1e+25 4.7e+106 9.2e+43 5.5e+196 3.5e+69 3.6e+103 1.6e+147

5 2110 1.1e+08 2.3e+15 1.4e+53 9.1e+46 4.6e+187 4.9e+114 1.3e+235

u

6 1.1e+06 3.4e+16 2.8e+42 1.5e+154 4.0e+174

7 6.9e+10 5.1e+31 2.0e+118

K

and

d.

8 1.2e+21 1.5e+64

of the closure system of all upsets of

d.

d K

1

2

3

4

5

6

7

1

0

0

7

18

18

92

239

2

4

3

19

156

796

3861

23002

3

3

78

208

1901

4456

24628

27271

4

17

86

626

3518

23002

24173

24923

5

12

200

2380

10987

6

29

353

2023

23002

7

60

396

4959

8

87

572

7698

Table 2: Number of iterations for solving the linear program via dual simplex for detecting d stochastic dominance for V = {0, . . . , K} for dierent values of K and d. The objective

function was a standard normally distributed random sample.

3.2.2 Duality In this section, we analyze the dual linear program of program (4) for detecting rst order stochastic dominance. The most interesting inside will be that this dual program can be interpreted as a special kind of mass transportation problem. In order to determine the dual program of program (4), rst note that the second class of constraints of problem (4) can equivalently be rewritten as

∀i, j ∈ {1, . . . , k} : mi ≥ Iij · mj 11

(6)

where

Iij := 1< ((vj , vi )) and < denotes the strict part of the i ∈ {1, . . . , k}, the matrix M (i) ∈ Rk×k via   if p = q Iip (i) Mpq = −1 if q = i ∧ q 6= p   0 else

partial order

for each

≤.

By dening

(7)

one then can reformulate the linear programming problem (4) by the equivalent linear programming problem



hwx − wy , mi −→ max w.r.t. m1 , . . . , mk ≥ 0

(8)

 Ek M (1)     ..  · m ≤ (1, . . . , 1, 0, . . . , 0)T =: b | {z } | {z }  .  k−times k2 −times (k) M Ek denotes the k -dimensional unit matrix. (x1 , . . . , xk , z11 , . . . , z1k , . . . , zk1 , . . . , zkk ) and let b be where

Dene

wxy := wx − wy

and

z :=

dened as in the constraints of the

above linear program (8). Then the dual linear program of (8) is given by:

k X l=1

xl = hb, zi −→ min

(9)

w.r.t. 2

Ek M (1)T

z ∈ Rk+k   ≥0 w1xy   . . . M (k)T · z ≥  ...  wkxy

In order to investigate what duality theory can teach us about our original problem, we

12

rewrite the program (9) as:

k X l=1

∀i ∈ {1, . . . , k} : xi −

X

zis +

s∈{1,...,k}\{i}

X

xl −→ min w.r.t. z ∈ R+ k+k

s∈{1,...,k}\{i}

(10)

2

Isi · zsi ≥ wixy

zis with Iis = 0, for nding an optimal solution one can always set zis to zero, because such zis are not present in the objective function and do occur separated only in the i − th inequality constraint with a negative sign. Thus, the program can be simplied For variables

to

∀i ∈ {1, . . . , k} : xi −

X

s∈{1,...,k}\{i}

Iis · zis +

X

s∈{1,...,k}\{i}

Isi · zsi ≥ wixy ,

which again can be simplied to

∀i ∈ {1, . . . , k} : xi −

X

zis +

{s:vs 0 chosen small enough (for example ε = 1/2 min{|mi − ml | |

relaxed program with some relaxed program, since for

i ∈ {1, . . . , n}, mi 6= ml }) feasible vectors

m + xε

and

it can be represented as the convex combination of the two

m − xε

where

xε ∈ Rk

is dened as

( ε xεi = 0

if

mi = ml

else

.

4.2 The case of closure systems eciently described by a generating formal context Closure systems also naturally arise in the theory of formal concept analysis: the family of all formal concept extents (as well as the family of all concept intents) of a concept lattice is a closure system. Furthermore, every arbitrary closure system can be represented

14 .

as a closure system of extents (or intents) of an appropriately chosen formal context

In statistical applications, it appears natural to take as objects the observed data points, for example persons in a social survey. As attributes one can take the values of dierent variables of interest, for example the answers of the persons to dierent questions.

(If

the questions are yes-no questions, then they can be incorporated directly, otherwise one can apply the method conceptual scaling to get binary data, cf. Section 2.2.) The formal concept lattice then gives valuable qualitative information about dierent subgroups of persons that supplied response patterns that belong to the same formal concept and thus share specic attributes.

If one is interested in dierences between dierent subgroups

(e.g., male and female participants) w.r.t. the answers to the questions, one could look at every formal concept and analyze the dierences between the subpopulations that belong to the given concept. Often, the concept lattice is very large and it becomes dicult to look at every formal concept.

Then, one can look for example only on that concepts,

for which the dierence between the proportions of persons belonging to this concept in each subgroup is maximal or minimal. This is exactly the problem of maximizing a linear function on the closure system of concept extents.

If the whole concept lattice can be

computed explicitly, then one can simply explicitly compute for every extent the dierence in proportions between both subpopulations.

However, in many situations the concept

lattice is so big that it is very hard to compute all extents/intents explicitly to perform |G| |M | the optimization. (In the worst case, a formal context can have min(2 , 2 ) associated formal concepts.) In this situation one can use the fact that a pair

14 For a closure system exactly the sets of sets of

S.

S.

S ⊆ 2V

K := (V, S, ∈), then the formal extents K := (S, V, 3), the formal intents are exactly

take the formal context

Analogously, for the dual context

21

(A, B) with A ⊆ G and are the

B⊆M

is a formal concept i

A = B 0 & B = A0 or equivalently i

∀g ∈ G, m ∈ M : 1A (g) = min 1m0 (g) & 1B (m) = min 1g0 (m). m∈B

(12)

g∈A

Characterization (12) can be used to describe the property of being a formal concept with the help of the following characterizing inequalities:

∀g ∈ G, m ∈ B : 1A (g) ≤ 1m0 (g) ∀g ∈ A, m ∈ M : 1B (m) ≤ 1g0 (m) X ∀g ∈ A, m ∈ B : 1A (g) ≥ 1m0 (g) − |B| + 1 &

(13) (14) (15)

m∈B

1B (m) ≥

X g∈A

1g0 (m) − |A| + 1.

Equations (13) and (21) capture the fact that

min 1g0 (m), g∈A

respectively.

1B (m) ≥ min 1g0 (m), g∈G

ject

g

(16)

1A (g) ≤ min 1m0 (g)

and

m∈B

Equations (15) and (16) say that

1B (m) ≤

1A (g) ≥ min 1m0 (g) m∈B

and

respectively, which is equivalent to the condition that if an ob-

has all attributes of

is shared by all objects of

B then it has to be in the extent A and that if an attribute m A, then it should be in the intent B . The characterization via

inequality constraints can be used to optimize a linear function of the indicator function

G = {g1 , . . . , gm } be m×n the set of objects, M = {m1 , . . . , mn } the set of attributes and let A ∈ {0, 1} be a matrix describing the incidence I with the interpretation Aij = 1 ⇐⇒ object number i has attribute number j . A formal concept can then be described by a binary vector z = (z1 , . . . , zm , zm+1 , . . . , zm+n ) ∈ {0, 1}m+n , where the rst m entries describe the extent via zi = 1 i object i belongs to the extent and the last n entries describe the intent as zj+m = 1 i attribute j belongs to the intent. The characterizing constraints (13) - (16)

of the extents (or the intents, or both) with a binary program: Let

would then translate to the conditions

∀(i, j)

s.t.

Aij = 0 :

∀i ∈ {1, . . . , m} : ∀j ∈ {1, . . . , n} :

zi ≤ 1 − zj+m & zj+m ≤ 1 − zi X X zi ≥ zk+m − zk+m + 1

(17)

zj+m ≥

(19)

k:Aik =1

X

k:Akj =1

(18)

k=1,...,n

zk −

X

zk + 1.

k=1,...,m (20)

have to be satised. This could be simplied to the condition

22

∀(i, j)

s.t.

zi ≤ 1 − zj+m X zk+m ≥ 1 − zi

Aij = 0 :

∀i ∈ {1, . . . , m} :

(21) (22)

k:Aik =0

X

∀j ∈ {1, . . . , n} :

k:Akj =0

zk ≥ 1 − zj+m ,

(23)

which has the following intuitive interpretation: For every 1. if

0-entry

in the

i-th

Aij = 0 and if object gi

row and the

j -th

column of the matrix

A

we have:

belongs to the extent, then necessarily attribute

mj

cannot

belong to the intent and vice versa. 2. If object

gi

does not belong to the extent, then there exists at least one attribute

of the intent, that the object 3. Dualy, if attribute object

gk

mj

gi

mk

does not have.

does not belong to the intent, then there exists at least one

of the extent, that has not attribute

mj .

Thus, we can compute the maximum

max

hwext , 1A i + hwint , 1B i

(A,B)∈B(K)

of an arbitrary linear objective function

(w1ext , . . . , wnext , w1int , . . . , wnint )

of both the extents

and the intents by solving the binary program

ext h(w1ext , . . . , wm , w1int , . . . , wnint ), (z1 , . . . , zm , zm+1 , . . . , zm+n )i −→ max w.r.t. ∀(i, j) s.t. Aij = 0 : zi ≤ 1 − zj+m X ∀i ∈ {1, . . . , m} : zk+m ≥ 1 − zi

(24)

k:Aik =0

∀j ∈ {1, . . . , n} :

All in all, we would thus have to solve a

|{(i, j | Aij = 0)}| + m + n constraints.

binary

X

k:Akj =0

zk ≥ 1 − zj+m

program with

m+n

variables and

This problem can become cumbersome if the formal

context is big enough, especially because one cannot simply drop the integrality-constraints. However, in practical applications, often only the number of objects is large and the number of items is medium-sized. If one further analyzes the binary program, then one observes that the inequalities concerning the objects and the constraints concerning the attributes

23

are separated in the sense that if one relaxes only the integrality constraints of the variables

z1 , . . . , zm

describing the extent, then the optimum of the associated relaxed mixed binary

program is still (also) attained at a binary solution and thus one can relax the integrality constraints of the extent.

The reason is that for xed and binary

(zm+1 , . . . , zm+n )

the

inequalities (21) and (22) are either redundant or reduce to equality constraints of the

zi = 0 or zi = 1 and inequality (23) is either redundant or demands that a sum of zk 's associated with the extent is greater or equal to 1. If one of the zk 's is not binary, than at least one other zk0 has to be greater than zero. This means that for an appropriately 15 ε > 0 the vectors (z , . . . , z +ε, . . . , z 0 −ε, . . . , z chosen 1 k k m+n ) and (z1 , . . . , zk −ε, . . . , zk0 + ε, . . . , zm+n ) are still feasible with respect to the relaxed feasible set and this shows that nonform

integer points are no extreme-points of the restricted polytope where the binary variables describing the intent are xed. Thus, the optimal value for the relaxed program is always also attained at a binary solution.

5

Statistical inference

We now treat the question of inference. Coming back to the example of detecting stochastic dominance, we were able to detect stochastic dominance in a sample. The natural question of inference is now:

What can we reasonably infer about stochastic dominance in the

population we sampled from? From a substance matter perspective, one would supposedly be interested for example in the hypotheses

H0 : X H1 : X

is not stochastically dominated by is stochastically dominated by

Y

vs

Y.

However, a reasonable consistent classical statistical test of this pair of hypotheses is not reachable since already in the univariate case where the distribution function characterizes

X ≤SD Y , in every arbitrarily X SD Y˜ . To circumvent this switching the roles of H0 and H1

stochastic dominance, we have the problem that for every small

16 neighbourhood

of

Y

we can nd some

Y˜

with

problem, one can modify the hypotheses, for example by

(for consistent statistical tests of this kind in the univariate case, see, e.g., [Barrett and + Donald, 2003]). Here, we go a slightly dierent way. Since the value of D characterizes X ≤SD Y via X ≤SD Y i D+ = 0 and D− characterizes Y ≤SD X via Y ≤SD X i D− = 0 and furthermore X SD Y (where X SD Y means X SD Y & Y SD X ) i D+ > 0 & D− < 0) we can simply test, if D+ and D− are signicantly dierent from zero. + − (In the case of for example D is signicantly positive and D is not signicantly negative,

15 The choice of constraints,

ε

ε

depends on all inequalities that involve

zk

and

zk0

but since there are only nite many

can in fact be chosen small enough and still greater than zero.

16 This is meant w.r.t. e.g., the Kolmogorov-Smirnov distance. Note also, that in our situation, we have

not much freedom of choice of other distances that induce other neighborhood concepts, since we can only make use of the partially ordered scale of measurement of

24

X

and

Y.

X ≤SD Y but at least Y SD X and the possibility X SD Y A with P (X ∈ A) ≥ P (Y ∈ A) where the dierence P (X ∈ A) − P (Y ∈ A) is only slightly positive.) In the sequel we will put focus on D+ and also do + − not explicitly correct for multiple testing if considering both D and D simultaneously. we cannot directly conclude

is only possible due to upsets

Actually, conceptually, here we do not take the inference problem as the primitive and do not rigorously test a beforehand exactly stated hypothesis by doing a statistical test that provides us with a descriptively interpretable test statistic as a by-product. Instead, we see it a little bit the other way around:

In the rst place, we would like to get a

good, conceptually rigorous descriptive insight into the data by not relying on traditional

17

approaches based on somehow arbitrarily chosen location measures data by one number and then comparing the obtained numbers.

summarizing the

Instead, by relying on

stochastic dominance, we in a sense somehow look simultaneously at all reasonable location measures and if we know measure

18

X ≤SD Y ,

then we also know that every reasonable location

would give a lower (or equal) number to

X

than to

Y.

This is a conceptually

much more reliable statement than simply comparing numbers (of course with the drawback of being less decisive). Only in a second step we think in statistical terms about to which extent the conceptually rigorous statement of stochastic dominance can be translated from the sample to the population.

5.1 Permutation-based tests Now, let us come to the purely statistical aspects of inference for detecting stochastic dominance. (All considerations are similarly valid for linear optimization on general closure systems.) variables

In the simple univariate case of real-valued, continuously distributed random

X, Y , for the two-sample case under H0 : FX = FY , the distribution of the D+ (and also D− and D := max{D+ , −D− }) is independent of the true

test statistic law

FX ,

has known asymptotics and can be furthermore computed exactly for identical

sample sizes (see, e.g., [Pratt and Gibbons, 2012, Chapter 7]). Opposed to this, in the + general multivariate situation, the statistic D is not distribution free, anymore: Firstly, + the distribution of D depends on the concrete structure of the poset (V, ≤): If the relation

≤

is very sparse, then the set of all upsets is very large and one would generally expect + that D would typically have higher values than for the case of a very dense relation ≤. Secondly, also the interplay between the structure of

also of relevance: For example in a very large poset

(V, ≤) and (V, ≤) with

the unknown true law is a very sparse relation

≤

it could be still the case that the most probability mass is living on a much smaller subset

W ⊆V

on which the restricted relation

≤ ∩ W ×W

is actually very dense. This suggests D+ seems to be only partially

that a rigorous analytic treatment of the distribution of

17 Note the non-classical scale of measurement we are dealing with, here. 18 One of the few location measures that does not respect rst order stochastic dominance is the mode. But note that the mode appears most naturally if we have a categorical or an interval scale of measurement, the mode seems to give no valuable information if we want to analyze inequality which is a genuinely ordinal concept.

25

possible

19 .

Thus, a natural alternative is to apply a two sample observation-randomization

test (permutation test, see, e.g., [Pratt and Gibbons, 2012, Chaper 6]), here. The procedure + for evaluating the distribution of D under H0 : PX = PY , which is the least favorable case + ˜0 : Dtrue := sup PX (A) − PY (A) = 0 ( ⇐⇒ X ≤SD Y )) is straightforward: of H A∈U ((V,≤))

x = (x1 , . . . , xnx ) of size nx for subpopulation X (y1 , . . . , yny ) of size y of subpopulation Y be given.

1. Let a sample

2. Compute the statistic

D+

3. Take the pooled sample

DI+

y =

for the actually observed data.

z = (x1 , . . . , xnx , y1 , . . . , yny ).

4. Take all index sets I ⊆ {1, . . . , nx + ny } of size DI+ that would be obtained for a virtual sample y˜ = (zi )i∈{1,...,nx +ny }\I for subpopulation Y . 5. Order all

and a sample

nx and compute the test statistic x˜ = (zi )i∈I for population X and

in increasing order

H0 if the test statistic D+ for dγ · |I|e-th value of the increasingly

6. Reject

the actually observed data is greater than the + ordered values DI , where γ is the envisaged

condence level.

In step 4 one has to compute the test statistic for a very huge number of resamples, thus one usually does not compute the test statistic for all resamples but only for a smaller number of randomly chosen resamples. In the context of linear programming on closure systems, the computation of the test statistic for one resample could be already computational demanding for very complex data sets, so the application of observation-randomization tests has some limitations, here.

5.2 Conservative bounds via Vapnik-Chervonenkis theory Beyond applying resampling schemes for inference there is the further possibility to apply Vapnik-Chervonenkis theory (see, e.g., [Vapnik and Kotz, 1982]) to obtain conservative bounds for the test statistic:

In Vapnik-Chervonenkis theory, among other things, one

analyzes the distribution of

sup |Pn (A) − P (A)|

A∈S or

sup |Pn (A) − Pn0 (A)|,

A∈S

19 Actually, there exists some literature on the asymptotic distribution of the optimal value of a random linear program (e.g., [Babbar, 1955, Sengupta et al., 1963, Prèkopa, 1966]). However, this literature seems to be not applicable in our situation, because in our case, under the null hypothesis, the random objective function is symmetrically distributed around the zero vector, such that the assumption of a unique optimal basis for the asymptotic linear program (cf. [Prèkopa, 1966, Theorem 5]) is not satised.

26

is an unknown probability law and Pn is the empirical law associated with an n (and Pn0 is the empirical law associated to a further independently drawn sample of the same size n). Here, the family S can be any arbitrary family of where

P

i.i.d.-sample of size

subsets of a given space

Ω.

In our situation, the family

S

is the underlying closure system

of interest. The Vapnik-Chervonenkis inequalities (cf., [Vapnik and Kotz, 1982, p.170-172]) then state that

2 P sup |Pn (A) − P (A)| > ε ≤ 6 mS (2n) e−nε /4 A∈S 2 0 P sup |Pn (A) − Pn (A)| > ε ≤ 3 mS (2n) e−nε .

and

(25)

(26)

A∈S

These inequalities

20

can be used to get conservative critical values for a one sample and

a two sample test. (In the sequel, we will put focus on the two sample situation.) The crucial quantity involved in the right hand sides of these inequalities is the so-called

function

mS (k) :=

max

A⊆Ω,|A|=k

∆S (A),

growth

where

∆S (A) :=|{S ∩ A | S ∈ S}|

describes the cardinality of the projection of the family S on the set A. Obviously, if S S is nite, then the growth function m (k) is always lower than or equal to the cardinality of

S

and thus

|S|

can be used to get a bound for the left hand sides of (25) and (26).

Vess of S ⊆ 2Vess 0 Ω0 system S ⊆ 2

Actually, in our setting, we will use as the underlying space always the subset actually observed values of the basic set

V

and an associated closure system

all on

Ω := Vess . (Note that the projection of a closure on ⊆ Ω0 via S 0 |Ω = {S ∩ Ω | S ∈ S 0 } is again a closure system on Ω.) S Thus, with A = Vess we have ∆S (A) = |S| and m (2n) = |S|, such that the bound |S| is a S sharp bound for m (2n). If the family S is explicitly given, we could thus work with the computable bound |S|. The far more interesting situation appears if the family S is very the restricted space Ω0 onto a subset Ω

large and only implicitly given. Then there is another important bound (see [Vapnik and Kotz, 1982, p.167]) on the growth function that is related to the

dimension (V.C.-dimension) of the family21 S : mS (k) ≤ 1.5

Vapnik-Chervonenkis

k V C−1 , (V C − 1)!

20 There is a bunch of similar Vapnik-Chervonenkis type inequalities that could be of help here, see, e.g., the summary given in Table 1 of [Vayatis and Azencott, 1999, p.4].

21 Originally, Vapnik-Chervonenkis theory was mainly developed to be able to deal with innite families

S . Here, we have nite families S , and if we would know |S| then, in our setting, we would better bound mS (k) by |S| instead of using the Vapnik-Chervonenkis dimension since in our setting of S ⊆ 2Vess , this dimension essentially only provides an upper bound for the cardinality of |S|. If the family S is very

large and is only implicitly given, then the V.C.-dimension can still provide a good computable bound for

mS (k). wa have

Note further that sometimes also other bounds for

mS (2n) ≤ 2|Ω| . 27

mS (2n)

can be useful, for example for nite

Ω

where

VC

is the Vapnik-Chervonenkis dimension of the family

the cardinality of the largest possible subset

A

a set

can be

projection of

S

∆S (A) = 2|A| .

shattered

on

A

by

S

A

S,

that can be shattered by

that is dened as

S:

One says that

(or alternatively that A is shatterable w.r.t. S ) if the A, i.e. 2A = {S ∩ A | S ∈ S} or equivalently

contains all subsets of

In many cases, the V.C.-dimension cannot be computed explicitly.

However, in our context it shows up that we can compute the V.C.-dimension either with the help of binary programs or with a sharp characterization of the V.C.-dimension. Of course, the Vapnik-Chervonenkis inequalities provide only very conservative bounds for inference. law

P .)

(Note that the right hand sights of (25) and (26) do not depend on the true

If one is able to perform an observation randomization test, then one should do it

instead of dealing with the conservative Vapnik-Chervonenkis inequalities. However, the Vapnik-Chervonenkis inequalities give us some guidance for dealing with situations where the closure system is so big that one would expect that the distribution of the test statistic is behaving too ugly to allow for a sensible statistical test with enough power.

In such

a situation, we can use Vapnik-Chervonenkis theory to appropriately reduce the closure

22 .

system to hope for making the tail distribution of the test statistic more well-behaved

If one appropriately reduces the cardinality of the closure system, then one could hope for a test statistic that has a better power for the detection of a systematic deviation from

H0 .

23

This possibly increased power would then come along with a smaller and thus

less ne-grained closure system

S

that is then not so sensitive to very specic alternatives.

Note that the V.C.-inequality is essentially based on the eective size of is explicitly given, one can simply drop some sets of

S

to make

S

S.

Thus, if

S

smaller. However, in

our situation, we often have a closure system that is implicitly given and only nicely describable because it is a closure system. A simple removal of some sets of

S S 0,

S

is thus not

possible because sets of

are not explicitly given and an arbitrary removal of some sets

could lead to a family

that is not a closure system, and thus not easily describable,

anymore.

The beauty of V.C.-theory in this situation lies in the fact that if we can

A of maximal cardinality, then A make S A to tame S eciently. Actually,

compute the V.C.-dimension by supplying a shatterable set we also have a straightforward possibility to tame

S:

Since big shatterable sets

very big, we can drop some or all elements of such sets one would not completely remove

A

(or a subset of

A)

for the whole data analysis, but

only for the construction of the closure system under which the nal data analysis takes place.

Before explaining how this is exactly meant in dierent situations and how we

ensure that the tamed system is still a closure system (cf. Section 5.3), we will now rstly characterize shatterable sets and the V.C.-dimension for dierent closure systems in the next section.

22 Of course, V.C.-theory gives us only bounds on the tail behavior of the test statistic and no direct insight into the actual behavior of the tails, so a sharpening of the bound does not necessarily mean that the actual tail behavior will be getting better if we reduce the V.C.-dimension of the closure system.

23 Of course, one cannot hope for a more powerful statistic w.r.t. every thinkable deviation from

only for a better power for detecting deviations that are not too complex w.r.t. V.C.-dimension.

28

H0

but

5.2.1 The Vapnik-Chervonenkis dimension of several special closure systems This section is actually very technical and not really necessary to understand the basic ideas in the following sections. The reader more interested in the basic concepts can thus skip this section and Section 5.2.2. For the reader interested in Vapnik-Chervonenkis theory and the reader interested in a detailed understanding, we would like to recommend the following sections, because, though looking a bit technical, there are no deep or cumbersome ideas involved in the following theorems. Contrarily, the relation between Vapnik-Chervonenkis theory and formal concept analysis seems to be very natural. Maybe somehow surprising, there seems to be not too much research that directly connects formal concept analysis and Vapnik-Chervonenkis theory, the only works in this direction, the authors are aware of, are the papers Anthony et al. [1990a,b], Albano and Chornomaz [2017], Chornomaz [2015], Albano [2017a,b], Makhalova and Kuznetsov [2017].

Denition & Proposition 1. Let S ⊆ 2Ω be a closure system on Ω. Let furthermore I(S)

be the set of all formal implications the closure system S respects. A set M ⊆ Ω is called implication-free if there is no formal implication (A, B) ∈ I where A and B are disjoint non-empty subsets of M . A set M is shatterable w.r.t. S if and only if it is implicationfree and thus the Vapnik-Chervonenkis dimension of S is the maximal cardinality of an implication-free set M ⊆ Ω.

Denition 5 (Vapnik-Chervonenkis principal dimension (VCPI/VCPF)). Let (V, ≤) be a partially ordered sett. The Vapnik-Chervonenkis principal ideal dimension (VCPI) is the Vapnik-Chervonenkis dimension of the family

pi((V, ≤)) := {↓ x | x ∈ V } = {{y | y ≤ x} | x ∈ V }}

of all principal ideals of (V, ≤). If (V, ≤) is a complete lattice, then pi((V, ≤)) is a closure system. In this case we also say that a set M ⊆ V is join-shatterable if it is shatterable w.r.t. the family pi((V, ≤)). Analogously, the Vapnik-Chervonenkis principal lter dimension (VCPF) is the Vapnik-Chervonenkis dimension of the family pf((V, ≤)) := {↑ x | x ∈ V } = {{y | y ≥ x} | x ∈ V }}

of all principal lters of (V, ≤). If (V, ≤) is a complete lattice, then pf((V, ≤)) is a closure system. In this case we also say that a set M ⊆ V is meet-shatterable if it is shatterable w.r.t. the family pf((V, ≤)).

Theorem 1

(Motivating the notions join-shatterable and meet-shatterable ). A subset M of a complete lattice (V, ≤) is join-shatterable if and and only if we have for every x ∈ M :

x

_

M \{x}.

^

M \{x}.

(27)

Analogously, a subset M of a complete lattice (V, ≤) is meet-shatterable if and and only if we have for every x ∈ M : x

29

(28)

Proof.

We only proof the rst statement, the second statement can be proofed analogously.

W W A :=↓ B ∈ pi((V, ≤)). Then A ∩ M ⊇ B since ∀b ∈ B : b ≤ B . Additionally, for every x ∈ M \B we have B ⊆ M \{x} and thus x ∈ / A, since if x ∈ A, W W because of A ⊆↓ M \{x} we would get x ≤ M \{x} which would be a contradiction to (27). Thus A ∩ M = B and because B was an arbitrary subset of M , we can conclude that M is shatterable. W only if: Let x ≤ M \{x} for some x ∈ M . Then the set M \{x} is not shatterable w.r.t. pi((V, ≤)), because W every a ∈ V with ∀y ∈ M \{x} : y ≤ a is an upper bound of M \{x} and thus a ≥ M \{x} ≥ x. But this means, that every set A =↓ a ∈ pi((V, ≤)) that contains all elements of M \{x} necessarily also contains x which shows that in fact M \{x}

if:

Let

B ⊆ M.

Take

is not shatterable.

Theorem 2. For every nite join-shatterable set

M of a nite24 complete lattice (V, ≤) of join-irreducible elements of (V, ≤) that has

there exists another join-shatterable set JM the same cardinality as M . This means that for determining the Vapnik-Chervonenkis principal ideal dimension it is enough to look at join-shatterable sets of join-irreducible elements. Proof.

Let

M ⊆ V

then we are done.

join-irreducible element Since of

M

M

If all elements of M are join-irreducible x ∈ M that is not join-irreducible we can nd a ˜ := M \{x} ∪ {z} is still join shatterable. the set M

be a nite shatterable set. If there exists an

z

such that

is assumed to be nite, we can replace step by step every join-reducible element

by a join-irreducible element and thus obtain a shatterable set of join-irreducible

W x ∈ M \ J (V ) . Then x = B for some set W B ⊆ J (V ). Furthermore, we have z M \{x} for at least one z ∈ B , because otherwise W W we would have M \{x} ≥ B = x which is in contradiction with the assumption that ˜ := M \{x} ∪ {z}. Then, M ˜ is join-shatterable. the set M is join-shatterable. Now, take M ˜ only dier in the elements x and z and z ≤ x. Thus To see this, observe that M and M W W ˜ ˜ we have y W M ˜ \{y} because z M \{x} = M \{z} and for every other y ∈ M W ˜ W otherwise we would have y ≤ M \{y} ≤ M \{y} which is in contradiction with M elements with the same cardinality: So let

being join-shatterable.

Theorem 3. The Vapnik-Chervonenkis principal ideal dimension VCPI (and also the

Vapnik-Chervonenkis principal lter dimension VCPF) of a poset (V, ≤) is bounded by its order dimension25 odim((V, ≤)). Proof.

d := odim((V, ≤)) and let L1 , . . . , Ld be d linear orders representing ≤ via x ≤ y ⇐⇒ ∀i ∈ {1, . . . , d} : xLi y . We show that every set M of more than d elements Let

24 The niteness assumption can be dropped if one only assumes that every element

written as a supremum of join-irreducible elements of innite descending chains in

V.

x ∈ V

can be

This is for example the case if there are no

V.

25 Remember that the order dimension of a poset

L1 , . . . , Ld ⊆ V × V such that the relation ≤ via x ≤ y ⇐⇒ ∀i ∈ {1, . . . , d} : xLi y.

(V, ≤)

is the smallest number

d

of linear orders

can be represented as the intersection of these linear orders

30

M with |M | > d and take for every i ∈ {1, . . . , d} that xi ∈ M that is the greatest element of M w.r.t. the linear order Li . Then every principal ideal ↓ a that contains all the xi necessarily also contains every further element y ∈ M \{x1 , . . . , xd } = 6 ∅ because for every i ∈ {1, . . . , d} we have yLi xi Li a. of

V

is not join-shatterable: Take

element

Theorem 4. Let V = (V, ≤) be a nite complete lattice. If V is distributive26 , which can be characterized by saying that the condition ∀B ⊆ J (V)∀x ∈ J (V) :

x≤

_

B =⇒ x ≤ z for some z ∈ B

(29)

is fullled, then the Vapnik-Chervonenkis principal dimension of V is exactly the width of J (V) and because of Birkhos theorem we have V ∼ = (D(J (V)) and the width of (J (V)) is exactly the order dimension of V, so in this case we have V CP I(V) = odim(V). Proof.

Because of Theorem 2 we only have to look at the set

J (V)

of the join-irreducible

V. Let d denote the width of J (V). It is clear that a join-shatterable set M ⊆ (J (V)) necessarily is an antichain. Thus VCPI is lower than or equal to d. To see that V CP I = d take an antichain A of size d. Then this antichain is obviously shatterable W W because for all x ∈ A we have x A\{x} since if x ≤ A\{x} because of (29) we would have x ≤ z for some z in A\{x}, but this would be in contradiction with A being elements of

an antichain.

Denition & Proposition 2 (Vapnik-Chervonenkis upset dimension:

.

Simply the width)

Let V = (V, ≤) be a poset and U(V) be the set of all upsets of V. Then the VapnikChervonenkis dimension of U(V) is called the Vapnik-Chervonenkis upset dimension. The Vapnik-Chervonenkis upset dimension is identical to the width of V, because the shatterable sets are exactly the implication-free sets, which are in this case the antichains of V. Analogously, the Vapnik-Chervonenkis dimension of all downsets is also equal to the width.

Denition

6

. Let

(Vapnik-Chervonenkis formal context dimension (VCC))

(G, M, I) be a formal context. Let

K :=

S := B1 ((G, M, I)) = {A ⊆ G | (A, B) ∈ B((G, M, I)) for some B ⊆ M }

be the closure system of all concept extents. The Vapnik-Chervonenkis formal concept dimension (VCC) is dened as the Vapnik-Chervonenkis dimension of S .

Theorem 5 (cf.

also [Albano and Chornomaz, 2017, Albano, 2017a,b]). Let K := (G, M, I) be a formal context and let S := B1 ((G, M, I)). Then a set {g1 , . . . , gl } ⊆ G of objects is shatterable w.r.t. S if and only if there exists a set {m1 , . . . , ml } ⊆ M of attributes such that

26 A lattice

L

∀i, j ∈ {1, . . . , l} : (gi , mj ) ∈ I ⇐⇒ i 6= j. is called distributive if we have

x ∧ (y ∨ z) = (x ∧ y) ∨ (x ∧ z)

31

(30) for arbitrary

x, y, z ∈ L.

Proof. if:

A ⊆ {g1 , . . . , gl }. Take the formal concept (A00 , A0 ). Then A00 contains all gi ∈ A and for all j with gj ∈ / A because of mj ∈ A0 we have gj ∈ / A00 and thus A00 ∩ {g1 , . . . , gl } = A which shows that {g1 , . . . , gl } is shatterable w.r.t. S . only if: If {g1 , . . . , gl } is shatterable then for every gi there exists a formal concept (Ai , Bi ) such that gi ∈ / Ai and ∀j ∈ {1, . . . , l}\{i} : gi ∈ Ai . But this means that for every i ∈ {1, . . . , l} there exists an attribute mi such that (gi , mi ) ∈ / I and ∀j ∈ {1, . . . , l}\{i} : (gi , mj ) ∈ I . Let

Corollary 1. The Vapnik-Chervonenkis formal context dimension of a context (G, M, I) is

equal to the Vapnik-Chervonenkis formal context dimension of the dual context (M, G, I ∂ ), where I ∂ = {(m, g) | g ∈ G, m ∈ M, gIm}.

5.2.2 Computation of the Vapnik-Chervonenkis dimension In this section we shortly propose some methods to actually compute the VapnikChervonenkis dimension for dierent closure systems.

Computing the Vapnik-Chervonenkis dimension if the closure system is given via formal implications If the closure system

S

is given by all valid formal implications, then computing the V.C.-

dimension can be done by searching for an implication-free set

A

of maximal cardinality.

To do this, one can solve the following binary program:

k X

mi −→ max

(31)

w.r.t. X 1 mi ≤ |Y | mi + |Z| i:v ∈Z

(32)

i=1

∀(Y, Z) ∈ I(S) :

X

i:vi ∈Y

(33)

i

m = (m1 , . . . , mk ) ∈ {0, 1}k

(34)

Y −→ Z a shatterable of Z if it contains all

Here, condition (33) codies the demand that for a valid implication (implication-free) set elements of

Y.

A

necessarily cannot contain any element

Instead of the whole set of implications in (33) one can also use only that

Y −→ Z where Y is minimal (in the sense that Y˜ −→ Z is not valid ˜ ( Y ) and Z is maximal (in th sense that Y −→ Z˜ is not valid anymore anymore for every Y ˜ for every Z ) Z ). This set of implications is referred to as the generic base in formal

valid implications

concept analysis (cf., e.g., [Bastide et al., 2000], where also an algorithm for extracting the generic base is given). Note that in (33) one cannot use an arbitrary implication base: For example the implication base

I := {{v1 } −→ {v2 }, {v2 , v3 } −→ {v4 }} 32

induces the further implication

{v1 , v3 } −→ {v4 }

and thus the set

shatterable, but the anti-implications

A = {v1 , v3 , v4 }

is not

I := {{v1 } −→ {¬v2 }, {v2 , v3 } −→ {¬v4 }} obtained by demanding (33) only for the implication base

I

would not exclude the set

A

although it is not shatterable. If one wants to compute for example the Vapnik-Chervonenkis principal ideal dimension VCPI of an explicitly given complete lattice context

K := (V, V, ≥).

L = (L, ≤), one can rstly construct the formal

Then, the closure system of all intents of this context is exactly the

set of all principal ideals of

(L, ≤) and one can compute the generic base of all implications

that are valid in this closure system. Finally, one can build and solve the binary program (31).

Actually, due to Theorem 2 it suces to look only at the reduced context where

join-irreducible elements of

L

are removed.

27

Computing the Vapnik-Chervonenkis upset dimension: Computing the width If one wants to compute the Vapnik-Chervonenkis upset dimension, in principle one can use the binary program (31), but since the Vapnik-Chervonenkis upset dimension is simply the width, one can also use other more ecient algorithms to compute the width.

One

possibility is to reformulate the problem of computing the width of a poset as a matching

G = (V × {1}, V × {2}, E) where the set of vertices is the disjoint union of V and V and the two parts of G are essentially two copies of the poset V and an edge e = ((v, 1), (w, 2)) is in E i v < w . Now one can compute a maximal matching in G. The maximum matching then corresponds to a minimum size chain partitioning of V where two elements v and w with v < w are in the same partition i the edge ((v, 1), (w, 2)) is in the maximal matching. The number of partitions is then |V | − m where m is the size of the maximal matching. This means that we have found a minimal chain partitioning of V with size |V | − n which is due to problem in a bipartite graph: Dene the bipartite graph

Dilworth's theorem identical to the maximal cardinality of an antichain, i.e., the width. To actually compute the maximum matching one can use e.g. the algorithm of Hopcroft and Karp (Hopcroft and Karp [1971]), which would have time complexity

situation.

5

O(|V | 2 )

in our

Computing the Vapnik-Chervonenkis formal context dimension VCC 27 Note that for an explicitly given poset

(V, ≤)

that is not a complete lattice, the family

pi((V, ≤))

of all principal ideals is generally not a closure system, but one can look at the closure system that is generated by all principal ideals of

(V, ≤)

(of course, without the need of explicitly computing it). Then,

to make the computation more ecient, one can similarly remove all reducible attributes (and also all reducible objects) from the context concept

({a}0 , {a}00 )

K = (V, V, ≥),

where an attribute

is meet-reducible and an object

o

is join-reducible.

33

a

is called reducible if the formal

is called reducible if the formal concept

({o}00 , {o}0 )

To compute the Vapnik-Chervonenkis formal context dimension VCC one can simply make use of Theorem 5. One can equivalently express condition (30) of Theorem 5 by saying

A = {g1 , . . . , gl } of objects is shatterable w.r.t. B1 (K) if and only if there exists B = {m1 , . . . , ml } such that for every object g ∈ A there exists exactly one attribute m ∈ B with (g, m) ∈ / I and if furthermore, for every attribute m ∈ B there exists also exactly one object g ∈ A with (g, m) ∈ / I . These two conditions can also be incorporated

that A set a set

via inequality constraints. Thus, we can compute the Vapnik-Chervonenkis formal context dimension of a context intent set

B

K

by jointly analyzing pairs

(A, B)

of an object set

A

and a an

satisfying (30). With the notation of Section 4.2 we have to solve the binary

program

∀i ∈ {1, . . . , m} :

z1 + . . . + zm −→ max w.r.t. X (n − 1) · zi + zj+m ≤ n −zi +

∀j ∈ {1, . . . , n} :

X

j:Aij =0

(m − 1) · zj+m + −zj+m +

Here, the constraints (35) and (36) are redundant if not belong to the envisaged shatterable set set

A,

A.

If object

zj+m ≥ 0

X

i:Aij =0

X

i:Aij =0

B

with

(gi , mj ) ∈ / I

at least one such attribute.

(36)

zi ≤ m

(37)

zi ≥ 0.

(38)

zi is zero, gi is in the

e.g., if object

gi

does

envisaged shatterable

then (35) demands exactly that there is maximal one attribute

attribute set

(35)

j:Aij =0

mj

in the associated

and constraint (36) further demands that there is also

The constraints (37) and (38) analogously codify the dual

statement where the roles of objects and attributes are exchanged.

Here, unfortunately

one generally cannot drop any integrality constraint, so the computation of the V.C. formal context dimension is generally very hard.

5.3 Taming the monster: pruning closure systems via VapnikChervonenkis theory The last section showed how to compute the V.C.-dimension for several closure systems and how to identify shatterable sets of maximal cardinality. The ability to identify such big shatterable sets supplies us with a simple possibility of eectively taming the closure system by removing such big shatterable sets to get a test statistic that is less crude in the sense that one gets better bounds in (25) and (26) due to a lower V.C.-dimension.

34

Concretely, for e.g. the closure system can be characterized by the set

U((V, ≤))

min(M )

U((V, ≤))

of upsets, every upset

of all minimal elements of

28

M

via

M ∈ U((V, ≤)) M =↑ min(M ).

A of maximal cardinality and then A (or a subsetA˜ of A) from U((V, ≤)) by considering not all upsets U((V, ≤)) = {↑ B | B ⊆ V } but only the family S 0 = {↑ B | B ⊆ V \A} of all upsets that ˜ ). This family is generally not a closure system, anymore, are generated by V \A (or V \A 29 S˜ that is generated by but one can simply take not the family S but the closure system T 0 0 0 S via S˜ := cl(S ) = {F | F closure system on V \A, F ⊇ S }. To tame

one can compute an antichain

remove this antichain

For taming the Vapnik-Chervonenkis formal context dimension of a given formal

context

K

one can similarly look for objects involved in a shatterable set of maximal

cardinality and then take the closure system of the concept extents of the formal concept lattice generated by the modied context where the objects involved in a shatterable set of maximal cardinality are removed. Generally, two issues arise here: Firstly, for a closure system one shatterable set of size

V C.

S

of V.C.-dimension

has to remove the rst found shatterable set of size further shatterable sets of size

VC

one usually has more than

To eectively tame the closure system one therefore

VC

VC

and then one has to look at

and remove them, too. In this situation, it could be

the case that the result of the taming procedure depends upon which shatterable set of maximal cardinality was removed rst.

To avoid this problem, one can alternatively

look jointly at all shatterable sets of maximal cardinality and remove them all. However, this could have the eect that in one step a huge number of sets is removed such that the V.C.-dimension becomes too small already in one step.

Furthermore, if one decides

for removing only subsets of shatterable sets, then it is not straightforward, which subsets exactly to remove and also here, the choice of the removed subsets could possibly have an impact on which set would be a shatterable set of maximal cardinality in the next step.

Since the ability of removing not only whole shatterable sets but also sub-

sets would be very helpful for taming in a very exible way, this could be seen as a problem. Secondly, the taming of the closure system is only a statistical regularization procedure that only cares for the purely statistical aspects. Thus, it is desirable to analyze the taming also with respect to its conceptual behavior in the sense that one should care for how exible the tamed closure system is w.r.t. which sets are in the closure system and how ne-grained the tamed closure system thus is w.r.t a purely descriptive/conceptual point of view. This is clearly a matter of the concrete application. For the closure systems of upsets and the closure system of concept extents we will now give concrete proposals for taming that are in our view also acceptable from a conceptual point of view in the situations of the application examples given later in Sections 6.1 and 6.3.

28 As shown in Section 5.2.1, for the closure system of all upsets of a poset are exactly the antichains of

(V, ≤).

29 For the computation of the test statistic on the tamed closure system

S˜ explicitly,

see Section 5.3.3.

35

(V, ≤),

the shatterable sets

S˜ one does not need to compute

5.3.1 Taming upsets in the context of inequality analysis The closure system of upsets played a crucial role in the context of stochastic dominance. One eld of application of stochastic dominance is multivariate poverty or inequality analysis.

In this context one does not start with a poset

V,

instead one has some

(often totally ordered) dimensions of poverty/inequality. In our example of application given in Section 6.1., we have basically the The poset

(V, ≤)

3

dimensions

Income, Education

Health.

is then given by the three dimensional attributes of all persons in

the survey equipped with the coordinate wise order (i.e. as equally poor as person dimension).

and

y

person

x

is poorer than or

i she is poorer than or as equally poor as

y

w.r.t.

every

Then, the concept of an upset codies a multivariate poverty line in

the sense that an upset

M

would be a reasonable concretion of the term

saying that every person in the set

M

could be termed

non-poor

non-poor

by

and every person in

the complement of M could be termed poor. The statement of stochastic dominance X ≤SD Y where X describes one subpopulation and Y another subpopulation would then mean ∀M ∈ U((V, ≤)); P (X ∈ M ) ≤ P (Y ∈ M ) which can be simply translated to the statement: However the term proportion of the

non-poor

poor

is actually reasonably concretized, in every case the

persons in subpopulation corresponding to

than or equal to the proportion in the subpopulation related to reasonable tame the closure system of upsets in this context? of upsets is getting very big already for small posets

V,

M =↑ min(M )

X

is always lower

Now, how can we

Since the closure system

a taming by explicitly removing

upsets seems hopeless, but one can use the fact that every upset minimal elements via

Y .

M

is generated by its

and look at antichains instead of upsets. One way to

tame the closure system of all upsets, i.e., the closure system of all reasonable concretions of the term

non-poor

in a conceptually reasonable way could be to exclude some very

skew concretions of the term

poor :

One can try to remove upsets generated by antichains

consisting of very unbalanced elements, i.e. attributes that are very low in one dimension and at the same time very high in another dimension.

To do so, one has to concretize

here, what low and high means. dimension to be

n

U [0, 1]

30

One possibility would be to rstly standardize every n×p distributed. Concretely, if X ∈ R is the matrix containing the

31

attributes of dimension p, dene for j = 1, . . . , p the univariate empirical distribution j function F according to the distribution of the j -th dimension in the sample and dene Z ∈ Rn×p via Zij = F j (Xij ). Then Z is a transformation of X where every dimension Z•j has values ranging from n1 to 1 allowing for some kind of relative comparability of the l transformed attributes with the simple interpretation that if Zij = the person i is the n

30 If one has any external substance matter insight into how some decrease in one dimension can be reasonably be compensated for by an increase in another dimension, one should try to reect this substance matter insight into the taming procedure.

Of course, the herein proposed taming procedure has to be

understood as a general purpose procedure that could be substantially improved by modications based on substance matter considerations.

31 One can use here the complementary distribution function

F j (x) =

|{i|Xij ≤x}| , which would lead to identical results. n distribution function because it ts more to the notion of upsets.

function

F j (x) =

36

|{i|Xij ≥x}| or the usual distribution n We use here the complementary

n − l + 1-th

i.

poorest person in the sample w.r.t. dimension

To concretize the notion of

an imbalanced multivariate poverty line one can rstly dene a transformed multivariate

Zi· = (Zi1 , . . . , Zip ) as balanced if max{Zi1 , . . . , Zip } − min{Zi1 , . . . , Zip } ≤ δ for δ . Then, one can dene a poverty line as balanced if it is generated by antichain that consists only of balanced attributes. If one sets the threshold δ globally

attribute

a xed threshold an

to one xed value, then w.r.t.

V.C.-dimension it can happen that the V.C.-dimension

can vary drastically from region to region in the sense that e.g. transformed for extreme

Z Z

for regions of medium

values there are big sets of balanced elements building an antichain whereas values there are only small sized antichains of balanced elements. For the

statistical side of the taming procedure this could lead to a very brute taming of regions of extreme

Z

values without globally reducing the V.C.-dimension very much. Of course, in

the proof of the Vapnik-Chervonenkis inequality one essentially deals with the cardinality of the closure system and this is actually sized down by the procedure, so the statistical taming would actually still be achieved, but only if one is taming very strongly which means that one would reduce far more sets in regions of extreme than in regions of medium very high.

Z -values/high

Z -values/low

width

width where the density of upsets is already k induces 2k upsets). One can avoid this

(Note that every antichain of size

32 :

seemingly bad eect with the following localization method

envisaged V.C.-dimension h0 . For given α ∈ [0, 1] and ε > 0 dene an ε-stratum around the center α as the set Mε (α) = {vi | ∀j ∈ {1, . . . , p} : |Zij − α| ≤ ε}. Then, for xed α choose ε(α) such that the V.C.-dimension of Mε(α) (α) is lower than or equal to h0 and such that ε(α) S is maximal w.r.t. this property. Then collect in a set T (h0 ) := {Mε(α) (α) | α ∈ [0, 1]} all strata Mε(α) (α). The closure system Sh0 = cl(Fh0 ) generated by the family of sets Fh0 = {↑ B | B ⊆ T (h0 )} can then serve as a tamed subsystem of S . Note that the V.C.-dimension of Sh0 needs not to be h0 , it can be higher, because elements of dierent strata can build an antichain of size bigger than h0 . A further important point is that First,

for

x

some

arbitrary

with this taming procedure we have introduced some asymmetry: In the case of the full closure system

S

of upsets it played no role that we looked at upsets and not at downsets:

If we would have dealt with downsets to model the

non-poor

poor

persons instead of modeling the

persons via upsets, we would still have got the same results. The reason for this

is simply that the complements of upsets are downsets and vice versa. In contrast to this, the complement of special selected upsets generated by antichains of some subset

V

are not necessarily downsets generated by the antichains of

T (h0 ).

T (h0 )

of

Thus, for practical

applications, one should analyze the results of both the tamed upset and the downset approach, which we will do in the example of application given in Section 6.1.

32 If a taming with a global threshold

δ

appears more reasonable from a conceptual point of view in

a concrete situation of application then a global taming may still be a better choice.

However, in the

example of application given in Section 6.1 we see no direct conceptual advantages of a global taming.

37

5.3.2 Taming formal contexts in the context of cognitive diagnosis models and knowledge space theory For taming the closure system of all concept extents of a given formal context with V.C.-dimension size

VC

V C,

K = (G, M, I)

in principal one can search for all shatterable objects sets of

(either step by step or in one whole step, see the remarks above) and exclude the

˜ = (G, ˜ M, I ∩ G×M ˜ K to obtain a reduced context K ) lower than V C (If this reduced V.C.-dimension is still

objects of these sets from the context which then has a V.C.-dimension

too high, one can repeat the taming process until the resulting V.C.-dimension is low enough). For the actual data analysis one can then rstly take the closure system

˜ B2 (K)

˜ and secondly dene the reduced closure system K ˜ generated by all intents of the reduced ˜ S := {{g ∈ G | ∀m ∈ B : gIm} | B ∈ B2 (K)} ˜ but w.r.t. the objects of the full original context K. In Section 5.3.3 we will context K of the intents of the reduced context

show how to do this in computational terms. Another possibility would be to not remove objects but attributes. In practical applications, often objects represent data points and the attributes represent the multidimensional values of the data points, so in classical situations one usually has much more objects than attributes.

In these situations it

appears more natural to remove objects, because if one would remove attributes, then one would remove these attributes for the whole big set of all objects.

Compared to

this, if one removes objects, then one removes only the specic concept intents generated by these objects (and also intents that are jointly generated by removed objects and non-removed objects).

If one removes objects in the above described way, then one

reduces the V.C.-dimension of the closure system under which the nal analysis will be done. However, from a conceptual/descriptive/substance matter point of view, one does not know if one had removed sets that are actually interesting/important or that one did not remove uninteresting/unimportant sets. In some situations one can tame a context in a more guided manner: One interesting example where one has some kind of substance matter guidance for taming is the case of

cognitive diagnosis models (CDM), which are some kind of non-

parametric item response models. Note that cognitive diagnosis models are very closely related to the theory of knowledge spaces ([Doignon and Falmagne, 2012], see [Heller et al., 2015]) which is itself closely related to formal concept analysis (see [Rusch and Wille, 1996]). In cognitive diagnosis models one has a set a set

M = {1, . . . , |M |}

G

of persons which respond to

of cognitive tasks, for example math tasks like fraction addition

or fraction subtraction (for one well known fraction-subtraction data set see [Tatsuoka, 1984]).

In contrast to more classical item response theory (IRT), in cognitive diagnosis

modeling one is not mainly interested in measuring the abilities of persons and the difculties of items, instead one is interested in the cognitive processes that generated the observed response patterns.

Here, one demand is to give persons not only one or more

numbers that measure their ability but to give a more qualitative feedback about which concrete skills the persons possess and which skills they do not possess.

To do so, one

develops (either theory driven or data driven or, in the best case, driven by a theory that

38

was rigorously empirically tested and persisted the tests) a so called

Q matrix that species

for every item, what kind of skills are in principle necessary to solve this item. Concretely,

K relevant skills, Qij = 1 means that

for a set of an entry

one assumes that a person is needed to solve item

i,

Q-matrix is a |M | × K matrix of zeros and ones where j is needed to solve item i. In the simplest case expected to solve item i if she possesses all skills that are a

the skill

so a lack of one skill cannot be compensated by other skills. (This

is the DINA model, cf. [Haertel, 1989, Junker and Sijtsma, 2001], but there are also other compensatory variants like the DINO model, cf.

[Junker and Sijtsma, 2001].)

Further-

more, one assumes the possibility of slipping an item one is principally prepared to solve and of luckily guessing the right answer of an item one is not prepared to solve. the moment we ignore the possibility of slipping and guessing, then the

Q-

If for

matrix induces

some structure of the idealized item response patterns that are possible if the probabilities of guessing and slipping are zero. For example if for solving one item i one needs all skills 0 that one also needs for solving item i plus some more, then response patterns of the form

(. . .

1 |{z}

...

i-th entry

0 |{z}

. . .)

i'-th entry

i or a slipping of item i0 . This fact can be 0 33 expressed by saying that the formal implication {i} −→ {i } is valid in the closure system T SQ := B2 ({1, . . . , K}, {1, . . . , |M |}, 1 − Q ) of all possible idealized response patterns. T To see that the closure system B2 ({1, . . . , K}, {1, . . . , |M |}, 1 − Q ) of the intents of T the context KQ := ({1, . . . , K}, {1, . . . , |M |}, 1 − Q ) is exactly the space of all possible 0 idealized response patterns, note that the intents are generated as {A | A ⊆ {1, . . . , K}} where a set A can be understood as the set of skills an imaginary person does not posses. 0 T Then A is the set of all items i with ∀j ∈ A : (1 − Q )ij = 1, i.e. the set of all items i where all skills the person does not possess are actually not needed to solve the item i. 0 Thus, the intent A is in fact the set of all items a person not possessing exactly all skills of A would actually be able to solve and all intents are exactly all observable idealized response patterns. A valid formal implication Y −→ Z of KQ could be interpreted in this situation as All skills that are not necessary for solving any item from Y are also not necessary for solving items from Z or alternatively as every imaginary person who possess all skills for solving all items from Y also possesses all necessary skills for solving all items from Z .

are only possible due to a lucky guess of item

Now, one can incorporate some or all valid implications of the idealized response pattern space

SQ

to reduce the original closure system by looking only at concept

intents of the original context

K = (G, M, I)

(where

gIm

⇐⇒

person

g

has sol-

m) that respect all or some of the valid implications of the formal context KQ = ({1, . . . , K}, {1, . . . , |M |}, 1 − QT ) representing the idealized response pattern space. ved item

If one enforces that all valid implications of the idealized response pattern space should also valid in the tamed closure system for the nal analysis, then the V.C.-dimension would

33 By abuse of notation, we identify the matrix

1 − QT 39

with the relation

{(i, j) | (1 − QT )i,j = 1}.

decrease, but maybe unnecessarily too much.

To enforce only a subset of implications

one has to reasonably decide, which implications to include and which implications to not include.

This can be made based on theoretical substance matter considerations

about which implications are expected to be more clearly valid from a cognition theoretic perspective and which implications are maybe more questionable because they are due to a less rigorous but a more schematic specication of involved skills. can substantially make use of the technique of

To do so, one

attribute exploration (see [Ganter and

Wille, 2012, p.85]) known from formal concept analysis:

Given the formal context

KQ

an algorithm like the next closure algorithm (see [Ganter and Wille, 2012, p.66-68]) can compute all formal concepts and also the so called stem base (see [Ganter and Wille, 2012, p.83]) of all valid implications of this context.

In attribute exploration, at every

step of the computation of a new implication, the user is asked in an interactive way, if the currently computed implication is actually true.

Then the user can say that the

implication is actually true or provide the algorithm with an object with specic attributes that are actually contradicting the formal implication. Then the algorithm would include this counterexample into the context and proceed, but not by computing all implications from the modied context anew, but by knowing that all implications computed before the counterexample was given are still valid in the modied context. Another possibility of selecting implications to include for taming is to do it data driven. One can look for example at all valid implications at least a certain proportion at least a proportion from

Z.

C

of

SQ

that are respected by

of objects from the original context

of persons, who solved all items of

Y

K

in the sense that

did also solve all items

Y −→ Z that have a so supp(Y ∪Z) −→ Z) := supp(Y ) and context K. Here, the issue

Formally, this can be described as enforcing all rules

called condence supp(A)

C

Y −→ Z

:= |A0 |

34

conf(Y

−→ Z)

and the operator

0

arises that if for example the rules threshold

C

of at least

C,

where conf(Y

is meant w.r.t. the original

Y −→ Z1

and

Y −→ Z2

have a condence above the

then they would be included and furthermore the rule

Y −→ Z1 ∪ Z2

would

implicitly be also valid in the tamed closure system, but this rule does not necessarily have a condence of

C.

One can deal with this issue in dierent ways.

would be to enforce a set

I

One way of taming

of implications that is deductively closed (this means that if

an implication follows from some implications from

I

then it should be already in the set

I ) and that only consists of implications with condence above C

and that is furthermore

maximal w.r.t. these properties. Such maximal sets are generally not unique. Furthermore, if one has the idea that response patterns that violate implications of the idealized response pattern space are due to a random guessing or slipping, then if the slipping/guessing for dierent items is independent, for valid idealized implications condence

C1

and

C2

Y −→ Z1 and Y −→ Z2 with Y −→ Z1 ∪ Z2 that

one would expect a condence of the implication

is generally lower than

min{C1 , C2 }.

Thus, choosing the same threshold for implications

with dierently sized consequents seems to be not natural. Another way to proceed is to

34 This term is used in the eld of association rule mining, cf., e.g., [Agrawal et al., 1993, PiatetskyShapiro, 1991] which is also related to formal concept analysis, cf., e.g., [Lakhal and Stumme, 2005].

40

simply take the set of all implications with support above a threshold

C

and accept that

with this set also implications from its deductive closure that may have a condence smaller than

C

are also implicitly included for taming the closure system. If one takes this route,

one is faced with computing all implications with condence above

C.

For an implication

Y −→ {z1 , z2 } with condence above C it is necessary that also the implications Y −→ {z1 } Y −→ {z2 } have condence above C , so it suces to look only at implications with a singleton consequence. (The implication Y −→ {z1 , z2 } is included for taming if and only if both Y −→ {z1 } and Y −→ {z2 } have condence above C because this is necessary and if both Y −→ {z1 } and Y −→ {z2 } have condence above C they are both included and thus also Y −→ {z1 , z2 } has to be included since it follows from the included implications Y −→ {z1 } and Y −→ {z2 }.) Instead of computing all implications with a singleton and

consequence one can also compute in a rst step only that implications that have a minimal antecedent. Then one could exclude such implications than

C

Y −→ {z}

with condence lower

and recompute the generic base. To do so, one can directly work with the formal T context KQ := ({1, . . . , K}, {1, . . . , |M |}, 1 − Q ). One can compute the generic basis and

Y −→ {z1 }, . . . , Y −→ {zl } Y −→ {z} with condence 00 lower than C one can exclude it by adding a the counterexample Y \{z} as a further item 00 pattern to the context KQ . (Here the operation is meant w.r.t. the context KQ .) Then

split every implication

Y −→ {z1 , . . . , zl }

into implications

with singleton consequents. Then, for every such implication

one can compute again and again the generic base of the enlarged context until no rule has condence lower than

C

anymore. A computationally more elegant way would be to not

to recompute the whole rule base of the enlarged context anew. In the spirit of attribute exploration (see [Ganter and Wille, 2012, p.85]), one can smartly exclude implications with condence lower than

C

directly during the generation of the rules. However, since one

does not work with the generic base, but with the stem base, the result would then be dierent and would furthermore dependent one the concrete order in which the computed implications were presented to the user.

5.3.3 How to compute the test statistics for the tamed closures systems In this section we shortly indicate, how one can compute the test statistic for a tamed closure system. We start with the example of the closure system of upsets. In Section 5.3.1 we came up with the tamed family of sets

Fh0 = {↑ B | B ⊆ T (h0 )}

that generates the

Sh0 = cl(Fh0 ). Of course, it would be intractable to explicitly compute Sh0 generated by Fh0 because it is simply too big. Fortunately, the explicit computation of Sh0 is not needed: The closure system Sh0 simply consists of all possible intersections of sets of the generating family of sets Fh0 . For ease of presentation, assume that (V, ≤) itself is already a complete lattice (otherwise, simply closure system

the closure system

take its Dedekind-MacNeille completion, cf., e.g., [Ganter and Wille, 2012, p.48]). closure system

35

For a nite

Sh0

Fh0 . ↑ A1 ∩ . . . ∩ ↑ An

is simply the set of all possible intersections of sets of the family

family

(↑ Ai )i∈{1,...,n}

of upsets from

Fh0 ,

the intersection

35 The niteness is actually not needed, it only makes the presentation more simple, here.

41

The

S ↑ A1 ∩ . . . ∩ ↑ An = {↑ a1 ∩ . . . ∩ ↑ an | ∀i ∈ {1, . . . , n} : ai ∈ Ai } = {↑ {a1 , . . . an } | ∀i ∈ {1, . . . , n} as 0 can be written W : ai ∈ Ai }. Thus, ShS W ¯ ¯ Sh0 = {↑ B | B ⊆ T (h0 )} where T (h0 ) = { A | A ⊆ T (h0 )}. Since ↑ Bi =↑ Bi for can be written as

S

W

arbitrary families

(Bi )i∈I , the closure system Sh0

i∈I i∈I is closed under arbitrary unions and thus,

because of Birkhos theorem, the valid implications of

Sh0

are simple implications. Thus,

we can rstly calculate all simple implications or a basis thereof and implement them in

x ∈ V the set ↓ x ∩ T (h0 ) of T (h0 ) that are below x. Then, one can take from the set M of all upper bounds of ↓ x ∩ T (h0 ) the minimal elements min M . Finally, for every y ∈ min M one simply has to implement the associated implication {x} −→ {y} as an inequality constrain

a linear program: For example one can compute for every all elements of

in the linear program.

For the case of non-guided taming of a closure system that is given by a generating formal context, remember that the taming was simply done by removing objects from the context (but only for the generation of the closure system of the intents, and not for the whole analysis). Let

I

denote the set of indices of the objects that were excluded for the

generation of the closure system. To compute the statistic for the tamed context, one only has to modify the program (24) to the following program:

ext h(w1ext , . . . , wm , w1int , . . . , wnint ), (z1 , . . . , zm , zm+1 , . . . , zm+n )i −→ max w.r.t. ∀(i, j) s.t. Aij = 0 : zi ≤ 1 − zj+m X zk+m ≥ 1 − zi ∀i ∈ {1, . . . , m} :

(39)

k:Aik =0

∀j ∈ {1, . . . , n} :

X

k∈I:A / kj =0

zk ≥ 1 − zj+m

Here, the only dierence is that in the last set of inequalities, one does not sum over every object index

k

but only over that indices, that were not excluded for the generation of

the closure system.

To see the validity of this modication, simply note that the three

verbalizations directly above the linear program (24) are still exactly characterizing the situation with the only modication of point Dually, if attribute object

gk

mj

3,

which has to be modied to

does not belong to the intent, then there exists at least one

that was not excluded for the generation of the closure system of intents, and

that belongs to the extent, but does not have attribute

mj .

For the guided taming the computation of the test statistic is straightforward. Since the formal implications one additionally imposes are computed explicitly, one can modify the binary program described in Section 4.2 by additionally implementing the further imposed implications as inequality constraints like described in Section 4.1.

42

6

Examples of application

In this section we apply the developed methods to dierent data sets. The applications should on the one hand be not taken at face value as serious substance matter applications. On the other hand, they should also be not misunderstood as pure toy examples. The aim of the following examples of application is to show that the developed methods are in fact applicable to real-world data sets and that these methods are in principle very exible and can also deal with dierent kinds of data deciency. The big part that is missing to make the examples serious substance matter studies is the fact that at much stages of the analysis, some substance matter considerations have to be made or could maybe be made to make the analysis more decisive.

However, since the authors are clearly no experts

in the substance matter elds the applications are related to, they would like to refrain from making such substance matter decisions, if possible, or to make the actually needed substance matter decisions only for purposes of illustration.

In the following examples,

especially in our main example of Section 6.1, we analyze the data sets by a more generic way of proceeding and, if appropriate, shortly indicate, at which steps and in which way one could make a more rened data analysis, that of course would be dependent on some substance matter decisions.

6.1 Upsets: Relational inequality analysis We start with our main example of multivariate inequality analysis using data from the German General Social Survey (ALLBUS) of the year 2014 (GESIS - Leibniz - Institut fur Sozialwissenschaften [2015]).

In this survey, altogether

3471

persons participated.

Here, we analyze systematic multivariate dierences between the group of male and female participants w.r.t. about

Health

the variables

Income, Education

and

Health.

The question

was asked in a split ballot design to test for a possible impact of dierent

response scales on the result.

The participants were asked both in split A and split

B about how they would describe their health status in general. split A got the

5

Zufriedenstellend

The participants of

dierent answer categories Sehr gut (very good), Gut (good), (satisfactory) ,

Weniger gut

(suboptimal) and Schlecht

(bad)

whereas the participants of split B got the additional category Ausgezeichnet (excellent).

(The english categories in brackets are our own english translation.)

For reasons

of simplicity, we used here only the participants of split B and did a complete case analysis. Of course, one could also use both splits for the analysis: If one has some reason to assume that both response scales adequately operationalize the same construct, one can do a joint analysis of both splits by matching the two scales to each other based on their respective empirical distribution functions. This is actually possible because the splitting

36

was random and thus the measured construct has the same distribution in every split.

36 Note that due to measurement error, which can be dierent within the two splits, the actual measurements can dier in their distribution. But if the measurement errors are independent of the measured construct and from each other, this will only produce some smearing of the measurements, which can

43

For the joint analysis of the three variables

Income, Education

and

Health,

the

1515 participants (706 female and 809 male) corresponding to a non-response rate of 12.2%. The variable Income contributed most to the non-response-rate (the non-response rate for Income was 11.8%.) Here, income was complete case analysis consisted of altogether

asked for in a two step procedure: First with an open question and then, for participants who refused to answer the open question, a categorized question with

23 answer-categories

ranging from no income to more than 7500 Euro was added. This two-step procedure was done to reduce the non-response rate.

Here, for simplicity we use the combined

answers to the open and the list query, where for participants who answered only the list query simply the mid-points of the interval representing the categorized answer were

37 .

used as a surrogate for the true income ordinal structure of the variable

Income

Note that for our analysis we only need the

and furthermore we can actually deal also with

a partially ordered structure of the dimension income. Thus, here one can also use more cautious approaches where one says for example that an income that is actually only categorically observed as

[a, b)

is only lower than or equal to another observed income

(no matter if precisely observed as

b ≤ c.

[c, c]

or imprecisely observed as

[c, d))

[c, d) i [a, b) are c ∈ [a, b)).

of

Another possibility would be to say that categorically observed incomes

comparable to itself (i.e.

[a, b) ≤ [a, b)),

but not to a precisely observed value

The stochastic dominance approach is thus very exible to deal with certain kinds of non-response/interval-valued observations.

Here, we do simply work with the combined

values where interval-valued observed incomes are replaced by the corresponding interval mid-points. The variable

Education

is the classication of the level of education according to the

International Standard Classication of Education (ISCED) 2011 (see [UNESCO Institute for Statistics (UIS), 2012]) implemented for Germany. On the highest stage, this classication dierentiates between

Level 0: Level 1: Level 2: Level 3: Level 4:

9

dierent main levels of education:

Less than primary education Primary education Lower secondary education Upper secondary education Post-secondary non-tertiary education

We treat here the variable

Education

the sample, only the levels from

1

to

8

Level 5: Level 6: Level 7: Level 8:

Short-cycle tertiary education Bachelor's or equivalent level Master's or equivalent level Doctoral or equivalent level

as of totally ordered scale of measurement. In

were observed. Note that also for this dimension

the methodology of stochastic dominance would be able to deal with an only partially ordered scale: The ISCED 2011 could also be implementation in a more cautious way: For example, instead of only comparing the highest educational achievements, one could lead to cases where stochastic dominance w.r.t. the underlying construct is actually present, but it is not present anymore for the measurements. A transition of non-stochastic dominance w.r.t. the construct into stochastic dominance w.r.t. the measurements cannot happen.

37 For the answer category below 200 Euro a value of 150 Euro and for the category more than 7500

Euro a value of 8750 Euro was assigned.

44

alternatively look at the whole educational paths and and say that a person poor than another person

B

w.r.t.

the dimension

A

followed the same educational path but person educational achievement than person

B.

Education

A

is more

only if both persons

stopped earlier with a lower highest

This partial ordering of the dimension

Education

would lead to a less decisive analysis, but it has the potential to reveal, how much a more classical analysis would dependent on the choice of a totally ordered scale for the dimension

education.

We begin with a marginal analysis of all

3

variables.

Figure 3 shows the lower

cumulative distribution function for every variable for both the male and the female group. One can see that the female group is almost dominated by the male group for the variables

Income, Education

Health.

With regard to

Income, the extent of dominance

1300 Euro, but only 31.9% of the 1300 Euro, which is a dierence of 34.5 percentage points. Only for the very high income of 12000 Euro there is a small deviation from dominance in the sense that 99.9% of the men earn not more than 12000 Euro, where this is the case for only 99.8% of the women. For the variable Health there is only deviation from dominance w.r.t. the percentage of women reporting a health-status bad : Only 2.2% of the women report a health status bad, which is about 0.7 percentage points lower than the amount of 2.9% for the men. The variable Education shows strict dominance. is the highest:

66.4%

and

of the women earn not more than

men earn not more than

Now, let us come to the joint analysis. For the statistics

D+ = D− = where

X

max

hwx − wy , mi

min

hwx − wy , mi,

M ∈U ((V,≤))

M ∈U ((V,≤))

describes the subpopulation of male, and

Y

describes the subpopulation of female

persons, we obtain

D+ ≈ 36.48% D− ≈ −1.21%, which indicates an almost strict dominance for the joint distribution of the variables − , and , where the small deviation from dominance is with D

come Education

In-

Health ≈ −1.21% not much higher than the largest deviation of −0.7% for the variable Health in the marginal analysis. The maximal value of 36.48% is about 2 percentage points higher than for the largest maximal value of 34.5% for the variable Income in the marginal analysis. Beyond the purely quantitative analysis one can also look, at which upsets the maximum and the minimum of the test statistic is attained. The maximum of the statistic is attained at an upset

U

generated by the antichain

A

(via

U =↑ A)

containing

9

elements depicted

in Table 3. The minimal test statistic is attained at an upset generated by an antichain of size described in Table 4. Based on a resampling scheme with

45

10000

4

replications, the test sta-

● ● ● ● ● ● ● ● ●

200

1.0

● ●

●

● ●

●

● ● ● ● ● ● ● ● ● ●

●

●

●

0.2

0.4

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ●●● ● ●● ●● ●● ●●●● ● ● ●● ● ● ●● ● ●● ● ●

50

● ●

0.6

● ● ● ● ● ● ●

●

● ●

0.0

0.0

0.2

0.4

0.6

● ● ● ● ● ● ● ●

●

0.8

1.0 0.8

● ● ●● ● ●● ● ●● ●● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

1000

5000

20000

● ●

0

4

6

8

Education (ISCED 2011)

1.0

Income

2

●

0.8

● ●

0.4

0.6

● ●

●

0.2

●

● ●

0.0

● ●

0

1

2

3

4

5

6

7

Health Figure 3: Empirical cumulative distribution function for all

3

considered variables for the

male group (black) and the female group (grey).

D+ appears as highly signicantly above 0 whereas D− is really only non-signicantly + dierent from zero: The maximal observed value of D in the resample is 17.48% and the − minimal value of D observed in the resample is −1.63%, which is very close to −1.21% for the actually analyzed data set, actually, the value of −1.21% is closer to zero for the tistic

actual data set than the closest value of the resample. The poset generated by the actu-

ally observed data and the coordinate-wise ordering has a Vapnik-Chervonenkis-dimension (width) of

33.

For such a V.C.-dimension and an

38 The data set included

706

female and

809

n

38

around

700,

the V.C.-inequality

male participants. Note that actually the sampling weights

for east and west have to be also taken into account, here, however, we only want to get a rough idea about how sharp the V.C.-inequality is in our situation.

46

Income (Euro)


Health (self-reported)

dierence

above

1

400 (0.93)

Upper secondary education (0.9)

excellent (0.08)

0.02

0.06

2

650 (0.84)

Lower secondary education (0.99)

excellent (0.08)

0.02

0.06

3

1080 (0.64)

Master's or equivalent level (0.22)

very good (0.32)

0.02

0.07

4

1100 (0.64)


good (0.68)

0.06

0.14

5

1260 (0.55)


satisfactory (0.89)

0.08

0.17

6

1300 (0.55)


satisfactory (0.89)

0.3

0.49

7

1400 (0.51)

Primary education (1)

good (0.68)

0.25

0.37

8

1400 (0.51)


bad (1)

0.33

0.49

9

1450 (0.48)


satisfactory (0.89)

0.32

0.45

Table 3: The antichain

A = {A1 , . . . , A9 }

that generates that upset

U =↑ A

where the

maximum of the test statistic is attained. In brackets the marginal upper quantiles that correspond to the values are given, e.g. the

93%

column means that ca. column

dierence

0.93

behind the

400

in the rst row of the rst

of the persons in the population earn at least

i

displays for every row

400

Euro. The

the dierence between the proportion of male

and the proportion of female persons that are above element the proportion of all persons that are above

Ai .

The column

above

shows

Ai .

Income (Euro)



dierence

above

1

100 (1)


excellent (0.08)

-0.0019

0.02

2

130 (0.99)


suboptimal (0.97)

0.0359

0.87

3

600 (0.86)


very good (0.32)

0.0759

0.27

4

2900 (0.12)


bad (1)

0.0797

0.07

U =↑ A

where the


A = {A1 , . . . , A4 }


minimum of the test statistic is attained. In brackets the marginal upper quantiles that correspond to the values are given, e.g. the column means that ca. column

dierence

100% of

1

behind the

100

in the rst row of the rst

the persons in the population earn at least

displays for every row

i

the proportion of all persons that are above

Ai .

The column

above

shows

Ai .

(26) is too loose. For a value of the test statistic of about

8

The

the dierence between the proportion of male

and the proportion of female persons that are above element

chosen a V.C.-dimension of about

100 Euro.

36%

one would have to have

to make the conservative V.C.-inequality leading to a

signicant result. Since we were able to compute a large enough resample, we actually do not need to rely on the V.C.-inequality. However, for the purpose of illustration, we can tame the closure system of upsets to get an insight into how this aects the behavior of the test statistic for the actually observed data and the distribution of the test statistic + under H0 . Figure 4 shows a the value of the test statistic D for the actually observed 39 + data, as well as the distribution of D under H0 for dierent V.C.-dimensions ranging from

4 to 39.

Note that the original V.C.-dimension was

the V.C.-dimension of

39

33, which is maybe surprising, but

for the biggest tamed closure system is due to the fact that by

39 Here, we computed a resample of size

1000

to get a rough insight into the distribution of

H0 .

47

D+

under

taming the closure system one gets in a rst step only a family of sets that is generally no closure system and one has to enlarge this family in a second step to be a closure system to make the analysis computationally feasible. One can see that, as expected, with increasing V.C.-dimension, both the value of the test statistic for the actually observed data, as well as the expectation of the test statistic under

H0

increases. The standard deviation of the

test statistic has also an increasing trend for increasing V.C.-dimensions. If one standar+ dizes the test statistic D by subtracting its mean and dividing the result by its standard + deviation, one sees that the shape of the distribution of D is approximately independent of the V.C.-dimension. The fact that the shape of the test statistic is approximately independent of the V.C.-dimension could possibly be used to get rules of thumb for situations where the computation of large resamples is computationally intractable. However, the + approximate independence of the shape of the distribution of D from the V.C.-dimension may be only present in our special situation and thus may be misleading for getting a rule of thumb for the general case.

48

0.31

0.32

0.33

D+

0.34

0.35

0.36

Actually observed Test statistic D+ for different V.C. − dimensions

5

10

15

20

25

30

35

40

VC

1.0

Empirical distribution of D+xy

0.0

0.2

0.4

FDxy+

0.6

0.8

VC=4 VC=7 VC=10 VC=13 VC=17 VC=21 VC=26 VC=29 VC=33

0.00

0.05

0.10

0.15

D+xy

1.0

Empirical distribution of standardized D+xy

0.0

0.2

0.4

FDxy+

0.6

0.8

VC=4 VC=7 VC=9 VC=10 VC=13 VC=16 VC=17 VC=17 VC=18 VC=21 VC=26 VC=29 VC=32 VC=33

−2

0

2

4

standardized D+xy

Figure 4:

D+ for the actually observed data, as + statistic D and the standardized test statistic

The value of the test statistic

well as the distributions of the test D+ −D+ under H0 for dierent V.C.-dimensions. One can see that the shape of the sd(D+ ) + distribution D is nearly independent of the V.C-dimension. 49

Expectation of D+ for different V.C. − dimensions

0.021 0.020

SD(D+)

0.04

0.018

0.019

0.05

0.06

E(D+)

0.07

0.022

0.08

0.023

Standard deviation of D+ for different V.C. − dimensions

5

10

15

20

25

30

35

40

5

10

15

VC

20

25

30

35

40

VC

Figure 5: The expectation and the standard deviation of

D+

under

H0

for dierent V.C.-

dimensions. Now, we would like to illustrate a little bit, how the taming behaves w.r.t. conceptual terms. Therefore, we analyze for a tamed closure system of V.C.-dimension 7 the tamed + − upsets and downsets, where the maximum D and the minimum D is attained. For the upset-approach, the maximal statistic for the tamed closure system is attained at an antichain of size

7

summarized in Table 5.

Income (Euro)



dierence

above

1

1020 (0.65)

Short-cycle tertiary education (0.37)

excellent (0.08)

0.01

0.02

2

1050 (0.65)


very good (0.32)

0.04

0.12

3

1050 (0.65)


good (0.68)

0.06

0.14

4

1063 (0.65)


good (0.68)

0.11

0.23

5

1200 (0.6)


good (0.68)

0.23

0.42

6

1248 (0.56)


satisfactory (0.89)

0.3

0.5

7

1300 (0.55)


suboptimal (0.97)

0.32

0.52


A = {A1 , ..., A7 } that generates that upset U =↑ A where the max-

imum of the test statistic is attained for the tamed closure system with a V.C. dimension of

7.

One can see that the maximal dierence between the transformed Z -values in brackets 0.65 − 0.08 = 0.57 attained for the rst element A1 , where the Z -value of 0.65 for an income of 1020 Euro is the largest, and a Z -value of 0.08 for a health status excellent is is

the smallest value. Compared to this, in the non-tamed case, the skewest element of the antichain generating the upset where the maximal value of the test statistic is attained is the element

education

A2

Z -value of 0.99 for an education status Lower secondary Z -value of 0.08 for a health status excellent. The dierence

with a maximal

and a minimal

0.99 − 0.08 = 0.91

is clearly greater than for the tamed situation showing that we actually

managed to reduce the skewness of elements generating the closure system for the tamed analysis.

50

The value of the tamed test statistic is with value of

36.48%,

32.90%

not much smaller than the initial

still signicantly dierent from zero.

The minimal value is

attained at the antichain consisting of only one element depicted in Table 6

1

−0.045%

Income (Euro)



dierence

above

3500 (0.08)


excellent (0.08)

-5e-04

0.003


A = {A1 }


U =↑ A

where the minimum

of the test statistic is attained for the tamed closure system with a V.C. dimension of

7.

The minimal value is not signicantly dierent from zero. Table 7 and Table 8 nally show the results one would obtain if one would do the tamed analysis by looking at downsets instead of upsets: Obviously, the role of the maximal and the minimal value of the test statistic will interchange: The maximal value of the test statistic of between the proportion of the

poor

male and the

attained if one concretizes the term

poor

poor

0.15%

means that the dierence

female persons is maximally

with the downset

D =↓ A

antichain given in Table 7. The minimal value of the test statistic is

0.15%

generated by the

−28.82%

attained

for the downset generated by the antichain given in Table 8. Note that for the downsetanalysis we used for the construction of the

Z -values

not the complementary distribution

function, but the usual distribution function, because this ts better to the notion of a downset. This only has an impact on the interpretation of the numbers given in brackets. For example the

0.06

beyond the income value of

360

Euro in Table 7 means now, that

below 360 Euro. Additionally, the last column, denoted below, now gives the proportion of persons below the corresponding element of the antichain and the column dierence gives the dierence of the proportions below the corresponding

6%

of the population have income

element.

1

Income (Euro)



dierence

below

360 (0.06)

Primary education (0.01)

bad (0.03)

0.0015

0.0008


0.0015 of 7.

A = {A1 } that generates that downset D =↓ A where the maximum

of the test statistic is attained for the tamed closure system with a V.C. dimension

6.2 Concept extents: Gender dierences and dierential item functioning in an item response dataset In this section, we shortly analyze an IRT-dataset w.r.t. gender dierences and

tial item functioning (DIF, [Osterlind and Everson, 2009]). from the general knowledge quiz

Studentenpisa 51

Dieren-

The data set is a subsample

conducted online by the German weekly

1 2 3 4

Income (Euro) 1400 (0.51) 1450 (0.52) 1460 (0.52) 1474 (0.53)

Education (ISCED 2011) Bachelor's or equivalent level (0.78) Short-cycle tertiary education (0.75) Post secondary non-tertiary education (0.63) Upper secondary education (0.55)


−0.288 dimension of 7.

minimum

A = {A1 , ..., A4 }

Health (self-reported) good (0.68) very good (0.92) good (0.68) good (0.68)

that generates that downset

dierence -0.21 -0.28 -0.20 -0.16

D =↓ A

below 0.33 0.41 0.3 0.28 where the

of the test statistic is attained for the tamed closure system with a V.C.

news magazine SPIEGEL ([SPIEGEL Online, 2009], see also Trepte and Verbeet [2010] for a broad analysis and discussion of the original data set.) The data contain the answers of

1075 university students from Bavaria to 45 multiple choice items concerning the 5 dierent topics politics, history, economy, culture and natural sciences. For every topic, 9 questions were posed, for example question 1 of the politics topic was: Who determines the rules of action in German politics according to the constitution?. The data set was analyzed in a number of papers, for example in Strobl et al. [2015], Tutz and Schauberger [2015], Tutz and Berger [2016], mostly from an IRT point of view. All mentioned papers identied systematic dierences between the subgroups of male and female students in the sense of the presence of dierential item functioning. Dierential item functioning is present if the distribution of the item response patterns in two subgroups with identical latent abilities are dierent. Here, one cannot assume that the subgroups of male and female students that actually participated in the online quiz have the same latent abilities, because for example self selection processes can be present. To analyze the presence of dierential item functioning one has to rstly somehow match persons of the two subgroups with similar abilities. One classical non-parametric procedure is the test of Mantel Haenszel (see [Holland et al., 1988].), where one takes the item scores (i.e., the number of solved items) as a matching criterion

40 .

One straties the populations into parts with the same item score and then 2 compares the subpopulations in every stratum. The nal test statistic is then a χ -type statistic cumulating over all strata. The Mantel Haenszel procedure is an item-wise test, one tests for every item separately, if DIF is present for this item. For the construction of the matching score one usually does not take the whole set of items, instead one ignores items that showed DIF in a rst preliminary analysis that was based on the whole set of items

41 .

This process is called purication and there are dierent variants of purication,

see, e.g., [Osterlind and Everson, 2009, p.16]. We can use the linear programming approach on formal contexts to develop a joint DIF test based on the item scores as a matching criterion. Firstly, we have to care for the dierent distributions of the abilities in the dierent subgroups. Here, we do not make a conditional analysis since conditioning would make all classes with the same item score relatively small such that a

45-dimensional

multivariate

40 Note that this will only work if the score values are a sucient statistic for the abilities, which is for example the case for the Rasch model. For a discussion of deviations from this assumption in the context of the classical Mantel Haenszel procedure, see, e.g., [Zwick, 1990]

41 The actually tested item should always be included for matching to make the Mantel Haenszel proce-

dure valid under the null hypothesis of no dierential item functioning, see [Holland et al., 1988, p.16].

52

analysis in every stratum would expectedly have very low power.

Instead, we re-weight

both subgroups such that the ability distributions in the male and the female group are approximately the same and then we analyze the joint distribution of item patterns and abilities (measured via the item scores). Concretely, we do the following:

K0 = (G, M, I) be the formal context where G = {g1 , . . . , g1075 } is the set of persons, M = {m1 , . . . , m45 } is the set of items and gIm i person g solved item m.

1. Let

2. Separately for the male and the female group we estimate the density of the distribution of the item scores

s, denoted with fˆmale

and

fˆf emale , respectively.

The estimation

is done here with a kernel density estimator. 3. Then we inversely re-weight the sample by giving a weight

( fˆmale (si ) Wi := fˆf emale (si )

if the if the

ith ith

person is female person is male.

After this, the re-weighted distribution of the scores in the male and female group are approximately the same. 4. Then we analyze the joint distribution of response patterns and the score values in both subgroups.

To do this, we use the exibility of formal context analysis and

simply conceptually scale the score values with an interordinal scale. Concretely, for every score value

K0

s

we add an attribute ≤

s

and an attribute ≥

s

to the original

g has attribute ≤ s if g has a score value s and person g has attribute ≥ s if g has a score value greater than or equal to s. Afterwards, we analyze the enlarged context K1 by looking at the closure system B1 (K1 ) and computing context

with the interpretation person

lower than or equal to

max/minM ∈B (K ) h(w 1 1 where

wx

x

are the original weights for the male and

persons. ( Concretely, in the sample there were 1 if the ith person is male wix = 658 0 if the ith person is female and

wiy =

( 0

− wy ) · W, 1M i,

1 417

if the ith person is male if the ith person is female

wy

are the weights for the female

658 male and 417 female persons, thus

.

5. In a last step we apply a purication procedure by basing the item scores for matching only on items that are not in the concept intents for which the maximal and minimal test statistic was obtained in a rst run. We repeat the purication procedure until not further items are excluded.

53

Before showing the actual results, we rstly compute the test statistics for the context

22

K0

and has about

D+

and

D−

without re-weighting the data. The context has a V.C.-dimension of

8.900.000.000

formal concepts (this is an estimate based on random

sampling of arbitrary item sets and checking if they are a concept intent) and is thus very hardly describable explicitly and we will use the binary program described in Section 4.2 to compute the test statistics

42 .

The maximal value of the test statistic is

0.335

attained

at a formal concept containing the questions F6: Who is this? - (Picture of Horst Seehofer.) F26: Which internet company took over the media group Time Warner? - AOL. This means that the dierence in the proportions of male and female persons who answered at least questions

F6

and

F 26

correctly is the greatest observed dierence between

proportions of male and female persons that answered at least all items of some set of items correctly. Concretely,

53.6%

of the male and

20.1%

of the female persons answered

these both questions rightly. The minimal value of the test statistic is a formal concept containing the questions

−0.169

attained at

F40: What is also termed Trisomy 21? - Down syndrome. F43: Which kind of bird is this? - Blackbird. Here,

59.6%

of the male and

76.5%

of the female persons got both questions right. Both 1000 the value max{D+ , −D− } had a

dierences are signicant, for a resample of size range from

0.05

to

0.14

and a standard deviation of

0.014.).

Figure 6 gives a rough idea

about how a (non-guided) taming of the closure system by removing big shatterable sets + of objects from the context aects the distribution of the test statistic D under H0 . (Note, that this is only a very small simulation where we resampled only

100

times for

every value of the V.C.-dimension.) The initial context has a V.C.-dimension of

22.

One

can see, that by reducing the V.C.-dimension, the mean and the standard deviation of the test statistic does not change very much for small reductions of the V.C.-dimension. Only a very strong taming to a V.C.-dimension below + mean of D under H0 .

8

seems to have an eect in reducing the

Now we come to the actual DIF-analysis: Figure 7 shows the distribution of the item scores for the male and female persons. The distributions are very dierent and thus we have to correct for this dierence by re-weighting the data.

However, generally, every

attempt to account for such a kind of dierence should be taken with some grain of salt, because initially we would like to account for dierences in the abilities, but the abilities are only latent traits that cannot be observed and thus have to be estimated, in our situation

42 To get a rough idea of computational complexity: The MIP solver needed ca.

100

Gurobi

(see [Gu et al., 2012])

seconds to compute the statistic using one core on a 2.60 Ghz CPU (Intel(R) Xenon(R)

CPU E5-2650 v2 @2.60 Ghz, 64GB RAM).

54

●

● ●

●

●

●

● ●

●

0.10

●

●

●

statistic D+

●

●

● ●

0.05

● ●

0.00

2

3

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

V.C.−dimension Figure 6: Distribution of the tamed statistic

from the item scores.

D+

for dierent V.C.-dimensions.

In the unlucky case, an attempt for accounting for dierences in

abilities can make the analysis still more misleading if the items that suer from DIF cannot be detected accurately enough and thus the item scores are invalidated as a surrogate for the abilities.

The joint analysis of the re-weighted sample leads in the rst step to a

maximal value of the statistic of

0.234

attained at the intent of persons who answered the

question F26: Which internet company took over the media group Time Warner? - AOL. correctly and had a score value between the rst step was

−0.333

17

and

37.

The minimum of the test statistic in

attained at an intent containing the

5

questions

F12: Which form of government is associated with the French King Louis XIV? - Absolutism.

55

1.0

Empirical distribution functions of observed item scores

●

male female

●

●

●

●

0.8

● ●

●

● ●

●

● ● ●

● ●

●

●

●

●

0.6

●

●

●

●

●

Fn(x)

●

●

● ●

●

0.4

● ● ● ●

● ●

0.2

● ●

● ●

0.0

●

●

●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

10

20

30

40

x

Figure 7: Empirical distribution of the item scores for the female and the male group in the subsample of the general knowledge quiz Studentenpisa ([SPIEGEL Online, 2009]).

F33: What is the name of the bestselling novel by Daniel Kehlmann? - Die Vermessung der Welt (Measuring The World)." F35: In which city is this building located? - Paris." F40: What is also termed Trisomy 21? - Down syndrome." F43: Which kind of bird is this? - Blackbird." and score values between

16

and

35.

After excluding questions F26, F12, F33 F35, F40

and F43 for matching, in a second step, additionally the two questions F34: Which city is the setting for the novel 'Buddenbrooks' ? - Lübeck." and F36: Which one of the following operas is not by Mozart? - Aida." were excluded for matching. In the third step, the procedure stopped with a maximal + nal statistic D of 0.241 attained for the intent containing F 26 and (modied) score − values between 16 and 29. The minimal value D was −0.290 attained for the intent containing questions and

30.

F 12, F 33, F 35, F 40 and F 43 and (modied) score values between 12

Thus, altogether, questions F12, F26 , F33, F34, F35, F36, F40 and F43 showed

DIF. (The result was statistically signicant in the sense that a bootstrapp sample of

56

10

sample yielded a distribution of the absolute value of the test statistic with mean standard deviation

0.21 and

0.01.)

6.3 Guided taming of concept extents in cognitive diagnosis models In this section, we would like to illustrate a little bit, how one can tame a formal context in a more guided way in the context of cognitive diagnosis models. For illustration, we use a subsample of the the year

2007.

Trends in International Mathematics and Science Study

(TIMSS) of

This study is an international assessment of the mathematics and science

knowledge of students, that was rstly conducted in

1995 and has been administered every

four years thereafter by the International Association for the Evaluation of Educational Achievement (IEA). It analyses math- and science knowledge of 4th and 8th grade students.

CDM43 ,

consisting of 698 Austrian 25 math questions (dataset data.timss07.G4.lee). Since not all students answered all 25 questions, we restrict here the analysis to that 344 students that answered all questions. The 25 questions were the same as that used in Lee et al. [2011]. The package also provides the Q-matrix and the description of the skills We use here a subsample provided in the R-package students (4th grade) answering a set of

used in Lee et al. [2011]. We will use this small subsample to illustrate the guided taming procedure by comparing it to the non-guided taming procedure described in section 5.3.2.

K0 = ({g1 , . . . , g344 }, {m1 , . . . , m25 }, I) has a Vapnik-Chervonenkis dimension of 14 and consists of 255712 formal concepts. Because of the small cardinality of the concept lattice, we can explicitly compute the closure system B1 (K0 ) of all extents The formal context

and thus we will analyze the taming process not w.r.t. the V.C.-dimension, but w.r.t. the cardinalities of the tamed closure systems

˜ . B1 (K)

The data set contains also information

about gender, so we will analyze dierences w.r.t. gender. Figure 8 shows the value of the test statistic for the actually observed data in dependence on the cardinality of the tamed closure system for both the guided taming and the non-guided taming. One can see that, as expected, the statistic increases with increasing cardinality of the closure system. The general pictures for the non-guided and the guided taming are very similar. For the guided taming, the smallest closure system that is obtained by enforcing all valid implications of the idealized response pattern space, has a size of smallest possible closure system of size

2,

127,

which is much higher than the

obtainable by the strongest possible non-guided

taming. Figure 9 shows the p-value one would obtain if one would do a statistical test. (Here, we did resampling with

1000

resamples to compute the p-values.)

One can see,

that for comparable sizes of the closure system, the guided taming procedure generally has lower p-values.

One could speculate here, that the guided taming tends to exclude

mainly sets that are statistically not so important in the sense that they play no crucial role w.r.t. dierences between male and female participants. If one assumes that in the actually observed data set there are clear dierences between male and female participants,

43 See Robitzsch et al. [2016] for an introduction to the package

57

CDM.

then it seemingly appears here, that the guided taming leads to a smart reduction of the size of the closure system that actually reduces the variability of the statistic under without reducing the test statistic under

H1 ,

H0

too much.

0.15

0.10

statistic

taming guided non−guided

0.05

0.00 2

4

8

16

32

64

128

256

512

1024

2048

4096

8192

16384

cardinality

Figure 8: Value of the tamed test statistic for dierent cardinalities of the tamed closure system, both for the non-guided and the guided taming.

p.value

0.6

taming 0.4

guided non−guided

0.2

2

4

8

16

32

64

128

256

512

1024

2048

4096

8192

16384

cardinality

Figure 9: Obtained p-values of the test statistic for the actually observed data for dierent sizes of the tamed closure system, both for the non-guided and the guided procedure.

58

6.4 Convex sets: A geometrical generalization of the KolmogorovSmirnov test Finally, we want to illustrate that for small data sizes the generalization of the KolmogorovSmirnov test for analyzing spatial dierences between subpopulations in spatial statistics as indicated in Section 4 is also practically applicable. We use here the data set

quercusvm

which is a subsample of a larger data set analyzed in Laskurain [2008] and available in the R Package

ecespa

([de la Cruz Rot, 2008]). This data set consists of

100

data points

representing the locations of alive and dead oak trees (Quercus robur) in a secondary wood in Urkiola Natural Park (Basque country, north of Spain). The data are depicted in Figure 10.

We can now compute that convex sets where the maximal and the minimal

Figure 10:

Locations of altogether

100

alive and dead oak tress (Quercus robur) in a

secondary wood in Urkiola Natural Park (Basque country, north of Spain). dierences in proportion of alive and dead oak trees is attained.

59

Figure 11 shows the

results.

The blue convex set is the set, where the dierence is maximal (37%): In the

Figure 11: Dierence in proportions of alive and dead oak trees. Blue: maximal dierence of

39%

(more alive than dead trees). Red: minimal dierence of

alive trees).

blue convex area we have a proportion of

61%

−37%

(more dead than

alive, but only a proportion of

24%

dead

trees. The red convex set is the set, where the dierence is minimal (−39%): In the red

15% alive and 54% dead trees. Based on 1000 resamples, one gets p-value of 0.83, so the dierences or not statistically signicant. Note

convex area there are an approximate

that also the Cramér von Mises type test proposed by Syrjala [1996] and also a classical generalization of the Kolmogorov-Smirnov test, where one only looks at rectangular areas are both non-signicant. (The p-values of the tests, computed with the function from the R Package

ecespa

are approximately

0.84

and

0.67,

syrjala

respectively.) Compared to

Syrjalas test, the test based on convex sets has the advantage that it is somehow better

60

interpretable because one can actually see, in which areas the dierences in proportion are maximal or minimal. A further modication of the test is also possible: Since the convex sets are described by formal implications and one explicitly models these implications by imposing corresponding inequality constraints in the binary program, one has the exibility to impose not all, but only some implications.

One natural way to select implications to include would be to

include only implications where the data points of the premise and the conclusion are not too far away from each other. This would lead to some kind of a localization method and the associated closure system would get larger, which means more exibility in detecting nonconvex distributional features but a generally higher V.C.-dimension. We shortly illustrate this modication by imposing only implications where the distances between the points of the premises and the conclusions is not greater than

40m.

Figure 12 shows the non-convex

sets where the maximal (blue) and the minimal (red) dierences in the proportions of alive and dead oak trees is attained.

Figure 12: Non-convex sets where the maximal (blue) and the minimal (red) dierence in proportions between alive and dead oak trees is attained for the modied method where only implications, where the distance between the points of the premises and the conclusions in not greater than

40m,

and red points have a radius of

are used. The dashed circles around the highlighted blue

40m

to get an impression about which implications were

actually included.

For the modied version we got a maximal value of

58%

and a minimal value of

for the dierence of the proportions with an approximate p-value of

61

0.17.

50%

While the computations for this data set could be done quickly enough, for larger data sets, the method quickly becomes intractable. containing

327

The data set analyzed in Syrjala [1996]

spatial measuring points was already very hard to analyze, to compute the

test statistic, the mixed integer solver Gurobi (Gu, Rothberg, and Bixby [2012]) took a few days to solve the binary program. Similarly to the analysis in Syrjala [1996], the result w.r.t. dierences between male and female cods was not statistically signicant. To assess the statistical signicance of our test statistic, we did not need to do resampling, which would actually be very time demanding. Instead we could rely on the fact that for a value of

0.06

that we observed for our test statistic, still the more classical Kolmogorov-Smirnov

type test statistic that only looks at rectangular areas would not be statistically signicant. However, for dealing with the computational issue, one can use the technique of attribute exploration for formal contexts:

K := (G, M, I)

where

measuring points and

One can rstly look at the formal context

G is the set of all rectangular areas, M is the set of all spatial gIm i measuring point m lies in the rectangular area g . The re-

sulting closure system of all concept intents is then the set of all sets of measuring points lying in some rectangular area, which is a smaller closure system than the system of all convex sets of measuring points and in which thus more formal implications are valid. Note that despite this, a base of all implications of this smaller closure system can be given as

{{p, q} 7→ [p, q] | p, q ∈ M, [p, q] ) {p, q}}, [p, q] := {r ∈ M | r1 ∈ [min{p1 , q1 }, max{p1 , q1 }] & r2 ∈ [min{p2 , q2 }, max{p2 , q2 }]}. 2 Compared to the base for general convex sets, this base has only O(n ) implications and where

is thus far more easy to handle.

Now, during the computation of all valid implications of

K,

in the spirit of attribute

exploration, one can check for every currently generated implication, if it is also approximately true with some condence

c

all half-spaces generated by two points of

g.

in the half-space

˜ = (G, ˜ M, I) ˜ , where G ˜ K ˜ means that measuring g Im

in the context

M

and

is the set of point

m

lies

The intents of this context are exactly all convex areas of measuring

points. If the currently generated implication is also true in the larger closure system of all convex areas, then one would treat it as valid, otherwise one would provide a convex half-space

˜ g∈G

as a counterexample.

With this procedure one would generate a closure system that is larger than the system of all rectangular areas and smaller than the closure system of all convex areas and the condence level

c

regulates the size of the resulting closure system and the size of the

implication base. Thus, with this modication, we have some scalable method for spatial statistics. (Of course, with the drawback that now the result of the method is dependent on the choice of the coordinate system.)

62

7

Conclusion

In this paper we analyzed the problem of detecting stochastic dominance as a prototypical example of optimizing a linear function on a closure system. Compared to the general case, for stochastic dominance, the integrality constraints of the underlying binary program could be dropped which helped in making the problem more tractable. For general closure systems the binary programs are more dicult to solve, but we managed to solve them in our concrete cases of application. Note that we did not explicitly incorporate knowledge about the underlying closure system into the mixed integer solver we used. It seems that one can make the computations far more ecient by using for example knowledge about valid formal implications of the underlying closure system. If one knows that the certain formal implications are valid, then one can possibly use this knowledge to explicitly prune the search space in the branch and cut algorithm of the mixed integer solver. The solved binary programs and the associated test statistics treated in this paper could be understood as some Kolmogorov-Smirnov type generalizations. This motivates the question if also other generalizations like weighted Kolmogorov-Smirnov type or AndersonDarling type tests are computational tractable. Actually, it seems to be not too dicult to compute such variants of a test statistic: Firstly, one can impose one additional constraint into the underlying program that demands that the sets one is optimizing over contain at least (or at most, or exactly) an amount

c

of overall probability mass. Secondly, one can

do the constrained optimization for every possible amount optimal values for dierent

where

ψ

c

c

and can then aggregate the

for example to

sup

sup

c

m: hwx +wy ,mi≥c

hwx − wy , mi·ψ(c),

is some appropriately chosen weighting function.

All in all, it seems that the optimization of linear functions on closure systems has a broad range of possible applications and thus deserves further research.

References R. Agrawal, T. Imieli«ski, and A. Swami. Mining association rules between sets of items

Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, SIGMOD '93, pages 207216. ACM, 1993.

in large databases. In

A. Albano.

The implication logic of (n,k)-extremal lattices.

In K. Bertet, D. Borch-

Formal Concept Analysis: 14th International Conference, ICFCA 2017, Rennes, France, June 13-16, 2017, Proceedings, pages 3955. mann, P. Cellier, and S. Ferré, editors, Springer, 2017a.

Polynomial growth of concept lattices, canonical bases and generators: extremal set theory in formal concept analysis. PhD thesis, Technische Universität Dresden,

A. Albano.

63

2017b. URL

http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-226980.

acces-

sed 19.08.2017. A. Albano and B. Chornomaz. Why concept lattices are large: extremal theory for generators, concepts, and VC-dimension.

International Journal of General Systems,

pages

118, 2017. S. Alkire, J. Foster, S. Seth, J. Roche, and M. Santos.

rement and Analysis.

Multidimensional Poverty Measu-

Oxford University Press, 2015.

M. Anthony, N. Biggs, and J. Shawe-Taylor.

The learnability of formal concepts.

Proceedings of the Third Annual Workshop on Computational Learning Theory,

In

COLT

'90, pages 246257. Morgan Kaufmann Publishers Inc., 1990a. M. Anthony, N. Biggs, and J. Shawe-Taylor.

sis.

Learnability and Formal Concept Analy-

University of London. Royal Holloway and Bedford New College. Department of

Computer Science, 1990b. C. Arndt, R. Distante, M. A. Hussain, L. P. Østerdal, P. L. Huong, and M. Ibraimo. Ordinal welfare comparisons with multiple discrete indicators: A rst order dominance approach and application to child poverty.

World Development, 40(11):22902301, 2012.

C. Arndt, M. A. Hussain, V. Salvucci, F. Tarp, and L. P. Østerdal. Advancing small area estimation. Technical report, WIDER Working Paper, 2013. URL

net/10419/80908.

http://hdl.handle.

accessed 28.08.2017.

C. Arndt, N. Siersbæk, and L. P. Østerdal. Multidimensional rst-order dominance comparisons of population wellbeing. Technical report, WIDER Working Paper, 2015. URL

http://hdl.handle.net/10419/129471. M. M. Babbar. Distributions of solutions of a set of linear equations (with an application to linear programming).

Journal of the American Statistical Association, 50(271):854869,

1955.

Econometrica,

G. F. Barrett and S. G. Donald. Consistent tests for stochastic dominance. 71(1):71104, 2003. Y. Bastide, N. Pasquier, R. Taouil, G. Stumme, and L. Lakhal.

Mining minimal non-

redundant association rules using frequent closed itemsets. In J. Lloyd, V. Dahl, U. Fuhrbach, M. Kerber, K.-K. Lau, C. Palamidessi, and L. M. Pereira, editors,

Logic-CL 2000, pages 972986. Springer, 2000. G. Birkho. Rings of sets.

Computational

Duke Mathematical Journal, 3(3):443454, 1937.

S0012-7094-37-00334-X.

64

doi: 10.1215/

L. Bottou.

In hindsight: Doklady akademii nauk SSSR, 181(4), 1968.

Z. Luo, and V. Vovk, editors,

In B. Schölkopf,

Empirical Inference: Festschrift in Honor of Vladimir N.

Vapnik, pages 35. Springer, 2013.

B. Chornomaz. Counting extremal lattices. working paper or preprint, 2015. URL

//hal.archives-ouvertes.fr/hal-01175633. B. A. Davey and H. A. Priestley.

https:

Introduction to Lattices and Order. Cambridge University

Press, 2002. O. Davidov and S. Peddada. Testing for the multivariate stochastic order among ordered experimental groups with application to doseresponse studies.

Biometrics,

69(4):982

990, 2013. M. de la Cruz Rot. Metodos para analizar datos puntuales. In F. T. Maestre, A. Escudero,

Introduccion al Analisis Espacial de Datos en Ecologia y Ciencias Ambientales: Metodos y Aplicaciones., chapter 3, pages 76127. Asociacion Espanola de and A. Bonet, editors,

Ecologia Terrestre, Universidad Rey Juan Carlos and Caja de Ahorros del Mediterraneo, 2008. URL

https://cran.r-project.org/web/packages/ecespa/index.html.

J.-P. Doignon and J.-C. Falmagne.

Knowledge Spaces.

B. Dushnik and E. W. Miller. Partially ordered sets.

Springer, 2012.

American Journal of Mathematics,

63(3):600610, 1941. B. Ganter and R. Wille.

Formal concept analysis: mathematical foundations.

Springer,

2012. GESIS - Leibniz - Institut fur Sozialwissenschaften.

Allbus compact (2015):

Allge-

https://www.gesis. org/allbus/inhalte-suche/studienprofile-1980-bis-2016/2014/. GESIS Datemeine Bevölkerungsumfrage der Sozialwissenschaften, 2015. URL narchiv, Köln. ZA5241 Datenle Version 1.1.0. Z. Gu, E. Rothberg, and R. Bixby.

http://www.gurobi.com.

Gurobi optimizer reference manual.

2012.

URL

accessed 19.08.2017.

E. H. Haertel. Using restricted latent class models to map the skill structure of achievement items.

Journal of Educational Measurement, 26(4):301321, 1989.

G. Hansel and J. Troallic. Mesures marginales et théorème de Ford-Fulkerson.

Theory and Related Fields, 43(3):245251, 1978.

Probability

J. Heller, L. Stefanutti, P. Anselmi, and E. Robusto. On the link between cognitive diagnostic models and knowledge space theory.

Psychometrika, 80(4):9951019, Dec 2015.

P. W. Holland, D. T. Thayer, H. Wainer, and H. Braun. Dierential item performance and the Mantel-Haenszel procedure.

Test validity, pages 129145, 1988. 65

J. E. Hopcroft and R. M. Karp. A n5/2 algorithm for maximum matchings in bipartite. In

Switching and Automata Theory, 1971., 12th Annual Symposium on, pages 122125.

IEEE, 1971. B. W. Junker and K. Sijtsma. Cognitive assessment models with few assumptions, and connections with nonparametric item response theory.

Applied Psychological Measurement,

25(3):258272, 2001. T. Kamae, U. Krengel, and G. L. O'Brien. spaces.

Stochastic inequalities on partially ordered

The Annals of Probability, 5(6):899912, 1977.

T. Kuosmanen. Ecient diversication according to stochastic dominance criteria.

gement Science, 50(10):13901406, 2004.

Mana-

L. Lakhal and G. Stumme. Ecient mining of association rules based on formal concept analysis.

Formal concept analysis, 3626:180195, 2005.

Dinámica espacio-temporal de un bosque secundario en el Parque Natural de Urkiola (Bizkaia). PhD thesis, Universidad del País Vasco/Euskal Herriko Unibert-

N. Laskurain.

sitatea, 2008. Y.-S. Lee, Y. S. Park, and D. Taylan. A cognitive diagnostic modeling of attribute mastery in massachusetts, minnesota, and the us national sample using the timss 2007.

International Journal of Testing, 11(2):144177, 2011. E. Lehmann. Ordered families of distributions.

Ann Math Stat, 26:399419, 1955.

M. Leshno and H. Levy. Stochastic dominance and medical decision making.

Management Science, 7(3):207215, 2004.

D. Levhari, J. Paroush, and B. Peleg.

Health Care

Eciency analysis for multivariate distributions.

The Review of Economic Studies, 42(1):8791, 1975.

H. Levy.

Stochastic dominance: Investment decision making under uncertainty.

Springer,

2015. H. Levy and M. Levy. Experimental test of the prospect theory value function: A stochastic dominance approach.

Organizational Behavior and Human Decision Processes,

89(2):

1058 1081, 2002. T. Makhalova and S. O. Kuznetsov.

On overtting of classiers making a lattice.

In

Formal Concept Analysis: 14th International Conference, ICFCA 2017, pages 184197. Springer, 2017. K. Bertet, D. Borchmann, P. Cellier, and S. Ferré, editors,

K. Mosler and M. Scarsini. M. Scarsini, editors,

Some Theory of Stochastic Dominance.

In K. Mosler and

Stochastic orders and decision under risk, pages 203212. Institute

of Mathematical Statistics, 1991.

66

A. Müller and D. Stoyan.

Comparison Methods for Stochastic Models and Risks.

Wiley,

2002. S. J. Osterlind and H. T. Everson. G. Piatetsky-Shapiro.

Dierential Item Functioning.

Sage Publications, 2009.

Discovery, analysis and presentation of strong rules.

Discovery in Databases, pages 229248, 1991.

J. W. Pratt and J. D. Gibbons.

Concepts of Nonparametric Theory.

Knowledge

Springer, 2012.

A. Prèkopa. On the probability distribution of the optimum of a random linear program.

SIAM Journal on Control, 4(1):211222, 1966.

C. Preston. A generalization of the FKG inequalities.

Physics, 36(3):233241, 1974.

T. M. Range and L. P. Østerdal.

Communications in Mathematical

Checking bivariate rst order dominance.

Technical

report, Discussion Papers on Business and Economics, 2013. A. Robitzsch, T. Kiefer, A. C. George, and A. Uenlue. 2016. URL

CDM: Cognitive Diagnosis Modeling,

http://CRAN.R-project.org/package=CDM.

A. Rusch and R. Wille.

R package version 4.8-0.

Knowledge spaces and formal concept analysis.

In H.-H. Bock

Data Analysis and Information Systems: Statistical and Conceptual Approaches Proceedings of the 19th Annual Conference of the Gesellschaft für Klassikation e.V. University of Basel, pages 427436. Springer, 1996.

and W. Polasek, editors,

N. Sauer. On the density of families of sets.

Journal of Combinatorial Theory, Series A,

13(1):145147, 1972. H. Scheiblechner. A unied nonparametric IRT model for d-dimensional psychological test data (d-ISOP).

Psychometrika, 72(1):43, 2007.

J. K. Sengupta, G. Tintner, and B. Morrison. Stochastic linear programming with applications to economic models.

Economica, 30(119):262276, 1963.

S. Shelah. A combinatorial problem; stability and order for models and theories in innitary languages.

Pacic Journal of Mathematics, 41(1):247261, 1972.

SPIEGEL Online.

Studentenpisa - Alle fragen, alle Antworten, 2009.

www.spiegel.de/unispiegel/studium/0,1518,620101,00.html.

URL

http://

In German. acces-

sed 18.08.2017. V. Strassen. The existence of probability measures with given marginals.

Mathematical Statistics, 36(2):423439, 1965.

The Annals of

C. Strobl, J. Kopf, and A. Zeileis. Rasch trees: A new method for detecting dierential item functioning in the rasch model.

Psychometrika, 80(2):289316, 2015. 67

S. E. Syrjala.

A statistical test for a dierence between the spatial distributions of two

populations.

Ecology, 77(1):7580, 1996.

F. Tarp and L. P. Østerdal. Multivariate discrete rst order stochastic dominance. Discussion Papers 07-23, University of Copenhagen. Department of Economics, 2007. URL

http://EconPapers.repec.org/RePEc:kud:kuiedp:0723. K. K. Tatsuoka.

Analysis of errors in fraction addition and subtraction problems.

National

Institute of Education, Washington, D.C., 1984.

Allgemeinbildung in Deutschland: Erkenntnisse aus dem SPIEGEL-Studentenpisa-Test. VS Verlag für Sozialwissenschaften, 2010.

S. Trepte and M. Verbeet.

W. T. Trotter.

Combinatorics and Partially Ordered Sets: Dimension Theory,

volume 6.

JHU Press, 2001. G. Tutz and M. Berger. Item-focussed trees for the identication of items in dierential item functioning.

Psychometrika, 81(3):727750, 2016.

G. Tutz and G. Schauberger. A penalty approach to dierential item functioning in Rasch models.

Psychometrika, 80(1):2143, 2015.

UNESCO Institute for Statistics (UIS).

ISCED 2011.

International Standard Classication of Education:

UIS, Montreal, Quebec, 2012.

V. N. Vapnik and S. Kotz.

Estimation of Dependences Based on Empirical Data. Springer,

1982. N. Vayatis and R. Azencott. Distribution-dependent vapnik-chervonenkis bounds. In P. Fischer and H. U. Simon, editors,

European Conference on Computational Learning Theory,

pages 230240. Springer, 1999. R. Wille. Restructuring lattice theory: an approach based on hierarchies of concepts. In

Ordered Sets, pages 445470. Springer, 1982.

R. Zwick. When do item response function and Mantel-Haenszel denitions of dierential item functioning coincide?

Journal of Educational Statistics, 15(3):185197, 1990.

68