Inductive Typology: Methodological Appendix 1

1 downloads 0 Views 466KB Size Report
We used the tuneRF() function in the. randomForest package to select the optimal number of variables to try at each tree split (it chose between 3, 4, 6, 9, 12, 18, ...
Inductive Typology: Methodological Appendix Methodological Appendix Variable selection Our first step in extracting types or clusters of network style was to determine which variables mattered for sorting individual networks. We canvassed many descriptors in the NCCS, not only those explicitly designed to measure nominated ties, but others that would also help characterize different types of social networks. We included key aspects of personal networks that scholars have found to be important, including: degree of support, geographical distance to alters, homophily, opportunities for meeting alters (“contexts” in Fischer et al, 1977; “foci” in Feld 1981), social styles, and tie strength or closeness. We considered many candidates but after removing redundant measures and combining some that formed logical packages, we arrived at the 43 measures listed in Table A1. The category “degree of support” includes measures of how many kin and non-kin respondents listed for various kinds of exchanges, such as confiding or borrowing money. “Distance” merely captures how many immediate family members live nearby, the percentage of kin and nonkin who live within a five minute and one hour’s distance. “Homophily” variables include the percentage of nonkin who were similar on activities, age, ethnicity, gender, religion, and work. “Opportunities for meeting alters” include employment status (e.g., school, working, retired, home-maker), frequency of seeing those with whom respondents shared leisure activities, frequency of attending church, social events with coworkers, involvement in outside organizations, and having young children (whom we see as potential bridges to other adults and as restraints on social activity). “Social styles” describes how respondents interacted with alters – whether they socialized at home or at restaurants and events outside the home, had a favorite “hangout,” and whether they frequently talked about personal matters. We also included a measure of how many others were present during the interview (the vast majority of which were conducted at respondents’ homes) for an indication of how many people were generally around the respondent. “Tie strength” can be measured in many ways. We included the percentage of kin and nonkin whom the respondents explicitly labeled as “close,” both exchange and role multiplexity, the average number of years the respondent had known nonkin adjusted for age (diff_timeknown_m), the average frequency with which the respondent saw kin and nonkin, and density, defined as the extent to which alters (in a detailed subsample of all alters) knew one another, as perceived by the respondents. (More details are available in the methodological appendices to Fischer, 1982.). These 43 variables served as inputs for creating the typology. In many clustering procedures it is typical to first standardize variables. Random Forests (RFs) do not require standardization because they can find cut points in variables regardless of their distribution. Moreover, different ranges for different variables do not affect the results. Standardizing variables would therefore reduce the information available to RFs. In Table A1, the letters appended to the variable names indicate the variable type: m: mean (computed) c: categorical i: indicator o: ordinal n: count (computed) 1

Inductive Typology: Methodological Appendix p: percent (computed) Explanation of selected variables: 1. hasPartner_i combined people who had a spouse, fiancé or other steady long-term partner. 2- 11. kin or nonkin preceding emerloan, hobbies, judgment, personal, and social followed by the letter n indicates the number of kin or nonkin whom the respondent named as being available for these types of support or companionship. emerloan text: “If you needed to get a large sum of money together, what would you do-would you ask someone you know to lend it to you; go to a bank, savings and loan, or credit union, or do something else?” [if ask someone, follow up:] Who would that be?” “What about in an emergency situation--is there anyone (else) you could probably ask to lend you some of all of the money? [If yes:] Who would that be?” hobbies text: “Sometimes people get together with others to talk about hobbies or sparetime interests they have in common. Do you ever do this? [IF yes:] Who do you usually do this with?” judgment text: “Often people rely on the judgment of someone they know in making important decisions about their lives--for example, decisions about their family or their work. Is there anyone whose opinion you consider seriously in making important decisions? [If yes:] Whose opinion do you consider? PROBE: Is there anyone else?” personal text: “When you are concerned about a personal matter--for example, about someone you are close to or something you are worried about--how often do you talk about it with someone--usually, sometimes, or hardly ever? [Unless says “never”:] When you do talk with someone about personal matters, who do you talk with? PROBE: Anyone else?” 1 social text: “Which, if any, of these have you done in the last three months? - Had someone to your home for lunch or dinner? - Went to someone's home for lunch or dinner? - Someone came by your home to visit? - Went over to someone's home for a visit? - Went out with someone (e.g., a restaurant, bar, movie, park) - Met someone you know outside your home (e.g., a restaurant, bar, park, club) - [R volunteers other activity] - None [if YES on any]: May I have the first names of the people you do these things with?” 14. immed_fam_nearby_o is an ordinal variable that captures the number of individuals in the respondent’s immediate family who live within an hour. The question is “Do you have any immediate family--such as parents, children, brothers or sisters, or in-laws--living in this area, 1

Historians of network analysis may note that this item was the precursor to the GSS “important matters” question.

2

Inductive Typology: Methodological Appendix that is, within about an hour’s drive of here? [IF YES:] Counting adults only, about how many of your (and your SPOUSE'S) relatives live in this area--just one or two, 3 to 6, or more than that? I mean individuals, not couples.” 15 – 18. kin or nonkin preceding lives_near or lives_far followed by p indicates the percentage of kin or nonkin living either within five minutes or more than one hour away. 19 – 24. hom preceding (leisure) activities, age, ethnicity, gender, religion and work indicates the percentage of nonkin whom the respondent identified as being similar on these characteristics. Age homophily was simply measured as the mean age difference between each respondent and a subsample of his or her nonkin alters. 25. Employment status (employ_status_c) is a 9 category variable that includes full-time, parttime, home-maker, retired, student, looking for work, and unable to work. 29. orgs_involved_n is a total of the number of civic, religious, school, professional and other organizations the respondent said he or she participated in. 30. young_kids_n is the number of children age 11 or younger living at the respondent’s home. 32. social_at_home_n is the sum of four indicators: had someone home for meal, went to home for meal, someone visited home, went to home to visit (0-4). 33. social_hangout_n is the count of alters whom the respondent reported seeing at a hangout spot, such as a bar or café. social hangout text: “Some people have a particular place they know they can go to and find their friends when they want to--it might be a park, club, coffee shop, a restaurant, or some other kind of place. Do you have any place that where you and your friends tend to see each other? [If yes:] Which of the people on this list do you usually see there?” 34. social_outside_n is the sum of two indicators: went out with someone and met someone outside home in last three months (0-2). 35. others_present_n is the interviewer’s count of the others present during the interview. 36. Average years known (avg_years_known_m) was computed by first fitting a spline to capture the relationship between age and the average number of years respondents have known a subsample of nonkin alters. We then subtracted the expected number of years for having known nonkin (by age) from each respondent’s average number of years for knowing nonkin. Thus, high scorers tended to report unusually enduring nonkin ties given their own ages. 37. Exchange multiplexity (exchange_multiplexity_m) is the average number of types of exchanges per alter, operationally the average number of different name-eliciting questions alters appeared in divided by the total number of alters. There were 8 types of exchanges (look after house, talk about work, help around house, been sociable with, talk about hobbies, talk about personal problems, rely on judgment, emergency loan). 3

Inductive Typology: Methodological Appendix

40. pct_kin_close_p and 41. pct_nonkin_close_p is the percentage of kin and of nonkin whom the respondent said he or she “feels close to.” 42. role_multiplexity_m is the average number of role relations per alter in the respondent’s network. There were 7 possible roles (relative, friend, acquaintance, co-worker, co-organization member, neighbor, other relation). 43. subsamp_dens_p is the subsample density, which is the number of actual ties among a subsample of the respondent’s alters divided by the number of potential ties. 2 Table A1: Initial Descriptors and Summary Statistics Descriptor DEGREE OF SUPPORT

N

STDev.

Min

Pctl(25)

Median

Pctl(75)

Max

1

hasPartner_i

1,050

0.7

0.5

0

0

1

1

1

2

kin_emerloan_n

1,050

1

1.1

0

0

1

2

4

3

kin_hobbies_n

1,050

0.3

0.8

0

0

0

0

8

4

kin_judgment_n

1,050

0.9

1

0

0

1

1

7

5

kin_personal_n

1,050

1.2

1.2

0

0

1

2

8

6

kin_social_n

1,050

1.6

2.1

0

0

1

2

10

7

nonkin_emerloan_n

1,050

0.5

0.9

0

0

0

1

4

8

nonkin_hobbies_n

1,050

2

2.3

0

0

1

4

8

9

nonkin_judgment_n

1,050

0.6

1.1

0

0

0

1

7

10

nonkin_personal_n

1,050

1.3

1.5

0

0

1

2

8

11

nonkin_social_n

1,050

4.8

2.9

0

2

5

7

10

12

total_kin_n (ln)

1,050 total_nonkin_n (ln) 1,050 DISTANCE TO KIN AND NONKIN

1.8

0.8

0

1.4

1.9

2.3

3.8

2.1

0.8

0

1.8

2.3

2.6

3.8

14

immed_fam_nearby_o

1,047

1.4

1.1

0

0

1

2

3

15

kin_lives_far_p

1,050

0.4

0.4

0

0

0.4

0.7

1

16

kin_lives_near_p

1,050

0.2

0.2

0

0

0

0.2

1

17

nonkin_lives_far_p

1,050

0.2

0.2

0

0

0.1

0.3

1

nonkin_lives_near_p HOMOPHILY

1,050

0.4

0.3

0

0.2

0.4

0.6

1

19

hom_activities_p

1,050

0.2

0.2

0

0

0

0.3

1

20

hom_age_m

978

6.9

6.9

0

2

4.7

9.6

51

21

hom_ethnicity_p

1,050

0.1

0.2

0

0

0

0

1

22

hom_gender_p

1,028

0.7

0.2

0

0.5

0.7

0.8

1

23

hom_religion_p

1,050

0.2

0.3

0

0

0

0.3

1

24

hom_work_p

1,050

0.2

0.2

0

0

0.1

0.3

1

13

18

2

Mean

See Fischer (1982) for procedure used to select subsample of names.

4

Inductive Typology: Methodological Appendix OPPORTUNITIES FOR MEETING ALTERS 25

employ_status_c

1,050

3

2.6

1

1

1

5

9

26

freq_activityers_o

1,048

0.6

1

0

0

0

1

3

27

freq_church_o

1,046

1.1

1.7

0

0

0

3

4

28

freq_coworkers_o

1,050

1.5

2.1

0

0

0

3

6

29

orgs_involved_n

1,050

1.8

2.2

0

0

1

3

23

young_kids_n SOCIAL STYLES

1,050

0.4

0.8

0

0

0

1

5

31

freq_personal_o

1,050

1.8

0.9

1

1

2

2

4

32

social_at_home_n

1,050

3.3

1.1

0

3

4

4

4

33

social_hangout_n

1,050

1.3

3

0

0

0

1

22

34

social_outside_n

1,050

1.5

0.7

0

1

2

2

2

others_present_o TIE STRENGTH

1,050

1.2

1.2

0

0

1

2

3

979

0

6.5

-14.9

-3.2

-1

2.3

52.6

1,034

0.8

0.4

0

0.5

0.7

1

3.7

30

35 36

avg_years_known_m

37

exchange_multiplexity_m

38

freq_sees_kin_m

842

3.6

1.7

0

2

4

5

6

39

freq_sees_nonkin_m

998

4.4

1.3

0

3.7

4.5

5.3

6

40

pct_kin_close_p

1,050

0.6

0.3

0

0.3

0.6

1

1

41

pct_nonkin_close_p

1,050

0.2

0.2

0

0

0.2

0.3

1

42

role_multiplexity_m

1,050

1.4

0.3

0.9

1.2

1.4

1.6

3

43

subsample_density_p

1,033

0.5

0.3

0

0.2

0.4

0.7

1

Variables used in initial clustering procedure. Variables in bold were used in the final results. Variable recoding, cleaning and missing value imputation We changed some of the variable values in the dataset that were initially coded as NA to 0 for substantive reasons: - For individuals who had zero nonkin and an “NA” value on hom_ethnicity_p, hom_religion_p, hom_work_p, and hom_activities_p, we changed the homophily variable to 0. - 22 of the 230 individuals missing values on freq_sees_kin_m had zero kin, so we recoded the frequency from NA to “0”, interpreted as “never sees kin.” Similarly, 22 of the 72 individuals missing values on freq_sees_nonkin_m had zero nonkin so we recoded the frequency to “0”. We recoded 22 missing values of pct_kin_close_p and pct_nonkin_close_p to “0” by the same logic. - Replacing values on measures of proximity to kin and nonkin relied on wider interpretative steps. We claim that having zero kin is similar to (for the purposes of grouping individuals) to having zero kin nearby and all of one’s kin far away. Thus, for the 22 individuals with zero kin, we assigned them values of 0 on pct_kin_near_p and 1 on pct_kin_far_p. We did the same for the 22 individuals with zero nonkin on pct_nonin_near_p and pct_nonkin_far_p. 5

Inductive Typology: Methodological Appendix As shown in the N column in the table above, some variables were still missing data after these recodings. The variables with more than 20 NAs in descending order were: freq_sees_kin_m (208), adj_age_sim_m (71), freq_sees_nonkin (52), hom_gender_p (22). We imputed the missing values of these variables by matching observations that were missing values with observations having non-missing values on as many other variables as possible. We also tried imputation based on simply assigning the median. Imputation turned out to matter little as none of these variables ended up being used in the composite variables from which we derive our typology. Random Forests for Clustering In order to cluster the observations we needed a measure of their dissimilarities. We explain why we used Random Forests (RFs) to estimate these dissimilarities in the main text, but here we provide more detail on RFs and the parameters we used. 3 In general terms, Random Forests works as follows in the case of unsupervised learning (i.e., for clustering): 1. Create a simulated dataset based on variable distributions from the real dataset. 2. Try to classify real versus simulated observations. (Note: the point is to have a classification exercise by which real observations will end up in the same terminal nodes, thereby providing information about their similarity.) 3. Each tree in the forest gets a random subset of observations to classify. 4. At each branch in the tree, the algorithm randomly samples k variables from all the variables. 5. It chooses among these the variable (and variable value) that best separates the real observations from the simulated observations. 6. It then splits the data based on this variable value. 7. This branching process repeats until the minimum node size is reached (e.g., 2 observations in a node) 8. Prediction error (i.e., informativeness) of each tree is calculated by using the tree’s fitted values to predict the membership of the observations that were not in the random subset (from Step 3). 9. As a simple example of how the similarity among observations is calculated (although the process varies depending on the specific implementation): For each tree in which observations appear together in the bottommost nodes, the two observations get a 1; for each tree in which those that do not end up together, they get a 0. These values are summed for all the trees in the forest and then divided by the total number of trees. In this case, observations that appear together in the end nodes of 200 out of 1000 trees would have a similarity score of .2. 10. This results in a similarity score from 0 to 1 for every pair of observations (stored as an n by n matrix).

3

There are many good overviews of Random Forests available. See Cutler, Cutler and Stevens (2012) for a general overview; Brieman (2001) for a technical introduction, Liaw and Wiener (2002) for an introduction with the R language. Brieman’s website is also a useful resource: https://www.stat.berkeley.edu/~breiman/RandomForests/

6

Inductive Typology: Methodological Appendix Because clustering algorithms usually make use of dissimilarity matrices, we converted the similarity matrix into a dissimilarity matrix by subtracting it from 1. We used the lower half of the resulting dissimilarity matrix for clustering (function as.dist() in R). In the next section, we describe the application of RF to the NCCS data. Step 1: Unsupervised Random Forest and Clustering using 43 Variables Random Forests Implementation: At the time of writing, the only widely-used implementation of RFs in R capable of unsupervised learning is the randomForest package. However, when we compared the clustering results from the resulting proximity matrices to those from Salford Predictive Modeler 7 (SPM), we found that the SPM results exhibited less variation due to random seeds and were more substantively informative. As a result, we used SPM to perform unsupervised learning (creating the similarity matrices) and randomForest in R to perform classification (because we found it easier to automate comparisons of classification performance in R). To create (dis)similarity matrices in SPM, we chose parameters that maximized the stability of the resulting clusters across different runs. In other words, our assumption was that the less susceptible the results were to random variation, the more information the RF algorithm had succeeded in extracting from the data. The stability of the clustering results was measured using normalized mutual information (NMI) and the corrected Rand index. When performing unsupervised learning on the 43 initial variables, we followed Breiman’s (2003) suggestion to try the following numbers of predictors: the square root of the number of variables, half the square root and double the square root. We did this and converged on 12 as the best number of variables to sample at each split for learning using the 43 initial variables. Although we experimented with different sizes of terminal nodes in using random forests for unsupervised learning on all 43 variables, we did not see large differences in the stability of the results (as measured by the similarity of different clustering partitions with the R package CLUE). We ended up using the default value, which is 2. While intuition is not always a reliable guide in these exercises, we imagined that smaller terminal nodes would also emphasize extreme similarities among observations, and our goal in the first clustering step was to obtain smaller, very well-defined clusters. We output a Gini variable importance in SPM, which measures the contribution of variables to the ability of trees in the forest to correctly classify observations into real and simulated. The scores are relative so we simply present the ranking of the top ten variables in Table A2 below. Table A2: Ten Most Important Variables in Creating Similarity Matrix in SPM 7 1. TOTAL_NONKIN_N 2. KIN_LIVES_FAR_P 3. FREQ_SEES_KIN_M 4. NONKIN_SOCIAL_N 5. EMPLOY_STATUS_C 7

Inductive Typology: Methodological Appendix 6. HOM_WORK_P 7. NONKIN_LIVES_FAR_P 8. KIN_LIVES_NEAR_P 9. FREQ_ACTIVITYERS_O 10. IMMED_FAM_NEARBY_O Step 2: Initial Clustering and Choosing the Number of Clusters To perform clustering, we used the agnes function found in the R package cluster. Agnes is an implementation of agglomerative clustering. We found the results of agnes to be substantively better than diana (divisive clustering) and pam (for “partitioning around medoids”, a modification of k-means clustering). Within agnes, we used method=“gaverage” (for generalized average) to perform clustering. We also tried “ward” and “average,” but the resulting clusters were less distinctive in a substantive sense. To choose the number of clusters, we began by creating partitions of 10 to 50 clusters. We then used randomForest within R to classify observations into their respective clusters. We looked for the number of clusters that maximized the ability of randomForest to predict which observations belonged in which clusters. The underlying idea was that a number of clusters that led to more homogenous groupings of observations would lead to better prediction performance by randomForest. We chose parameters for randomForest that maximized the accuracy with which the package predicted the cluster membership of observations. The key parameter in this case was the number of variables to randomly sample at each tree split. We used the tuneRF() function in the randomForest package to select the optimal number of variables to try at each tree split (it chose between 3, 4, 6, 9, 12, 18, and 27 variables, though the optimal number never went above 9). We allowed the trees to grow to their maximum height for classification (no floor on terminal node size).

8

Inductive Typology: Methodological Appendix Figure A1: Random Forests Prediction Accuracy by Number of Clusters

In choosing the optimal number of clusters we tried to balance a few criteria. First, we wanted to retain as many observations as possible in these clusters as this subset would be used for combining variables into composite variables. Second, we also know that that a few large clusters tend to be less distinctive than many smaller clusters. In order to obtain richer results, we would want smaller clusters with more complex differences that would require the use of more variables to characterize them. Finally, we wanted the machine-predictable clusters to be sociologically interpretable and so we examined the results of a few local maxima in Figure A1. In sum, we tried to balance between keeping in as many observations as possible, finding small and distinctive clusters, and the substantive meaningfulness of the characteristics of the resulting clusters. We settled upon a partition of 35 clusters, 14 of which had lower than 25% error rate. These preferred 14 clusters represented 579 observations. The means of z-scored variables associated with each cluster can be found in Table A3. At the bottom of the table is the number of observations in that cluster and its prediction error. The shaded bars going to the right and left of each cell help to highlight key values by cluster and variable. A few examples of how to read the table follow it.

9

Inductive Typology: Methodological Appendix Table A3: Means of z-scored variables for 14 clusters with prediction error less than 25%

10

Inductive Typology: Methodological Appendix Example descriptions of initial clusters: We describe the first three clusters in Table A2 here to illustrate how we read the table. Cluster 1 can be characterized by an active lifestyle. While maintaining above-average contact with nearby family (freq_sees_kin_m, imm_fam_near_o, kin_lives_far_p), members often socialized with people known from hobbies (freq_activityers) and many of the nonkin they knew were from these contexts (nonkin_hobbies_n, hom_activities_p). They also socialized with coworkers (freq_coworkers_o). Cluster 3’s members had above-average contact with kin and nonkin (freq_sees_kin_m, freq_sees_nonkin_m), but their kin network was relatively larger, more social and supportive (e.g., total_kin_n, kin_social, kin_emerloan; perhaps in part because people in the cluster had younger children – young_kids_n). The nonkin they did know socially seemed to be “just friends” as they did not share the same work, religion or activities (the hom_ variables). Cluster 4 lived far from kin, saw them infrequently, and it is likely that their kin and nonkin did not know each other (subsamp_dens_p). They had a moderate number of nonkin overall (total_nonkin_n) and for socializing (nonkin_social_n), but these ties seem somewhat superficial as they could not count on these individuals for much support (nonkin_: emerloan, judgment, personal) and seem not to have known them long (diffTimeKnown).

11

Inductive Typology: Methodological Appendix Step 3: Grouping Variables into “Dimensions” The next step in our process was to draw on these results to combine as many variables as possible into meaningful dimensions of personal networks. These dimensions would be used to place nearly all the observations (not just the 579 that could be classified in the previous step) into network types. As mentioned in the main text, we clustered variables using the R package clustOfVar, which allows both categorical and continuous variables (Chavent et al., 2012). In clustOfVar, we used the hclust() function, which clusters variables in a hierarchical fashion. clustOfVar uses a bootstrap replication approach (function stability()) to evaluating the stability of information loss at each split in the hierarchy. We chose the optimal number of clusters of variables using both substantive interpretation of the variable groupings as well as an adjusted Rand criterion (Figure A2). After reviewing the results of a few local maxima, 14 clusters provided the most sensible results. Because half the clusters of variables proposed by clustOfVar had only one variable or had combined correlated but sociologically unrelated variables, we only kept seven of the 14 proposed groups. 4 We modified them a bit for greater conceptual focus (described below). Twenty-one of the 43 original variables combined to form these seven distinctive dimensions, to wit:

4

The exact output from clustOfVar for 14 clusters of variables is as follows (squared loadings to the right of each variable): Cluster 1 : Cluster 7 : freq_personal_o 1 role_mult_m 0.66 Cluster 2 : social_hangout_n 0.66 orgs_involved_n 0.23 Cluster 8 : total_nonkin_n 0.78 hom_religion_p 0.68 nonkin_social_n 0.70 hom_ethnicity_p 0.31 nonkin_hobbies_n 0.37 freq_church_o 0.61 social_at_home_n 0.47 Cluster 9 : social_outside_n 0.61 hom_work_p 0.77 Cluster 3 : freq_coworkers_o 0.77 total_kin_n 0.56 Cluster 10 : kin_social_n 0.46 hom_activities_p 0.77 kin_hobbies_n 0.25 freq_activityers_o 0.77 kin_personal_n 0.49 Cluster 11 : kin_judgment_n 0.44 hom_gender_p 1 kin_emerloan_n 0.29 Cluster 12 : Cluster 4 : hom_age_m 0.62 nonkin_personal_n 0.56 diffTimeKnown 0.62 nonkin_judgment_n 0.63 Cluster 13 : nonkin_emerloan_n 0.42 freq_sees_nonkin_m 0.63 exchange_mult_m 0.43 nonkin_lives_near_p 0.58 Cluster 5 : nonkin_lives_far_p 0.71 pct_kin_close_p 0.6 Cluster 14 : pct_nonkin_close_p 0.6 others_present_o 0.48 Cluster 6 : young_kids_n 0.53 subsamp_dens_p 0.29 hasPartner_i 0.44 freq_sees_kin_m 0.75 kin_lives_near_p 0.50 kin_lives_far_p 0.74 immed_fam_nearby_o 0.66

12

Inductive Typology: Methodological Appendix Figure A2: Number of Variable Clusters by mean adjusted Rand criterion

Nonkin Interaction: number of nonkin for socializing + ln(total number of nonkin) Nonkin Support: number of nonkin with whom ego discusses personal matters + number of nonkin ego relies upon for judgment + number of nonkin ego could ask for an emergency loan. We dropped exchange multiplexity from the nonkin support dimension because it merely proxies for the fact that the respondent has nonkin whom he or she can rely upon. Kin Involvement/Support: ln(total number of kin) + number of kin with whom ego socializes + number of kin with whom ego discusses hobbies + number of kin with whom ego discusses personal matters + number of kin ego relies upon for judgment + number of kin ego could ask for an emergency loan. Kin Proximity: number of immediate family nearby + percentage of kin living within five minutes + (1 - percentage of kin farther than one hour). We dropped frequency with which ego sees kin from this group of variables because we wanted a measure of whether kin were physically nearby, rather than mixing availability and interaction. We dropped the network density measure from this variable cluster because it is a correlate of being in the same place for a long time (perhaps remaining near kin), but does not specifically measure whether the respondent has kin in the vicinity. Activity-based: percentage of nonkin alters who do same leisure activities as ego + frequency ego sees alters who share activity + number of nonkin with whom ego discusses hobbies. We moved the variable regarding the number of nonkin with whom the respondent discusses hobbies to this category (rather than the nonkin interaction) in order to keep each composite variable 13

Inductive Typology: Methodological Appendix conceptually distinct. Religion-based: percentage of nonkin who share ego’s religion + the frequency ego attends church. We dropped homophily of ethnicity from this group because it is distinct from the concept of interest. Work-based: percentage of nonkin who do same work as ego + the frequency ego sees coworkers socially. We dropped employment status from the “work-based” cluster because we decided that it would partition people who were, in purely personal network terms, quite similar. Additional notes on the composite variables: Unsupervised clustering can often greatly benefit from applying substantive understanding to raw results (Armstrong, 1967; Grimmer and King, 2010). We dropped variable dimensions that only contained one variable. This included the frequency that the respondent discussed personal matters and gender homophily. We also dropped variable dimensions that were substantively difficult to interpret or were not informative by themselves. The percentage of kin considered close and the percentage of nonkin considered close were grouped together by the clustering procedure. This appeared, to us, to reflect more of a response style rather than any meaningful aspect of a respondent’s personal network. Role multiplexity and the number of friends seen at social hangouts were also clustered together. We dropped these two because the relationship seems to be an artifact of other characteristics of respondents and their networks. We also dropped the cluster that combined age homophily and the measure of the time that the respondent had known alters versus the expected time the respondent has known alters given his or her age. This group did not seem to be very informative as a composite variable; it likely reflects the fact that older people who have maintained friendships with peers have higher age homophily than those who lost touch with peers. We also dropped the variable cluster capturing whether others were present during the interview, whether the respondent had children and whether the respondent had a spouse. While spouses or children being present during the interview explains this cluster, we did not think this was sociologically informative. We also dropped the cluster pertaining to the distance from nonkin. The reasoning was that nonkin are “replaceable” in the sense that people who are at least moderately sociable will find new nonkin alters when they move to a new area. The small number of people who had many nonkin who lived faraway tended not to interact with them much (it was 1977 and phone calls were expensive). Thus, having information about the location of nonkin was to some extent redundant with having information about the overall existence of nonkin interaction alters and nonkin support. These choices, we recognize, entail applying our sociological knowledge to the results and other investigators might make other decisions. Fortunately, the data are in the public domain and available for alternative analysis. Step 4: Unsupervised RF and Clustering using 7 composite variables At this stage, our goal was to use the composite variables to place as many observations as possible into a manageable number of interpretable clusters. Using the 7 composite variables, we ran RF again in order to create dissimilarity measures and cluster observations into between 2 and 20 clusters. Using SPM 7, we found that the optimal number of variables to sample at each 14

Inductive Typology: Methodological Appendix split was 5 (though the results varied little from 4-7). In setting the minimum node size, we experimented with many values from 2 to 210. We found that a terminal node size of 4 maximized the stability of the resulting clusters (30 was a close contender). We then used randomForest to try to predict the membership of observations within clusters in order to select the best number of clusters for our typology. In this case, we found 6 predictors to be best (though the results differed little using between 2 and 7 predictors). We set the number of trees for both SPM and randomForest to 10,000 because there is little risk of overfitting when using Random forests (Cutler et al., 2012). We observed small stability improvements from 2,000 to 10,000 trees. While SPM supports up to 20,000 trees and randomForest supports up to 250,000 trees, we saw little benefit in ratcheting all the way up to these values. We used set.seed(111) in R order to ensure that comparisons of different parameter values were not affected by chance. We used RF SEED = 12667 in SPM 7. Of course, when comparing the stability across runs due to the randomness of Random Forests, we used a set of fixed seeds. A note on clustering stability: To see how stable our results were, we created 30 sets of clustering results based on 30 different random seeds (we fixed these 30 for the purposes of comparison). In the early step of clustering based on all 43 initial variables, we obtained a normalized mutual information (NMI) score of .73 (scale is 0 to 1). In the later step of using the 7 composite variables, we obtained an NMI of .86. From a purely information-theoretic view, 86% of the information in the clusters is being preserved regardless of randomization, which we take to be good robustness. We relied on only 30 sets of results because generating proximity matrices in SPM 7, formatting and importing them to R for conversion and clustering had to be done manually. We calculated stability using the R package CLUE [function cl_agreement(), method = “NMI”]. Step 5: Choosing the Number of Clusters for the Typology We used RFs again to predict the membership of observations in clusters. We tried to identify the number of clusters that placed the greatest number of observations in highly predictable clusters (error less than .2 or .3, Figure A3). While the peak of the .2 error line at 18 clusters is only marginally higher than that at 12 clusters, the 6 clusters that were split off are all small and very noisy, suggesting that that the remaining observations in the cluster are thereby more coherent. In our results, we highlight the 11 clusters of the 18 that had good prediction accuracy. (In the body of the main text, we focus on the 7 that each contained more than 5% of the observations and describe the remaining 4 in footnotes.) Thus, 94% of all 1,050 respondents were successfully classified into these 11 clusters and 81% of them were successfully classified into the 7 larger clusters. Put another way, 94% of respondents fell within 11 clusters whose members could be predicted with greater than 80% accuracy.

15

Inductive Typology: Methodological Appendix Figure A3: Random Forest Prediction Accuracy by Number of Clusters 1050 1000 950

Obs.

900 850 800 750 700 650 600 2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Number of Clusters obs. in clusters < 0.3 error

obs. in clusters < 0.2 error

total correctly classified

16