Supporting Information (SI Appendix) Overview - PNAS

2 downloads 0 Views 1MB Size Report
XP_005486070.1. HMGN3. Start lost. 96. 1. M/I. XP_005486910.1. ACYP2. Stop gain. 99. (incomplete annotation). 16. R/*. XP_005487649.1. GGPS1. Start lost.
Supporting Information (SI Appendix) Overview: In this study, we generated genome sequence of a rare ‘super-white’ white-throated sparrow homozygous for a rearrangement on chromosome two (ZAL2m/ZAL2m) (1). Prior to our work, Tuttle et al. (2) published the genome sequence of a tan individual (homozygous for ZAL2). Because of the homozygosity of both of these individuals, we could use both data sets to confidently identify ZAL2m-specific substitutions. To investigate genetic divergence between the rearranged regions of ZAL2 and ZAL2m, we first classified ZAL2-linked tan scaffolds utilizing multiple lines of evidence (including, for example, mapping tan scaffolds to the zebra finch chromosome that is homologous to ZAL2/2m). This step improved the previous available list of ZAL2-linked scaffolds by identifying some previously unidentified ZAL2-linked scaffolds and eliminating an erroneous assignment (see below, ‘Identification of scaffolds on the second chromosome’). We then mapped our newly generated sequence reads from the super-white to the tan scaffolds. To confidently identify substitutions that distinguish the ZAL2 and ZAL2m chromosomes, we used the Genome Analysis Toolkit (GATK) to call variants in the available genome sequence data and in RNA-seq data published by Zinzow-Kramer et al. (3). Additionally, we investigated morph-biased and allele-specific expression patterns using this RNA-Seq data from multiple tan and white birds. We identified many genes with differential expression using DESeq2 (4). Sequencing. We sequenced a super-white bird homozygous for the rearrangement (ZAL2m/ZAL2m) (1). High molecular weight genomic DNA was extracted from liver and sequenced using HiSeq2500 at the Roy Carver Genome Center of the University of Illinois. Approximately 240 million reads of 150 bps were generated, which are available in the SRA database (SRA accession number: SRR4191732). Identification of scaffolds on the second chromosome. Genomic scaffolds from a tan bird were recently published by Tuttle et al. (2). These data are available in NCBI (GCF_000385455.1). To confidently identify scaffolds that originate from the second chromosome of white throated sparrows, we mapped those scaffolds onto that of the zebra finch using LASTZ 1.03.73 (5) (parameters: --step=20 --chain --gfextend --gapped --traceback=2000M --ydrop=300400 --identity=85 --matchcount=1000), with additional parameters of >30% coverage for scaffolds longer than 10kbps and >80% coverage for scaffolds shorter than 10kbps. These cutoff criteria were selected to avoid scaffolds that mapped to ZAL2 due to partial mapping to repetitive sequences (6), on the basis of the coverage% distribution across scaffolds (Fig. S8).

1

We identified ZAL2 scaffolds using homology to the corresponding zebra finch chromosome (commonly referred to as TGU3 due to its homology to chicken chromosome 3 (7)), previous fluorescent in situ hybridization (FISH) studies (2, 8, 9), as well as homology with two other passerine birds with chromosome-level assemblies, collared flycatcher (Ficedula albicollis) and great tit (Parus major) (10, 11). Following these procedures, we identified 56 scaffolds on the ZAL2 chromosome (Table S1). Compared with the results of Tuttle et al. (2), our results included 19 additional ZAL2 scaffolds (corresponding to ~25Mb) that were previously unrecognized as linked to this chromosome. In addition, we excluded a ~45Mb scaffold (NW_005081536.1) that had been denoted as residing on a non-rearranged portion of the ZAL2 (2). Regions homologous to this scaffold were found on a different chromosome in zebra finch, collared flycatcher and great tit, making it unlikely that it has moved to ZAL2 in Z. albicollis given the well-conserved chromosomal homology in birds (7, 11, 12). FISH studies also previously showed that regions outside of the rearrangement on chromosome two are only ~10Mbs in length (8, 9). Genetic divergence between ZAL2 and ZAL2m from genome sequences and RNA-seq data. We then called variants (SNPs and indels) that distinguish ZAL2 and ZAL2m sequences by following the Genome Analysis Toolkit (GATK) best practices for variant calling in genome sequencing data (13-15). First, super-white reads from whole-genome sequencing were aligned to the reference genome from a tan bird (2) using BWA 0.7.12 (16). GATK 3.4 was used to call variants (the variant call data are available at Figshare DOI: http://dx.doi.org/10.6084/m9.figshare.5716039), and those variants with quality lower than 30 or read depth less than 5 were excluded. We then used a sliding window approach to calculate short indel frequencies and dXY following Jukes-Cantor correction (17). We also called variants from available transcriptome data from the hypothalamus and nucleus taeniae of nine tan and 11 white individuals (3) following the GATK best practices for variant calling in RNA-seq data (13-15) (The variant call data are available at Figshare DOI: http://dx.doi.org/10.6084/m9.figshare.5715037). A variant was considered putatively fixed in the sampled individuals if: 1) the variant was biallelic; 2) tan individuals were homozygous for the reference allele (AA); 3) white individuals (ZAL2/ZAL2m) were heterozygous (Aa); 4) the superwhite (ZAL2m/ZAL2m) individual was homozygous for the alternative allele (aa). Coordinates for fixed differences can be accessed through the following Figshare DOI: http://dx.doi.org/10.6084/m9.figshare.5715079.

2

Scaffolds inside versus outside the rearrangement. Scaffolds residing inside versus outside the rearrangement were identified on the basis of distinctive patterns of genetic divergence. Specifically, the distributions of dXY and FST were bimodal, categorized as 'high’ or ‘low’ dXY or FST (Fig. S9). Scaffolds with high dXY, high FST and putatively fixed differences were designated as ‘confidently inside’. If only one of the two criteria were satisfied, the scaffolds were designated as ‘likely inside’. Scaffolds were defined as ‘confidently outside’ if they had low dXY, low FST, and no putatively fixed differences, and one of the following two conditions was satisfied: either they were already shown to be outside the rearrangement (2, 8, 9), or they shared homology with the homologous chromosome of two other passerine birds (F. albicollis and P. major) (10, 11). Scaffolds that exhibited low dXY, low FST and an absence of putatively fixed differences, with no extra supporting evidence, were designated as ‘likely outside’. We used only ‘confidently inside’ and ‘confidently outside’ scaffolds for the calculations shown in Fig. 1 C-E, but including ‘likely inside’ and ‘likely outside’ did not change the patterns (Fig. S1). Analyses of protein coding sequences. For each ZAL2-linked gene, we extracted the longest transcript and constructed the ZAL2m version with the putatively fixed differences. Additionally, we downloaded all available genome annotations for 13 other avian species in the order of Passeriformes (the same order as the white-throated sparrow) from NCBI, and the tree for all 14 species were inferred from several avian phylogeny studies (18-21) (Fig. S6). Each gene was aligned by MAFFT v7.245 (22), low-quality alignment parts were trimmed by trimAl v1.4 (23), and the codon alignment was constructed by PAL2NAL v14 (24). We obtained codon alignments for a total of 800 genes. We calculated dN and dS for the ZAL2 and ZAL2m branches with a free-ratio model using codeml from the PAML 4.8 package (25). The Hon-New package (26), which adopts the amino acid classification system that considers charge and polarity, was used to estimate rates of radical amino acid substitution (dR) and conservative amino acid substitution (dC). To test for positive selection, a branch-site model was run with codeml in PAML, which was then compared with the null model following the simulation approach used by Nielsen et al. (27). Briefly, 10,000 random DNA sequence alignments with the same substitution parameters used in the null model were generated using Evolver (25). The resulting empirical LRT distribution served as the null distribution of the test statistic, and the P-value was inferred from the empirical distribution of LRTs. Since incomplete lineage sorting may result in discordance between the gene trees and the species tree and therefore may prevent the identification of positively selected genes (PSGs), we constructed maximum-likelihood gene trees in MEGA7 (28) for the candidate PSGs from the previous step. We reran PAML using the gene trees.

3

Positively-selected functional categories were identified by first assigning genes according to the PANTHER Classification System version 10 (29). For each category with more than 10 genes, the difference between the cumulative distributions of P for genes in the category, versus that of genes not in that category, was tested using a one-tailed Mann-Whitney U (MWU) test (30, 31). Gene expression. We examined genes with morph-biased (TS≠WS) expression and allelespecific expression (ZAL2≠ZAL2m) patterns. To account for potential mapping bias towards the reference (ZAL2/ZAL2) genome caused by mismatches between ZAL2 and ZAL2m, we Nmasked (32) putatively fixed differences in the reference. We additionally checked potential leftover bias by aligning whole-genome sequences from three white birds described by Tuttle et al. (2) to this N-masked genome. ZAL2 and ZAL2m alleles should have roughly equal coverage per site if mapping bias has been eliminated. Indeed, for all three white birds, we did not observe significant coverage bias towards the ZAL2 allele (see ‘Examining mapping bias in the N-masked reference genome’ and Fig. S7). We mapped the aforementioned RNA-Seq data from nine tan and 10 white individuals (3) to the N-masked genome with STAR 2.4.1d under the 2-pass mode (33). Only uniquely mapped reads were retained for further differential expression analysis. SNPsplit 0.3.3 (32) was run to assign reads to ZAL2 or ZAL2m for the white samples and to filter out reads without fixed differences in the tan samples. Read counts per gene at the morph and allele level were calculated by htseqcount 0.9.1 (34) with ‘-s no -m intersection-nonempty’. To detect morph-biased expression, we calculated size factors, normalized libraries with these factors, and then identified differential expression with ‘design = ~ morph’ in DESeq2 1.12.3 (4). To detect allele-specific expression, we normalized libraries with the size factors generated in the previous step, and identified differential expression with ‘design = ~ allele + sample’ in DESeq2. Only genes with average expression levels (‘baseMean’ in the DESeq2 output) higher than 5 at the morph level were retained for later analysis (809 genes for the hypothalamus and 806 for nucleus taeniae). All differential expression data can be accessed through Figshare DOI: http://dx.doi.org/10.6084/m9.figshare.5715064. De novo assembly of the super-white genome, whole-genome alignment, and detection of gene deletion. Paired-end sequences from the super-white bird were first trimmed by PRINSEQ 0.20.4 (35) and assembled by Abyss 1.5.2 (36). The final assembly has a contig N50 of 26,601bp and a scaffold N50 of 32,876bp. The total assembly size is 1.01 Gbps, which is close to the estimated genome size of the white-throated sparrow (2). This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession PKOH00000000. The version 4

described in this paper is version PKOH01000000. We aligned the newly generated super-white assembly to the tan reference genome using LASTZ with the aforementioned parameters. We found no evidence of deletion of exons and/or large (> 50bps) indels. Annotation of repetitive sequences. We used both de novo and homology-based approaches to annotate repetitive sequences in the tan and super-white assemblies. First, both genomes were annotated with RepeatModeler 1.0.8 (37). The generated de novo library was merged with the avian RepeatMasker library (20150807 version) (38). Then, RepeatMasker 4.0.6 was used to annotate both genomes on the basis of homology with repeats in the merged library (parameters: -xsmall -s -nolow -norna -nocut). According to these analyses, the identified repeat content was highly similar on ZAL2 and ZAL2m, 7.11% and 7.38% respectively (Table S5). However, since the super-white genome assembly is more fragmented than the reference, the repeat content might be underestimated on ZAL2m. Examining mapping bias in the N-masked reference genome. We examined potential leftover mapping bias by aligning whole-genome sequences of three white birds (Sample IDs: 10_083, 10_092 and 10_093) published by Tuttle et al. (2) to the N-masked genome using HISAT2.1.0 (39) (parameters: --no-spliced-alignment --sp 1000,1000). Reads were assigned to the ZAL2 or ZAL2m chromosome by SNPsplit 0.3.3 (32) on the basis of putative fixed differences, and bedtools genomecov (40) was used to count per base coverage on ZAL2 and ZAL2m, respectively. Across putative fixed differences, we observed roughly equal per base coverage between ZAL2 and ZAL2m (Fig. S7), suggesting mapping bias was significantly eliminated using the SNP N-masking approach.

5

SI Figures

A

B

dXY

C

FST

***

Indel frequency

***

***

***

***

0.004

0.02

***

0.4 NS

0.01

NS

0.002

NS

0.2

0.00

0.0 Inside

Outside Genome

0.000 Inside

Outside Genome

Inside Outside Genome

Fig. S1. A similar divergence pattern is found using a less conservative approach. Fig. 1 of the main text shows the patterns of divergence that were revealed using only scaffolds that are ‘confidently inside’ and ‘confidently outside’ the ZAL2m rearrangement. This figure shows that adding all ‘likely inside’ and ‘likely outside’ scaffolds does not change the pattern. (A) Pairwise nucleotide divergences (dXY), (B) degrees of population differentiation (FST), and (C) indel frequencies, all measured in 10-kb non-overlapping windows, were significantly higher in scaffolds within the rearrangement than in those outside the rearrangement (Mann–Whitney U test, ***:P < 0.001; NS: not significant).

6

Fig. S2. Alternative scenario for dosage compensation in Z. albicollis. Fig. 3 in the main paper shows a potential scenario for dosage compensation when dosage imbalance is caused by down-regulation of ZAL2m alleles. When dosage imbalance is instead caused by up-regulation of ZAL2m alleles, we propose the following alternative scenario. A) Initially, expression dosage (shown as black waves) is similar between ZAL2 and ZAL2m and between tan and white birds. B) Expression of the ZAL2m allele may increase due to mis-regulation. Consequently, heterozygous (white) individuals should show increased expression. C) Dosage may be re-balanced, to be similar in the two morphs, via under-expression of the ZAL2 (non-degenerated) allele in white birds. Consequently, expression of the ZAL2 allele should be greater in tan than white birds. To test this prediction, we examined the ratio of ZAL2 in tan to ZAL2 in white birds. D-E) Levels of compensation (tan-ZAL2/white-ZAL2) were significantly elevated for tan≈white genes compared with tanZAL2m), those with relatively similar expression in tan and white birds (tan≈white) exhibit significantly higher expression than that of the background genes. The definitions of ZAL2>ZAL2m and tan≈white genes are based on A-B) P values and C-D) FDR-corrected Q values from DESeq2 (Table 1). The Y-axis represents ‘baseMean’ from the DESeq2 output, which is essentially the mean of normalized counts of all samples. Mann-Whitney U test, ***:P < 0.001.

8

Fig. S4. Potentially dosage-compensated ZAL2m>ZAL2 genes are more highly expressed than background genes in nucleus taeniae. In the nucleus taeniae, but not in the hypothalamus, the average expression (irrespective of morph) of genes with higher ZAL2m than ZAL2 expression (ZAL2m>ZAL2) and relatively similar expression in tan and white birds (tan≈white) is significantly higher than that of the background genes (i.e., genes that do not exhibit allele-specific or morph-biased expression patterns). The definitions of ZAL2m>ZAL2 and tan≈white genes are based on A-B) P values and C-D) FDR-corrected Q values from DESeq2 (Table 1). The Y-axis represents ‘baseMean’ from the DESeq2 output which is essentially the mean of normalized counts of all samples. Mann-Whitney U test, *:P < 0.05; **:P < 0.01; NS: not significant.

9

Fig. S5. Evidence for dosage compensation is robust in the set of genes defined by FDRcorrected Q values. A-B) For ZAL2>ZAL2m genes, the ratio of white-ZAL2 expression to tanZAL2 expression is significantly elevated for tan»white genes compared with tan>white genes. CD) For ZAL2m>ZAL2 genes, the ratio of tan-ZAL2 expression to white-ZAL2 expression is significantly elevated for tan»white genes compared with tan 0.05).

12

Fig. S8. Distribution of percent coverage (Coverage %) for scaffolds in the tan reference genome. Coverage % was calculated as the length of the region that could be mapped to the TGU3 chromosome, divided by the total length of that scaffold. The cutoff for coverage % by our criteria is shown as the red dashed line. Only scaffolds >10kbps are included.

13

Fig. S9. Bimodal patterns of divergence for pairwise nucleotide divergence and degrees of population differentiation. The graphs show the distribution of average pairwise nucleotide divergence (dXY) between ZAL2 and ZAL2m chromosomes (left panel) and degrees of population differentiation (FST) between white and tan samples over ZAL2 scaffolds (right panel). Cutoffs to distinguish between ‘Low’ and ‘High’ (as well as ‘Intermediate’ in the case of FST) of the two measures were determined by the clear divisions in the distributions. Thus, low dXY is in the range of [0.00080, 0.00588], and high dXY corresponds to [0.00968, 0.01431]. Similarly, low FST is in the range of [0.00448, 0.01420], intermediate FST is [0.11821, 0.22853], and high FST is [0.28586, 0.45355].

14

SI Tables Table S1. Designation of scaffolds inside or outside the inversion.

Scaffold

Length (bp)

Extra

Presence of dXY

FST

fixed

Inversion

differences

being outside the inversion

NW_005081548.1

11927161

0.01125

0.40361

yes

NW_005081553.1

10178608

0.01111

0.21094

yes

NW_005081561.1

7587459

0.01111

0.39278

yes

Inside

NW_005081569.1

6428186

0.01231

0.34407

yes

Inside

NW_005081574.1

5690812

0.01261

0.40039

yes

Inside

NW_005081577.1

5965230

0.01153

0.33666

yes

Inside

NW_005081582.1

4964275

0.01265

0.36901

yes

Inside

NW_005081589.1

4458466

0.01243

0.40993

yes

Inside

NW_005081591.1

4515391

0.0118

0.38121

yes

Inside

NW_005081596.1

4339766

0.01182

0.40298

yes

Inside

NW_005081602.1

4246152

0.01356

0.35859

yes

Inside

NW_005081609.1

3616420

0.01415

0.37739

yes

Inside

NW_005081611.1

3581226

0.01233

0.41416

yes

Inside

NW_005081615.1

3504361

0.00968

0.42848

yes

Inside

NW_005081620.1

3189763

0.01130

0.39955

yes

Inside

NW_005081621.1

3168318

0.01273

0.31734

yes

Inside

NW_005081632.1

2520776

0.01176

0.38496

yes

Inside

NW_005081635.1

2503139

0.00222

0.00639

no

Outside

NW_005081642.1

2153052

0.01251

0.33943

yes

Inside

NW_005081653.1

2022704

0.01198

0.40088

yes

Inside

NW_005081654.1

1930009

0.01194

0.28586

yes

NW_005081662.1

1786786

0.01343

0.42324

yes

Inside

NW_005081697.1

1189320

0.01268

0.36119

yes

Inside

NW_005081699.1

1172919

0.01334

0.33628

yes

Inside

NW_005081708.1

1082959

0.00080

0.01420

no

Outside

15

evidence for

Inside Likely Inside

Supported by (2, 8, 9)

Likely Inside

Supported by

(2, 9) Supported by homology of NW_005081720.1

1134561

0.00381

0.00616

no

Outside

chicken, zebra finch, great tit and flycatcher

NW_005081729.1

913800

0.01168

0.37611

yes

Inside

NW_005081742.1

825557

0.01062

0.44194

yes

Inside

NW_005081746.1

758667

0.01175

0.45355

yes

Inside

NW_005081754.1

810001

0.01220

NA

no

NW_005081771.1

578351

0.01431

NA

no

NW_005081821.1

411424

0.00473

0.34085

yes

NW_005081827.1

355097

0.01366

NA

no

NW_005081831.1

438198

0.00327

0.00448

no

NW_005081832.1

349547

0.00588

0.30376

yes

NW_005081844.1

292358

0.01314

0.22853

yes

NW_005081876.1

266586

0.01127

0.22601

yes

NW_005081883.1

224620

0.00583

0.11821

yes

NW_005081958.1

224289

0.00358

0.03357

no

NW_005082055.1

43098

0.00527

NA

no

Unknown

NW_005082170.1

28300

0.00573

NA

no

Unknown

NW_005082187.1

36427

0.00344

0.21837

yes

NW_005082848.1

7948

NA

NA

no

Unknown

NW_005082865.1

18815

0.00090

NA

no

Unknown

16

Likely Inside Likely Inside Likely Inside Likely Inside Likely Outside Likely Inside Likely Inside Likely Inside Likely Inside Likely Outside

Likely Inside

NW_005083054.1

6122

NA

NA

no

Unknown

NW_005083097.1

5868

NA

NA

no

Unknown

NW_005083174.1

5400

NA

NA

no

Unknown

NW_005083671.1

3948

NA

NA

no

Unknown

NW_005083866.1

5132

NA

NA

no

Unknown

NW_005083989.1

2649

NA

NA

no

Unknown

NW_005084012.1

2613

NA

0.14742

no

Unknown

NW_005084703.1

1690

NA

NA

no

Unknown

NW_005085200.1

1515

NA

NA

no

Unknown

NW_005085751.1

1335

NA

NA

no

Unknown

NW_005085851.1

1254

NA

NA

no

Unknown

NW_005086488.1

1178

NA

NA

no

Unknown

17

Table S2. Genes with disrupted ORFs (open reading frames). Mutation

Protein

Mutation

type

length

position

CENPF

Stop gain

2937

1934

Q/*

XP_005483141.1

PREPL

Stop lost

729

729

*/E

XP_005484012.1

CASP8AP2

Stop gain

2002

652

Q/*

XP_005485183.2

KIAA1919

Stop gain

569

565

W/*

XP_005485996.1

PGM3

Stop lost

543

543

*/S

XP_005486070.1

HMGN3

Start lost

96

1

M/I

16

R/*

Protein accession

Gene symbol

XP_005483072.1

Mutation

99 XP_005486910.1

ACYP2

Stop gain

(incomplete annotation)

XP_005487649.1

GGPS1

Start lost

299

1

M/T

XP_005488567.1

SYNDIG1

Start lost

270

1

M/T

XP_005489463.1

PROKR1

Start lost

395

1

M/I

XP_005489744.1

HSP90AB1

Stop gain

737

2

Y/*

XP_005489744.1

HSP90AB1

Start lost

737

1

M/I

XP_005493909.1

LOC102067625

Stop gain

1283

1028

W/*

XP_014119811.1

LOC102071579

Stop gain

511

250

R/*

XP_014120380.1

EYS

Stop gain

1379

106

K/*

XP_014121156.1

TRAF3IP2

Stop gain

576

17

W/*

XP_014121156.1

TRAF3IP2

Start lost

576

1

M/V

XP_014121760.1

LOC102064740

Stop gain

378

21

Q/*

XP_014121813.1

CEP162

Stop gain

1419

469

R/*

XP_014122104.1

LOC106629365

Stop lost

1078

1078

*/W

XP_014122157.1

PARK2

Stop gain

515

4

Q/*

XP_014122326.1

LOC106629373

Stop lost

266

266

*/Q

XP_014122352.1

CFAP61

Stop gain

1162

1089

Q/*

XP_014122355.1

LOC106629377

Stop gain

145

70

W/*

XP_014122639.1

LOC106629392

Start lost

308

1

M/T

XP_014122653.1

PQLC3

Stop lost

166

166

*/W

XP_014123260.1

MYB

Start lost

795

1

M/V

XP_014123293.1

FNDC1

Stop gain

1696

11

W/*

XP_014124168.1

LOC102065471

Start lost

548

1

M/V

XP_014127383.1

LOC106629801

Stop gain

362

160

R/*

18

Table S3. Genes with signatures of positive selection that are detected using a branch-site model.

Foreground branch

Gene

LRT

Psim value

PK2a+PK2b

statistic

(Qsim value)1

(%)2

Positively dN/dS

selected sites (BEB probability3) 311S (0.876)

DIEXF

19.159

0 (0)

0.415

>10

336G (0.841)

ZAL2

ZAL2m

314R (0.834)

LGALSL

3.424

0 (0)

6.768

2.118

-

SLC35F3

7.994

0.0006 (0.16)

0.335

>10

26N (0.979)

AKAP12

14.699

0 (0)

0.222

>10

1912P (0.883)

DISC1

11.121

0 (0)

0.252

>10

469L (0.959)

All genes with FDR-adjusted Q < 0.2 are shown. Positions of positively selected sites are based on the longest transcripts. 1

P and FDR-adjusted Q values were calculated using a simulation method (SI Appendix) with 10,000

replications. 2

Proportion of K2a and K2b sites; specifically, K2a sites are those under positive selection (dN/dS ≥ 1)

on the foreground branch and under purifying selection (dN/dS < 1) on background branches, and K2b sites are those under positive selection (dN/dS ≥ 1) on the foreground branch and under neutral evolution (dN/dS = 1) on background branches. 3

Probability of a site being positively selected was estimated by the Bayes Empirical Bayes (BEB)

method (41). Only sites with BEB probability > 0.8 are shown.

19

Table S4. Candidate functional categories (biological process and molecular function) under positive selection. The cumulative distribution of P values of each functional category was compared with that of genomic background using Mann-Whitney U tests. Categories with FDR-adjusted Q < 0.1 are shown. For testing signatures of positive selection, see Materials and Methods and SI Appendix. # of Functional Category

Assigned

FDRPMWU

Genes ZAL2

Biological Process

Adjusted QMWU

rRNA processing (GO:0006364)

14

1.48 x 10-5

0.0924

response to drug (GO:0042493)

13

0.0039

0.0681

cilium assembly (GO:0042384)

13

0.0039

0.0681

10

3.18 x 10-11

2.80 x 10-9

11

0.0016

0.0468

10

0.0009

0.0381

12

0.0027

0.0511

12

0.0026

0.0511

11

0.0017

0.0511

15

0.0078

0.0879

13

0.0037

0.0524

Wnt signaling pathway Biological

(GO:0016055)

Process

mitotic anaphase (GO:0000090) positive regulation of neuron projection development (GO:0010976)

ZAL2m

ATPase activity (GO:0016887) protein domain specific binding (GO:0019904) Molecular

protein C-terminus binding

Function

(GO:0008022) receptor binding (GO:0005102) protein complex binding(GO:0032403)

20

Table S5. RepeatMasker(38) annotation of interspersed repeat content on ZAL2 and ZAL2m. Sequence%

Sequence%

(ZAL2)

(ZAL2m)

Total

0.08

0.08

ALUs

0

0

MIRs

0.04

0.04

Total

3.77

4.00

LINE1

0

0

LINE2

0.04

0.05

L3/CR1

3.72

3.93

Total

2.23

2.28

ERVL

1.28

1.49

ERVL-MaLRs

0

0

ERVL-classI

0.17

0.22

ERVL-classII

0.76

0.56

DNA elements

Total

0.37

0.37

Unclassified

Total

0.66

0.65

Total interspersed repeats

Total

7.11

7.38

SINEs

LINEs

LTR elements

21

SI References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.

Horton BM, et al. (2013) Behavioral characterization of a white-throated sparrow homozygous for the ZAL2m chromosomal rearrangement. Behavior genetics 43(1):60-70. Tuttle EM, et al. (2016) Divergence and functional degradation of a sex chromosome-like supergene. Curr Biol 26(3):344-350. Zinzow-Kramer WM, et al. (2015) Genes located in a chromosomal inversion are correlated with territorial song in white-throated sparrows. Genes Brain Behav 14(8):641654. Love MI, Huber W, & Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15(12). Harris RS (2007) Improved pairwise alignment of genomic DNA. Doctoral thesis (The Pennsylvania State University). Zhou Q, et al. (2014) Complex evolutionary trajectories of sex chromosomes across bird taxa. Science 346(6215):1246338. Warren WC, et al. (2010) The genome of a songbird. Nature 464(7289):757-762. Thomas JW, et al. (2008) The chromosomal polymorphism linked to variation in social behavior in the white-throated sparrow (Zonotrichia albicollis) is a complex rearrangement and suppressor of recombination. Genetics 179(3):1455-1468. Davis JK, et al. (2011) Haplotype-based genomic sequencing of a chromosomal polymorphism in the white-throated sparrow (Zonotrichia albicollis). J Hered 102(4):380390. Ellegren H, et al. (2012) The genomic landscape of species divergence in Ficedula flycatchers. Nature 491(7426):756-760. Laine VN, et al. (2016) Evolutionary signals of selection on cognition from the great tit genome and methylome. Nat Commun 7:10474. Shetty S, Griffin DK, & Graves JA (1999) Comparative painting reveals strong chromosome homology over 80 million years of bird evolution. Chromosome Res 7(4):289-295. McKenna A, et al. (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20(9):1297-1303. Van der Auwera GA, et al. (2013) From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics 43:11.10.1111.10.33. DePristo MA, et al. (2011) A framework for variation discovery and genotyping using nextgeneration DNA sequencing data. Nat Genet 43(5):491-498. Li H & Durbin R (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26(5):589-595. Jukes TH & Cantor CR (1969) Evolution of protein molecules. Mammalian Protein Metabolism, ed Munro HN (Academic Press), pp 21-132. Prum RO, et al. (2015) A comprehensive phylogeny of birds (Aves) using targeted nextgeneration DNA sequencing. Nature 526(7574):569-U247. Jetz W, Thomas GH, Joy JB, Hartmann K, & Mooers AO (2012) The global diversity of birds in space and time. Nature 491(7424):444-448. Jetz W, et al. (2014) Global distribution and conservation of evolutionary distinctness in birds. Curr Biol 24(9):919-930. Jarvis ED, et al. (2014) Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346(6215):1320-1331. Katoh K & Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30(4):772-780. Capella-Gutierrez S, Silla-Martinez JM, & Gabaldon T (2009) trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25(15):19721973.

22

24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41.

Suyama M, Torrents D, & Bork P (2006) PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res 34(Web Server issue):W609-612. Yang ZH (2007) PAML 4: phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution 24(8):1586-1591. Zhang J (2000) Rates of conservative and radical nonsynonymous nucleotide substitutions in mammalian nuclear genes. Journal of Molecular Evolution 50(1):56-68. Nielsen R, et al. (2005) A scan for positively selected genes in the genomes of humans and chimpanzees. Plos Biol 3(6):e170. Kumar S, Stecher G, & Tamura K (2016) MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets. Mol Biol Evol 33(7):1870-1874. Mi H, Poudel S, Muruganujan A, Casagrande JT, & Thomas PD (2016) PANTHER version 10: expanded protein families and functions, and analysis tools. Nucleic Acids Res 44(D1):D336-342. Clark AG, et al. (2003) Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios. Science 302:1960-1963. Haygood R, Fedrigo O, Hanson B, Yokoyama KD, & Awray G (2007) Promoter regions of many neural- and nutrition-related genes have experienced positive selection during human evolution. Nat Genet 39(9):1140-1144. Krueger F & Andrews SR (2016) SNPsplit: allele-specific splitting of alignments between genomes with known SNP genotypes. F1000Res 5:1479. Dobin A, et al. (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):1521. Anders S, Pyl PT, & Huber W (2015) HTSeq — a Python framework to work with highthroughput sequencing data. Bioinformatics 31(2):166-169. Schmieder R & Edwards R (2011) Quality control and preprocessing of metagenomic datasets. Bioinformatics 27(6):863-864. Simpson JT, et al. (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19(6):1117-1123. Smit AFA & Hubley R (2008-2015) RepeatModeler Open-1.0. Smit AFA, Hubley R, & Green P (2013-2015) RepeatMasker Open-4.0. Kim D, Langmead B, & Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12(4):357-360. Quinlan AR & Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6):841-842. Yang ZH, Wong WSW, & Nielsen R (2005) Bayes empirical Bayes inference of amino acid sites under positive selection. Molecular Biology and Evolution 22(4):1107-1118.

23