are exported all over the world and different industries with quality first. Our belief is to provide our customers with more and better high value-added products. Let's create a better future together.
ExtRaINSIGHT measures the fractional reduction in the incidence of rare variants in a target set of sites relative to nearby sites that are putatively free from (direct) natural selection. In this way, it is analogous to classical strategies for measuring selection in protein-coding genes22,23,24, as well as to newer methods that compare target sets of noncoding elements with suitable background sequences21,25,26,27. The focus on rare variants (here, variants with minor allele frequencies of < 0.1%), however, enables the method to focus in particular on point mutations of large selective effect.
The main challenge in this approach stems from the high sensitivity of relative rates of rare variants to variation in mutation rate. To address this problem, we follow refs. 12,15 in building a mutational model that accounts for both sequence context and regional variation in mutation rate. In our case, we condition the rate of each type of nucleotide substitution on the identity of the three flanking nucleotides on each side. In addition, following our earlier work20,21, we use a local control for overall mutation rate based on nearby sites identified as likely to be neutrally evolving. We also consider G+C content, sequencing coverage, and CpG islands as covariates (see Methods). With this strategy, we are able to predict with high accuracy the probability that a rare variant will occur at each site (Supplementary Fig. 1). Notably, this mutation model is also predictive of de novo variants from ref. 28 (Supplementary Fig. 3), which should be even less influenced by selection than the rare variants in gnomAD.
In the absence of natural selection, we assume a Bernoulli sampling model for the presence (probability Pi) or absence (probably 1 − Pi) of a rare variant at each site i, where Pi reflects the local sequence context and overall rate of mutation. We ignore sites at which common variants occur (similar to refs. 12,15). We then assume that natural selection has the effect of imposing a fractional reduction on the rate at which rare variants occur. To a first approximation, we maximize the following likelihood function,
$${{{{{{{\mathcal{L}}}}}}}}({\lambda }_{s};{\mathbb{Y}},{\mathbb{P}})= \,P({\mathbb{Y}};{\lambda }_{s},{\mathbb{P}})\\ = \,\mathop{\prod}\limits_{i}{\left[(1-{\lambda }_{s}){P}_{i}\right]}^{{Y}_{i}}{\left[1-(1-{\lambda }_{s}){P}_{i}\right]}^{1-{Y}_{i}}$$
(1)
where Yi is an indicator variable for the presence of a rare variant at position i in the sample, λs is a scale factor capturing a depletion of rare genetic variation, \({\mathbb{Y}}=\{{Y}_{i}\}\), \({\mathbb{P}}=\{{P}_{i}\}\), and the product excludes sites having common variants. By maximizing this function we can obtain a maximum-likelihood estimate (MLE) of λs conditional on pre-estimated values Pi. (In practice, we use a slighly more complicated likelihood function that distinguishes among the possible alternative alleles at each site; see “Methods” for complete details.) Assuming the Pi values are pre-estimated, an approximate, unbiased maximum-likelihood estimator (MLE) for λs and an estimator for its variance can be obtained in closed form (see “Methods”). Importantly, this variance has almost no sensitivity to variance in the pre-estimated Pi values in the regime of interest (see Supplementary Fig. 4), making the model highly robust to uncertainty in mutation rate estimates provided they are unbiased.
When λs falls between 0 and 1 it can be interpreted as a measure of the prevalence of ultraselection. In this case, λs can be thought of as the fraction of sites intolerant to heterozygous mutations, although in practice, some sites may be more, and some sites less, intolerant. Notice, however, that λs can also take values < 0 if rare variants occur at a higher-than-expected rate in the target set of sites. As we discuss below, we do observe a systematic tendency for λs to take negative values in particular classes of sites, likely reflecting the difficulty of precisely specifying the mutational model at these sites. Across most of the genome, however, estimates of λs fall between 0 and 1 and show general qualitative agreement with other measures of purifying selection.
Notably, in the case of strong selection against heterozygotes and mutation-selection balance (as detailed by refs. 11,17), a relatively simple relationship can be established between λs and the site-specific selection coefficient against heterozygous mutations, shet (see Eq. (12) in “Methods” and Supplementary Fig. 5). To test this relationship, following ref. 18, we simulated data sets under a realistic human demographic model with various values of shet and estimated λs from each one. We found that this approach led to highly accurate estimates of the true value down to about shet = 0.03, and somewhat elevated but acceptable estimates down to about shet = 0.02 (Supplementary Fig. 6), which corresponds to λs ≈ 0.45 with our data set. As it turns out, most of our estimates from real data do not exceed this threshold but when they do, we use this approach to estimate shet. Importantly, it is only these approximate estimates of shet, not λs itself, that depend on the assumption of mutation-selection balance.
We applied ExtRaINSIGHT to 19,955 protein-coding genes from GENCODE v. 38 29 as well as to a variety of proximal coding-associated sequences, including \(5^{\prime}\) and \(3^{\prime}\) untranslated regions (UTRs), promoters, and splice sites (Fig. 1). For comparison, we applied INSIGHT to the same sets of elements. As expected, we obtained considerably higher estimates of λs at 0-fold degenerate (0d) sites in coding sequences, at which each possible mutation results in an amino-acid change (λs = 0.22), than at 4-fold degenerate (4d) sites, at which every mutation is synonymous (λs = −0.008). The corresponding INSIGHT-based estimates of ρ were 0.80 and 0.39, respectively. Together, we can interpret these estimates as indicating that 22% of 0d sites are ultraselected, meaning that any mutation at these sites would be strongly deleterious, and another 80 − 22 = 58% are under weaker purifying selection—although the ExtRaINSIGHT and INSIGHT estimates are not precisely comparable in all respects (see “Discussion”). By contrast, at 4d sites, ultraselection is estimated to be completely absent, but 39% of 4d sites experience weak purifying selection (see ref. 9 for an estimate of 26% for synonymous sites). Overall, about 15% of coding sites (CDS) experience ultraselection (λs = 0.15) and another 47% experience weaker selection (ρ = 0.62).
Fig. 1: Measures of purifying selection at coding and coding-proximal genomic elements.A Estimates for various annotation types are shown for both ExtRaINSIGHT (λs; teal) and INSIGHT (ρ; orange). B Similar estimates are shown for protein-coding genes by deciles of the loss-of-function observed/expected upper bound fraction (LOEUF) measure13. Results are shown for 80,950 isoforms of 19,677 genes. Notice that lower LOEUF scores are associated with stronger depletions of LoF variants, so λs and ρ tend to decrease as LOEUF increases. Error bars are centered at the MLE and indicate one standard error in each direction (see “Methods”).
Full size image
Among coding-related sites, the strongest selection, by far, occurred in splice sites (see also ref. 30), where almost half of sites were subject to ultraselection (λs = 0.45; corresponding to shet ≈ 0.02), with another 43% subject to weaker selection (ρ = 0.88). By contrast, \(3^{\prime}\) UTRs showed little evidence of ultraselection (λs = 0.028) despite considerable evidence of weaker selection (ρ = 0.24). Interestingly, we observed a persistent tendency for negative estimates of λs at regions near the \(5^{\prime}\) ends of genes, at both \(5^{\prime}\) UTRs and promoter regions, despite non-neglible estimates of ρ (0.22 and 0.13, respectively). As we discuss in a later section, these estimates appear to be a consequence of unusual mutational patterns in these regions that are difficult to accommodate using even our regional and neighbor-dependent mutation model.
To see whether ExtRaINSIGHT was capable of distinguishing among protein-coding sequences experiencing different levels of selection against heterozygous loss-of-function (LoF) variants, we compared it with the recently introduced “loss-of-function observed/expected upper bound fraction” (LOEUF) measure13. LOEUF is similarly based on rare variants but differs from ExtRaINSIGHT in that it is computed separately for each gene by pooling together all mutations predicted to result in loss-of-function of that gene (including nonsense mutations, mutations that disrupt splice sites, and frameshift mutations). In contrast to λs and ρ, lower LOEUF scores are associated with stronger depletions of LoF variants and increased constraint, and higher LOEUF scores are associated with weaker depletions and reduced constraint. To compare the two measures, we partitioned 80,950 different isoforms of 19,677 genes into deciles by LOEUF score and ran ExtRaINSIGHT separately on the pooled coding sites corresponding to each decile. Again, we computed ρ values using INSIGHT together with the λs values. We found that both ρ and λs decreased monotonically with LOEUF decile, with λs ranging from 0.28 for the genes having the lowest LOEUF scores to 0.008 for the genes having the highest LOEUF scores, and ρ similarly ranging from 0.77 to 0.43 (Fig. 1). These results suggest that in the 10% of genes under the weakest selection against heterozygous LoF mutations, only 0.8% of sites are subject to ultraselection, but over 40% still experience weaker purifying selection; whereas in the 10% of genes under the strongest selection against LoF mutations, almost 30% of sites are under ultraselection and another ~ 40% are under weaker purifying selection.
Finally, we considered an alternative grouping of genes by biological pathway, using the top-level annotation from the Reactome pathway database31 (Fig. 2). Again, we ran both ExtRaINSIGHT and INSIGHT on each group of genes and observed similar trends in the two measures, with λs ranging from 10% to 27%, and ρ ranging from 61% to 75%. We found genes annotated as belonging to the “Neuronal System” to be experiencing the most ultraselection (λs = 0.27), consistent with other recent findings9. Genes annotated as being involved in “Reproduction” showed the least ultraselection (λs = 0.10). Notably, the estimates of λs exhibited considerably greater variation, as a fraction of the mean, than did estimates of ρ. The ratio λs/ρ—which can be interpreted as the fraction of selected sites experiencing ultraselection—was also highest for “Neuronal System” genes (at 0.36) and lowest for “Reproduction” genes (at 0.18). An analysis of genes exhibiting tissue-specific expression produced similar results, with several brain tissues exhibiting the most ultraselection and vagina exhibiting the least (Supplementary Fig. 7).
Fig. 2: Measures of purifying selection in protein-coding genes by biological pathway.Genes were assigned coarse-grained functional categories using the top-level annotation from the Reactome pathway database31. An estimates for each category is shown for both ExtRaINSIGHT (λs; teal) and INSIGHT (ρ; orange). Error bars are centered at the MLE and indicate one standard error in each direction (see “Methods”). Total number of genes: n = 19,256 (ranging from 125 to 2707 per category).
Full size image
We carried out a similar analysis on noncoding sequences, including a variety of noncoding RNAs, transcription factor binding sites (TFBS) supported by chromatin-immunoprecipitation-and-sequencing (ChIP-seq) data (from ref. 21), and unannotated intronic and intergenic regions. Among these sequences, we observed the strongest signature of ultraselection in microRNAs (miRNAs), particularly in evolutionarily “old” miRNAs broadly shared across mammals (designated as “conserved” by TargetScan; see “Methods”), where we estimated λs = 0.34 (Fig. 3). We found that the seed regions of these miRNAs had even slightly higher values of λs = 0.39. Interestingly, however, the prevalance of ultraselection was greatly reduced at evolutionarily “new” miRNAs that are not shared across mammals ("nonconserved” in TargetScan), where we estimated only λs = 0.031.
Fig. 3: Measures of purifying selection at annotated noncoding elements and in genomic intervals near protein-coding genes.A Estimates for both ExtRaINSIGHT (λs; teal) and INSIGHT (ρ; orange) at noncoding elements (x-axis). B Estimates of λs in windows upstream of the transcription start site (TSS) and downstream of the polyadenylation site (PAS) (x-axis). The \(5^{\prime}\) and \(3^{\prime}\) UTRs are also shown, as are fourfold degenerate (4d) coding sites (CDS). C Estimates of λs for the extended promoter region (2kb upstream of the TSS) within transcription factor binding sites (TFBS) annotated in the Ensembl Regulatory Build44 and in the immediate flanking sequences (10bp on each side). The difference in (C) is highly statistically significant by a two-sided likelihood ratio test based on the ExtRaINSIGHT likelihood model (p = 2.8 × 10−13). Error bars are centered at the MLE and indicate one standard error in each direction (see “Methods”). Numbers of elements: A old miRNA: n = 7537; UCNE: n = 1,415,142; HAR: n = 674,492; young miRNA: n = 6285; other intron: n = 971,109,276; other intergenic: n = 1,255,478,347; lncRNA: n = 453,200,392; miRNA: n = 140,681; snoRNA: n = 49,837; snRNA: n = 155,304; B n = 58,496. C n = 1,120,839.
Full size image
Other types of noncoding RNAs also showed little indication of ultraselection: our estimates for long noncoding RNAs (lncRNAs), small nuclear RNAs (snRNAs), and small nucleolar RNAs (snoRNAs) were all close to zero or negative. In an attempt to identify regions within these RNAs that might be subject to stronger selection, we intersected them with conserved elements identified by phastCons25. However, we found that even these putatively conserved portions of noncoding RNAs exhibited at most λs ≈ 0.05 (in lncRNAs).
When we analyzed a pooled set of all ~ 2M TFBSs from ref. 21, we obtained a negative estimate of λs = −0.08, despite that the same elements yielded a nonnegligible estimate of ρ = 0.23. We therefore examined only the binding sites of the 10 TFs whose binding sites showed the largest ρ estimates (ρ = 0.61 overall; see “Methods”), but even for this putatively conserved set, we obtained an estimate of only λs = 0.03. Thus, of the noncoding RNA and TFBSs we considered, only “old” miRNAs appear to experience high levels of ultraselection.
We also evaluated ultraconserved noncoding elements (UCNEs)32 and noncoding human accelerated regions (HARs)33,34,35—two types of elements that have been widely studied for their unusual patterns of cross-species conservation, and have been shown to function in various ways, including as enhancers36,37 and noncoding-RNA transcription units33. Interestingly, despite their extreme levels of cross-species conservation, UCNEs show only modest levels of ultraselection, with λs = 0.09. This observation suggests that what is unusual about these elements is not the strength of selection acting on them (which is considerably weaker than that at protein-coding sequences or “old” miRNAs), but instead the uniformity of selection acting at each nucleotide (see “Discussion”). Notably, HARs display only slightly lower levels of ultraselection than UCNEs (λs = 0.04) and levels comparable to those of conserved sequences in introns. Thus, despite their rapid evolutionary change during the past 5–7 million years, HARs now appear to contain many nucleotides that are under strong purifying selection in human populations.
To account genome-wide for the incidence of strongly deleterious mutations, we ran ExtRaINSIGHT on a collection of mutually exclusive and exhaustive annotations. For this analysis, we considered CDSs, UTRs, splice sites, lncRNAs, introns, and intergenic regions, but excluded smaller classes of noncoding RNAs, which make negligible genome-wide contributions (Table 1). As above, we intersected the lncRNA, intron, and intergenic classes with phastCons elements, and separately considered the conserved and nonconserved partitions of each class. For each category, we multiplied our estimate of λs by the number of sites in the category to estimate category-specific expected numbers of sites subject to ultraselection. To account for potential misspecification of the mutational model, we conservatively subtracted from the category-specific estimates of λs the estimate for nonconserved intronic regions (0.009). Thus, by construction, the expected number of ultraselected sites in these and similar regions (including nonconserved intergenic and lncRNA sites) was zero.
Table 1 Ultraselection across the human genome (based on ExtRaINSIGHT).Full size table
Overall, we estimated that 0.374% ± 0.002% of the human genome is ultraselected, with 44% of ultraselected sites falling in CDSs, 13% in conserved introns, 11% in conserved intergenic regions, 12% in conserved lncRNAs, 5% in \(3^{\prime}\) UTRs and 3% in splice sites. Notably, ultraselected sites are overrepresented 37-fold in CDSs, but CDSs still account for less than half of ultraselected sites. Splice sites are overrepresented 121-fold but make a minor overall contribution owing to their small number.
Our assumption is that any point mutation at these ultraselected sites will be strongly deleterious, and simulations indicate that the detected sites are indeed subject to extreme purifying selection (see Discussion). Thus, if we multiply the expected numbers of sites by twice (allowing for heterozygous mutations) the estimated per-generation, per-nucleotide mutation rate (here assumed to be 1.2 × 10−8 ref. 38), we obtain expected numbers of de novo strongly deleterious mutations per potential zygote ("potential” because some mutations will act prior to fertilization). By this method, we estimate 0.258 ± 0.001 strongly deleterious mutations per potential zygote. By construction, these strongly deleterious mutations occur in the same category-specific proportions as the ultraselected sites (44% from CDS, 23% from introns, etc.). Thus, we expect about 0.11 strongly deleterious coding mutations per potential zygote and about another 0.15 such mutations at various noncoding sites.
If we carry out a less conservative version of these calculations, by subtracting the λs estimate for nonconserved intergenic regions (0.003) rather than the one for intronic regions, we estimate 0.732% ± 0.004% of the genome to be ultraselected, with 23% falling in CDSs (Supplementary Table 1). The expected number of strongly deleterious mutations per potential zygote increases to 0.505 ± 0.003, of which 0.12 fall in CDSs. Taking these calculations together, we estimate a range of 0.26–0.51 strongly deleterious mutations per potential zygote, implying a high genetic burden but one that appears to be roughly compatible with other lines of evidence (see “Discussion”).
We performed a parallel analysis using INSIGHT, to estimate the numbers and distribution of more weakly deleterious mutations (Table 2). In this case, we estimate that 3.2% of sites are under selection and the expected number of de novo deleterious mutations per fertilization is 2.21. The fraction of deleterious mutations from CDS is 22%, with most of the remainder coming from introns and intergenic regions. lncRNAs and \(3^{\prime}\) UTRs also make significant contributions. Taking the ExtRaINSIGHT and INSIGHT estimates together, we estimate that each potential fertilization event is associated with 0.26–0.51 new strongly deleterious mutations and an additional 1.70–1.95 new mutations that are more weakly deleterious. One way to interpret these numbers is that, conditional on a threshold level of fitness (i.e., the existence of no strongly deleterious mutations), each person contains an expected ~2 new mutations that are sufficiently deleterious that they would tend to be eliminated from the population on the time-scale of human-chimpanzee divergence (as measured by INSIGHT), at least if humans continued to experience historical levels of purifying selection. That person’s genetic load would derive from both these new mutations and similar weakly deleterious mutations passed down from his or her ancestors.
Table 2 Weaker selection across the human genome (based on INSIGHT).Full size table
As noted above, we observed a consistent tendency to estimate negative values of λs at the \(5^{\prime}\) ends of genes, including in \(5^{\prime}\) UTRs and core promoters (Fig. 1), as well as at TFBSs and some noncoding RNAs from across the genome (Fig. 3). In an attempt to bound the genomic regions near protein-coding genes that give rise to these negative estimates, we applied ExtRaINSIGHT in a series of windows near the \(5^{\prime}\) and \(3^{\prime}\) ends of genes, pooling data from all ~ 20,000 genes (Fig. 3b). We found that the effect was most pronounced in the \(5^{\prime}\) UTR, where we estimated λs = −0.16 (see Fig. 1) and in the 250bp immediately upstream of the TSS (λs = −0.13). As we looked farther upstream, it diminished fairly rapidly, with λs = −0.05 in the (−500, −250) window and λs = −0.02 in the (−1000, −500) window. By the (−2000, −1000) window, the estimates had returned to slightly positive values. We did not observe negative estimates near the \(3^{\prime}\) ends of genes, and the estimate for 4d sites within the CDS was only slightly negative. Therefore, the tendency to estimate λs < 0 near genes appears to be limited to the \(5^{\prime}\) UTR and the ~1 kb region upstream of the TSS.
We hypothesized that, despite being well-calibrated across the majority of the genome (Supplementary Fig. 1), our mutation model is misspecified in promoter regions, perhaps owing to correlations of mutation rates with features such as chromatin accessibility or hypomethylation. We therefore adapted our model to consider the predicted state from an application of the 25-state ChromHMM model39,40 to Roadmap Epigenomics data41 as a categorical covariate and refitted it to the data, trying ChromHMM predictions for several cell types. However, we found that this approach did not eliminate the tendency for negative estimates of λs, perhaps because the available epigenomic data has too coarse a resolution or is not well matched by cell type.
Having observed negative estimates of λs also at TFBSs outside of promoter regions, however, we wondered if the effect could be driven, at least in part, by TF binding itself, which has been shown to be mutagenic in melanoma42,43. In an attempt to isolate the effects of TF binding, we applied ExtRaINSIGHT separately to predicted TFBS in extended promoter regions, using predictions from the Ensembl Regulatory Build44, and to the immediate flanking 10bp on either side of these predictions, excluding flanking sequences that themselves included TFBSs. Interestingly, we found that estimates of λs were significantly more negative in the TFBSs than in the immediate flanking sites (Fig. 3c); p = 2.8 × 10−13, likelihood ratio test), suggesting a possible influence from the mutagenic effects of TF binding (see “Discussion”). In the end, we were not able to eliminate this apparent problem with our mutation model, but its effects appear to be generally quite local to TSSs and TFBSs and therefore are likely to have a limited impact on our genome-wide analyses.
Some genes might not mind a bit of extra pressure when it comes to evolution.
A Swiss team led by Andreas Wagner of the University of Zurich has demonstrated evolution of a yellow gene to green in Escherichia coli – a common bacteria that lives in the gut. Strong selective pressure caused the gene to evolve more quickly, because it developed a robust protein that helped it to do so efficiently.
This could be one of the first experimentally demonstrated examples of selection helping a gene to be better at evolving, instead of crippling it. This is very hard to observe because of how long evolution usually takes.
“To our knowledge, this is the first experimental proof that selection can drive the ability to adapt in a Darwinian sense and increase evolvability,” says Wagner. “There are still people out there who question whether evolution is real. But we don’t just look at fossils where we have a historical record. We observe evolution in the laboratory.”
The findings are described in a paper in the journal Science.
Strong selective pressure occurs when only a few individuals in a population can reproduce, usually because the environment is harsh, and a very specific set of genes is needed to survive to adulthood. However, this can often mean that the genome doesn’t get a chance to collect useful mutations, and proteins can become weak.
However, the researchers found the opposite was the case in their E. coli experiment, where strong selection led to more robust proteins that were less likely to be damaged by harmful mutations.
“This discovery was a real surprise to me because it showed that selection for fitness didn’t conflict with selection for robustness, which contrasts with previous work,” says co-author Jia Zheng, also from the University of Zurich.
“While most mutations that proteins encounter harm their stability or ability to fold correctly, the robustness-improving mutations actually mitigate such deleterious effects. Robust proteins have a higher chance to function and thus evolve new traits.”
The team took a gene from a jellyfish that glows fluorescent yellow in certain light and put it into the E. coli to observe whether the resulting protein evolved to be a new colour. The use of a foreign gene meant they could observe the change in proteins without there being any influence from interacting genes elsewhere in the genome.
They then watched how the gene in E. coli evolved over four generations. To simulate natural selection, they chose greener glowing E. coli and subjected the gene to genetic mutation, as would happen in nature. They found that most of the E. coli in the final generation glowed green, despite the extra mutations, when the green protein was strongly selected for.
The team compared two experiments. The first tested strong selective pressure by only choosing the top 0.01% of green glowing E. coli each generation, and the second tested weak selective pressure by choosing any cells that glowed green to carry on.
Interestingly, the genes that experienced strong selection to be green also evolved robust proteins that kept their shape and function well. This might be because the selective pressure did not allow for detrimental mutations to really become established in the population, so proteins ended up being stronger and/or resisting the effects of mutations.
“It shows that natural selection can play a crucial and active role in creating standing variation that is both beneficial and enhances evolvability—for example, by increasing robustness to deleterious mutations. This contrasts with some theoretical and experimental work, in which first-order selection for fitness conflicts with second-order selection for robustness,” the researchers say in their paper.
This experiment was done under laboratory conditions and might not reflect all instances of natural selection in the wild, especially as it tested a single gene instead of a whole genome. Nevertheless, this is an exciting demonstration of selective pressure because of how difficult it is to replicate evolution.
If you are looking for more details, kindly visit our website.
Click here to get more.