Main

Tobacco smoking is one of the greatest hazards to human health, accounting for over 200 million disability-adjusted life years and 7 million deaths each year globally1. The currently available first-line smoking-cessation drugs (varenicline and bupropion) were introduced more than 2 decades ago, even before the Human Genome Project was completed and the genomic revolution started2,3,4. Despite their proven efficacy and wide usage5, smoking remains a global health hazard, warranting advancements in smoking-related drug-discovery efforts that make use of recent innovations in therapeutic design and delivery6.

Large-scale rare variant association studies have the potential to advance drug discovery7,8,9,10. Drug designs inspired by naturally occurring genetic variants that protect humans against diseases have been successful in the past, for example, inhibitors of the enzyme PCSK9 for the treatment of hypercholesterolemia11,12,13. Smoking behavior is strongly influenced by genetics, with twin-based heritability estimates ranging between 45% (for smoking initiation) and 75% (for nicotine dependence)14. Genetic variants across the entire minor allele-frequency (MAF) spectrum (common (MAF > 1%), low-frequency (MAF, 0.1–1%) and rare (MAF < 0.1%) variants) contribute to this high heritability15. However, human genetic studies of smoking behavior have thus far focused mainly on common and low-frequency variants (that can be imputed with at least moderate accuracy)16,17,18,19. Such genome-wide association studies (GWASs) were successful in identifying genomic regions associated with smoking. In contrast to GWASs, only a very few rare variant studies of smoking exist to date15,20. Although such studies have demonstrated that rare variants contribute substantially to smoking heritability, very few genes have been confidently linked to smoking based on rare variant associations15,20.

Unlike common variant associations, rare coding variant associations often pinpoint causal genes21, inform effect direction21,22, guide follow-up experiments23 and provide an estimate of the therapeutic efficacy11,24 and safety25 of targeting a gene or its product. Even for known drug targets, discovering human genetic evidence is valuable, as it can improve our understanding of the drug mechanisms and help develop new therapeutic modalities to treat diseases26. Hence, with the goal of discovering drug targets for smoking, we undertook a large-scale exome-wide association study (ExWAS) of smoking behavior involving up to 749,459 individuals. We studied the associations of rare coding variants in the human genome, captured via exome sequencing, with six major smoking phenotypes and a range of secondary phenotypes including smoking-related diseases. We also selectively explored the rare variant associations at the known GWAS loci and conducted ancestry-specific and cross-ancestry GWAS meta-analyses for the six smoking phenotypes to validate known loci and identify new loci. Finally, we studied the combined influences of both common and rare variants on smoking behavior.

Results

Exome-wide significant associations

The overall study design is shown in Fig. 1. We performed ExWAS meta-analyses for six primary phenotypes (ever smoker, heavy smoker, former smoker, nicotine dependence, cigarettes smoked per day (cig per day) and age started smoking) in sample sizes ranging from 112,670 (cig per day) to 749,459 (ever smoker). The study cohorts and phenotype definitions are described in the Methods, and the cohort-specific sample sizes and participant demographics are summarized in Supplementary Tables 1 and 2, respectively. We focused on coding variants of two functional categories: missense variants and predicted loss-of-function (pLOF) variants (frameshift, splice donor, splice acceptor, stop lost, stop gain and start lost) with MAF < 0.01. In addition to variant-level associations, we also studied gene-level associations, using burden tests in which either pLOF variants only or pLOF and likely deleterious missense variants (that is, predicted to be deleterious by five different algorithms) in a gene are aggregated to create burden masks (or variant sets), which are then tested for association with the phenotypes (Methods)21. The burden masks were created using variants at five MAF thresholds (<0.01, <0.001, <0.0001, <0.00001 and singletons) (Supplementary Table 3). Altogether, we performed 8,417,987 association tests across six smoking phenotypes. Applying a false detection rate (FDR) of 1% (corresponding P value = 4.5 × 10−8), we identified 35 significant associations implicating three genes: ASXL1, DNMT3A and CHRNB2 (Fig. 2, Supplementary Fig. 1 and Supplementary Table 4). Although these results were based on analyses in which individuals of all ancestries were pooled together, we found that the results were highly similar to those from a cross-ancestry meta-analysis or a meta-analysis involving only individuals of European ancestry, suggesting that the results were not influenced by population stratification (Supplementary Fig. 2).

Fig. 1: Study design.
figure 1

The flow chart summarizes the overall study design in terms of cohorts, phenotypes and types of genetic analyses performed. ICD, International Classification of Diseases.

Fig. 2: Discovery of rare variants associated with smoking phenotypes.
figure 2

Quantile–quantile (QQ) plot of the rare variant associations (both variant and burden associations) with six smoking phenotypes (ever smoker, heavy smoker, former smoker, nicotine dependence, cig per day and age started smoking). The dashed line corresponds to the exome-wide significant threshold, 4.5 × 10−8, determined based on a 1% FDR correction applied across all the associations (n tests = 8,417,987).

Associations of rare variants in CHRNB2

The primary phenotype that discovered the CHRNB2 association was heavy smoker, where cases were individuals who smoked at least ten cigarettes per day either currently or formerly (n = 110,494), and controls were individuals who have never smoked in their lifetime (n = 374,842). The strongest association was observed for pLOF-plus-missense burden (an aggregate of pLOF and likely deleterious missense variants in CHRNB2 with MAF < 0.001), for which the odds of being a heavy smoker were significantly lower in carriers than in non-carriers (OR = 0.65; CI = 0.56–0.76; P = 1.9 × 10−8). The rare variant burden association was independent of any nearby common variant associations with P < 0.01 (Supplementary Fig. 3 and Methods), and the effect estimates were consistently in the protective direction across the three cohorts that contributed to the meta-analysis (Fig. 3). The protective association of CHRNB2 pLOF-plus-missense burden with heavy smoking was observed irrespective of how we defined heavy smoking (Supplementary Fig. 4). Furthermore, the protective association was also seen for the ever smoker phenotype (where individuals who ever smoked regularly in their lifetime were defined as cases, n = 345,805) but was less significant than for the heavy smoker phenotype, despite a relatively larger sample size, highlighting the importance of phenotype specificity in gene discovery (Extended Data Fig. 1). However, when considering pLOF-only burden (an aggregate of pLOF variants in CHRNB2 with MAF < 0.001), which provides the strongest evidence on the direction of the association, the association reached at least a nominal level of significance (P < 0.05) only for the ever smoker phenotype but not for the heavy smoker phenotype, likely because the ever smoker phenotype captured more pLOF carriers (281 carriers) than the heavy smoker phenotype (174 carriers), suggesting that a larger sample size at the expense of phenotype specificity is also valuable, particularly at the rarer end of the allele-frequency spectrum.

Fig. 3: Forest plots of the top burden–trait associations of the significant genes.
figure 3

Cohort-level and meta-analysis summary statistics of the most significant burden–trait associations for each of the three exome-wide significant genes are summarized using forest plots. The ORs and 95% CIs are plotted. The columns ‘case counts’ and ‘control counts’ show the case and control sample sizes, respectively, broken down to the number of carriers of the homozygous reference, heterozygous and homozygous alternative genotypes. For burden definitions, refer to Supplementary Table 2. ALL, all ancestries; AAF, alternative allele frequency (combined frequency of all the variants aggregated in the burden mask).

We next studied the association of CHRNB2 pLOF-plus-missense burden with a range of secondary smoking phenotypes, mainly derived from UK Biobank (UKB)27 participants’ responses to a lifestyle questionnaire related to smoking (Methods). The overall association pattern was in line with our main finding that rare pLOF and likely deleterious missense variants in CHRNB2 in aggregate confer protection against smoking addiction (Extended Data Fig. 2 and Supplementary Table 5). We also studied the burden associations with a curated list of binary and quantitative health phenotypes related to smoking and observed suggestive associations, all in the protective direction, for example, emphysema (OR = 0.45; CI = 0.28–0.71; P = 6.9 × 10−4), chronic obstructive pulmonary disease (COPD; OR = 0.80; CI = 0.62–1.03; P = 0.08) and family history of lung cancer (OR = 0.84; CI = 0.69–1.01; P = 0.06) (Extended Data Fig. 2).

No individual pLOF or missense variants in CHRNB2 surpassed the study-wide significance threshold, suggesting that our sample sizes were still underpowered to capture single-variant associations. Using a leave-one-variant-out (LOVO) burden analysis28 (Methods), we identified a missense variant (rs202079239, Arg460Gly) that contributed the most to the pLOF-plus-missense burden association in the UKB (Fig. 4a and Supplementary Table 6). Importantly, even after excluding Arg460Gly, the burden association was still nominally significant with a protective OR (OR = 0.71; CI = 0.57–0.88; P = 0.001), suggesting that other variants in the burden mask also contributed to the association (Supplementary Table 6). Additionally, the Arg460Gly variant independently showed a moderately significant protective association with the heavy smoker phenotype (OR = 0.56; CI = 0.43–0.72; P = 1.1 × 10−5). We found that this variant has drifted to a higher frequency in Finns (gnomAD29 MAF = 0.0018) compared to non-Finnish Europeans (gnomAD MAF = 0.00038; Fig. 4b). Statistical power increases with MAF; hence we expected that the protective association of Arg460Gly with smoking or related phenotypes might be detectable in FinnGen30, a population-based cohort in Finland, despite its sample size being smaller than that of the UKB. A selective exploration of Arg460Gly with smoking, substance use and smoking-related lung disease phenotypes in the publicly available data from the FinnGen research project (freeze version 7) revealed significant enrichment for protective associations (hypergeometric test for enrichment, P = 0.03; Fig. 4c,d and Supplementary Table 7). At least two phenotypes showed nominally significant (P < 0.05) protective associations: substance-use disorder (excluding alcohol) (OR = 0.39; CI = 0.21–0.73; P = 0.003) and COPD (OR = 0.69; CI = 0.49–0.96; P = 0.03). Therefore, by exploiting the natural phenomenon of genetic drift in an isolated population30, we were able to validate the protective association of CHRNB2 with smoking-related phenotypes in an independent cohort.

Fig. 4: A Finnish-enriched missense variant contributes most to the CHRNB2 burden association.
figure 4

a, Results from LOVO analysis (Methods) of the CHRNB2 pLOF-plus-missense burden (AAF < 0.001) in the UKB. The LOVO P values are plotted against the variant positions. The dashed blue line corresponds to the P value of the full burden association. The dashed gray line corresponds to P = 0.05. b, MAFs of Arg460Gly (rs202079239) in different populations in the gnomAD database. EUR, European ancestry. c, Volcano plot showing the PheWAS associations of Arg460Gly with smoking-related phenotypes in the FinnGen database. The dashed line corresponds to P = 0.05. d, ORs and 95% CIs of selected phenotype associations of Arg460Gly in the FinnGen database are displayed. Excl., excluding, SUD, substance use disorder.

Associations of common variants near CHRNB2

Common variant associations by themselves often do not pinpoint the causal gene(s); when they do, they mostly bring limited insights into the druggability of the gene. However, when interpreted along with rare coding variant associations, they can offer valuable insights. To this end, we searched for any known common variant GWAS signals near CHRNB2 that were reported previously for smoking-related traits. Liu et al.16 have reported a GWAS association with cig per day near CHRNB2 where the fine-mapped 95% credible set contained a single variant, rs2072659, located within the 3′ untranslated region (UTR) of CHRNB2. This variant showed significant (P < 0.05) associations in our dataset with multiple smoking phenotypes including heavy smoker (OR = 0.96; CI = 0.94–0.98; P = 5.3 × 10−6), all in the protective direction (Fig. 5a). In a phenome-wide association study (PheWAS) of this variant across 7,469 phenotypes in two of the large cohorts (UKB and Geisinger Health System (GHS)), the strongest association was with smoking (Fig. 5b). In addition, seven of the top ten associations were with smoking-related phenotypes, all in the protective direction.

Fig. 5: Association of a common 3′ UTR variant with smoking.
figure 5

a, Forest plots of associations of the CHRNB2 3′ UTR variant (rs2072659) with the major smoking phenotypes based on cross-ancestry meta-analyses (Methods); either ORs (if binary traits) or β estimates (in s.d. units) and their 95% CIs are plotted. b, QQ plot of the PheWAS associations of rs2072659 in the UKB and the GHS cohorts.

Associations of clonal hematopoiesis of indeterminate potential mutations in ASXL1 and DNMT3A

Among the three exome-wide significant genes, ASXL1 and DNMT3A showed the strongest associations with most of the smoking phenotypes (Figs. 2 and 3, Extended Data Figs. 3 and 4 and Supplementary Tables 4 and 5). However, both ASXL1 and DNMT3A are known to accumulate somatic mutations in circulating blood cells with increasing age in the general population, a phenomenon described as clonal hematopoiesis of indeterminate potential (CHIP)31. When the DNA source for exome sequencing is peripheral blood, standard exome variant-calling workflows capture CHIP mutations along with germline variants32,33. We have previously reported a comprehensive genetic analysis of CHIP, in which we systematically called somatic variants in participants of the UKB and the GHS cohorts and studied their germline associations33. It is well known that smoking is strongly associated with CHIP34,35, and the association of ASXL1 CHIP mutations with smoking in the UKB has been previously reported35. Hence, we were not surprised to learn that the ASXL1 and DNMT3A associations were driven by CHIP mutations, which we confirmed through burden analyses based on burden masks with and without CHIP mutations and association analyses of the variant allele fraction (VAF) of CHIP mutations with smoking phenotypes (Fig. 6, Extended Data Fig. 5 and Supplementary Note). As was previously proposed, the association of CHIP mutations with smoking phenotypes suggests that smoking offers a clonal advantage to certain CHIP mutations, although the underlying mechanisms have yet to be understood. Also, our findings echo the caution previously raised by many in relation to using exome-sequencing data based on blood samples to establish genetic diagnoses for Mendelian diseases in adults36,37 (Supplementary Note).

Fig. 6: Association of ASXL1 and DNMT3A CHIP mutations with smoking.
figure 6

We constructed pLOF-only and pLOF-plus-missense burden masks at five allele frequency thresholds using all variants (wCHIP) and excluding CHIP variants (woCHIP) in the UKB and the GHS cohorts and tested their associations with the six major smoking phenotypes using REGENIE (Methods). The burden association P values are plotted, and the summary statistics including sample sizes are provided in Supplementary Table 8. The dashed line corresponds to the significance threshold after adjusting for multiple testing (1% FDR correction).

Association of rare variants at known GWAS loci

Two of the strongest genetic risk loci for smoking that were identified early in the GWAS timeline were locus 15q25.1, containing three nicotine acetylcholine receptor (nAChR) genes (CHRNA5, CHRNA3 and CHRNB4)38,39, and locus 19q13.2, containing a cluster of cytochrome P450 enzyme-coding genes (CYP2A, CYP2B and CYP2F subfamilies); both strongly influence the number of cigarettes smoked per day40,41. Although none of the genes were significant at the exome-wide level in our analysis, given their strong biological links to smoking, we explored these loci for evidence of any subthreshold rare variant associations. At the cytochrome P450 locus, we found little evidence for rare variant associations beyond the known common variant signals (Extended Data Fig. 6a and Supplementary Table 11). However, we observed nominal rare variant gene burden associations with cig per day at locus 15q25.1, implicating all three nAChRs (CHRNA5, CHRNA3 and CHRNB4) with effect sizes larger than those observed for common variants (Extended Data Fig. 6b). Notably, the largest effect size was observed for the CHRNB4 pLOF-only rare variant burden, where the 13 pLOF carriers smoked on average ~6.8 cigarettes per day more than non-carriers (β = 0.68 s.d.; CI = 0.17–1.18; P = 0.008; Extended Data Fig. 6c). This effect size is approximately three to four times larger than the largest effect sizes observed for CHRNA5 (β = 0.23; CI = 0.05–0.40; P = 0.01) and CHRNA3 (β = 0.16; CI = 0.02–0.31; P = 0.03) pLOF-only rare variant burden and ~7.5 times larger than that for rs16969968 (approximately one cigarette more; β = 0.09; CI = 0.09–0.10; P = 3.8 × 10−125), a well-characterized common risk variant at this locus (Supplementary Table 11). Power calculations based on observed effect sizes suggest that these associations will likely emerge as significant at the genome-wide level when the sample size for ExWAS of the cig per day phenotype reaches between 300,000 and 500,000 (Extended Data Fig. 7).

Previous exome studies have shown that rare variant associations are enriched near GWAS loci for many human diseases and traits21,42. Hence, we analyzed the burden associations, focusing only on genes mapped to GWAS loci19 (Methods). We observed no significant rare variant burden associations other than the association of CHRNB2 pLOF-plus-missense burden with the heavy smoker phenotype (Extended Data Fig. 8). The results suggested that our current sample sizes are underpowered to capture the convergence between common and rare variant associations at the known smoking GWAS loci.

Cross-ancestry and ancestry-specific GWAS

We first performed GWAS for the six primary smoking phenotypes in individuals of European ancestries and used these results to analyze SNP-based heritability (SNP-h2) and genetic correlations using a European ancestry-based linkage disequilibrium (LD) reference panel43. Our SNP-h2 estimates were comparable to previously reported estimates16 (Supplementary Fig. 5a and Supplementary Table 12). Also, our GWAS results showed strong genetic correlations with the previous GWAS results16 (Supplementary Fig. 5b and Supplementary Table 13), which suggests high reproducibility of the polygenic signals of the studied smoking phenotypes. Also, we observed moderate-to-large genetic correlations across our six phenotypes, suggesting shared genetic architecture across the phenotypes (Supplementary Fig. 5c and Supplementary Table 14).

Next, we performed cross-ancestry GWAS meta-analyses for the six primary smoking phenotypes. Across all the phenotypes, in total, we identified 328 LD-independent loci, of which a majority (94%) are known. This was expected, given that a GWAS with a much larger sample size has been published before16 (Supplementary Fig. 6a–f and Supplementary Table 16). Among the new loci, an X chromosome locus that we identified for nicotine dependence deserves special mention, as it implicates a nicotinic receptor-related gene. This locus, Xq22.1, harbors TMEM35A (the closest gene to the index variant), also referred to as NACHO (new acetylcholine receptor chaperone); this gene encodes a molecular chaperone protein that is involved in the assembly of α7, α6β2 and α6β2β3 nAChRs44. Mice lacking Tmem35a develop hyperalgesia44, and we observed that the index variant at this locus is also associated with increased intake of oxycodone, an analgesic medication, in the UKB (OR = 1.58; P = 0.0001; data from https://www.opentargets.org)45, suggesting that this locus might influence both smoking and pain phenotypes in humans.

After European ancestries, the second largest proportion (19%) of our study participants were of admixed American ancestries (AMR), mostly from the Mexico City Prospective Study (MCPS) cohort46. Published GWASs of smoking behavior in AMR ancestries are sparse47. In the AMR-specific GWAS, we identified 25 independent loci across the six phenotypes, of which 15 are known and 10 are new (Supplementary Table 16). The known loci include some of the strongest GWAS loci identified in European-specific GWAS: CHRNA5 (ref. 39), CHRNA4 (ref. 48), DBH41, CYP2A6 (refs. 40,41) and NCAM1 (ref. 49) (Supplementary Table 16). In AMR ancestries, we also identified an X chromosome locus that has been previously linked to smoking in those of European ancestries18. Notably, at this locus (with GPR101 in the vicinity), we identified a genome-wide significant association with the heavy smoker phenotype in the AMR-specific GWAS (rs1190734; ORAMR = 0.83 (0.79–0.88); PAMR = 1.2 × 10−11) but only a nominal association with the heavy smoker phenotype in the European-specific GWAS (OREUR = 0.98 (0.97–0.99); PEUR = 0.001). However, the same variant showed genome-wide significant association with the cig per day phenotype in European-specific GWAS (βEUR = −0.02; PEUR = 7.6 × 10−16), corroborating the GWAS signal at this locus reported previously for the cig per day phenotype18. Whether this locus is associated with the cig per day phenotype in AMR ancestry with a larger effect size than that in European ancestry is not clear, as we did not have this phenotype in the MCPS cohort at the time of this analysis. Nevertheless, the findings overall suggest that the GPR101 locus influences smoking behavior in both European and AMR ancestries. Regarding the ten new loci identified in the AMR ancestries, as expected, many (seven loci) harbored variants that are relatively more common in AMR ancestries than in European ancestries, thereby offering higher statistical power for discovery; for example, at 10q21.1, an intergenic locus, we identified a genome-wide significant association with the heavy smoker phenotype where the index variant is observed in ~10% of admixed Americans but only in ~0.05% of Europeans; at 8p22 (closest gene, C8orf48), we identified a genome-wide significant association with the ever smoker phenotype, where the index variant is observed in ~30% of admixed Americans but only in ~7% of Europeans (Supplementary Table 16).

Interplay between common and rare variants

Large-scale sequencing projects provide increased power to detect additive effects between common and rare variants for many diseases and traits. For example, we have previously demonstrated an additive effect between GPR75 obesity-protective rare variants and polygenic score (PGS) for obesity based on common variants10. We performed a similar analysis to test whether an additive effect is also evident for CHRNB2 rare variants and smoking PGS. We calculated smoking PGS for UKB participants of European ancestries based on a GWAS of the ever smoker phenotype performed in an independent sample (a meta-analysis of GWAS and Sequencing Consortium of Alcohol and Nicotine use (GSCAN) GWAS19 results excluding 23andMe and the UKB with the GWAS results of the GHS50, one of our largest European cohorts). First, we studied the associations of CHRNB2 pLOF-plus-missense burden and smoking PGS with heavy smoking within a single regression model that included an interaction term between the burden mask and the PGS (Methods). Both burden mask (OR = 0.66; 95% CI = 0.56–0.79; P = 3.4 × 10−6) and the PGS (β = 0.33; standard error (SE) = 0.004; P = 1 × 10−300) were associated with heavy smoking without a statistically significant interaction (P = 0.71). The results suggest that rare variants and the PGS influence the risk of heavy smoking independently. Second, to demonstrate the additive effect, we binned UKB individuals into quintiles based on their smoking PGS and quantified the prevalence of heavy smokers in CHRNB2 pLOF-plus-missense burden mask carriers (the burden mask that showed the strongest association with the heavy smoker phenotype) and non-carriers. The prevalence of heavy smokers increased in both carriers and non-carriers from lower to higher PGS quintiles (Fig. 7 and Supplementary Table 17). Importantly, within each of the quintiles, the prevalence of heavy smokers was lower in CHRNB2 rare variant carriers than in non-carriers, demonstrating an additive effect between PGS and rare variants. The additivity implies that the smoking PGS modifies the penetrance of CHRNB2 rare variants and vice versa, that is, the protective effect of CHRNB2 rare variant burden is attenuated in individuals with higher PGS compared to in individuals with lower PGS, and the risk effect of increased PGS is attenuated in rare variant carriers compared to in non-carriers.

Fig. 7: Additive effects between CHRNB2 rare variants and smoking PGS.
figure 7

Prevalence estimates of heavy smokers among CHRNB2 rare variant carriers and non-carriers within each of the five PGS quintiles in the UKB are plotted. Standard errors of the prevalence estimates, displayed as error bars, were calculated using the formula √(pq/n), where n is the number of individuals in each group, p is the prevalence of heavy smoker in the group and q is 1 − p. The PGS was based on a GWAS meta-analysis of the ever smoker phenotype (Methods). CHRNB2 rare variants are those that were aggregated into the CHRNB2 pLOF-plus-missense (AAF < 0.001) burden mask. Statistical differences in the prevalence between carriers and non-carriers were tested using a logistic regression analysis within each quintile; ORs, 95% CIs and P values are shown.

Discussion

GWASs of smoking behavior16,17,18,19 based on common variants have made tremendous progress in the field, with the recent GWAS involving more than 3 million individuals19. Such studies have substantially improved our understanding of the polygenic architecture of smoking phenotypes and have highlighted genes and pathways including nAChRs, genes involved in nicotine metabolism and dopaminergic and glutamatergic signaling16. However, to date, few studies based on whole-exome- or whole-genome-sequencing data have been reported15,20, and they involved sample sizes insufficient to capture associations at variant- and gene-level resolutions. Hence, our understanding of the contributions of rare variants to smoking behavior has been minimal thus far. In the present study, we performed a large-scale rare variant analysis in sample sizes that had enough power to identify associations of a rare variant or an aggregate of rare variants with an OR of 2.5 and above (or 0.4 and below) when there are at least 100 carriers (Extended Data Fig. 9). The fact that our analysis revealed only one germline association indicates that there are no ‘low-hanging fruits’ for smoking in the rare variant space other than CHRNB2. However, we acknowledge that this interpretation applies only to European populations, and we cannot exclude the possibility that rare variants exist that are more frequent in other ancestries and might be discovered in the future in similar or even smaller sample sizes than ours. Nevertheless, we note that 25% of our samples represent non-European ancestries, with the largest proportion (19%) representing admixed Americans46. However, the sample sizes, when broken down into individual ancestry groups, are still smaller than what would be necessary to make rare variant discoveries.

The major finding from our analysis is that individuals with rare pLOF and likely deleterious missense variants in CHRNB2 are at decreased odds of smoking heavily. Although the top association was observed for the gene burden that combined both pLOF and missense variants, the concordant protective effect sizes observed for the pLOF-only burden strengthened our interpretation that what we observe is a loss-of-function association. This knowledge is crucial as it informs therapeutic hypotheses for drug design. Moreover, we identified a single deleterious missense variant that drifted to a higher frequency in the Finnish population, which gave us an opportunity to validate the protective associations in the FinnGen study30. The finding highlights the value of isolated populations to inform drug target discovery51.

Another important finding is the convergence of rare and common variant findings of CHRNB2. We highlight a common 3′ UTR variant, reported in previous GWASs16,19, that shows protective associations with multiple smoking phenotypes, suggesting that this variant likely decreases CHRNB2 expression. Importantly, the OR of the common variant association with the heavy smoker phenotype was 0.96 as opposed to 0.65 for the pLOF-plus-missense rare variant burden. The pattern suggests a dose–response relationship between the gene and the phenotype in which varying levels of gene perturbations result in proportional effects on the phenotype. We particularly highlight the fact that this variant, although discovered in the earlier GWAS16, did not receive attention, as it was buried underneath the hundreds of GWAS associations, reflecting an important limitation of interpreting common variant findings. However, when interpreted in the light of rare variant findings, the common variant association stood out as highly valuable, exemplifying the combined value of GWAS and ExWAS in drug target discovery. Such observations will become frequent in the future with the rapidly growing population-scale ExWAS of human diseases and traits52.

CHRNB2 codes for the β2 subunit of the α4β2 nAChR, which is the predominant nicotinic receptor expressed in the human brain53. The role of α4β2 nAChR in mediating nicotine effects has been well characterized by decades of animal studies54,55, thanks to the pioneering work of Picciotto and colleagues who demonstrated in 1995 that deletion of the gene encoding β2 in mice abolished nicotine-mediated effects on avoidance learning and reinforcement behavior56,57. However, we describe human genetic evidence supporting the hypothesis that loss of CHRNB2 protects against nicotine addiction. Importantly, the protein encoded by CHRNB2 can be viewed as a known drug target as it is a component of the α4β2 nAChR, which, being the major nicotine receptor in the brain, has been the target of most nAChR partial agonists and antagonists developed thus far, including cytisine (an α4β2 partial agonist58) and varenicline (an α4β2 partial agonist and antagonist3). Varenicline is the current drug of choice to aid smoking cessation and was developed in 1997 by Pfizer based on the molecular structure of cytisine2,3. In addition to α4β2, varenicline binds to various other nAChRs in the brain including α7, α3β4 and α6β2 (ref. 59). Given the established role of α4β2 in mediating rewarding and reinforcement actions of nicotine, it is believed that the α4β2-antagonistic action of varenicline helps with smoking cessation3. Our finding aligns with this hypothesis, emphasizing that human genetics is useful not only to discover new drugs but also to better understand the mechanism of action of old drugs that have been in use for decades, and such knowledge can pave the way for better drug designs with greater efficacy and limited adverse effects.

Limitations of our study include small sample sizes for finer quantitative phenotypes such as cig per day, which have limited our power to capture rare variant associations of genes mediating aversive effects of nicotine (for example, CHRNA5) and those related to nicotine metabolism (for example, CYP2A6)39,40. As is often the case, individuals of non-European ancestries were under-represented in our study cohorts, which has limited the generalizability of the findings to all ancestries60,61. However, we involved a substantial number of individuals of AMR ancestries, who belong to one of the most under-represented populations in human genetic studies, a step in the right direction46. With growing awareness of the importance of diversity in human genetic studies, the representation of non-European ancestries is expected to improve in future studies60,61. Finally, we have focused only on the coding regions of the genome captured via whole-exome sequencing, and therefore we may have missed rare variants with large effects on smoking behavior residing in noncoding regulatory regions. With the recent increase in large-scale whole-genome-sequencing efforts, rare large-effect regulatory variants influencing human diseases and traits are being discovered, and such discoveries may have the potential to lead to drug targets62. However, the question of whether whole-genome sequencing is a more cost-effective investment than whole-exome sequencing for drug target discovery has yet to be answered.

To conclude, we have performed a large-scale ExWAS of smoking behavior and identified a protective association between rare coding variants in CHRNB2 and smoking. The results align with the findings from published knockout animal models and the mechanism of action of varenicline that is currently in use to aid smoking cessation and will support future therapeutic developments to treat smoking addiction.

Methods

Participating cohorts

UK Biobank

The UKB is an open-access, large population cohort of 500,000 individuals established in the United Kingdom27,63. The participants were, in general, community-dwelling middle-aged to old-aged volunteers who were recruited between 2006 and 2010 through invitations sent by mail63. The age of the participants ranged between 40 and 69 years at the time of recruitment. A deep set of phenotypes has been collected from the participants prospectively, including physical, biochemical and multimodal imaging measures, disease history based on electronic health records (EHRs) and a wide range of environmental measures obtained via touchscreen and web-based questionnaires. The smoking phenotypes that we studied in this project were based on the information collected through lifestyle and environment touchscreen questionnaires (data field category 100058). The health-related phenotypes that we studied including the history of lung and vascular diseases are based on ICD-10 codes from the EHRs or self-reported or a combination of both.

Geisinger Health System

The GHS participants come from Geisinger’s MyCode Community Health Initiative, which was established in 2007 to create a biorepository for research projects investigating the molecular and genetic bases of health and disease50,64. The participants were patients enrolled in the health care system who consented to participate in the MyCode initiative and gave access to their EHRs. The smoking phenotypes that we studied were based on the clinical history of smoking available in the EHR. Finer details on the smoking behavior such as the number of cigarettes smoked per day, age started smoking, etc. were available for a subset of patients through spirometry questionnaires available in the EHR.

Mexico City Prospective Study

The MCPS is a large prospective cohort of 150,000 individuals recruited between 1998 and 2004 with a major aim to investigate the known and new risk factors for mortality in individuals of Mexican descent46,65. The participants were residents of the Coyoacan and Iztapalapa districts of Mexico City. Phenotype data including information on smoking behavior were collected through house-to-house visits through interviewer-administered questionnaires.

Sinai

The Sinai participants were from the BioMe Biobank Program of the Charles Bronfman Institute for Personalized Medicine at the Mount Sinai Medical Center established in 2007 (ref. 66). The BioMe participants are patients enrolled in the Mount Sinai health system who consented to participate in the BioMe initiative and gave access to their EHRs. The smoking phenotypes that we studied were derived from the EHR.

Ethical approval and informed consent

All study participants have provided informed consent, and all participating cohorts have received ethical approval from their respective institutional review board. The UKB project has received ethical approval from the Northwest Centre for Research Ethics Committee (11/NW/0382)21,27. The work described here has been approved by the UKB (application no. 26041)21. The GHS project has received ethical approval from the Geisinger Health System Institutional Review Board under project no. 2006-0258 (refs. 50,64). The MCPS has received ethical approval from the Mexican Ministry of Health, the Mexican National Council for Science and Technology, the UNAM and the University of Oxford46,65. The BioMe biobank has received ethical approval from the institutional review board at the Icahn School of Medicine at Mount Sinai66.

Phenotype definitions

We defined six phenotypes for the primary analysis: (1) ever smoker: cases were those who ever smoked regularly (including both former and current smokers), and controls were those who never smoked in their lifetime; (2) heavy smoker: cases were those who smoked ten or more cigarettes per day (including both former and current smokers), and controls were those who never smoked in their lifetime; (3) former smoker: cases were those who smoked in the past but not at the present, and controls were current smokers; (4) nicotine dependence: cases were those who had an ICD-10 F17 diagnosis in the EHR, and controls were those who did not have an ICD-10 F17 diagnosis; (5) cig per day: number of cigarettes smoked per day in both current and former smokers; (6) age started smoking: age when the person first started smoking.

In addition to the six primary phenotypes, we also studied a set of secondary smoking phenotypes primarily derived from the smoking lifestyle questionnaire data in the UKB (data field category 100058). We also studied a selected list of disease phenotypes related to smoking, namely lung cancer (ICD-10 C34), COPD (ICD-10 J44), emphysema (ICD-10 J43), chronic bronchitis (ICD-10 J42), peripheral arterial disease (ICD-10 I73), coronary artery disease (ICD-10 I25) and myocardial infarction (ICD-10 I21).

Exome sequencing and variant calling

The exomes of individuals from all participating cohorts were sequenced at the RGC. Exome-sequencing and variant-calling workflows followed for each of the participating cohorts are described in detail elsewhere10,21,46,64,67. Briefly, the DNA source for exome sequencing in all the cohorts was peripheral blood. The DNA samples were first enzymatically fragmented into 200-bp DNA libraries, to which 10-bp barcodes were added to facilitate multiplexed operations. Exome regions containing DNA fragments were captured overnight using a modified version of the xGen probe from Integrated DNA Technologies. The captured fragments were then amplified by PCR and sequenced in a multiplexed manner using 75-bp paired-end reads on the Illumina NovaSeq 6000 platform. On average, 20× coverage was achieved for more than 90% of the target sequences in 99% of the samples.

Sequenced reads were mapped to the hg38 reference genome using BWA-MEM to create BAM files. Duplicated reads were marked for exclusion using the Picard tool. Next, variant calling was performed at individual sample levels using the WeCall variant caller to create per-sample gVCF files to enable a sample-level filter. Data from samples with low sequence coverage (<85% of the targeted bases achieving >20× coverage), excess heterozygosity, disagreement between genetic and reported sex, disagreement between exome and array genotype calls and genetic duplicates were removed. The remaining high-quality gVCF files were merged into a single project-level VCF (pVCF) file using the GLnexus joint genotyping tool. A further variant-level filter was applied to the multi-sample pVCF file. SNVs with read depth <7 and indels with read depth <10 were removed. Also, variants without either at least a single homozygous genotype or a single heterozygous genotype with allele balance ratio ≥0.15 (≥0.20 if indel) were removed. The quality-controlled pVCF files were then converted to analysis-ready PGEN format using PLINK version 2.

Variant annotation

Variants called from exome-sequencing data were annotated using the SnpEff tool68. Each variant was assigned the most severe consequence across all the protein-coding transcripts for which start and end positions were defined according to Ensembl release 85. Variants with any of the following annotations: stop gain, start lost, splice donor, splice acceptor, stop lost and frameshift corresponding to the non-ancestral allele were annotated as pLOF variants. Missense deleteriousness was predicted using five different algorithms, namely SIFT69, PolyPhen-2 HDIV and PolyPhen-2 HVAR70, LRT71 and MutationTaster72, and missense variants that were predicted to be deleterious by all five algorithms were annotated as ‘likely deleterious’ variants.

Genotyping and imputation

Genotyping was performed using DNA genotyping arrays that varied from cohort to cohort and are reported in detail in cohort-specific publications27,46,64. Briefly, UKB participants were genotyped using the Applied Biosystems UK BiLEVE Axiom Array or the Applied Biosystems UKB Axiom Array; GHS participants were genotyped using either the Illumina Infinium OmniExpressExome or the Global Screening Array; and MCPS and Sinai participants were genotyped using the Global Screening Array. Standard quality-control procedures were followed to retain only high-quality genotyped variants, which were then used for imputing common variants using the TOPMed LD reference panel73. For all cohorts, imputation was performed in the TOPMed Imputation Server by uploading the quality-controlled genotypes in randomized batches. Following imputation, we retained only variants with MAF > 0.01 and imputation INFO score > 0.8 for the analysis reported in the current study. After all quality control, the final number of common variants included in the cross-ancestry meta-analyses ranged from ~6.7 million for the ever smoker phenotype to ~14 million variants for the cig-per-day phenotype (the final number of variants decreased as expected with increases in the number of cohorts included in the meta-analyses). Appropriate variables for the genotyping arrays and the imputation batches were used as covariates in all analyses of imputed variants.

Genetic ancestry inference

Genetic ancestries of the individuals from all participating cohorts were quantified using a set of common variants that were genotyped directly using the genotyping arrays21. We first computed principal components (PCs) in HapMap3 individuals using the publicly available genotype reference panel74; only high-confidence variants (MAF > 0.10, genotype missingness < 5% and Hardy–Weinberg equilibrium test P > 1 × 10−5) that were common between our dataset and HapMap3 were used for PC calculations. PCs were first computed in the HapMap3 samples on which the rest of the samples were projected. Individuals were assigned to one of five ancestral groups, namely, Europeans, Africans, AMR, East Asians and South Asians, if their likelihood of belonging to a particular ancestry was >0.3; the likelihood estimate was calculated using a kernel density estimator trained on the HapMap3 PCs21.

Genetic association analysis

Genetic association analyses were performed within each of the cohorts separately using REGENIE software28, and the results were then meta-analyzed together using an inverse-variance-weighted approach using METAL software75. REGENIE uses a two-step whole-genome regression framework that controls for population stratification and sample relatedness in a cost-effective and computationally efficient manner. Briefly, in step 1, REGENIE computes trait-prediction values (also called local PGS) using a sparse set of genotypes, which are typically the array genotypes. In step 2, REGENIE computes the variant associations with phenotypes using either logistic or linear regression, where the trait-prediction values computed in step 1 are included as covariates along with other covariates, namely the first 20 genetic PCs computed using common variants, the first 20 genetic PCs computed using rare variants, age, age squared, sex, an interaction term between age and sex and genotyping batches. Specifically, for binary traits with imbalanced case–control ratios, REGENIE uses a fast Firth regression, which has been shown to perform better than saddlepoint-approximation correction used in the logistic mixed-model approach implemented in software such as SAIGE76. For burden analysis, REGENIE first creates a pseudo-genotype, described as a burden mask, by collapsing a set of variants (see Supplementary Table 2 for the different burden definitions used) into a single categorical variable and then treats this burden mask in the same manner as a variant genotype to compute association statistics.

For the top burden associations, we performed a sensitivity analysis called LOVO implemented in REGENIE. To perform LOVO, REGENIE creates a series of burden masks iteratively for a given set of variants, where, during each iteration, one variant is left out of the burden mask. The created burden masks are then tested for association with the phenotype of interest. Variants that contribute substantially to the burden association will cause a large drop in the statistical significance when left out. Therefore, such an approach can isolate variants that are mainly driving the association and can help evaluate whether a burden association is driven by multiple variants or only a single variant; this is important, as, in the latter, the inferred effect direction cannot be attributed to all variants that were included in the burden mask.

For the top burden associations, we also tested whether the associations were driven by any nearby common variant signals. For this, we iteratively included the most significant common variant observed within 1 Mb on either side of the gene start as a covariate in the REGENIE regression analysis until no nearby common variants with P < 0.01 were observed. The burden results from the conditional analysis in each of the cohorts were then meta-analyzed together.

FinnGen analysis

We downloaded the associations of variant rs202079239 with 3,095 disease endpoints in the FinnGen database using their web browser (https://r7.finngen.fi/variant/1-154575801-C-G)30. Through a string search, we extracted associations related to smoking, substance abuse, addiction, COPD and other lung diseases. To test for enrichment of protective associations (OR < 1) in the extracted phenotypes, we did a hypergeometric test using the ‘phyper’ function implemented in the R base package by passing the following values: q = 36 (number of associations with OR < 1 among the smoking-related phenotypes), m = 2,018 (number of associations with OR < 1 among all phenotypes), n = 1,077 (number of associations with OR > 1 among all phenotypes) and k = 47 (total number of smoking-related phenotypes extracted).

Association of rare variant burden at known GWAS loci

The most recent GWAS by Saunders et al. has identified 1,647 loci associated with one or more smoking traits, and, furthermore, the authors have mapped a set of ‘high-priority genes’ through statistical fine mapping19. Leveraging these results, we analyzed rare variant burden associations with our six primary smoking phenotypes focused on two gene sets: high-priority genes (n genes = 788) and a broader list of genes that are located close to any of the 1,647 GWAS loci reported by Saunders et al. (n genes = 1,177)19. Similar to our primary analysis, we studied pLOF-only and pLOF-plus-deleterious missense variant burden at five allele frequency thresholds for each of the genes. We applied an FDR of 1% to correct the P values for multiple testing.

CHIP mutation analysis

We identified CHIP mutations in the exome-sequencing data of UKB and GHS participants using a somatic mutation-calling pipeline, which we have described in detail in a previous publication focused on CHIP33. Briefly, we used the somatic mutation caller Mutect2, which uses variant mapping and allele-frequency measures to call somatic mutations against a background of germline variants and sequencing errors. CHIP mutation calls were then refined using exome data of a set of reference individuals without somatic mutations (sampled from the lower tail of the age distribution). This was followed by a series of quality-control filtering to identify a final set of highly confident CHIP mutations. In the current work, we studied only the CHIP mutations identified in the eight most recurrent CHIP genes (DNMT3A, TET2, ASXL1, PPM1D, TP53, JAK2, SRSF2 and SF3B1)33.

To test whether the ExWAS associations of ASXL1 and DNMT3A are driven by CHIP mutations, we constructed gene burden masks that excluded CHIP mutations and performed burden association tests using REGENIE and compared with the results based on burden masks that included all rare variants. Furthermore, we constructed burden masks for all eight recurrent CHIP genes using only the CHIP mutations and performed burden analysis using REGENIE. We also tested the associations of VAF of the CHIP mutations with the six smoking phenotypes in a merged genetic dataset of CHIP mutation carriers in the UKB (n = 28,348) and the GHS (n = 11,063) cohorts. We aggregated the VAF estimates for CHIP mutations within each (and across all) of the eight genes and tested their associations with smoking phenotypes through regression analysis adjusted for age, sex, the first ten genetic PCs and a dummy variable for the cohort of origin.

Identification of independent known and new GWAS loci

To define approximate LD-independent GWAS signals, we used conditional and joint analysis (COJO) implemented in the GCTA software77. For the LD reference, we used individual-level genotype data of 10,000 randomly sampled unrelated individuals of either European ancestry (for cross-ancestry and European-specific GWAS) or AMR ancestry (for AMR-specific GWAS). The standard errors of the GWAS summary statistics were adjusted for the LD score regression intercept (LD score regression analysis) before GCTA-COJO analysis. We defined GWAS loci as ‘known’ if the index variant in the loci was in LD (R2 > 0.1) with genome-wide significant variants reported previously16. LD calculations were carried out using PLINK version 2 (ref. 78). Our list of known GWAS loci came primarily from Liu et al.16. However, before declaring a variant as ‘new’, we also manually queried the variants in the GWAS Catalog to ensure that the variants were not in LD with variants reported in other smoking GWAS publications.

LD score regression analysis

We calculated SNP-h2, that is, the proportion of phenotypic variance explained by the common variants, using LD score regression software43. We used a European LD reference panel built in house using a random set of 10,000 unrelated European individuals from the UKB following the instructions provided by the authors of the LD score regression software. Genetic correlations were also computed using LD score regression software using the European LD reference panel. We used LD score regression also to quantify the population stratification that is known to inflate GWAS association statistics43. We computed LD score intercepts for all GWAS runs including the cross-ancestry and AMR-specific GWAS and then compared the values to the corresponding genomic control (GC) λ values. A GC λ > 1 but an intercept = 1 suggests that the observed inflation in the test statistics is fully due to polygenicity. For phenotypes such as smoking that are substantially influenced by environmental factors, it is common to have intercept values slightly above 1 (but still lower than GC λ), indicating that there is inflation in test statistics due to factors other than polygenicity, for example, population stratification, cryptic relatedness, etc.43. To remove such inflation, we applied a correction factor79 to the test statistics to constrain the LD score intercept close to 1. We scaled the standard errors of the variant associations by a factor of the square root of the LD score intercept. This is a better alternative to GC correction (commonly practiced in large-scale consortium GWAS), as GC correction tends to overcorrect the statistics, removing true polygenic signals79. The LD score statistics before and after intercept correction are reported in Supplementary Table 15. We used the European LD reference panel even for cross-ancestry as well as AMR-specific GWAS, as there are no well-established guidelines on how to handle cross-ancestry or admixed ancestry-based GWAS results. We acknowledge that this has likely biased the results toward variants that are shared between European and other ancestries.

Polygenic score analysis

We calculated smoking PGS for the UKB participants using SNP weights based on a GWAS of the ever smoker phenotype conducted in an independent sample. We obtained the summary statistics of the most recent GWAS of the ever smoker phenotype from the GSCAN consortium based on an analysis of all the participating GSCAN cohorts except the UKB and 23andMe19. To improve the statistical power of the PGS, we meta-analyzed the GSCAN results with the GWAS results of the GHS cohort, which together yielded a total sample size of 482,096 individuals. We then refined the SNP effect sizes in the GWAS summary statistics using PRS-CS software80, which uses a Bayesian approach to calculate SNP posterior effect sizes under continuous shrinkage priors based on an external LD reference panel. The refined SNP weights are then used to compute PGS using PLINK version 2 software78.

We performed two types of analysis. First, we studied the associations of PGS and CHRNB2 pLOF-plus-missense burden with heavy smoking using logistic regression analysis, in which the heavy smoker phenotype was coded as the dependent variable (that is, outcome), and PGS, burden mask, an interaction term between PGS and burden mask and relevant covariates (the same as the ones used in the GWAS) were coded as independent variables (regression formula: heavy smoker ≈ PGS + burden mask + PGS × burden mask + covariate1 + …covariaten). Second, we binned UKB individuals into quintiles (five equally sized groups) based on their smoking PGS. Individuals within each quintile were further divided into carriers and non-carriers of CHRNB2 pLOF or likely deleterious missense variants at MAF < 0.001. The prevalence of heavy smokers was then compared between carriers and non-carriers within each quintile; the standard error was calculated using the formula √(pq/n), where n is the number of individuals in the group, p is the prevalence of heavy smokers in the group and q is 1 − p. We also tested the statistical difference in the prevalence of heavy smokers between carriers and non-carriers of rare variant burden using logistic regression analysis adjusted for relevant covariates (the same as the ones used in the GWAS). The OR, 95% CI and the P value for each quintile are reported in Fig. 7.

Power calculations

All power calculations were carried out in R using the package ‘genpwr’ available from CRAN81. In all cases, we computed effect sizes (β values) using the function ‘genpwr.calc’ with the following input parameters: power = 0.80, calc = ‘es’, model = ‘logistic’ for binary phenotypes and ‘linear’ for quantitative phenotypes, α = ‘5 × 10−8’ for GWAS and ‘4.5 × 10−8’ for ExWAS, MAF = values ranging from 0 to 0.5, True.model = ‘additive’ and Test.model = ‘additive’, n = total sample size, case_rate = n cases(n total)−1 (for binary phenotypes) and sd_y = 1 (for quantitative phenotypes).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.