野生和栽培大米的pangenome参考

　　根据1,529家报道的配件的系统发育关系，我们总共选择了149个水稻加入19，这些关系在中国杭州的中国国家水稻研究所和日本米西马的国家遗传研究所保存。为了进行抽样，所有加入均在海南的林肖县种植。　　我们从新鲜叶片中提取基因组DNA，然后在液氮中立即闪烁，并存储在-80°C下。HIFI测序在Pacific Biosciences续集II平台上进行133个加入，并按照其各自的标准协议在Promethion平台上进行了16个加入。删除低质量读数后，HIFI-Sequenceed配件的平均N50通读长度为16 KB，而ONT序列的加入为26 KB。该测序产生了6.8–17.4 GB的HIFI读取133份水稻配件和28.4-55.3 GB的纳米孔读取16个稻米垫。142份水稻的叶子还用于提取基因组DNA并构建具有350 bp插入尺寸的Illumina库，然后在HISEQ 4000平台（Illumina）上进行了测序。有关这些加入的地理起源和测序覆盖范围的详细信息可以在补充表1和2中找到。　　我们从30种代表性品种的HI-C测序中选择了叶片。按照标准协议，由这些样品构建了HI-C库，后来在Novaseq平台（Illumina）上进行了测序。　　从各个植物部位收集组织，包括从149份加入的植物生长过程中的叶子，146个加入的根，137个饰品（3-5 cm）的幼苗（约15天大），来自122份加入。然后使用Trizol（Invitrogen）提取RNA。根据所提供的说明，制备了插入尺寸为350 bp的RNA测序（RNA-Seq）库，并在HISEQ 4000平台（Illumina）上进行测序。　　我们使用HIFIASM61（V.0.16.0）和默认参数组装了133个配件的基因组。这个过程既产生p-contigs and a-contigs。对于使用Nanopore技术测序的七个加入，我们使用了NECAT62（V.20200803）的默认参数用于汇编。使用NextDenovo63（v.2.1）组装了其余的九个加入。为了提高纳米孔基因组的一致性和单基准确性，我们使用纳米孔读数使用了Racon64（v.1.0.0）进行一轮抛光，然后使用Illumina读数进行两轮抛光（V.1.0.2）。为了排除细胞器基因组，我们将IRGSP-1.0的叶绿体（NC 001320.1）和线粒体（NC 011033.1和NC 001751.1）对齐与使用Mummer66（V.4.0.0beta2）的每个登录组合物的IRGSP-1.0的参考序列。特别针对和去除了显示超过50％的覆盖范围和小于500 kb的重叠群。随后使用allmaps67对归因于核基因组的其余重叠群进行评估并锚定在伪染色体上。　　为了实现染色体级的基因组组件，使用Chromap68（V.0.2.6）将所得的HI-C读数与重叠群对齐。然后将这些比对通过Samtools69（v.1.20）转换为床或BAM格式，并使用YAHS70（v.1.1）进行处理。随后，Juice_Tools（V.1.19.02）生成了用于可视化的HI-C接触图，然后使用JuiceBox Assembly Tools71（V.1.9.8）进行手动优化。　　为了评估基因组组件的质量，我们实施了几个索引。首先，我们使用BUSCO72（v.5.2.2）使用Embryophyta_odb10数据库评估了基因完整性，并使用ltr_retriever（v.2.9.0）的LTR组装索引（LAI）25进行重复完整性，并使用参数为-mamaxlenltr 7000’。此外，使用Inspector26（v.1.0.1）（无参考组装评估者）测量整体组装质量（QV）。为了解决我们的评估，使用Quast73（v.5.0.1）评估了本研究中组装的Nipponbare基因组与参考基因组IRGSP-1.0和T2T-NIP之间的不匹配数量。　　在Nipponbare（IRGSP-1.0）基因组和O. sativa（Gla4），O。Rufipogon（W1943），O。Longistaminata（Ol3101）和O. Meridionalis（OM1952）中的基因组水平的同步分析和四个代表性基因组，我们使用了Meridiansis（OM1952），我们使用了Momma66的Numma66（V.44）。使用具有“ -i 85 -L 5000 -O 85”参数的Delta -Filter程序过滤所识别的同步块，然后使用Mumplot程序进行可视化。　　我们在使用Allmaps方法获得的使用HI-C技术和伪染色体构建的染色体之间进行了比较基因组分析。最初，我们使用Mummer66的Nucmer程序对齐基因组。使用带有参数“ -m”的Delta-Filter程序对确定的同步块进行过滤。随后，我们比较了两个染色体级组件之间的对齐，并使用SYRI74（v.1.4）确定了同步块和结构重排。　　整合了三种不同的策略，包括基于同源的基于同源性和基于转录组的预测，以生成预测的基因模型。为了从头开始预测，使用了四个不同的程序：fgenesh+75（v.3.1.1），snap76（v.2006-07-28），Genemark-es77（v.4.68_lic）和Augustus78（v.3.3.2）。在基于同源性的预测中，使用基因组基因组序列数据库（V.7.0，http://rice.plantbiology.su.edu）来源的同源蛋白序列与使用基因运动员79（v.1.7.1.1.1）对齐。RNA-seq从每个登录的四个不同组织中读取的RNA-seq读取使用HISAT2（参考文献80）（v.2.0.5）映射到各自组装的基因组，然后用StringTie81（v.2.0）组装成转录本。我们使用Trinity82（v.2.12.0）进行了从头开始和基因组引导的RNA-Seq组件，随后将它们与PASA83（v.2.0.1）进行对齐。通过为每种预测方法分配适当的权重，我们使用EvidenceModeler84（v.1.1.1）将所有预测的基因结构合成为共有基因模型。使用BUSCO72（v.5.2.2）评估蛋白质编码注释。　　对于基因的功能注释，使用Intersoscan85（V.5.56-89.0）来预测潜在的蛋白质结构域。为了计算基因表达水平，首先使用具有“ -l 30”参数的FASTP86（V.0.23.0）去除低质量的RNA-Seq读数。然后，使用Salmon87（v.1.6.0）在基于参数的模式中使用参数'-L a -validatemappings-gcbias’，使用Salmon87（v.1.6.0）将基因组串联到带注释的转录本的末端的诱饵序列的索引映射到诱饵序列的索引。最后，通过计算映射到每个转录本的读数数量并计算成百万（TPM）值的转录本来量化基因表达水平。　　我们开发了一条管道，以识别HIFI基因组组件中P-Contigs和A-Contigs之间的等位基因基因（补充图31）。首先，使用Genetribe88（v.1.2.0）和默认参数确定等位基因对为相互访问。随后，来自A-Contigs的基因的全长序列对P-Contigs的BLASTN（V.2.9.0+）进行了搜索。具有相当于100％的身份和覆盖率值的基因被归类为纯合等位基因，而所有其他等位基因则被归类为杂合等位基因。此外，根据两个严格的条件，分离出A-Contigs中的基因，但在P-Contigs（MIP基因）中缺少缺少：（1）当来自A-Contigs的基因序列与blastn（v.2.9.0+）和（2）使用a-conty（2）使用A-CDSS gmaps（cdss gmaps）时，无命中与p-contig对齐，没有命中。（V.2021-05-27）。未分类的A-Contigs上的其他基因被定义为“其他”。　　差异表达的等位基因的识别是基于必须同时满足的三个标准：　　Rgaugury90管道用于预测133份野生水稻加入（包括129 O. Rufipogon，3 O. longistaminata和1 O.子午线）和129个培养的水稻垫（补充表8）。根据这些RGA结构域和主题的组合，将RGA候选物鉴定为四个主要家族：RLK，NBSS，TM-CC蛋白和RLP。考虑到RGA倾向于聚集在基因组中，使用MCSCAN91（Python版本）管道的“ JCVI.Para.catalog”模块鉴定了串联重复，并根据其在染色体上的位置基于默认参数。如果每个基因座的RGA分别为1、2和2个以上，则将每个基因座的RGA分别归类为单例，对和簇。　　每个登录相对于其他所有登录的界线都是使用McScan（Python版本）管道中的“ JCVI.compara.catalog”模块的“直系同源”工具构建的。使用“ jcvi.compara.synteny”模块的“ mcScan”工具与参数“ -mergetandem”集成了串联基因。然后，使用“ jcvi.formats.base”模块的“ JOIN”工具将每个登录的所有共线块与所有其他磁带块一起加入矩阵。最后，通过使用自定义脚本合并，对所有共线矩阵进行合并，分类和重复数据创建综合的RGA矩阵。　　使用ClustalW93（v.2.1），将大米中的Cento卫星重复序列与核基因组对齐。使用在线序列转换工具（http://sequenceconversion.bugaco.com/converter/converter/biology/sepences/）将生成的对齐文件转换为Stockholm格式，然后用作使用HMMMER94（v.3.1b2）中的“ HMMBuild”功能来构建HMM文件的输入。接下来，进行了同源搜索，以使用NHMMER识别所有染色体的CentO重复序列，并将E-Value阈值设置为1×10-5。我们采用了在每个染色体上每第五十个间隔提取一个Cento重复单元的策略，以选择一些Cento重复序列，以使用MAFFT95（v.7.490）对基因组进行随后的相似性比较。随后的系统发育分析使用IQ-Tree96（V.1.6.12）进行，富含1,000的自举值。　　为了识别每个染色体中的端粒序列，直接通过自定义脚本直接搜索了端粒序列5'-CCCTAAA-3'和七个碱基的反向补体。　　使用手动策划的水稻TE文库（Rice6.9.5.5.liban）和每个基因组的带注释的CDSS，通过EDTA97（V.2.1.0）管道鉴定了149个染色体水平的基因组。然后，将每个基因组产生的单个非冗余TE库与panedta37结合使用了策划的TE库（Rice6.9.5.liban），从而形成了全面的Pangenome TE库。然后，这个pangenome te库被用来在我们的研究的149个集会中重新注释全基因组TE，以及在先前出版的33个耕种大米量的Pangenome中的28个大米大会中16（不包括Oryza barthii，Oryza barthii，Oryza glaberrima，aus aus，basmati and wsssm）。（http://repeatmasker.org）（v.4.1.2）。使用默认参数的LTR_RETRIEVER98（v.2.9.0）估算每个完整吉普赛和COPIA元素的插入时间。　　在先前的研究37中，通过比较家庭大小和两个组之间每个家族中的独奏与独立性LTR的比率来确定LTR家族的动力学。在大小和独立性LTR比率上表现出显着差异的家庭被归类为“清除家族”。我们对只有家庭规模与“放大家庭”显着不同的案例分类。另一方面，如果独奏与独立性LTR比率发生了显着变化，而没有相应的家庭规模转移，则将其归类为“平衡家庭”。没有在任一维中表现出显着差异的家庭称为“漂流家庭”。为了对Or-IIIA-Larger Gypsy家族的动态进行分类，我们首先使用LTR_RETRIEVER软件包中的“ Solo_finder.pl”脚本确定了Solo Gypsy元素，并从每个基因组的最终注释结果中获得了完整的吉普赛元素的家庭信息。然后，通过将独奏元素的数量除以吉普赛家族中完整元素的数量来计算独奏比率。最后，我们应用了学生的t检验，以比较OR-IIIA和Japonica组之间的家庭规模和独奏比率，P< 0.01 as the cut-off for significance. 　　We mapped the sequences of identified insertions and deletions to the comprehensive pangenome TE library using BLASTN (v.2.12.0+). If both the identity and the coverage reached 80%, the PAV was defined as a TE insertion polymorphism (TIP)99. To identify the genes adjacent to the Or-IIIa-larger Gypsy families, we mapped the gene sequences against the TIP sequences containing the Or-IIIa-larger Gypsy families in each Or-IIIa genome. Genes with an identity of at least 95% and a coverage of at least 50% were classified as adjacent to these families. Gene Ontology (GO) enrichment analysis of these genes was performed in the R package clusterProfiler (v.4.6.2), with P ≤ 0.05 as the threshold for significance. 　　We adopted three strategies to detect PAVs (large insertions and deletions) in the 133 Hifi genomes. (1) We mapped HiFi reads to the IRGSP-1.0 reference genome using pbmm2 (https://github.com/PacificBiosciences/pbmm2/) (v.1.4.0) with the ‘--preset CCS’ parameters. Then, pbsv (https://github.com/PacificBiosciences/pbsv) (v.2.6.2) was used for variant calling of each accession with the parameters ‘--min-sv-length 30 --max-ins-length 100K --max-dup-length 100K’. (2) We mapped HiFi reads to the IRGSP-1.0 reference genome using minimap2 (ref. 100) (v.2.21-r1071) with the ‘-x map-hifi’ parameters. CuteSV101 (v.1.0.11) was then used for variant calling of each accession with the parameters ‘--min_support 3 --min_size 30 --max_size 100000 --max_cluster_bias_INS 1000 --diff_ratio_merging_INS 0.9 --max_cluster_bias_DEL 1000 --diff_ratio_merging_DEL 0.5’. (3) We mapped assembled contigs to the IRGSP-1.0 reference genome using minimap2 (ref. 100) (version 2.22-r1071) with the parameters ‘-x asm5 –cs -r 2k’. For variant calling, SVIM-asm102 (v.1.0.7) was operated in ‘haploid’ mode, with the parameters ‘–min_sv_size 30 –max_sv_size 100000’. For obtaining high-confidence variations, we merged all insertions within a 50-bp range and deletions within a range of 50% of their length using SURVIVOR103 (v.1.0.6). We reported only those variants that were corroborated by at least two of the calling methods and for which there was a consensus on the variant type. To detect PAVs in 16 nanopore genomes, we tailored our approach by using the second and third strategies of the above method, with some modifications to suit the characteristics of nanopore data: (1) the minimap2 (ref. 100) (v.2.21-r1071) parameters for mapping nanopore reads were adjusted to ‘-x map-ont’; (2) the parameters for CuteSV101 (v.1.0.11) were modified to increase the minimum support threshold to 10; and (3) the minimum supporting caller number in SURVIVOR103 (v.1.0.6) was adjusted to one. 　　To detect translocations and inversions, we first aligned each pseudo-chromosome to the reference genome across 149 genomes and then used the SyRI74 (v.1.4) pipeline for variation calling. Per the classifications provided by SyRI, INV variants were categorized as inversions, in comparison with the Nipponbare reference. Both TRANS and INVTR variants were categorized as translocations. To detect small indels, we extracted INS and DEL variants consisting of fewer than 30 bp. 　　We implemented both read-mapping-based and assembly-based approaches to identify SNPs using Nipponbare as the reference genome. For read mapping, we called SNPs through Longshot104 (v.0.4.1), using the alignment results of reads by minimap2 (ref. 100) during the process of PAV calling. Parameters were set to ‘-c 3 -D 3:10:50’ for HiFi genomes and ‘-c 10 -D 3:10:50’ for nanopore genomes. For assembly-based calling, we first aligned each contig to the reference genome using the ‘nucmer’ program and refined the alignments to one-to-one matches using the ‘delta-filter’ program. SNPs were then identified using the ‘show-snps’ program with the ‘-C -I’ parameters, all from the MUMMER package66 (v.4.0.0beta2). To minimize false positives, we only considered SNPs detected by both methodologies. 　　From a population of 280 accessions, including 132 long-read sequencing genomes sourced from published studies (Supplementary Table 16), SNP calling was performed by mapping the assembled contigs using MUMMER66 alone. We merged the resulting SNP datasets for each sample using a custom Perl script. To compile a high-confidence SNP dataset, we used the ‘VariantFiltration’ function in the Genome Analysis Toolkit105 (v.4.1.4.0) with the ‘--cluster-window-size 10 --cluster-size 3’ parameters. This dataset served as the basis for further evolutionary analysis. Finally, we annotated and predicted the effects of our identified SNPs using SnpEff106 (v.55.0), to ensure a comprehensive understanding of their potential impact. The same method was also applied to the population of 510 samples (Supplementary Table 17). 　　We performed all-versus-all CDS alignment in the pangenomes for 16 O. sativa accessions, 129 O. rufipogon accessions and a combined set of 145 wild–cultivated rice accessions (16 O. sativa and 129 O. rufipogon) using BLASTN (v.2.2.18). If a gene was aligned with at least 95% identity and at least 50% coverage, it was considered present in the corresponding genome. On the basis of their frequency, we classified genes into the following four categories: core (those present in all individuals), soft-core (those present in more than 90% of samples but not all), dispensable (those present in more than one but less than 90% of samples) and private (present in only one accession). To achieve a balanced comparison, we incorporated 113 non-redundant cultivated rice genomes from 3 previously published pangenomic datasets to match the size of the wild rice population (Supplementary Table 10). Gene annotation was performed uniformly across all samples using a consistent methodology. In the dataset comprising 129 O. rufipogon and 129 O. sativa accessions, genes present exclusively in wild rice and absent in all cultivated rice were defined as wild-rice-specific genes. 　　To construct three distinct pangenome graphs for our study, we applied the Minigraph-Cactus pipeline107 to the assembled genomes of 16 O. sativa, 129 O. rufipogon and a combined set of 145 accessions. The first step involved using minigraph108 (v.0.19-r551) to develop a primary pangenome graph, capturing the SVs within the input assemblies. Subsequently, these assemblies were remapped onto the primary graph using minigraph108. The mapping results were then used as the input for Cactus109 (v.2.2.1), which facilitated the generation of the final graphs. We defined the graph size as the total length of all nodes, and nodes that were not included in the reference genome (non-ref) were defined as novel sequences. To call variants for 142 accessions from our study (the remaining 6 samples lacked next-generation data) and 407 newly sequenced samples (33 O. sativa and 374 O. rufipogon or Oryza nivara) from another study24 (Supplementary Table 14), the Illumina short paired-end reads from each accession were mapped against the graph-based cultivated–wild pangenome using vg giraffe34 (v.1.43.0). The variations were then called using DeepVariants110 (v.1.6.1) with the NGS model, and all individual variants were merged using GLnexus111 (v. 1.4.1-0-g68e25e5). 　　Chloroplast genomes are very conserved across different species and are frequently used to construct phylogenetic evolutionary trees, which can be instrumental in studying species classification and understanding their evolutionary relationships112. To assemble chloroplast genomes of wild rice in our study, the HiFi reads were first aligned to the reference chloroplast genome of Nipponbare (Gene Bank ID: GU592207.1) using minimap2 (ref. 100) (v.2.21-r1071) with the ‘-x map-hifi’ parameters. Chloroplast-derived reads with higher than 70% coverage were then extracted using a custom Perl script. The final assembly of the chloroplast genome was then performed using hifiasm61 (v.0.16.0) with default parameters. Locally collinear blocks among 72 assembled chloroplast genomes, along with published chloroplast genomes of O. barthii (KF359904.1), Oryza glumipatula (NC_027461.1), O. longistaminata (NC_027462.1), O. meridionalis (OV049999.1), O. rufipogon (NC_017835.1) and Nipponbare, were identified for constructing multi-sequence alignments using HomBlocks113. The phylogenetic tree was then constructed using IQ-TREE96 (v.2.2.0.3) with 1,000 bootstraps. On the basis of the results, we re-identified three accessions of O. longistaminata and one of O. meridionalis. 　　We performed an all-versus-all comparison of the amino acid sequences of protein-coding genes using DIAMOND114 (v.2.0.15). These genes were from 149 genome assemblies and 31 assemblies (excluding the same species NIP and WSSM) from a cultivated rice pangenome16. The alignment results were then input into OrthoFinder115 (v.2.5.4) to find orthogroups and orthologues. Using 844 identified single-copy orthologues, we constructed a gene-based maximum-likelihood phylogenetic tree using IQ-TREE96 (v.2.2.0.3) with 1,000 bootstraps. 　　To determine the phylogenetic relationships of three populations, including wild rice and cultivated rice (Supplementary Tables 14, 16 and 17), we first converted the SNP VCF files into tfam format using PLINK116 (v.1.90b6.9 64-bit). After this, a kinship matrix was generated using EMMAX117 (v.beta-07Mar2010) with the ‘-v -h -s -d 10’ parameters. The neighbour-joining phylogenetic tree was then constructed using the PHYLIP package (https://phylipweb.github.io/phylip/) (v.3.66). For visualizing the resulting phylogenetic trees, the interactive tool iTOL118 was used. 　　We first identified core SNP subsets of three populations, including wild rice and cultivated rice (Supplementary Tables 14, 16 and 17), each exhibiting a minor allele frequency of more than 0.05 and a missing rate of less than 0.8, using VCFTools119 (v.0.1.16). Further refinement was done using PLINK116 (v.1.07) to exclude SNPs with substantial LD (r² ≥ 0.5) in each sliding window (in windows of 100 SNPs within steps of 10 SNPs). Archetypal analysis120 of these SNP sets was performed with the parameters ‘-tolerance 0.0001 --max_iter 400’. 　　The nucleotide diversity (π) of each group and the fixation index (FST) between different groups were both estimated using VCFTools119 (v.0.1.16) with a window size of 100 kb and a step size of 10 kb. The genome-wide LD decay pattern for each group was calculated using PopLDdecay121 (v.3.42) and plotted using the Plot_MultiPop.pl script in the PopLDdecay package with parameters ‘-bin1 500 -bin2 7000 -break 5000’. DST was calculated using PLINK116 (v.1.07) with the ‘--genome’ and ‘--genome-full’ options. Heat plots of 1-DST matrices were made with the ggplot2 package in R (v.4.1.3). 　　We used MSMC2 (ref. 122) (v.2.1.4) to infer the population separation history. Our analysis began with the preparation of a negative mask file for the coding region of IRGSP-1.0 (MSU7.0) and a mappability mask file using seqbility (http://lh3lh3.users.sourceforge.net/snpable.shtml) (v.20091110) and makeMappabilityMask.py. The phased SNP sites with uniquely mapped reads and mean coverage depths greater than threefold were acquired using Longshot104 (v.0.4.1) and the high-quality regions of each genome were acquired using the filtered results of show-snps from MUMmer66 (v.4.0.0beta2). The MSMC2 input files were constructed by merging VCF and mask files using the ‘generate_multihetsep.py’ script. Because O. rufipogon naturally uses both cross-pollination and self-pollination, we followed an established approach of constructing pseudodiploids, which has been widely used in similar studies of inbreeding species such as Caenorhabditis123, Arabidopsis thaliana124, soybean125 and African wild rice126,127. We randomly selected four samples from each population and treated each sample as a single haplotype. We then paired chromosomes from haplotypes within the same population to construct pseudodiploids. The population split inference focused on 2 individuals (4 haplotypes) per group, calculating median population split times based on 50 random combinations for each comparative analysis. A mutation rate of 8.09 × 10−9 per site per generation128 and a generation time of one year were applied to estimate demographic history. 　　We used TreeMix129 (v.1.13) to infer population admixture graphs for major groups of O. rufipogon (Or-IIIa, Or-Ia and Or-Ib) and O. sativa (japonica, basmati, indica, intro-indica and aus) from East Asia and South Asia. Oryza officinalis (CC genome), O. longistaminata and O. meridionalis were set as outgroups to construct the phylogenetic tree. We systematically varied the number of migrations from 0 to 10, performing 10 iterations. For each migration event, TreeMix was executed by randomly sampling approximately 80% of the SNP loci using a random seed, applying the ‘-global -k 500’ parameters for global allele frequency estimation. The optimal number of migration edges (m = 4) was determined using the R package OptM130 (v.0.1.6). 　　To detect potential admixture events of the form (target; source 1, source 2), we performed an F3-admixture test using the qp3Pop program in ADMIXTOOLS131 (v.7.0.2). Under the null hypothesis that the target population is not a mixture of populations related to source 1 and source 2, the expected F3 statistic would yield a non-negative mean. A negative mean of the F3 statistic, on the other hand, would suggest admixture in the target population, with genetic contributions from groups related to source 1 and source 2. A z-score below −3 was considered indicative of significant admixture in population C. 　　Using a four-taxon model (((P1, P2), P3), PO), we calculated the D-statistic to perform the ABBA–BABA test, using the script calculate_abba_baba.r (https://github.com/palc/tutorials-1/tree/master/analysis_of_introgression_with_snp_data/src). With O. longistaminata designated as the outgroup, our analysis revealed a significantly positive D-statistic (P < 0.01), suggesting introgression between P3 and P2. To delve deeper into introgression segments between indica and aus from japonica, we computed the fd statistic across the genome in 100-kb sliding windows with a step size of 10 kb, using the script ABBABABAwindows.py from genomics_general toolkit (https://github.com/simonhmartin/genomics_general). The minimum number of SNPs per window was set to 250, and the minimum proportion of samples genotyped per site was set to 0.4. The fd < 0 values are converted to zero, and fd >1个值转换为1。最后，为了评估Japonica的Indica和AUS之间的渗入区域的一致性，我们将前10、30和50 100-KB窗口中的假定渗入段分类。　　为了量化两组之间的遗传相似性，我们对所有可能的成对组合进行了全面分析。我们专注于识别“相同窗口”，定义为相似性超过99.99％的窗口。使用以下公式计算每个10-KB窗口的相似性索引：　　在这里，numdiff是不同SNP的数量，Numnan是每个窗口中缺少数据的站点数。该方法还用于预测两个分类单元之间的潜在渗入片段。为此，我们从每个组中选择了一个代表性的变体，并在指定区域内的10 kB非重叠窗口中绘制了不同SNP的计数（称为成对分化）。　　为了检测驯化过程中与人工选择相关的选择性扫描，我们使用VCFTools119（V.0.1.16）计算了πWild/π栽培和FST，并使用100 kb的滑动窗口和10 kb的步骤来计算。之后，我们将BedTools132（v.2.30.0）与参数“ -d 30000”一起使用，以合并在两个值的前5％以内的重叠区域。值得一提的是，鉴于在Indica中观察到的广泛基因流动，我们的分析仅限于OR-IIIA和OR-IB作为野生水稻群的代表。栽培的水稻类别包括Indica，Japonica，Aus和Basmati。基于确定的选择性扫描中基于SNP的系统发育树的构建和原型分析与上述整个基因组的方法保持一致。为了识别DOMPAV，我们进行了一项两侧Fisher的测试，比较了野生和栽培的水稻133，考虑到具有错误发现率（FDR）调整的P <0.05的PAV为显着。　　为了追踪大米中的驯化区的起源，我们首先确定了31个OR-IIIA接口和19个OR-IB登录之间的566,513个区分SNP，显示出大于0.8的等位基因频率。然后，我们评估了亚洲栽培水稻组和OR-IA的这些SNP位点的主要等位基因，该等位基因的频率为90％或更高。如果主要等位基因与Or-IIIA相匹配，则SNP被归类为源自OR-IIIA；否则，将其归类为源自OR-IB。使用Rectchr（https://github.com/bgi-shenzhen/rectchr）（v.1.36）可视化整个基因组分布。　　为了构建驯化基因的系统发育树，我们选择了11种代表性基因，以其在水稻驯化的早期阶段而闻名，如发表的文献所述。使用GMAP89（V.2021-05-27），我们从280种水稻辅助中提取了基因区域序列或基因区域以及它们的上游和下游区域。然后，使用MAFFT95（V.7.490）对齐这些序列，并应用参数“ - 墨西哥1000”以优化对齐精度。为了进行系统发育的树结构，我们使用IQ-Tree96（v.2.2.0.3）进行了1,000个Bootstrap复制的树木推断，然后将O. longistaminata，O。Meridionalis和O. Glaberrima设置为O.在单倍型分析阶段，我们在手动修剪对齐序列后，使用R包装基因HAPR134（v.1.1.9）过滤了内含子序列，而没有候选功能位点或QTN进行单倍型分析，并使用R package genehapr134（v.1.1.9）进行了可视化的单倍型网络。我们的分析保留了大于两个频率的单倍型，并且与主要单倍型密切相关的单倍型，以确保清晰度。　　我们确定了在90个Indica品种和36个Japonica品种之间具有高度分化等位基因的SNP位点，要求它们的等位基因频率必须高于两组的0.9。我们总共确定了855,121个Indica – Japonica分化的SNP。为了确定这些SNP的祖先起源，我们分析了其各自祖先组的主要等位基因状态（频率≥0.6），并将其分为六类（补充表20）。我们还采用了相同的标准来识别26个Indica品种和13个Japonica品种之间的13,853个Indica – Japonica分化的PAV。　　研究中所有野生大米的地理记录是通过收集田间样品获得的。有关其分布范围的近似纬度和经度信息用于空间映射，可以在补充表2中找到。分发图是使用开源python工具Geopandas135（v.0.14.4）（v.0.14.4）（BSD-3-CLAUSE许可证）生成的，该工具是从公共映射层中衍生而来的。（https://www.naturalarearthdata.com/）。　　有关研究设计的更多信息可在与本文有关的自然投资组合报告摘要中获得。

本文来自作者[lejiaoyi]投稿，不代表言希号立场，如若转载，请注明出处：https://lejiaoyi.cn/zlan/202506-792.html