Integrating BioNano optical maps with the PacBio assembly
High-molecular-weight DNA was isolated and labelled from leaf tissue of three-week old quinoa plants according to standard BioNano protocols, using the single-stranded nicking endonuclease Nt.BspQI. Labelled DNA was imaged automatically using the BioNano Irys system and de novo assembled into consensus physical maps using the BioNano IrysView analysis software. The final de novo assembly used only single molecules with a minimum length of 150 kb and eight labels per molecule. PacBio-BioNano hybrid scaffolds were identified using IrysView’s hybrid scaffold alignment subprogram.
Chicago library preparation and sequencing
Using the same DNA prepared for PacBio sequencing, a Chicago library was prepared as described previously10. The library was sequenced on an Illumina HiSeq 2500.
Scaffolding the PacBio and BioNano assemblies with HiRise
Chicago sequence data (in FASTQ format) was used to scaffold the PacBio-BioNano hybrid assembly using HiRise, a software pipeline designed specifically for using Chicago data to assemble genomes10. Chicago library sequences were aligned to the draft input assembly using a modified SNAP read mapper (http://snap.cs.berkeley.edu). The separations of Chicago read pairs mapped within draft scaffolds were analysed by HiRise to produce a likelihood model, and the resulting likelihood model was used to identify putative mis-joins and score prospective joins.
Kurmi × 0654 population mapping and genetic marker analysis
A population was developed by crossing Kurmi (green, sweet) and 0654 (red, bitter). Homozygous high- and low-saponin F2 lines were identified by planting 12 F3 seeds derived from each F2 line, harvesting F4 seed from these F3 plants, and then performing foam tests on the F4 seed. Phenotyping was validated using gas chromatography/mass spectrometry (GC/MS). RNA was extracted from inflorescences containing a mixture of flowers and seeds at various stages of development from the parents and 45 individual F3 progeny. RNA extraction and Illumina sequencing were performed as described above. Sequencing reads from all lines were trimmed using Trimmomatic and mapped to the reference assembly using TopHat44, and SNPs were called using SAMtools mpileup (v1.1)45.
For linkage mapping, markers were assigned to linkage groups on the basis of the grouping by JoinMap v4.1. Using the maximum likelihood algorithm of JoinMap, the order of the markers was determined; using this as start order and fixed order, regression mapping in JoinMap was used to determine the cM distances.
Genes differentially expressed between bitter and sweet lines and between green and red lines were identified using default parameters of the Cuffdiff function of the Cufflinks program46.
Atlas × Carina Red population mapping and genetic marker analysis
A second mapping population was developed by crossing Atlas (sweet) and Carina Red (bitter). Bitter and sweet F2 lines were identified by performing foam and taste tests on the F3 seed. DNA sequencing was performed with DNA from the parents and 94 sweet F2 lines, as described above, and sequencing reads were mapped to the reference assembly using BWA. SNPs were called in the parents and in a merged file containing all combined F2 lines.
Genotype calls were generated for the 94 F2 genotypes by summing up read counts over a sliding window of 500 variants, at all variant positions for which the parents were homozygous and polymorphic. Over each 500-variant stretch, all reads with Atlas alleles were summed, and all reads with the Carina Red allele were summed. Markers were assigned to linkage groups using JoinMap, with regression mapping used to obtain the genetic maps per linkage group.
Integrated linkage map
The Kurmi × 0654 and Atlas × Carina Red maps were integrated with the previously published quinoa linkage map13, with the Kurmi × 0654 map being used as the reference for the positions of anchor markers and scaling. We selected markers from the same scaffold that were in the same 10,000-bp bin in the assembly. The anchor markers on the alternative map received the position of the Kurmi × 0654 map anchor marker in the integrated map. This process was repeated with anchor markers at the 100,000-bp bin level. The assumption is that at the 100,000-bp bin level recombination should essentially be zero. On this level, a regression of cM position on both maps yielded R2 values >0.85 and often >0.9, so the regression line can easily be used for interpolating the positions of the alternative map towards the corresponding position on the Kurmi × 0654 map. All Kurmi × 0654 markers went into the integrated map on their original position.
Pseudomolecules were assembled by concatenating scaffolds based on their order and orientation as determined from the integrated linkage map. An AGP (‘A Golden Path’) file was made that describes the positions of the scaffold-based assembly in coordinates of the pseudomolecule assembly, with 100 ‘N’s inserted between consecutive scaffolds. Based on these coordinates, custom scripts were used to generate the pseudomolecule assembly and to recoordinate the annotation file.
Sequencing and assembly of C. pallidicaule and C. suecicum
DNA was extracted from C. pallidicaule (PI 478407) and C. suecicum (BYU 1480) and was sent to the Beijing Genomic Institute (BGI, Hong Kong) where one 180-bp PE library and two mate-pair libraries with insert sizes of 3 and 6 kb were prepared and sequenced on the Illumina HiSeq platform to obtain 2 × 100-bp reads for each library. The generated reads were trimmed using the quality-based trimming tool Sickle (https://github.com/najoshi/sickle). The trimmed reads were then assembled using the ALLPATHS-LG assembler47, and GapCloser v1.1248 was used to resolve N spacers and gap lengths produced by the ALLPATHS-LG assembler.
Repeat families found in the genome assemblies of quinoa, C. pallidicaule and C. suecicum (see Supplementary Information 3) were first independently identified de novo and classified using the software package RepeatModeler49. RepeatMasker50 was used to discover and identify repeats within the respective genomes.
AUGUSTUS51 was used for ab initio gene prediction, using model training based on coding sequences from Amaranthus hypochondriacus, Beta vulgaris, Spinacia oleracea and Arabidopsis thaliana. RNA-seq and isoform sequencing reads generated from RNA of different tissues were mapped onto the reference genome using Bowtie 2 (ref. 52) and GMAP53, respectively. Hints with locations of potential intron–exon boundaries were generated from the alignment files with the software package BAM2hints in the MAKER package54. MAKER with AUGUSTUS (intron–exon boundary hints provided from RNA-seq and isoform sequencing) was then used to predict genes in the repeat-masked reference genome. To help guide the prediction process, peptide sequences from B. vulgaris and the original quinoa full-length transcript (provided as EST evidence) were used by MAKER during the prediction. Genes were characterized for their putative function by performing a BLAST search of the peptide sequences against the UniProt database. PFAM domains and InterProScan ID were added to the gene models using the scripts provided in the MAKER package.
The following quinoa accessions were chosen for DNA re-sequencing: 0654, Ollague, Real, Pasankalla (BYU 1202), Kurmi, CICA-17, Regalona (BYU 947), Salcedo INIA, G-205-95DK, Cherry Vanilla (BYU 1439), Chucapaca, Ku-2, PI 634921 (Ames 22157), Atlas and Carina Red. The following accessions of C. berlandieri were sequenced: var. boscianum (BYU 937), var. macrocalycium (BYU 803), var. zschackei (BYU 1314), var. sinuatum (BYU 14108), and subsp. nuttaliae (‘Huauzontle’). Two accessions of C. hircinum (BYU 566 and BYU 1101) were also sequenced. All sequencing was performed with an Illumina HiSeq 2000 machine, using either 125-bp (Atlas and Carina Red) or 100-bp (all other accessions) paired-end libraries. Reads were trimmed using Trimmomatic and mapped to the reference assembly using BWA (v0.7.10)55. Read alignments were manipulated with SAMtools, and the mpileup function of SAMtools was used to call SNPs.
Identification of orthologous genes
Orthologous and paralogous gene clusters were identified using OrthoMCL28. Recommended settings were used for all-against-all BLASTP comparisons (Blast+ v2.3.056) and OrthoMCL analyses. Custom Perl scripts were used to process OrthoMCL outputs for visualization with InteractiVenn57.
Using OrthoMCL, orthologous gene sets containing two copies in quinoa and one copy each in C. pallidicaule, C. suecicum, and B. vulgaris were identified. In total, 7,433 gene sets were chosen, and their amino acid sequences were aligned individually for each set using MAFFT58. The 7,433 alignments were converted into PHYLIP format files by the seqret command in the EMBOSS package59. Individual gene trees were then constructed using the maximum likelihood method using proml in PHYLIP60.
In addition, the genomic variants of all 25 sequenced taxa (Supplementary Data 5) relative to the reference sequence were called based on the mapped Illumina reads in 25 BAM files using SAMtools. To call variants in the reference genome (PI 614886), Illumina sequencing reads were mapped to the reference assembly. Variants were then filtered using VCFtools61 and SAMtools, and the qualified SNPs were combined into a single VCF file which was used as an input into SNPhylo62 to construct the phylogenetic relationship using maximum likelihood and 1,000 bootstrap iterations.
To identify FT homologues, the protein sequence from the A. thaliana flowering time gene FT was used as a BLAST query. Filtering for hits with an E value <1 e="" style="font-size: 12.3207px; line-height: 0; margin-left: 0.15em;" sup="">−31>and with RNA-seq evidence resulted in the identification of four quinoa proteins. One quinoa protein (AUR62013052) appeared to be comprised of two tandem repeats which were separated for the purposes of phylogenetic analysis. For the construction of the phylogenetic tree, protein sequences from these five quinoa FT homologues were aligned using Clustal Omega63 along with two B. vulgaris (gene models: BvFT1-miuf.t1, BvFT2-eewx.t1) and one A. thaliana (AT1G65480.1) homologue. Phylogenetic analysis was performed with MEGA64 (v6.06). The JTT model was selected as the best fitting model. The initial phylogenetic tree was estimated using the neighbour joining method (bootstrap value = 50, Gaps/ Missing Data Treatment = Partial Deletion, Cutoff 95%), and the final tree was estimated using the maximum likelihood method with a bootstrap value of 1,000 replicates. The syntenic relationships between the coding sequences of the chromosomal regions surrounding these FT genes were visualized using the CoGE65 GEvo tool and the Multi-Genome Synteny Viewer66.
The alignment of bHLH domains was performed with Clustal Omega63, using sequences from Mertens et al.39. The phylogeny was inferred using the maximum likelihood method based on the JTT matrix-based model67. Initial trees for the heuristic search were obtained automatically by applying Neighbour-Join and BioNJ algorithms to a matrix of pairwise distances estimated using a JTT model, and then selecting the topology with superior log likelihood value. All positions containing gaps and missing data were eliminated.
Distinguishing and analysing the quinoa sub-genomes
Trimmed PE Illumina sequencing reads that were used for the de novo assembly of C. suecicum and C. pallidicaule were mapped onto the reference quinoa genome using the default settings of BWA. For every base in the quinoa genome, the depth coverage of properly paired reads from the C. suecicum and C. pallidicaule mapping was calculated using the program GenomeCoverage in the BEDtools package68. A custom Perl script was used to calculate the percentage of each scaffold with more than 5× coverage from both diploids. Scaffolds were assigned to the A or B sub-genome if >65% of the bases were covered by reads from one diploid and <25 and="" bases="" between="" by="" covered="" diploid.="" diploid="" from="" i="" nbsp="" of="" other="" quinoa="" reads="" relationship="" species="" sub-genomes="" the="" were="">C. pallidicaule25>and C. suecicum was presented in a circle proportional to their sizes using Circos69. Orthologous regions in the three species were identified using BLASTN searches of the quinoa genome against each diploid genome individually. Single top BLASTN hits longer than 8 kb were selected and presented as links between the quinoa genome assembly (arranged in chromosomes, see Supplementary Information 7.3) and the two diploid genome assemblies on the Circos plot (Fig. 2a).
Sub-genome synteny was analysed by plotting the positions of homoeologous pairs of A- and B-sub-genome pairs within the context of the 18 chromosomes using Circos. Synteny between the sub-genomes and B. vulgaris was assessed by first creating pseudomolecules by concatenating scaffolds which were known to be ordered and oriented within each of the nine chromosomes. Syntenic regions between these B. vulgaris chromosomes and those of quinoa were then identified using the recommended settings of the CoGe SynMap tool70 and visualized using MCScanX71and VGSC72. For the purposes of visualization, quinoa chromosomes CqB05, CqA08, CqB11, CqA15 and CqB16 were inverted.
Quinoa seeds were embedded in a 2% carboxymethylcellulose solution and frozen above liquid nitrogen. Sections of 50 μm thickness were obtained using a Reichert-Jung Frigocut 2800N, modified to use a Feather C35 blade holder and blades at −20 °C using a modified Kawamoto method73. A 2,5-dihydroxybenzoic acid (Sigma-Aldrich) matrix (40 mg ml−1 in 70% methanol) was applied using a HTX TM-Sprayer (HTX Technologies LLC) with attached LC20-AD HPLC pump (Shimadzu Scientific Instruments). Sections were vacuum dried in a desiccator before analysis. The optical image was generated using an Epson 4400 Flatbed Scanner at 4,800 d.p.i. For mass spectrometric analyses, a Bruker SolariX XR with 7T magnet was used. Images were generated using Bruker Compass FlexImaging 4.1. Data were normalized to the TIC, and brightness optimization was employed to enhance visualization of the distribution of selected compounds. Individual spectra were recalibrated using Bruker Compass DataAnalysis 4.4 to internally lock masses of known DHB clusters: C14H9O6 = 273.039364 and C21H13O9 = 409.055408 m/z. Accurate mass measurements for individual saponins and identified compounds were run using continuous accumulation of selected ions (CASI) using mass windows of 50–100 m/z and a transient of 4 megaword generating a transient of 2.93 s providing a mass resolving power of approximately 390,000 at 400 m/z. Lipids were putatively assigned by searching the LipidMaps database74 (http://www.lipidmaps.org) and lipid class confirmed by collision-induced dissociation using a 10 m/z window centred around the monoisotopic peak with collision energy of between 15–20 V.
Quinoa flowers were marked at anthesis, and seeds were sampled at 12, 16, 20 and 24 days after anthesis. A pool of five seeds from each time point was analysed using GC/MS.
Quantification of saponins was performed indirectly by quantifying oleanolic acid (OA) derived from the hydrolysis of saponins extracted from quinoa seeds. Derivatized solution was analysed using single quadrupole GC/MS system (Agilent 7890 GC/5975C MSD) equipped with EI source at ionisation energy of 70 eV. Chromatography separation was performed using DB-5MS fused silica capillary column (30m × 0.25 mm I.D., 0.25 μm film thickness; Agilent J&W Scientific), chemically bonded with 5% phenyl 95% methylpolysiloxane cross-linked stationary phase. Helium was used as the carrier gas with constant flow rate of 1.0 ml min−1. The quantification of OA in each sample was performed using a standard curve based on standards of OA.
Specific, individual saponins were identified in quinoa using a preparation of 20 mg of seeds performed according a modified protocol from Giavalisco et al.75. Samples were measured with a Waters ACQUITY Reversed Phase Ultra Performance Liquid Chromatography (RP-UPLC) coupled to a Thermo-Fisher Exactive mass spectrometer, which consists of an electrospray ionisation source and an Orbitrap mass analyser. A C18 column was used for the hydrophilic measurements. Chromatograms were recorded in full-scan MS mode (mass range, 100 −1,500). Extraction of the LC/MS data was accomplished with the software REFINER MS 7.5 (GeneData).
SwissModel76 was used to produce homology models for the bHLH region of AUR62017204, AUR62017206 and AUR62010677. RaptorX77 was used for prediction of secondary structure and disorder. QUARK78 was used for ab initio modelling of the C-terminal domain, and the DALI server79 was used for 3D homology searches of this region. Models were manually inspected and evaluated using the PyMOL program (http://pymol.org).
Data availability statement
The genome assemblies and sequence data for C. quinoa, C. pallidicaule and C. suecicum were deposited at NCBI under BioProject codes PRJNA306026, PRJNA326220 and PRJNA326219, respectively. Additional accessions numbers for deposited data can be found in Supplementary Data 9. The quinoa genome can also be accessed at http://www.cbrc.kaust.edu.sa/chenopodiumdb/ and on the Phytozome database (http://www.phytozome.net/).