Notes on the scaffolds of the Anopheles gambiae PEST whole genome shotgun assembly

The WGS assembly for Anopheles gambiae strain PEST was prepared by Celera. It is described in a Science paper and deposited in GenBank in the form of scaffolds. Accessions AAAB01000001-AAAB01008987 constitute the original genome assembly. VectorBase is now responsible for the assembled genome sequence and its annotation. It was noted in the Science paper that some scaffolds have regions with anomalies that suggest possible mis-assembly. These are thought to arise because some regions of the PEST strain genome are polymorphic, and the assembly algorithm sometimes produced two versions of such regions instead of collapsing them into one (even after fine tuning to minimise this problem - see supporting text in the paper). The two versions may sometimes appear next to each in a single scaffold as artefactual duplications (these could be in tandem or inverted orientation). However, for any single example, it is difficult to be sure whether the duplication is artefactual or real. The current assembly AgamP3 designates some entire scaffolds as probable alternative assemblies of other, chromosomally-placed scaffolds; and also designates some adjacent ends of 2 chromosomally-placed scaffolds as probable alternative assemblies. But there has not yet been a systematic attempt to identify all scaffolds with assembly problems. Below is an informal list of some scaffolds where a problem may exist. Please add extra examples if you find them. All or part of a small number of BAC clones (from the same PEST strain) have been independently sequenced. These sequences represent regions of the genome that are also covered by the WGS assembly. The AgamP3 assembly uses the WGS scaffold sequence for these regions, rather than the BAC sequence. Note that the BAC sequence is likely to be of higher quality and is not subject to the problems mentioned above arising from polymorphism in PEST. The table shows scaffold regions for which BAC sequence is available. More detail and references follow the table below:

Scaffold acc. Type of anomaly or other feature Location within scaffold Noted by
AAAB01008794spurious base in assembly cf traces and GPRGR19 gene model G at position 23958 Martin Hammond (info from Hugh Robertson)
AAAB01008980small tandem duplication ABabA= 8473666->8482606 =B a= 8482607->8491701 =bMartin Hammond (info from Judy Willis)
 small tandem duplication including part of AST2 gene~49487000-49592000Martin Hammond (info from Veenstra/Noriega/Topalis)
AAAB01008882FISH places on 2L at 21B & start has possible overlap with start of 8900, but one or other of these overlap segments has significant rearrangement or misassembly - left on UNKN in AgamP31-46000Maria Sharakhova (FISH) & Martin Hammond (exonerate analysis)
AAAB01008816large duplications1462359 - 1591191Martin Hammond (exonerate analysis)
AAAB010088353 regions with large duplication(s)108080-176428, 922859-1070413, 1204367-1588676Martin Hammond (exonerate analysis)
AAAB01008851large duplication1266323 - 1339297Martin Hammond (exonerate analysis)
 probable artifactual duplication (of part of 663-698 kbp) including build4 CPR genes AGAP003391/CPR114, AGAP003392 & AGAP003393/CPR119; these genes were omitted from build5 <741422 - >768466 Martin Hammond (taken from Cornman et al PMID:18205929)
AAAB01008859large inverted duplication3909697 - 4202803Martin Hammond (exonerate analysis)
AAAB01008888large duplication2518998 - 2683317Martin Hammond (exonerate analysis)
AAAB01008964large inverted duplication10075534 - 10252272Martin Hammond (exonerate analysis)
AAAB01008980several large duplications within these regions, but could be real3119087 - 4136090, 4741445 - 5211037 Martin Hammond (exonerate analysis)
AAAB01008982large duplication106616 - 203282Martin Hammond (exonerate analysis)
AAAB01008984several large duplications within these regions, but could be real5150269 - 5256602, 6403573 - 6536009, 10654343 - 11174667Martin Hammond (exonerate analysis)
AAAB010089863 regions with large duplications5568075 - 5630149, 10788287 - 10863802, 11847256 - 11933924Martin Hammond (exonerate analysis)
AAAB01008987several large duplications within this region, but could be real6070650 - 6470973Martin Hammond (exonerate analysis)
AAAB01008182on UNKN in AgamP3 but probably alt assembly of 3L at ~10.988-10.995 Mb1-8598 (all)Martin Hammond (Blast)
AAAB01008147on UNKN in AgamP3 but probably alt assembly of 3L at ~11.253-11.259 Mb1-6981 (all)Martin Hammond (Blast)
AAAB01008987Region also covered by BAC sequence9514457-10041878 Reference: Thomasova et al (2002)
AAAB01008987Region also covered by BAC sequence7731212-8247333Reference: Eiglmeier et al (2005) - Contig D1
AAAB01008987Region also covered by BAC sequence7211833-7342228Reference: Eiglmeier et al (2005) - Contig D2
AAAB01008966Region also covered by BAC 10E23 sequence826662-937795Reference: Louis et al (unpublished)
AAAB01008890Region also covered by BAC 09O07 sequence; assembly problems suggested at and beyond the end of this region12508491-12618521 Reference: Louis et al (unpublished)
AAAB01008400on UNKN in AgamP3 but probably alt assembly of 2L around 9254000 bpallBob MacCallum (Affymetrix probe matches two very similar genes one of which is on UNKN.)

Details of sequenced BAC clones

Thomasova et al. (2002) Comparative genomic analysis in the region of a major Plasmodium-refractoriness locus of Anopheles gambiae. PNAS 99:8179-8184

Warning: The paper and its GenBank sequence entries refer to these clones without leading zeroes: 30E5, 4F11, 11N17, 25F12, 22J3, 08N20

BAC end sequence mapping in VectorBase, and BAC end sequences in GenBank refer to the same clones with leading zeroes: 30E05, 04F11, 11N17, 25F12, 22J03, 08N20 08N20 overlaps with parts of 11N17, 25F12, and 22J03 but appears to represent a different 'haplotype' (see paper).

Overall span is approx: 2R:6180720-6708141

Equivalent in scaffold coordinates is approx: AAAB01008987: 9514457-10041878

Clone Sequences End sequences
30E05 AJ439353 AL155801,AL155802
04F11 AJ438610 AL141748,AL141749
11N17 AJ439060 AL146160,AL146161
25F12 AJ439061 AL152963,AL152964
22J03 AJ439398 AL151409,AL151410
08N20 AJ441131 AL144497,AL144498

Eiglmeier et al. (2005) Comparative analysis of BAC and whole genome shotgun sequences from an Anopheles gambiae region related to Plasmodium encapsulation. Insect Biochem. Mol. Biol. 35: 799-814

Contig D1 (447895 bp)
GenBank CR954256

Made up of portions of the following BAC clones: 30N20 10K02 13E09 31J16 01G10M
Span in AgamP3 assembly chromosome coordinates is approx: 2R: 7975265-8491386
Equivalent scaffold coords are approx: AAAB01008987: 7731212-8247333
Paper notes that the D1 contig has high level of diffs with the WGS assembly from 285,897-405,387. This would be 2R: 8261162-8380652, which is therefore presumably a polymorphic region in PEST.

Contig D2 (137277 bp) is from the single clone 32F02.
GenBank CR954257

Span in AgamP3 assembly chromosome coordinates is approx: 2R: 8880370-9010791
Equivalent scaffold coords are approx: AAAB01008987: 7211833-7342228
Contig D2 also has a high level of diffs with the WGS assembly.
Genbank accessions for BAC end sequences

Clone End sequence (1) End sequence (2)
30N20 AL156118AL156119
10K02 AL145501AL145502
13E09 AL146910AL146911
31J16 AL156537AL156538
01G10 AL140066AL140067
32F02 AL611813AL156942

C. Louis et al (unpublished)

Please contact VectorBase for additional sequence information if needed.

Clone End sequences
10E23 maps to 3L:24797513-24908646 (AAAB01008966:826662-937795)
09O07 maps to 3R:48653610-48763640 (AAAB01008890:12508491-12618521)
However, the end of 09O07 includes a repeat element that is missing from the assembly at this point, and segments that also hit the assembly 40kb further down, suggesting some kind of assembly problem in this region.