Why are there protein sequences with Xs and genes sequences with Ns


A protein sequence or a gene sequence may have Xs and Ns, respectively, because of spanning a gap between two contigs.

Sometimes the gene has been extended, over a gap, to find a STOP codon - the annotation algorithm is trying to find the nearest stop codon. In other cases there are blast hits spanning the gap so it seems correct to have the gene prediction on two contigs. In other cases, a blast hit may overlap the gap so it is likely to be true. We can't really create an intron and the blast hits may clearly show that there is continuity.

The following is an example in Culex quinquefasciatus CPIJ008116 gene:

Last exon span a gap. It contains a long stretch of Ns and then 15 bases, finishing with a stop codon. All the blast hits stop at the beginning of the gap so the gene should stop before the gap - the automatic pipeline probably tried to find the nearest STOP codon. In this case it would have been better to stop the gene earlier and tag it as incomplete. It's also likely that a small exon is missing in the middle.

A manual annotator may blast both the gene (nucleotide sequence) and the protein (amino acid sequence) and try to annotate beyond the Xs and Ns base on orthologs in Anopheles gambiae and Aedes aegypti. If the blast hits are really nice is worth spanning the gap.