GenBank from the NCBI is an amazing and invaluable resource for DNA sequences, and combined with the searching tool BLAST it is really easy to identify your unknown gene sequence.
The trouble is that many of the organism identifications of GenBank sequences are dubious, outdated, or just plain wrong. Identifying a sequence as the wrong species is just bad science no matter what your reason, but is especially important for regulatory agencies. The problem of incorrect identifications is a self-compounding problem as these incorrect identifications are used by other sequence submitters to name their sequences.
Ideally you would use a carefully curated âknown goodâ list of sequences. The very best sequences to use are those derived from âtype materialâ, the nomenclatural type is the specimen that was used to originally describe a species, so by definition is the best sequence to use to get an accurate identification. A great website for bacteria that uses the 16S rDNA region is EzTaxon. I use this frequently but it has the inherent limitation of the lack of 16S variability in some bacterial species, which means it is not always easy to get an accurate identification.
For fungi using the universal barcode ITS rDNA region, a good website is www.fungalbarcoding.org this searches some sequences that have not yet been submitted to GenBank. GenBank has recognised the problem of poor quality identifications and for fungi has a curated list of type sequences, described in the publication Finding needles in haystacks (disclaimer: I am a co-author).
Still nothing beats the vast scope of GenBank, particularly if you want to use a gene other than ITS or 16S. There are several ways to limit the scope of you BLAST search to just good sequences. One way I recommend is to tick the box âExclude Uncultured/environmental sample sequencesâ under the exclude option. These will be of no value in getting an identification, and just clutter up the results. Ticking âExclude Models (XM/XP)â will make no difference either way, as these are automatically annotated genes from a few NCBI genomes (human, mouse, rat, honey bee, chicken, chimpanzee). You should also try ticking the âSequences from type materialâ under the âlimit toâ option. I find it very useful to view the Distance tree of results (select 'show all' under Collapse Mode), rather than just rely on the ranking given in the results page.
In addition to this you can use the very powerful Entrez Query option to limit the results further. These also work really well if you are using another service to query the GenBank database and not going through the website (e.g. using Geneious). For example try these:
Entrez Query | Result |
---|---|
sequence from type[filter] | Only retrieves sequences from type cultures or specimens (works the same as the âSequences from type materialâ option in the web interface) |
src specimen voucher[prop] OR src culture collection[prop] | Only retrieves sequences from sequences that are associated with a herbarium, fungarium, or culture collection |
collection icmp[prop] | Only retrieves sequences from cultures in the ICMP culture collection |
NOT(environmental samples[organism] OR metagenomes[organism]) | Filters out environmental samples or from metagenomes, these typically have poor identifications |
These Entrez Queries can also be used when finding sequences in the GenBank database without using BLAST, for example collection icmp[prop] OR icmp[title] AND fungi[orgn] AND 2014/01/01[PDAT] : 2014/12/31[PDAT] finds all ICMP fungal cultures deposited in Genbank in the year 2014.
I hope this helps you use GenBank more effectively.
Citation
Weir, B.S. (2014) How to get good fungal and bacterial identifications from GenBank sequences, NZ Rhizobia, 29 October 2014. https://www.rhizobia.co.nz/ids-using-genbank
Comments