Goodbye-Genbank

A Python package for Biopython that gives feature annotations from GenBank records a new and better life

View the Project on GitHub biosustain/goodbye-genbank

About Goodbye, GenBank

Goodbye, GenBank converts SeqFeature sequence annotations from NCBI GenBank records to a common and simplified format. GenBank feature annotations have a feature key and reasonably well defined qualifiers, but non-standard and discontinued feature types and qualifiers are commonly used and often the feature key is something someone made up and not a valid GenBank feature key. And even when a valid GenBank feature key is used, it is often incomplete and useless without additional details in the qualifiers.

This package converts most feature keys to appropriate Sequence Ontology terms used by GFF3 and SBOL. Non-standard qualifiers are repaired or removed.

Goodbye, GenBank is intended for those who wish to clean-up their GenBank files and then transition to a different format. The philosophy of this project is to salvage what is salvageable and to discard what is not. GenBank feature types are translated to Sequence Ontology terms; qualifiers are converted into a reduced set that contains only the parts that are not broken. Qualifiers are also converted to their correct type: int for integers, list only for qualifiers that can appear multiple times, bool for flags.

Moreover, different options are available to configure what is kept and what is thrown away.

Installation

pip install gbgb

Example

>>> feature
SeqFeature(FeatureLocation(ExactPosition(2931), ExactPosition(2936), strand=1), type='-10_signal')
>>> feature.qualifiers
{'ApEinfo_fwdcolor': ['pink'],
 'ApEinfo_graphicformat': ['arrow_data {{0 1 2 0 0 -1} {} 0} width 5 offset 0'],
 'ApEinfo_revcolor': ['pink'],
 'label': ['RNAII Promoter (-10 signal)']}
>>>
>>> from gbgb import convert_feature
>>> feature = convert_feature(feature)
>>> feature
SeqFeature(FeatureLocation(ExactPosition(2931), ExactPosition(2936), strand=1), type='minus_10_signal')
>>> feature.qualifiers
{'note': 'RNAII Promoter (-10 signal)'}
>>>
>>> from gbgb import genbank_feature_key
>>> genbank_feature_key('minus_10_signal')
'regulatory'

Documentation

gbgb

gbgb.maps.qualifiers

The following are transformation functions. They take two identical qualifiers objects, before and after and update after to reflect the updated qualifiers. All of these functions return after.

The transformation function has one additional argument:

gbgb.maps.features

gbgb.utils

Design considerations

For the most part, Goodbye, GenBank attempts to be idempotent, i.e. features and their types/keys and qualifiers can be safely transformed any number times with the same settings. The apparent mismatch between the conversion to Sequence Ontology feature terms and valid/fixed GenBank qualifiers is to simplify downstream processing. It is up to the users which qualifiers they wish to keep, but at least the choices they are given are reasonable.

Authors and Contributors

Goodbye, Genbank is written and maintained by Lars Schöning (@lyschoening).

Support or Contact

If you have any questions or suggestions or if you have found a unique new specimen of GenBank files that you would like to convert, please open an issue.