A Python package for Biopython that gives feature annotations from GenBank records a new and better life
Goodbye, GenBank converts SeqFeature sequence annotations from NCBI GenBank records to a common and simplified format. GenBank feature annotations have a feature key and reasonably well defined qualifiers, but non-standard and discontinued feature types and qualifiers are commonly used and often the feature key is something someone made up and not a valid GenBank feature key. And even when a valid GenBank feature key is used, it is often incomplete and useless without additional details in the qualifiers.
This package converts most feature keys to appropriate Sequence Ontology terms used by GFF3 and SBOL. Non-standard qualifiers are repaired or removed.
Goodbye, GenBank is intended for those who wish to clean-up their GenBank files and then transition to a different format. The philosophy of this project is to salvage what is salvageable and to discard what is not. GenBank feature types are translated to Sequence Ontology terms; qualifiers are converted into a reduced set that contains only the parts that are not broken. Qualifiers are also converted to their correct type: int for integers, list only for qualifiers that can appear multiple times, bool for flags.
Moreover, different options are available to configure what is kept and what is thrown away.
pip install gbgb
>>> feature
SeqFeature(FeatureLocation(ExactPosition(2931), ExactPosition(2936), strand=1), type='-10_signal')
>>> feature.qualifiers
{'ApEinfo_fwdcolor': ['pink'],
'ApEinfo_graphicformat': ['arrow_data {{0 1 2 0 0 -1} {} 0} width 5 offset 0'],
'ApEinfo_revcolor': ['pink'],
'label': ['RNAII Promoter (-10 signal)']}
>>>
>>> from gbgb import convert_feature
>>> feature = convert_feature(feature)
>>> feature
SeqFeature(FeatureLocation(ExactPosition(2931), ExactPosition(2936), strand=1), type='minus_10_signal')
>>> feature.qualifiers
{'note': 'RNAII Promoter (-10 signal)'}
>>>
>>> from gbgb import genbank_feature_key
>>> genbank_feature_key('minus_10_signal')
'regulatory'
gbgb
convert_feature_type(feature: Bio.SeqFeature) -> str
Finds a Sequence Ontology term for a GenBank feature.
This function requires a :class:SeqFeature
as opposed to just a GenBank feature key, since the type of a GenBank
feature is not always fully described by its feature key. For example a regulatory
GenBank feature could have
a /regulatory_class="promoter"
qualifier.
Returns a Sequence Ontology term for the type of this feature
convert_feature(feature: Bio.SeqFeature, qualifiers=gbgb.maps.qualifiers.DEFAULT_QUALIFIER_TRANSFORMERS) -> Bio.SeqFeature
Returns a Bio.SeqFeature
with valid GenBank qualifiers and a feature type that is a Sequence Ontology term.
unconvert_feature(feature: Bio.SeqFeature) -> Bio.SeqFeature
Restores a feature to its sad former shape.
To "clean up" a GenBank feature while maintaining it's format, use unconvert_feature(convert_feature(feature))
.
Returns a SeqFeature
with each qualifier being a list and having a GenBank feature key
genbank_feature_key(sequence_feature_term: str, idempotent=True) -> str
Returns a GenBank feature key for the provided Sequence Ontology term
gbgb.maps.qualifiers
The following are transformation functions. They take two identical qualifiers objects, before
and after
and update after
to reflect the updated qualifiers. All of these functions return after
.
remove_protein_id_and_add_to_xrefs(before: dict, after: dict) -> dict
Protein IDs are codes such as "AAF19666.1", which come from "International collaborators" and should all be
on GenBank. This translation function removes the /protein_id=""
qualifier and instead adds a /db_xref=""
qualifier.
Removes any malformed protein IDs as they are useless.
rename_label_to_note(before: dict, after: dict) -> dict
The /label=""
qualifier was discontinued in 2010, but is still used frequently.
See GenBank Release 180.
GENBANK_SINGLE_QUALIFIERS: [str]
format_single_qualifiers(before: dict, after: dict) -> dict
GENBANK_INTEGER_QUALIFIERS: [str]
format_integer_qualifiers(before: dict, after: dict) -> dict
GENBANK_FLAG_QUALIFIERS: [str]
format_flag_qualifiers(before: dict, after: dict) -> dict
APE_QUALIFIER_NAMESPACE = 'ApEinfo_'
remove_ape_a_plasmid_editor_qualifiers(before: dict, after: dict) -> dict
GENE_RELATED_FEATURES: [str]
GENBANK_QUALIFIER_FEATURE_KEYS: {str: [str]}
GENBANK_QUALIFIERS: [str]
remove_unrecognized_qualifiers(before: dict, after: dict) -> dict
The transformation function has one additional argument:
remove_qualifiers_inappropriate_for_feature(before: dict, after: dict, genbank_feature_key: str) -> dict
Removes all qualifiers that are not valid for a certain genbank feature key.
DEFAULT_QUALIFIER_TRANSFORMERS: [callable]
DEFAULT_QUALIFIER_TRANSFORMERS = (
format_single_qualifiers,
format_integer_qualifiers,
format_flag_qualifiers,
remove_protein_id_and_add_to_xrefs,
rename_label_to_note,
remove_unrecognized_qualifiers,
)
gbgb.maps.features
DEFAULT_SO_TERM = 'sequence_feature'
GENBANK_DISCONTINUED_FEATURE_KEY_SO_TERMS: {str: str}
GENBANK_FEATURE_KEY_SO_TERMS: {str: str}
SO_TERM_GENBANK_FEATURE_KEYS: {str: str}
GENBANK_FEATURE_KEYS: [str]
UNAMBIGUOUS_INVALID_KEY_SO_TERMS: {str: str}
GENBANK_REGULATORY_DEFAULT_SO_TERM = 'regulatory_region'
GENBANK_REGULATORY_CLASS_SO_TERMS: {str: str}
GENBANK_NC_RNA_DEFAULT_SO_TERM = 'ncRNA_gene'
GENBANK_NC_RNA_CLASS_SO_TERMS: {str: str}
GENBANK_MOBILE_ELEMENT_DEFAULT_SO_TERM: 'mobile_genetic_element'
GENBANK_MOBILE_ELEMENT_TYPE_SO_TERMS: {str: str}
GENBANK_REPEAT_REGION_DEFAULT_SO_TERM = 'repeat_region'
GENBANK_REPEAT_TYPE_SO_TERMS: {str: str}
GENBANK_PSEUDOGENE_DEFAULT_SO_TERM = 'pseudogene'
GENBANK_PSEUDOGENE_TYPE_SO_TERMS: {str: str}
DEFAULT_GENBANK_FEATURE_KEY = 'misc_feature'
gbgb.utils
single(qualifiers: dict, name: str, on_multiple_ignore=False: bool)
Converts qualifiers[name]
from a list-like object to a single item. If on_multiple_ignore
is True
, skips qualifiers that are defined multiple times; otherwise, returns the first qualifier. If qualifiers[name]
is not list-like, returns it.
Returns a single value for the qualifier; None
if the qualifier has no value.
as_set(qualifiers) -> set
as_list(qualifiers) -> list
For the most part, Goodbye, GenBank attempts to be idempotent, i.e. features and their types/keys and qualifiers can be safely transformed any number times with the same settings. The apparent mismatch between the conversion to Sequence Ontology feature terms and valid/fixed GenBank qualifiers is to simplify downstream processing. It is up to the users which qualifiers they wish to keep, but at least the choices they are given are reasonable.
Goodbye, Genbank is written and maintained by Lars Schöning (@lyschoening).
If you have any questions or suggestions or if you have found a unique new specimen of GenBank files that you would like to convert, please open an issue.