presto.Annotation
Annotation functions
- presto.Annotation.addHeader(header, fields, values, delimiter=('|', '=', ','))
Adds fields and values to a sequence header
- Parameters:
header – an annotation dictionary returned by parseAnnotation.
fields – the list of fields to add or append to.
values – the list of annotation values to add for each field.
delimiter – a tuple of delimiters for (fields, values, value lists).
- Returns:
modified header dictionary.
- Return type:
- presto.Annotation.annotationConsensus(seq_iter, field, delimiter=('|', '=', ','))
Calculate a consensus annotation for a set of sequences
- Parameters:
seq_iter – an iterator or list of SeqRecord objects
field – the annotation field to take a consensus of
delimiter – a tuple of delimiters for (annotations, field/values, value lists)
- Returns:
- Dictionary with keys
set containing a list of unique annotation values, count containing annotation counts, cons containing the consensus annotation, freq containing the majority annotation frequency
- Return type:
- presto.Annotation.collapseAnnotation(ann_dict, action, fields=None, delimiter=('|', '=', ','))
Collapses multiple annotations into new single annotations for each field
- Parameters:
ann_dict – dictionary of field/value pairs
action – collapse action to take; one of {min, max, sum, first, last, set, cat}
fields – subset of ann_dict to _collapse; if None, collapse all but the ID field
delimiter – Tuple of delimiters for (fields, values, value lists)
- Returns:
Modified field dictionary
- Return type:
OrderedDict
- presto.Annotation.collapseHeader(header, fields, actions, delimiter=('|', '=', ','))
Collapses a sequence header
- Parameters:
header – an annotation dictionary returned by parseAnnotation.
fields – the list of fields to collapse.
actions – the list of collapse action take; one of (max, min, sum, first, last, set, cat) for each field.
delimiter – a tuple of delimiters for (fields, values, value lists).
- Returns:
modified header dictionary.
- Return type:
- presto.Annotation.convert454Header(desc)
Parses 454 headers into the pRESTO format
- Parameters:
desc (str) – a sequence description string.
- Returns:
a dictionary of header field and value pairs.
- Return type:
Examples
New style 454 header:
@<accession> <length=##> @GXGJ56Z01AE06X length=222
Old style 454 header:
@<rank_x_y> <length=##> <uaccno=accession> @000034_0199_0169 length=437 uaccno=GNDG01201ARRCR
- presto.Annotation.convertGenbankHeader(desc, delimiter=('|', '=', ','))
Converts GenBank and RefSeq headers into the pRESTO format
- Parameters:
- Returns:
a dictionary of header field and value pairs.
- Return type:
Examples
New style GenBank header:
<accession>.<version> <description> >CM000663.2 Homo sapiens chromosome 1, GRCh38 reference primary assembly
Old style GenBank header:
gi|<GI record number>|<dbsrc>|<accession>.<version>|<description> >gi|568336023|gb|CM000663.2| Homo sapiens chromosome 1, GRCh38 reference primary assembly
- presto.Annotation.convertGenericHeader(desc, delimiter=('|', '=', ','))
Converts any header to the pRESTO format
- presto.Annotation.convertIMGTHeader(desc, simple=False)
Converts germline headers from IMGT/GENE-DB into the pRESTO format
- Parameters:
- Returns:
a dictionary of header field and value pairs.
- Return type:
Examples
IMGT header:
>X60503|IGHV1-18*02|Homo sapiens|F|V-REGION|142..417|276 nt|1| | | | |276+24=300|partial in 3'| |
Header contains 15 fields separated by
|
(http://imgt.org/genedb):IMGT/LIGM-DB accession number(s).
Gene and allele name.
Species.
Functionality.
Exon(s), region name(s), or extracted label(s).
Start and end positions in the IMGT/LIGM-DB accession number(s).
Number of nucleotides in the IMGT/LIGM-DB accession number(s).
Codon start, or ‘NR’ (not relevant) for non coding labels and out-of-frame pseudogenes.
Number of nucleotides added in
5'
compared to the corresponding label extracted from IMGT/LIGM-DB.Number of nucleotides added or removed in
3'
compared to the corresponding label extracted from IMGT/LIGM-DB.Number of added, deleted, and/or substituted nucleotides to correct sequencing errors, or ‘not corrected’ if non corrected sequencing errors.
Number of amino acids (AA). This field indicates that the sequence is in amino acids.
Number of characters in the sequence. Nucleotides (or AA) plus IMGT gaps.
Partial (if it is).
Reverse complementary (if it is).
- presto.Annotation.convertIlluminaHeader(desc)
Converts Illumina headers into the pRESTO format
- Parameters:
desc (str) – a sequence description string.
- Returns:
a dictionary of header field and value pairs.
- Return type:
Examples
New style Illumina header:
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read number>:<is filtered>:<control number>:<index sequence> @MISEQ:132:000000000-A2F3U:1:1101:14340:1555 2:N:0:ATCACG
Old style Illumina header:
@<instrument>:<flowcell lane>:<tile>:<x-pos>:<y-pos>#<index sequence>/<read number> @HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 @MS6_33112:1:1101:18371:1066/1
- presto.Annotation.convertMIGECHeader(desc)
Parses headers from the MIGEC tool into the pRESTO format
- Parameters:
desc (str) – a sequence description string.
- Returns:
a dictionary of header field and value pairs.
- Return type:
Examples
MIGEC header:
@MIG UMI:<UMI sequence>:<consensus read count> @MIG UMI:TCGGCCAACAAA:8
- presto.Annotation.convertSRAHeader(desc)
Parses NCBI SRA or EMBL-EBI ENA headers into the pRESTO format
- Parameters:
desc (str) – a sequence description string.
- Returns:
a dictionary of header field and value pairs.
- Return type:
Examples
Header from
fastq-dump --split-files
:@<accession>.<spot> <original sequence description> <length=#> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 @SRR1383326.1 1 length=250
Header from
fastq-dump --split-files -I
:@<accession>.<spot>.<read number> <original sequence description> <length=#> @SRR1383326.1.1 1 length=250
Header from ENA:
@<accession>.<spot> <original sequence description> @ERR220397.1 HKSQ1MM01DXT2W/3 @ERR346596.1 BS-DSFCONTROL04:4:000000000-A3F0Y:1:1101:12758:1640/1 @ERR346596.1 BS-DSFCONTROL04:4:000000000-A3F0Y:1:1101:12758:1640/2
- presto.Annotation.copyHeader(header, fields, names, actions=None, delimiter=('|', '=', ','))
Copies fields in a sequence header
- Parameters:
header – an annotation dictionary returned by parseAnnotation.
fields – a list of the field names to copy.
names – a list of the new field names.
actions – the list of collapse action take after the copy; one of (max, min, sum, first, last, set, cat) for each field.
delimiter – a tuple of delimiters for (fields, values, value lists).
- Returns:
modified header dictionary.
- Return type:
- presto.Annotation.deleteHeader(header, fields, delimiter=('|', '=', ','))
Deletes fields from a sequence header
- Parameters:
header – an annotation dictionary returned by parseAnnotation.
fields – the list of fields to delete.
delimiter – a tuple of delimiters for (fields, values, value lists).
- Returns:
modified header dictionary
- Return type:
- presto.Annotation.expandHeader(header, fields, separator=',', delimiter=('|', '=', ','))
Splits and annotation value into separate fields in a sequence header
- Parameters:
header – an annotation dictionary returned by parseAnnotation.
fields – the field to split.
separator – the delimiter to split the values by.
delimiter – a tuple of delimiters for (fields, values, value lists).
- Returns:
modified header dictionary.
- Return type:
- presto.Annotation.flattenAnnotation(ann_dict, delimiter=('|', '=', ','))
Converts annotations from a dictionary to a FASTA/FASTQ sequence description
- Parameters:
ann_dict – Dictionary of field/value pairs
delimiter – Tuple of delimiters for (fields, values, value lists)
- Returns:
Formatted sequence description string
- Return type:
- presto.Annotation.getAnnotationValues(seq_iter, field, unique=False, delimiter=('|', '=', ','))
Gets the set of unique annotation values in a sequence set
- Parameters:
seq_iter – Iterator or list of SeqRecord objects
field – Annotation field to retrieve values for
unique – If True return a list of only the unique values; if False return a list of all values
delimiter – Tuple of delimiters for (fields, values, value lists)
- Returns:
List of values for the field
- Return type:
- presto.Annotation.getCoordKey(header, coord_type='presto', delimiter=('|', '=', ','))
Return the coordinate identifier for a sequence description
- Parameters:
- Returns:
Coordinate identifier as a string.
- Return type:
- presto.Annotation.mergeAnnotation(ann_dict_1, ann_dict_2, prepend=False, delimiter=('|', '=', ','))
Merges non-ID field annotations from one field dictionary into another
- Parameters:
ann_dict_1 – Dictionary of field/value pairs to append to
ann_dict_2 – Dictionary of field/value pairs to merge with ann_dict_2
prepend – If True then add ann_dict_2 values to the front of any ann_dict_1 values that are already present, rather than the default behavior of appending ann_dict_2 values.
delimiter – Tuple of delimiters for (fields, values, value lists)
- Returns:
Modified ann_dict_1 dictonary of field/value pairs
- Return type:
OrderedDict
- presto.Annotation.mergeHeader(header, fields, name, action=None, delete=False, delimiter=('|', '=', ','))
Merges fields in a sequence header
- Parameters:
header – an annotation dictionary returned by parseAnnotation.
fields – a list of the field names to merge.
name – the name of the new field.
delete – if True delete the merged fields.
actions – the list of collapse action take after the merge one of (max, min, sum, first, last, set, cat).
delimiter – a tuple of delimiters for (fields, values, value lists)
- Returns:
modified header dictionary.
- Return type:
- presto.Annotation.parseAnnotation(record, fields=None, delimiter=('|', '=', ','))
Extracts annotations from a FASTA/FASTQ sequence description
- Parameters:
record – Description string to extract annotations from
fields – List of fields to subset the return dictionary to; if None return all fields
delimiter – a tuple of delimiters for (fields, values, value lists)
- Returns:
An OrderedDict of field/value pairs
- Return type:
OrderedDict
- presto.Annotation.parseLog(record)
Parses an pRESTO log record
- Parameters:
record (str) – a string of lines representing a log record including newline characters.
- Returns:
parsed log contain field and values pairs as a dictionary.
- Return type:
- presto.Annotation.renameAnnotation(ann_dict, old_field, new_field, delimiter=('|', '=', ','))
Renames an annotation and merges annotations if the new name already exists
- Parameters:
ann_dict – Dictionary of field/value pairs
old_field – Old field name
new_field – New field name
delimiter – Tuple of delimiters for (fields, values, value lists)
- Returns:
Modified fields dictonary
- Return type:
OrderedDict
- presto.Annotation.renameHeader(header, fields, names, actions=None, delimiter=('|', '=', ','))
Renames fields in a sequence header
- Parameters:
header – an annotation dictionary returned by parseAnnotation.
fields – a list of the current field names.
names – a list of the new field names.
actions – the list of collapse action take after the rename; one of (max, min, sum, first, last, set, cat) for each field.
delimiter – a tuple of delimiters for (fields, values, value lists).
- Returns:
modified header dictionary.
- Return type: