Skip to content

Very long headers in the FASTA are not parsed correctly #87

@CorinYeatsCGPS

Description

@CorinYeatsCGPS

I'm not sure the length limit, but I have a few FASTAs with >100 characters in the headers, which seems to cause Kleborate to fall over during the MLST stage. I replaced the original headers with shortened versions and the FASTA was processed correctly. Simply putting in a long run of digits was enough to trigger the issue. It might also be worth noting that in the FASTA which triggered this issue the first 300 characters of the header of each record were the same and couldn't be truncated.

strain  species N50     ST      virulence_score resistance_score        num_resistance_classes  num_resistance_genes
Traceback (most recent call last):
  File "/usr/local/bin/kleborate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/__main__.py", line 154, in main
    module_results = modules[module].get_results(unzipped_assembly, minimap2_index, args, results)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/modules/klebsiella_pneumo_complex__mlst/klebsiella_pneumo_complex__mlst.py", line 73, in get_results
    st, _, alleles = mlst(assembly, minimap2_index, profiles, alleles, genes, None,
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/mlst.py", line 44, in mlst
    hits_per_gene = {g: align_query_to_ref(allele_paths[g], assembly_path,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/mlst.py", line 44, in <dictcomp>
    hits_per_gene = {g: align_query_to_ref(allele_paths[g], assembly_path,
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/alignment.py", line 134, in align_query_to_ref
    alignments = [Alignment(x, query_seqs=query_seqs, ref_seqs=ref_seqs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/alignment.py", line 134, in <listcomp>
    alignments = [Alignment(x, query_seqs=query_seqs, ref_seqs=ref_seqs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/alignment.py", line 51, in __init__
    self.set_sequences(query_seqs, ref_seqs)
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/alignment.py", line 88, in set_sequences
    self.ref_seq = ref_seqs[self.ref_name][self.ref_start:self.ref_end]
                   ~~~~~~~~^^^^^^^^^^^^^^^
KeyError: '22222222222222222222222222222222222222222222222222222222222'

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions