On the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics

n proteomics, peptide information within mass spectrometry (MS) data from a specific organism sample is routinely matched against a protein sequence database that best represent such organism. However, if the species/strain in the sample is unknown or genetically poorly characterized, it becomes cha...

ver descrição completa

Na minha lista:

Detalhes bibliográficos
Principais autores:	Machado, Karla C. T., Fortuin, Suereta, Tomazella, Gisele Guicardi, Fonseca, Andre F., Warren, Robin Mark, Wiker, Harald G., Souza, Sandro José de, Souza, Gustavo Antonio de
Formato:	article
Idioma:	English
Publicado em:
Assuntos:	databases proteomics proteogenomics mass spectrometry pangenome
Endereço do item:	https://repositorio.ufrn.br/jspui/handle/123456789/27235
Tags:	Adicionar Tag Sem tags, seja o primeiro a adicionar uma tag!

id	ri-123456789-27235
record_format	dspace
spelling	ri-123456789-272352021-07-09T22:39:11Z On the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics Machado, Karla C. T. Fortuin, Suereta Tomazella, Gisele Guicardi Fonseca, Andre F. Warren, Robin Mark Wiker, Harald G. Souza, Sandro José de Souza, Gustavo Antonio de databases proteomics proteogenomics mass spectrometry pangenome n proteomics, peptide information within mass spectrometry (MS) data from a specific organism sample is routinely matched against a protein sequence database that best represent such organism. However, if the species/strain in the sample is unknown or genetically poorly characterized, it becomes challenging to determine a database which can represent such sample. Building customized protein sequence databases merging multiple strains for a given species has become a strategy to overcome such restrictions. However, as more genetic information is publicly available and interesting genetic features such as the existence of pan- and core genes within a species are revealed, we questioned how efficient such merging strategies are to report relevant information. To test this assumption, we constructed databases containing conserved and unique sequences for 10 different species. Features that are relevant for probabilistic-based protein identification by proteomics were then monitored. As expected, increase in database complexity correlates with pangenomic complexity. However, Mycobacterium tuberculosis and Bordetella pertussis generated very complex databases even having low pangenomic complexity. We further tested database performance by using MS data from eight clinical strains from M. tuberculosis, and from two published datasets from Staphylococcus aureus. We show that by using an approach where database size is controlled by removing repeated identical tryptic sequences across strains/species, computational time can be reduced drastically as database complexity increases. 2019-07-08T15:56:42Z 2019-07-08T15:56:42Z 2019-06-20 article https://repositorio.ufrn.br/jspui/handle/123456789/27235 10.3389/fmicb.2019.01410 en application/pdf
institution	Repositório Institucional
collection	RI - UFRN
language	English
topic	databases proteomics proteogenomics mass spectrometry pangenome
spellingShingle	databases proteomics proteogenomics mass spectrometry pangenome Machado, Karla C. T. Fortuin, Suereta Tomazella, Gisele Guicardi Fonseca, Andre F. Warren, Robin Mark Wiker, Harald G. Souza, Sandro José de Souza, Gustavo Antonio de On the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics
description	n proteomics, peptide information within mass spectrometry (MS) data from a specific organism sample is routinely matched against a protein sequence database that best represent such organism. However, if the species/strain in the sample is unknown or genetically poorly characterized, it becomes challenging to determine a database which can represent such sample. Building customized protein sequence databases merging multiple strains for a given species has become a strategy to overcome such restrictions. However, as more genetic information is publicly available and interesting genetic features such as the existence of pan- and core genes within a species are revealed, we questioned how efficient such merging strategies are to report relevant information. To test this assumption, we constructed databases containing conserved and unique sequences for 10 different species. Features that are relevant for probabilistic-based protein identification by proteomics were then monitored. As expected, increase in database complexity correlates with pangenomic complexity. However, Mycobacterium tuberculosis and Bordetella pertussis generated very complex databases even having low pangenomic complexity. We further tested database performance by using MS data from eight clinical strains from M. tuberculosis, and from two published datasets from Staphylococcus aureus. We show that by using an approach where database size is controlled by removing repeated identical tryptic sequences across strains/species, computational time can be reduced drastically as database complexity increases.
format	article
author	Machado, Karla C. T. Fortuin, Suereta Tomazella, Gisele Guicardi Fonseca, Andre F. Warren, Robin Mark Wiker, Harald G. Souza, Sandro José de Souza, Gustavo Antonio de
author_facet	Machado, Karla C. T. Fortuin, Suereta Tomazella, Gisele Guicardi Fonseca, Andre F. Warren, Robin Mark Wiker, Harald G. Souza, Sandro José de Souza, Gustavo Antonio de
author_sort	Machado, Karla C. T.
title	On the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics
title_short	On the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics
title_full	On the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics
title_fullStr	On the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics
title_full_unstemmed	On the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics
title_sort	on the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics
publishDate	2019
url	https://repositorio.ufrn.br/jspui/handle/123456789/27235
work_keys_str_mv	AT machadokarlact ontheimpactofthepangenomeandannotationdiscrepancieswhilebuildingproteinsequencedatabasesforbacteriaproteogenomics AT fortuinsuereta ontheimpactofthepangenomeandannotationdiscrepancieswhilebuildingproteinsequencedatabasesforbacteriaproteogenomics AT tomazellagiseleguicardi ontheimpactofthepangenomeandannotationdiscrepancieswhilebuildingproteinsequencedatabasesforbacteriaproteogenomics AT fonsecaandref ontheimpactofthepangenomeandannotationdiscrepancieswhilebuildingproteinsequencedatabasesforbacteriaproteogenomics AT warrenrobinmark ontheimpactofthepangenomeandannotationdiscrepancieswhilebuildingproteinsequencedatabasesforbacteriaproteogenomics AT wikerharaldg ontheimpactofthepangenomeandannotationdiscrepancieswhilebuildingproteinsequencedatabasesforbacteriaproteogenomics AT souzasandrojosede ontheimpactofthepangenomeandannotationdiscrepancieswhilebuildingproteinsequencedatabasesforbacteriaproteogenomics AT souzagustavoantoniode ontheimpactofthepangenomeandannotationdiscrepancieswhilebuildingproteinsequencedatabasesforbacteriaproteogenomics
_version_	1773965844619984896

On the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics

Registros relacionados