16
This is Luca Cappelletti's DBGI daily open-notebook.
Today is 2023.08.16
Doing
Working on fourth module of DBGI pipeline.
Output of the fourth module
Compound identification
A TSV document called compound_identification.tsv
.
The column called molecular formula contains the best hit for the molecular structure of the particular sample obtained from Sirius. The Sirius system uses a large database from PubChem, so we could find contaminants compounds such as pesticides.
NOTE: we can lookup the information about the molecule using the inchkey on PubChem, and by using the available infodump we can mine several additional information about the molecule. Analogously, we can ingest CHEBI for the same goals.
Zip files
The several ZIP files produces by the fourth module are produced by Sirius, they are annotations for each of the samples and feature within each sample. These ZIP documents can generally be ignored for the purpose of the DBGI pipeline.
Canopus compound summary
A TSV document called canopus_compound_summary.tsv
.
Canopus is less rigid, and can identify compounds that are not present in the PubChem database. The columns referring to canopus are called NPC#pathway, NPC#pathwat probability, NPC@#superclass, NPC#superclass probability, NPC#class and NPC#class probability.
Working on Fifth module, meta analysis
The fifth module of the pipeline retrieves the WikiData IDs associated
Note on possibility of siamese network or similarity tree
We can reasonably create a siamese network to compute an embedding that captures all the cosine similarities of the specters computed in the MGFs. Also, we can reasonably directly build a similarity tree.