-
Couldn't load subscription status.
- Fork 4
New chemistry module API #2
base: main
Are you sure you want to change the base?
Conversation
* introduce new traits abstracting amino acid definition & retrieval * add new types stolen from rustyms: Element, Composition (replacing Formula), Glycan types, CustomError
| } | ||
|
|
||
| impl FromStr for Element { | ||
| type Err = anyhow::Error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it more common for a library to use thiserror to avoid requiring its users to use a specific error abstraction library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point and I was also concerned by this.
I think we could dedicate a PR to define accurate and consistent error types.
| // FIXME: how to implement serde::{Deserialize, Serialize} for Arc<[u8]>??? | ||
| #[derive(Clone, Debug, PartialEq, PartialOrd, Serialize, Deserialize)] | ||
| pub struct LinearPeptide { | ||
| pub sequence: Arc<[u8]>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why inject the Arc into the peptide rather than wrap the whole peptide in an Arc when needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I have to confess I tried to replicate what was done in the Sage repo.
Maybe @lazear could give a rationale about this specific design.
| // Additional definitions to define an atom-based composition (mainly used for mass calculation) | ||
| use crate::chemistry::atom::AtomIsotopicVariant; | ||
|
|
||
| #[derive(Clone, Debug, PartialEq, PartialOrd, Serialize, Deserialize)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distinction between ElementalComposition and AtomicComposition seems to be to avoid a reference to a specific isotopic abundance, which would in turn inject either a lifetime and/or a heavy data structure into every type that carried a composition, correct?
This seems to require all code that might deal with isotopes must manage conversion to AtomicComposition separately. The pattern I followed in chemical-elements has the tradeoff of needing lifetimes everywhere, but which in the simple case of a single shared global table that can be made 'static without too much labor.
What kind of API are you envisioning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distinction between ElementalComposition and AtomicComposition seems to be to avoid a reference to a specific isotopic abundance,
Yes as well as isotopic masses
This seems to require all code that might deal with isotopes must manage conversion to AtomicComposition separately. The pattern I followed in chemical-elements has the tradeoff of needing lifetimes everywhere, but which in the simple case of a single shared global table that can be made 'static without too much labor.
Actually I adopted a similar design with my PhD in https://github.com/pinaraltiner/proteomics-rs/, which I forked in the the previous PR:
https://github.com/pinaraltiner/proteomics-rs/blob/57b57e01265afba22a58d2fa1f77a1ea7ed32939/src/chemistry/table.rs#L40
Then I found that rustyms definition of Elements could be used a nice way to of type safety for Elements identity.
So the rationale is to provide an API where identity (Element) and definitions/ properties (Atom) are implemented using different types.
This indeed require some bridges to go from Elements to Atoms.
However atoms can also be manipulated directly, and when atomic/isotopic properties are not needed, Elements can also be used independently.
Do you think it is too complex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are contexts where isotopic information isn't needed, where I agree that it'd be too much complexity to carry along all the isotopic metadata too, but this is too devoid of use-cases for me to venture an opinion of whether the complexity is worth it. The best I can say is try and see.
There is a little leaking of abstraction though because ElementalComposition also has a mass-related attribute. Is that because ElementalComposition is also pulling double-duty as a "configuration file" representation of something that would be converted into an AtomicComposition at run time? If so, wouldn't it be plausible for something to be specified with a heavy atom (e.g. an isotopically labeled modification or AA)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed, ElementalComposition adds isotope information (but accurate mass is still unspecified at this level), and this is easing conversion to AtomicComposition at runtime.
The main concern I think is how to define the monoisotopic mass of an isotopically labeled AA or peptide, since per definition it is not based on the natural isotope of elements.
One possible way to represent that is to specify the isotope index at the elemental composition level, converted then to the right isotopic variant at the atomic composition level.
Another possible way is to define a custom AtomTable where the property of the first isotope is changed. The second way is a bit hacky, but greatly simplify calculations.
The current API support both approaches. Mixing the two would lead to wrong results and should be avoided.
Not sure if it answers your concern though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use the first approach when parsing molecular formula, and I would use the second one for mass calculation of peptide sequences. In the two cases (if we consider a given molecule) it should lead to the same AtomicComposition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And actually, regarding the second approach, it is even simpler to generate a modified AminoAcidTable. For instance 15N labeling can be implemented by calculating new AA monoisotopic masses based on 15N instead of 14N. Then the new AtomTable can be used for peptide level computation instead of the legacy ones (standard or proteinogenic).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For rustyms I used a structure which does the first technique, also at the peptide level, because there are quite some modification in PSI-MOD and Unimod which define conversions of single carbons from natural distribution to a specific isotope. The second approach seems nice when you do not have these kinds of modifications and have a uniform isotope uptake. Altough I do not think the speed of this would be very different from the way in which I did it, have the normal atomic distribution and change all normal distributed isotopes to the fixed isotope afterwards.
|
|
||
| /// A monosaccharide with all its complexity | ||
| #[derive(Debug, Clone, PartialEq, Eq, Hash, Serialize, Deserialize)] | ||
| pub struct MonoSaccharide { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's spelled "Monosaccharide"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll fix that
| #[derive(Debug, Clone, PartialEq, Eq, Hash, Serialize, Deserialize)] | ||
| pub struct MonoSaccharide { | ||
| pub(super) base_sugar: BaseSugar, | ||
| pub(super) substituents: Vec<GlycanSubstituent>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concept of substituents used here treats ring backbone modifications (e.g. deoxy, ketone groups, double bonds), and sidechain substitutions (amines, acetylation, sulfation) identically, which is insufficient to support glycomics encodings found in the wild other than IUPAC three letter codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this comes from rusyms, I'll let @douweschulte reply to that one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have used this to read in IUPAC short glycan names in rustyms. For the ones that I actually analysed manually it gave the correct elemental composition when everything worked out in the end, I cannot give guarantee that the parsing is fully correct for all short names though. But I would be very open for suggestions, as I do not have a strong background in glycan chemistry.
| ("Fo", GlycanSubstituent::Formyl), | ||
| ("Gc", GlycanSubstituent::Glycolyl), | ||
| ("Gln2Ac", GlycanSubstituent::AcetylGlutaminyl), | ||
| ("5Glu2Me", GlycanSubstituent::MethylGlutamyl), // TODO: Does this number has to be there? (Not found in the glycosmos data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numbers refer to which backbone carbon the substituent is attached
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For @douweschulte as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they do. So it makes more sense to remove them where not needed. Here they only define how to read in modifications and are not saved in any way.
| ("Lt", GlycanSubstituent::Lactyl), | ||
| ("Lac", GlycanSubstituent::Lac), | ||
| ("Me", GlycanSubstituent::Methyl), | ||
| ("CMe", GlycanSubstituent::Methyl), // unsure about the difference with Me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C prefix here means that the substituent doesn't replace a hydroxyl group, but instead replaces the other hydrogen on the carbon it is attached to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For @douweschulte as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Sounds like I need an additional level of complexity in there, to be able to nicely separate the replacing group and the added group.
| @@ -0,0 +1,1594 @@ | |||
| #![allow(dead_code)] | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than spend too much time implementing a full glycan representation to just support ProForma, I recommend supporting just enough to connect the ProForma-recognized symbols in the monosaccharides CV to compositions and masses, and how to traverse GNO for masses and compositions if you want that level of detail (available in the RDF datastore).
We specifically didn't push more information glycan definition into ProForma because it's way too much complexity for something few users will need, and those who do want that may well want the full linkage stereochemistry model that is overkill for a lot of MS-based workflows. It's on the order of 3x more complexity in the object model, and that's before you get into the graph data structures required to manipulate them. I wrote it once in https://github.com/mobiusklein/glypy, and it's much worse in Rust when I started translating the concepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually my goal was mainly to fully support unimod PTM parsing in a first intent. But I have not really deeply investigated what was done precisely there.
I would vote for minimal glycan support there and extended one in a ProForma dedicated feature (living under its dedicated module).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree. I only added this level of detail because some colleagues wanted fragmentation spectra with internal fragments of glycans. To support that I opted to use a csv from glycosmos which provided the tree structure of the glycans in IUPAC short names to connect back to the GNO names. For unimod PTM and ProForma glycans you need way less complexity.
| } | ||
|
|
||
| // Kind of dangerous because it implicitly means that u8 <-> amino acid (it might not be desired) | ||
| impl HasMass for u8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to implement this on the AA indeed. My general goal for any library I write is that it should not be able to crash at all, any error should be indicated by the return type. Any end user then can decide how they want to handle the error cases. And the conversion from u8 to AA is essentially nothing except for a simple check, which I would suggest to do as soon as possible and store the AA as an AA instead of u8 where you know it has to be an AA.
|
|
||
| #[derive(Clone, Debug, PartialEq, PartialOrd, Serialize, Deserialize)] | ||
| pub struct AminoAcidDefinition { | ||
| pub code1: u8, // FIXME: shall we use AA enum type instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
| pub code1: u8, // FIXME: shall we use AA enum type instead? | ||
| pub code3: String, | ||
| pub name: String, | ||
| pub formula: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use the formula type you later define here
| } | ||
|
|
||
| /// Any item that has both a name and a symbol | ||
| // FIXME: shall we return Result<&Cow<str>>? this would avoid unwrapping in some implems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is needed in some cases you could also think about a TryHasNameAndSymbol which provides the same behaviour but uses Result<...> as output. This could be automatically implemented for anything that implements the base version. That makes it a bit easier to write and makes you trait bounds more descriptive. But only do that when actually needed. For now I suggest to not do this.
|
|
||
| /// Any item having a measurable mass (monoisotopic and average) | ||
| /// | ||
| // FIXME: shall we return Result<f64>? this would avoid unwrapping in some implems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I would suggest to create a TryHasMass and auto implement this for anything implementing implementing HasMass. Also I was thinking that maybe having a HasMultipleMasses can be useful, as B and Z can not implement HasMass but can implement HasMultipleMasses.
Another point, I would vote against having the average mass being optional, if it is needed to be optional the implementor has to implement TryHasMass.
| pub mono_mass: f64, | ||
| pub average_mass: f64, | ||
| pub position_constraint: PtmLocation, // any N-term, any C-term, protein N-term, protein C-term | ||
| pub residue_constraint: u8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these constraints imply an and relation? Because there are quite some modification that have more complex rules. Example: Unimod:Oxidation has RNDCQEHILKMFPSTWYVU@Anywhere, G@AnyCTerm as rules.
| code1: b'A', | ||
| code3: "Ala".to_string(), | ||
| name: "Alanine".to_string(), | ||
| formula: Some("C(3) H(5) O N".to_string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here using the molecular_formula! macro you copied from rustyms makes sense
| #[derive(Clone, Copy, Debug, Eq, Hash, Ord, PartialEq, PartialOrd, Serialize, Deserialize)] | ||
| #[repr(u8)] | ||
| pub enum AA { | ||
| // The 20 standard AAs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also have it defined somewhere else in this PR, maybe makes sense to just remove it.
| /// An error | ||
| #[derive(Serialize, Deserialize, PartialEq, Clone, Eq)] | ||
| // TODO: rename ParsingError | ||
| pub struct CustomError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you took over my full error module, then unifying the errors is something we definitely should talk about. TO leave some initial thoughts here: I have never used thiserror or any related crate, I only ever used this error type as used here (in multiple languages redefined by myself). My reason for doing that is because I wanted to provide the fanciest error messages I could rivalling those from the rust compiler itself.
| Self::c_p2 => SeriesDir::NTerminal, | ||
| Self::c_H2O => SeriesDir::NTerminal, | ||
| Self::c_NH3 => SeriesDir::NTerminal, | ||
| Self::d => SeriesDir::Unspecified, // FIXME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d/v are n terminal, w is c terminal
This PR is a continuation of the previous one (PR #1), and it tries to address unresolved issues in this previous PR, and also can been seen as a solution proposal to the concerns mentioned in this discussion:
https://github.com/orgs/rusteomics/discussions/10
The changes introduced by this PR are experimental and mainly serve as a reference for future discussions & improvements.
I added some TODOs & FIXMEs that may require particular attention from reviewers.
As a summary, this PR:
Amino acids are abstracted by the IsAminoAcid trait while a sequence of amino acids is abstracted by IsAminoAcidSeq.
Amino acids definitions can be retrieve by a proper implementation of the trait AminoAcidFactory<T: IsAminoAcid>.
Some other comments regarding Element and molecular formula, and how it differs from Rustyms:
There is a slight difference between these types within the present mzcore proposal, and I tried to match the usual nomenclature.
Element can be seen an atom identity while Atom represent the atomic definition of an Element, including some observed properties (number/mass of isotopes and their relative abundance).
Elements are represented as static entities while Atoms are data structure that can be instantiated in different ways (from different data sources).
I tried to differentiate them using the following definitions:
So the main difference is that molecular formula give absolute number while elemental/atomic composition can be absolute or relative, as the numbering is expressed by floating numbers.
Concretely, molecular formula are meant to be parsed or hard-coded, but are under the hood represented by an elemental composition type (which can be seen as a "super" representation of molecular formula).
Thus in Rusteomics a molecular formula is not a type per se, but I kept the
molecular_formula!()macro, which I find handy.