New chemistry module API #2

david-bouyssie · 2023-11-10T12:33:54Z

This PR is a continuation of the previous one (PR #1), and it tries to address unresolved issues in this previous PR, and also can been seen as a solution proposal to the concerns mentioned in this discussion:
https://github.com/orgs/rusteomics/discussions/10

The changes introduced by this PR are experimental and mainly serve as a reference for future discussions & improvements.
I added some TODOs & FIXMEs that may require particular attention from reviewers.

As a summary, this PR:

adds new types stolen from the great rustyms library: Element, Composition (replacing Formula), Glycan types, CustomError
introduces new traits abstracting amino acid definition & retrieval
replace the Peptide struct by LinearPeptide and introduce a trait to abstract a sequence of amino acids

Amino acids are abstracted by the IsAminoAcid trait while a sequence of amino acids is abstracted by IsAminoAcidSeq.
Amino acids definitions can be retrieve by a proper implementation of the trait AminoAcidFactory<T: IsAminoAcid>.

Some other comments regarding Element and molecular formula, and how it differs from Rustyms:

Element & Atom
There is a slight difference between these types within the present mzcore proposal, and I tried to match the usual nomenclature.
Element can be seen an atom identity while Atom represent the atomic definition of an Element, including some observed properties (number/mass of isotopes and their relative abundance).
Elements are represented as static entities while Atoms are data structure that can be instantiated in different ways (from different data sources).
Molecular formula, elemental composition & atomic composition
I tried to differentiate them using the following definitions:

molecular formula: the identity and number of chemical elements making up a molecule
elemental composition (also named chemical or percent composition): the identity and absolute or relative quantity of the chemical elements making up a molecule
atomic composition: similar to the elemental composition but with defined atomic properties (isotopic mass and abundance)

So the main difference is that molecular formula give absolute number while elemental/atomic composition can be absolute or relative, as the numbering is expressed by floating numbers.
Concretely, molecular formula are meant to be parsed or hard-coded, but are under the hood represented by an elemental composition type (which can be seen as a "super" representation of molecular formula).
Thus in Rusteomics a molecular formula is not a type per se, but I kept the molecular_formula!() macro, which I find handy.

* introduce new traits abstracting amino acid definition & retrieval * add new types stolen from rustyms: Element, Composition (replacing Formula), Glycan types, CustomError

mobiusklein · 2023-11-13T12:45:41Z

mzcore-rs/src/chemistry/element.rs

+}
+
+impl FromStr for Element {
+    type Err = anyhow::Error;


Isn't it more common for a library to use thiserror to avoid requiring its users to use a specific error abstraction library?

That's a fair point and I was also concerned by this.
I think we could dedicate a PR to define accurate and consistent error types.

mobiusklein · 2023-11-13T13:07:48Z

mzcore-rs/src/chemistry/peptide.rs

+// FIXME: how to implement serde::{Deserialize, Serialize} for Arc<[u8]>???
+#[derive(Clone, Debug, PartialEq, PartialOrd, Serialize, Deserialize)]
+pub struct LinearPeptide {
+    pub sequence: Arc<[u8]>,


Why inject the Arc into the peptide rather than wrap the whole peptide in an Arc when needed?

Good point. I have to confess I tried to replicate what was done in the Sage repo.
Maybe @lazear could give a rationale about this specific design.

mobiusklein · 2023-11-13T13:17:02Z

mzcore-rs/src/chemistry/composition.rs

+// Additional definitions to define an atom-based composition (mainly used for mass calculation)
+use crate::chemistry::atom::AtomIsotopicVariant;
+
+#[derive(Clone, Debug, PartialEq, PartialOrd, Serialize, Deserialize)]


The distinction between ElementalComposition and AtomicComposition seems to be to avoid a reference to a specific isotopic abundance, which would in turn inject either a lifetime and/or a heavy data structure into every type that carried a composition, correct?

This seems to require all code that might deal with isotopes must manage conversion to AtomicComposition separately. The pattern I followed in chemical-elements has the tradeoff of needing lifetimes everywhere, but which in the simple case of a single shared global table that can be made 'static without too much labor.

What kind of API are you envisioning?

The distinction between ElementalComposition and AtomicComposition seems to be to avoid a reference to a specific isotopic abundance,

Yes as well as isotopic masses

This seems to require all code that might deal with isotopes must manage conversion to AtomicComposition separately. The pattern I followed in chemical-elements has the tradeoff of needing lifetimes everywhere, but which in the simple case of a single shared global table that can be made 'static without too much labor.

Actually I adopted a similar design with my PhD in https://github.com/pinaraltiner/proteomics-rs/, which I forked in the the previous PR:
https://github.com/pinaraltiner/proteomics-rs/blob/57b57e01265afba22a58d2fa1f77a1ea7ed32939/src/chemistry/table.rs#L40

Then I found that rustyms definition of Elements could be used a nice way to of type safety for Elements identity.
So the rationale is to provide an API where identity (Element) and definitions/ properties (Atom) are implemented using different types.
This indeed require some bridges to go from Elements to Atoms.
However atoms can also be manipulated directly, and when atomic/isotopic properties are not needed, Elements can also be used independently.
Do you think it is too complex?

There are contexts where isotopic information isn't needed, where I agree that it'd be too much complexity to carry along all the isotopic metadata too, but this is too devoid of use-cases for me to venture an opinion of whether the complexity is worth it. The best I can say is try and see.

There is a little leaking of abstraction though because ElementalComposition also has a mass-related attribute. Is that because ElementalComposition is also pulling double-duty as a "configuration file" representation of something that would be converted into an AtomicComposition at run time? If so, wouldn't it be plausible for something to be specified with a heavy atom (e.g. an isotopically labeled modification or AA)?

Yes indeed, ElementalComposition adds isotope information (but accurate mass is still unspecified at this level), and this is easing conversion to AtomicComposition at runtime.
The main concern I think is how to define the monoisotopic mass of an isotopically labeled AA or peptide, since per definition it is not based on the natural isotope of elements.
One possible way to represent that is to specify the isotope index at the elemental composition level, converted then to the right isotopic variant at the atomic composition level.
Another possible way is to define a custom AtomTable where the property of the first isotope is changed. The second way is a bit hacky, but greatly simplify calculations.
The current API support both approaches. Mixing the two would lead to wrong results and should be avoided.
Not sure if it answers your concern though.

I would use the first approach when parsing molecular formula, and I would use the second one for mass calculation of peptide sequences. In the two cases (if we consider a given molecule) it should lead to the same AtomicComposition.

And actually, regarding the second approach, it is even simpler to generate a modified AminoAcidTable. For instance 15N labeling can be implemented by calculating new AA monoisotopic masses based on 15N instead of 14N. Then the new AtomTable can be used for peptide level computation instead of the legacy ones (standard or proteinogenic).

For rustyms I used a structure which does the first technique, also at the peptide level, because there are quite some modification in PSI-MOD and Unimod which define conversions of single carbons from natural distribution to a specific isotope. The second approach seems nice when you do not have these kinds of modifications and have a uniform isotope uptake. Altough I do not think the speed of this would be very different from the way in which I did it, have the normal atomic distribution and change all normal distributed isotopes to the fixed isotope afterwards.

mobiusklein · 2023-11-13T13:18:52Z