Skip to content

Sklong Official Paper Is Out! 0.1.0 is now out there 🙀

Latest

Choose a tag to compare

@simonprovost simonprovost released this 02 Aug 00:09
· 27 commits to main since this release

Hi folks!

We are pleased to announce that Scikit-Longitudinal is now available in its first beta release under the tag 0.1.0 🎉

Note

Why 0.1.0 now? With the publication of our paper in the Journal of Open Source Software (JOSS) today, we're marking a major step forward! This transitions us from alpha versions to beta, reflecting the library's maturity after rigorous open peer review and enhancements. Bear with us on the version jump—we've incorporated key updates since 0.0.7, including two minor releases (0.0.8 and 0.0.9) for pre-approval tweaks.

❝ In a nutshell, what's Scikit-Longitudinal?

📽️ Scikit-Longitudinal (also abbreviated as Sklong) is built on @scikit-learn (by @scikit-learn team and contributors, thanks to @GaelVaroquaux for inspiring for years around the Sklearn ecosystem w/ now @probabl-ai), @Scikit-Tree (by the @neurodata team, including @adam2392 and others), while also drawing inspiration from longitudinal research by top-notch researchers like Dr. Caio Ribeiro (@caioedurib), Dr. Tossapol Pomsuwan (@mastervii), Dr. Sergey Ovchinnik (@SergeyOvchinnik), Dr. Fernando Otero (@febo), etc. But really, what does it do? 👇👇👇

💡 Scikit-Longitudinal is an open-source Python package that improves machine learning for longitudinal data classification while integrating seamlessly with the Scikit-learn environment. Longitudinal data, which consists of repeated measurements of variables at different time intervals (known as waves), is widely used in sectors such as health and social sciences. Unlike ordinary tabular datasets, longitudinal data has temporal linkages that require specialised processing. In practice, people frequently either naively rely only on the last wave of information (while not too wrong, it forgets about the past) or flatten everything in an incorrect manner.

With Sklong, we address such a problematic workflow that is too frequently used with a novel set of tools, including:

  • Data Preparation: Utilities like LongitudinalDataset for loading and structuring data, defining temporal feature groups, and more.
  • Data Transformation: Methods to handle the temporal aspect, either by flattening the data into a static representation (e.g., MarWavTimeMinus or SepWav) for standard ML or preserving the temporal structure (e.g., MerWavTimePlus) for use in longitudinal-aware steps.
  • Preprocessing: Longitudinal-data-aware feature selection, such as CFSPerGroup, leveraging temporal information.
  • Estimators: Specialised algorithm-adaptation-based classifiers like LexicoRandomForestClassifier, LexicoDeepForestClassifier, and NestedTreesClassifier, which exploit the temporal structure to potentially enhance performance.

In total, the library implements (as of 0.1.0) 1 data preparation method, 4 data transformation methods, 1 preprocessing method, and 6 estimators—2 of which ( LexicoRandomForestClassifier and NestedTreesClassifier) are standalone methods published in the literature. Sklong emphasises highly-typed, Pythonic code with substantial test coverage (over 88%) and comprehensive documentation (over 72%).

|
|

Not Enough? ❞

🗞️ The scientific paper is available at (Published by @openjournals) : https://joss.theoj.org/papers/10.21105/joss.08481

As well as that, more is coming; explore our GitHub issues, read through our README, and check our documentation! We've also added a podcast explanation in the docs for a quick audio overview.

|
|

Open-Source Contribution, More Than Welcome! ❞

We hope to provide motivation for you to contribute your own estimators, preprocessors, data transformation techniques and more! If we could have 1% of what @scikit-learn did 10 years ago (back in France 🇫🇷) for the machine learning community globally, it'd be just insane!

As a result, please share your suggestions! Without external input, how can we ensure we're advancing longitudinal ML workflows? 👀 New primitives are welcome from external contributors without problems—simply open an issue to discuss.

For full transparency. JOSS peer review process (publicly available at openjournals/joss-reviews#8481), where reviewers @TahiriNadia and @blengerich provided invaluable feedback, leading to significant refinements in documentation, examples, and overall user-friendliness. A huge thanks to them!

At the moment, we have an ongoing external contribution by a French team, lead by the great @MathiasValla; more specifically, they are going to contribute a new Longitudinal-data aware classifier, dubbed the Time-penalised trees (TpT). Read the theoretical aspect at: https://link.springer.com/article/10.1007/s10472-024-09950-w.

❝ I guess it's now time for tech-ish changelog!

🫵
https://pypi.org/project/scikit-longitudinal/

[v0.1.0] - 2025-08-02 - JOSS Paper Publication and Beta Transition

Added

  • JOSS paper integration: Added links, badges, and references to the published paper (DOI: 10.21105/joss.08481) – core commit for release.
  • Tutorials section in documentation – #63.
  • Troubleshooting installation section – commit on Jun 27.
  • Podcast explanation of Sklong in docs – commit on Jul 30.
  • Glightbox dependency for MkDocs – #63.
  • Document-dates plugin for docs – #55.
  • Blurry tabs styling for docs – #55.

Enhanced

  • Documentation overhaul: New-style home page, improved wordings, import examples, and docstrings – #55, #63.
  • Emphasized version compatibility in docs – commit on Jun 27.
  • Adapted temporal dependency links – #63.
  • Updated GitHub stars count in docs – commit on Jul 30.
  • Updated development requirements (e.g., for RDT) and uv.lock – commits on Jun 26 and Jul 3.
  • Improved PyPI README and links – commits on Jun 27 and Jul 22.

Resolved

  • Fixed setup.py and PyPI links – commit 12 hours ago.
  • Addressed JOSS pre-approval tweaks in minor releases 0.0.8 and 0.0.9 – commits on Jun 27 and Jul 22.

Note: This changelog covers advancements since 0.0.7. For prior details, see the expanded history below.

Previously in v0.0.7 and earlier

Version 0.0.7 - 2025-01-15 - Migration to uv and Major Enhancements

New Features

  • Migration to uv: Successfully transitioned from PDM to uv for package management, enhancing workflow efficiency and build reliability. Enhanced documentation to assist users with installation and setup using uv.
  • Refactored Visualisations and Tutorials: Updated tutorials and visualisations to align with the migration to uv. Improved the Quick Start Guide by providing clearer instructions and optimising the layout to enhance user experience.
  • Enhanced Estimators and Pipelines:
    • Refactored Lexico Gradient Boosting for full compliance with Scikit-Learn, eliminating the previous dependency on StarBoost.
    • Improved preprocessing pipelines for ARFF file management using the powerful liac-arff library.
  • CI/CD Updates: Implemented enhancements to the continuous integration and deployment pipeline, resulting in more efficient builds and improved compatibility with GitHub Actions.

Enhanced

  • Documentation: Fixed typos and inconsistencies in installation guides and tutorials to enhance clarity and improve user experience.
  • Usage Examples: Enhanced examples through various methodologies to effectively illustrate the application of longitudinal machine learning techniques.
  • Compliance and Maintainability: Implemented key enhancements to boost compliance and maintainability of tools and documentation.

Resolved

  • Resolved minor issues in installation and documentation setups, improving usability and reliability.

🫵
https://pypi.org/project/scikit-longitudinal/0.0.7/

[v0.0.4] - 2024-07-04 - First Public Release and Major Enhancements

Added

  • Documentation: Comprehensive new documentation with Material for MKDocs. This includes a detailed tutorial on understanding vectors of waves in longitudinal datasets, a contribution guide, an FAQ section, and complete API references for all estimators, preprocessors, data preparations, and the pipeline manager.
  • Docker Installation: Added new Docker installation process.
  • Windows Support: Windows is now supported via Docker.
  • New Classifiers/Regressors: Introduced Lexico Deep Forest, Lexico Gradient Boosting, and Lexico Decision Tree Regressor.
  • PyPI Availability: Scikit-Longitudinal is now available on PyPI.
  • Continuous Integration: Integrated unit testing, documentation, and PyPI publishing within the CI pipeline.

Improved

  • PDM Setup and Installation: Enhanced setup and installation processes using PDM.
  • Testing Coverage: Improved testing coverage, ensuring that nearly 90% of the library is tested.
  • Scikit-Lexicographical-Trees: Extracted the lexicographical scikit-learn tree node splitting function into its own repository and published it to PyPI as Scikit-Lexicographical-Trees. This is now leveraged by our lexico-based estimators.
  • .env Management: Improved management of environment variables.
  • Lexicographical Enhancements: Integrated lexicographical enhancements of the waves vector within the variant of scikit-learn, scikit-lexicographical-trees, improving memory and time efficiency by handling algorithmic temporality directly in C++.

To-Do

  • Docstrings Alignment: Ensure that docstrings in the codebase align with the official documentation to avoid confusion.
  • Native Windows Compatibility: Achieve Windows compatibility without relying on Docker (requires access to a Windows machine).
  • Future Enhancements: Ongoing improvements and new features as they are identified.
  • Documentation examples: Add examples to the documentation to help users understand how to use the library with Jupyter notebooks.

[v0.0.3] - 2023-10-31 - Usability, Maintainability, and Compliance Enhancements

Added

  • Features Group Missing Waves Handling: Introduced mechanisms for gracefully handling missing waves in features groups.
  • Readiness Descriptions: New readiness indicators provide detailed descriptions of temporal data management across the library.
  • Auto-Sklong Compliance: The library is now compliant with Auto-Sklong standards.
  • Package Management Transition: Switched from Poetry to PDM for improved package and dependency management.
  • Docker Support: Linux-based Docker environment setup for streamlined installation and deployment.
  • Platform Testing: Library is tested on both Mac and Linux, with Windows support nearing completion.
  • Documentation: Comprehensive version 0.0.1 of the documentation is available on GitHub Pages.
  • Pipeline Manager: Refactored the pipeline into a more maintainable and flexible pipeline manager.
  • CFS Classes Refactoring: Separated CFS and CFS Per Group algorithms into distinct classes for better management.

Removed

  • Irrelevant Scripts: Removed scripts related to visualizations not core to the library's functionality.
  • Experiments Branch: Moved all experiment-related codes to a dedicated Experiments branch.

[v0.0.2] - 2023-05-17 - Enhanced Longitudinal Analysis and Parallelization Features

Added

  • Implementation and validation of the three CFS Per Group Nested Tree and LexicoRF algorithms.
  • Parallelization enhancements where possible.
  • Longitudinal dataset handler for access to non-longitudinal features, longitudinal features group, etc.
  • Longitudinal pipeline for longitudinal-based algorithms that pass features group onto each step of the pipeline.
  • Comprehensive documentation and extensive test coverage (>95% of the codebase).
  • Git hooks and other tools for long-term project use.
  • An improved version of the CFS per Group algorithm (version two) based on the paper's concept level.
  • Updated README file.

[v0.0.1] - 2023-03-27 - Initial Release

Added

  • Initial setup of the Poetry Python project with robust type-checking.
  • Integration of linting tools: pylint, flake8, pre-commit, black, and isort.
  • Correlation-based Feature Selection (CFS) algorithm with improved typing and testing.
  • CFS per Group for Longitudinal Data: Python implementation with parallelism for better performance.
Extras Could be of interest to: @sudehashrafi, @blakeandreou, @rushikeshburle, @MEDomics-UdeS, @Mvalliere, @gsi-upm, @JTFouquier, @bunu, @dado93, and more !