Skip to content

Improve Data Preprocessing Speed by Switching from Pandas to Polars #4

@achrefbenammar404

Description

@achrefbenammar404

Description

The current ModelPreprocessor class relies on the Pandas library for data manipulation and preprocessing. While Pandas is effective, it can be slow with large datasets. Switching to Polars, a faster DataFrame library optimized for parallel processing, could significantly improve preprocessing speed, especially for computationally intensive tasks like feature transformation and data type conversion.

Proposed Solution

  1. Replace Pandas with Polars in the ModelPreprocessor class.

  2. Update the following methods to use Polars syntax for efficient parallel processing:

    • feature_selection
    • convert_data_types
    • transform_categories
    • create_log1p_features
    • preprocess
  3. Benchmark the Performance:

    • Compare the preprocessing time between Pandas and Polars to confirm performance improvements.
    • Document any notable speedups or changes in memory usage.
  4. Test Compatibility:

    • Ensure compatibility with other parts of the pipeline, especially CatBoost, which may require converting Polars DataFrames to formats compatible with CatBoostClassifier.

Updated Code Example

Replace Pandas functions with equivalent Polars functions in ModelPreprocessor. Below is a partial example:

import polars as pl

class ModelPreprocessor:
    def feature_selection(self, df: pl.DataFrame):
        df = df.select(self.selected_features) if hasattr(self, 'selected_features') else df
        return df

    def convert_data_types(self, df: pl.DataFrame):
        # Convert categorical columns
        for column in self.categorical_features:
            df = df.with_column(pl.col(column).cast(pl.Categorical))
        # Convert numerical columns
        for column in self.numerical_features:
            df = df.with_column(pl.col(column).cast(pl.Float32))
        return df
    # Continue refactoring other methods similarly...

Tasks

  • Refactor ModelPreprocessor class to use Polars instead of Pandas.
  • Update all methods to use Polars syntax for data manipulation.
  • Test compatibility with CatBoostClassifier and make adjustments as needed.
  • Run benchmarks to compare preprocessing speed with Pandas and document results.
  • Update the documentation to reflect the change from Pandas to Polars.

Expected Outcome

  • Faster data preprocessing for improved efficiency in prediction workflows.
  • Reduced memory usage, especially with large datasets.
  • Cleaner, more concise code for data manipulation tasks.

Additional Notes

  • Polars does not currently support all functionalities of Pandas, so some operations may need creative solutions or fallbacks.
  • Ensure that any Polars-specific dependencies are added to the requirements file and documented in the README.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions