-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Description
The current ModelPreprocessor
class relies on the Pandas library for data manipulation and preprocessing. While Pandas is effective, it can be slow with large datasets. Switching to Polars, a faster DataFrame library optimized for parallel processing, could significantly improve preprocessing speed, especially for computationally intensive tasks like feature transformation and data type conversion.
Proposed Solution
-
Replace Pandas with Polars in the
ModelPreprocessor
class. -
Update the following methods to use Polars syntax for efficient parallel processing:
feature_selection
convert_data_types
transform_categories
create_log1p_features
preprocess
-
Benchmark the Performance:
- Compare the preprocessing time between Pandas and Polars to confirm performance improvements.
- Document any notable speedups or changes in memory usage.
-
Test Compatibility:
- Ensure compatibility with other parts of the pipeline, especially
CatBoost
, which may require converting Polars DataFrames to formats compatible withCatBoostClassifier
.
- Ensure compatibility with other parts of the pipeline, especially
Updated Code Example
Replace Pandas functions with equivalent Polars functions in ModelPreprocessor
. Below is a partial example:
import polars as pl
class ModelPreprocessor:
def feature_selection(self, df: pl.DataFrame):
df = df.select(self.selected_features) if hasattr(self, 'selected_features') else df
return df
def convert_data_types(self, df: pl.DataFrame):
# Convert categorical columns
for column in self.categorical_features:
df = df.with_column(pl.col(column).cast(pl.Categorical))
# Convert numerical columns
for column in self.numerical_features:
df = df.with_column(pl.col(column).cast(pl.Float32))
return df
# Continue refactoring other methods similarly...
Tasks
- Refactor
ModelPreprocessor
class to use Polars instead of Pandas. - Update all methods to use Polars syntax for data manipulation.
- Test compatibility with
CatBoostClassifier
and make adjustments as needed. - Run benchmarks to compare preprocessing speed with Pandas and document results.
- Update the documentation to reflect the change from Pandas to Polars.
Expected Outcome
- Faster data preprocessing for improved efficiency in prediction workflows.
- Reduced memory usage, especially with large datasets.
- Cleaner, more concise code for data manipulation tasks.
Additional Notes
- Polars does not currently support all functionalities of Pandas, so some operations may need creative solutions or fallbacks.
- Ensure that any Polars-specific dependencies are added to the requirements file and documented in the README.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request