In e-commerce, determining the optimal price point for products is crucial for marketplace success and customer satisfaction. Your challenge is to develop an ML solution that analyzes product details and predict the price of the product. The relationship between product attributes and pricing is complex - with factors like brand, specifications, product quantity directly influence pricing. Your task is to build a model that can analyze these product details holistically and suggest an optimal price.
The dataset consists of the following columns:
- sample_id: A unique identifier for the input sample
- catalog_content: Text field containing title, product description and an Item Pack Quantity(IPQ) concatenated.
- image_link: Public URL where the product image is available for download.
Example link - https://m.media-amazon.com/images/I/71XfHPR36-L.jpg
To download images use download_imagesfunction fromsrc/utils.py. See sample code insrc/test.ipynb.
- price: Price of the product (Target variable - only available in training data)
- Training Dataset: 75k products with complete product details and prices
- Test Set: 75k products for final evaluation
The output file should be a CSV with 2 columns:
- sample_id: The unique identifier of the data sample. Note the ID should match the test record sample_id.
- price: A float value representing the predicted price of the product.
Note: Make sure to output a prediction for all sample IDs. If you have less/more number of output samples in the output file as compared to test.csv, your output won't be evaluated.
Source files
- src/utils.py: Contains helper functions for downloading images from the image_link. You may need to retry a few times to download all images due to possible throttling issues.
- sample_code.py: Sample dummy code that can generate an output file in the given format. Usage of this file is optional.
Dataset files
- dataset/train.csv: Training file with labels (price).
- dataset/test.csv: Test file without output labels (price). Generate predictions using your model/solution on this file's data and format the output file to match sample_test_out.csv
- dataset/sample_test.csv: Sample test input file.
- dataset/sample_test_out.csv: Sample outputs for sample_test.csv. The output for test.csv must be formatted in the exact same way. Note: The predictions in the file might not be correct
- 
You will be provided with a sample output file. Format your output to match the sample output file exactly. 
- 
Predicted prices must be positive float values. 
- 
Final model should be a MIT/Apache 2.0 License model and up to 8 Billion parameters. 
Submissions are evaluated using Symmetric Mean Absolute Percentage Error (SMAPE): A statistical measure that expresses the relative difference between predicted and actual values as a percentage, while treating positive and negative errors equally.
Formula:
SMAPE = (1/n) * Σ |predicted_price - actual_price| / ((|actual_price| + |predicted_price|)/2)
Example: If actual price = $100 and predicted price = $120
SMAPE = |100-120| / ((|100| + |120|)/2) * 100% = 18.18%
Note: SMAPE is bounded between 0% and 200%. Lower values indicate better performance.
- Public Leaderboard: During the challenge, rankings will be based on 25K samples from the test set to provide real-time feedback on your model's performance.
- Final Rankings: The final decision will be based on performance on the complete 75K test set along with provided documentation of the proposed approach by the teams.
- 
Upload a test_out.csvfile in the Portal with the exact same formatting assample_test_out.csv
- 
All participating teams must also provide a 1-page document describing: - Methodology used
- Model architecture/algorithms selected
- Feature engineering techniques applied
- Any other relevant information about the approach Note: A sample template for this documentation is provided in Documentation_template.md
 
Participants are STRICTLY NOT ALLOWED to obtain prices from the internet, external databases, or any sources outside the provided dataset. This includes but is not limited to:
- Web scraping product prices from e-commerce websites
- Using APIs to fetch current market prices
- Manual price lookup from online sources
- Using any external pricing databases or services
Enforcement:
- All submitted approaches, methodologies, and code pipelines will be thoroughly reviewed and verified
- Any evidence of external price lookup or data augmentation from internet sources will result in immediate disqualification
Fair Play: This challenge is designed to test your machine learning and data science skills using only the provided training data. External price lookup defeats the purpose of the challenge.
- Consider both textual features (catalog_content) and visual features (product images)
- Explore feature engineering techniques for text and image data
- Consider ensemble methods combining different model types
- Pay attention to outliers and data preprocessing