This project showcases a complete cloud-native analytics workflow that transforms raw Yelp review data into structured insights using Snowflake, Python UDF, and SQL. It simulates a real-world business scenario where stakeholders need to understand customer sentiment, business performance, and reviewer behavior at scale. Link to downlaod the dataset: https://business.yelp.com/data/resources/open-dataset/
Enable data-driven decisions by:
- Classifying customer sentiment from unstructured review text
- Modeling review volume, star ratings, and business trends
- Identifying top-performing businesses and potential churn risks
- Data Source: Yelp JSON (Reviews + Businesses)
- Cloud Storage: AWS S3
- Data Warehouse: Snowflake (VARIANT, SQL, Python UDF)
- Languages: SQL, Python (TextBlob)
- Python script to split 5GB raw JSON review file into smaller chunks
- Ingest raw Yelp review and business data from S3 into Snowflake
- Parse and flatten JSON data using VARIANT column type
- Apply sentiment classification using Python UDF (TextBlob)
- Answer business-critical questions using SQL queries
- Use the Python script to split
yelp_academic_dataset_review.json
into smaller files - Upload the files to an AWS S3 bucket
- Run Snowflake
COPY INTO
commands to load the JSON files intoVARIANT
columns - Use SQL to flatten the data into structured tables
- Create and apply the Python UDF to generate sentiment classification
- Run the analytical SQL queries to derive business insights
- Business count by category
- Top 10 users reviewing restaurants
- Most reviewed business categories
- Top 3 recent reviews per business
- Month with highest review volume
- 5-star review % per business
- Top 5 reviewed businesses per city
- Avg rating for businesses with ≥100 reviews
- Top 10 reviewers + reviewed businesses
- Businesses with highest number of positive sentiment reviews
_All queries are available in Sql_analytic_queries.pdf
- Identified most-trusted business categories by 5★ share
- Highlighted cities and businesses with strong or weak sentiment
- Detected high-volume reviewers and churn-risk businesses
- Demonstrated cloud-scale ETL and Python-SQL integration in a real-world scenario