Project Overview
This project demonstrates cloud data warehouse architecture by analyzing Yelp restaurant review data in Snowflake. The pipeline includes data ingestion, transformation, sentiment analysis, and query optimization across a multi-phase SQL architecture.
Architecture
Data Pipeline Phases
SQL files are organized in dependency order to build analytics incrementally:
| Phase | Files | Purpose |
|---|---|---|
| 1 | 3 files | Base tables (parsed reviews, locations, review data) |
| 2 | 2 files | Individual review analysis (POS/NEG tags, review tags) |
| 3 | 2 files | Aggregated tag analysis (common tags, restaurant tags) |
| 4 | 4 files | Advanced analytics (sentiment over time, AI summaries, clusters) |
| 5 | 3 files | Query tools (location search, tag printing, queries) |
Python Automation
The execute_in_order.py script automates the SQL pipeline:
- Connects to Snowflake via Python connector
- Executes SQL files in dependency order
- Interactive menu for phase selection
- Error handling and retry logic
Key SQL Transformations
JSON Data Parsing
Raw Yelp data arrives as JSON files. Snowflake's VARIANT type and semi-structured data functions parse nested JSON into relational tables:
-- Parse nested JSON reviews
SELECT
raw_data:business_id::STRING as business_id,
raw_data:review_id::STRING as review_id,
raw_data:stars::NUMBER as rating,
raw_data:text::STRING as review_text,
raw_data:date::TIMESTAMP as review_date
FROM RESTAURANTDATATABLE;
Sentiment Analysis
Custom SQL-based sentiment scoring using keyword matching and weighted scoring algorithms. Positive and negative indicators are extracted and aggregated over time.
Review Clustering
Groups similar reviews using text analysis and categorization algorithms to identify common themes and customer concerns.
Competitor Analysis
Geographic proximity queries using Snowflake's geospatial functions to find competing restaurants and compare ratings.
Data Processing Workflow
1. Split large JSON files into manageable chunks
2. Upload to Snowflake stage via Snowsight UI
3. Load into VARIANT column table
4. Parse JSON into structured base tables
5. Run Phase 1-3 SQL for core transformations
6. Run Phase 4 for advanced analytics
7. Query results via Phase 5 query tools
Performance Optimizations
Clustering
Tables are clustered by commonly filtered columns (business_id, review_date) to improve query performance:
ALTER TABLE RESTAURANTDATATABLE
CLUSTER BY (business_id, review_date);
Warehouse Sizing
Dynamic warehouse scaling from Small to Large based on workload:
- Small: Development and light queries (~$2-4/hour)
- Medium: Standard analytics workloads
- Large: Heavy aggregations and ETL (~$8/hour)
Auto-Suspend
Warehouses configured to auto-suspend after 5 minutes of inactivity to control costs during development.
Technology Stack
| Technology | Purpose |
|---|---|
| Snowflake | Cloud data warehouse |
| Snowsight UI | Web-based SQL interface |
| Python | Automation scripts |
| snowflake-connector-python | Database connectivity |
| Streamlit | Dashboard application |
What I Learned
This project taught me cloud data warehouse fundamentals: semi-structured data handling, query optimization through clustering, cost management via warehouse sizing, and building automated ETL pipelines. Snowflake's separation of storage and compute enables flexible scaling that traditional databases can't match.
Use Cases
- Restaurant performance benchmarking
- Customer sentiment tracking over time
- Competitive analysis by location
- Trend identification in dining preferences