Project Overview

This project demonstrates cloud data warehouse architecture by analyzing Yelp restaurant review data in Snowflake. The pipeline includes data ingestion, transformation, sentiment analysis, and query optimization across a multi-phase SQL architecture.

Architecture

Data Pipeline Phases

SQL files are organized in dependency order to build analytics incrementally:

PhaseFilesPurpose
13 filesBase tables (parsed reviews, locations, review data)
22 filesIndividual review analysis (POS/NEG tags, review tags)
32 filesAggregated tag analysis (common tags, restaurant tags)
44 filesAdvanced analytics (sentiment over time, AI summaries, clusters)
53 filesQuery tools (location search, tag printing, queries)

Python Automation

The execute_in_order.py script automates the SQL pipeline:

Key SQL Transformations

JSON Data Parsing

Raw Yelp data arrives as JSON files. Snowflake's VARIANT type and semi-structured data functions parse nested JSON into relational tables:

-- Parse nested JSON reviews
SELECT 
  raw_data:business_id::STRING as business_id,
  raw_data:review_id::STRING as review_id,
  raw_data:stars::NUMBER as rating,
  raw_data:text::STRING as review_text,
  raw_data:date::TIMESTAMP as review_date
FROM RESTAURANTDATATABLE;

Sentiment Analysis

Custom SQL-based sentiment scoring using keyword matching and weighted scoring algorithms. Positive and negative indicators are extracted and aggregated over time.

Review Clustering

Groups similar reviews using text analysis and categorization algorithms to identify common themes and customer concerns.

Competitor Analysis

Geographic proximity queries using Snowflake's geospatial functions to find competing restaurants and compare ratings.

Data Processing Workflow

1. Split large JSON files into manageable chunks
2. Upload to Snowflake stage via Snowsight UI
3. Load into VARIANT column table
4. Parse JSON into structured base tables
5. Run Phase 1-3 SQL for core transformations
6. Run Phase 4 for advanced analytics
7. Query results via Phase 5 query tools

Performance Optimizations

Clustering

Tables are clustered by commonly filtered columns (business_id, review_date) to improve query performance:

ALTER TABLE RESTAURANTDATATABLE 
CLUSTER BY (business_id, review_date);

Warehouse Sizing

Dynamic warehouse scaling from Small to Large based on workload:

Auto-Suspend

Warehouses configured to auto-suspend after 5 minutes of inactivity to control costs during development.

Technology Stack

TechnologyPurpose
SnowflakeCloud data warehouse
Snowsight UIWeb-based SQL interface
PythonAutomation scripts
snowflake-connector-pythonDatabase connectivity
StreamlitDashboard application

What I Learned

This project taught me cloud data warehouse fundamentals: semi-structured data handling, query optimization through clustering, cost management via warehouse sizing, and building automated ETL pipelines. Snowflake's separation of storage and compute enables flexible scaling that traditional databases can't match.

Use Cases