Snowflake Analytics Project

Project Overview

This project demonstrates cloud data warehouse architecture by analyzing Yelp restaurant review data in Snowflake. The pipeline includes data ingestion, transformation, sentiment analysis, and query optimization across a multi-phase SQL architecture.

Architecture

Data Pipeline Phases

SQL files are organized in dependency order to build analytics incrementally:

Phase	Files	Purpose
1	3 files	Base tables (parsed reviews, locations, review data)
2	2 files	Individual review analysis (POS/NEG tags, review tags)
3	2 files	Aggregated tag analysis (common tags, restaurant tags)
4	4 files	Advanced analytics (sentiment over time, AI summaries, clusters)
5	3 files	Query tools (location search, tag printing, queries)

Python Automation

The execute_in_order.py script automates the SQL pipeline:

Connects to Snowflake via Python connector
Executes SQL files in dependency order
Interactive menu for phase selection
Error handling and retry logic

Key SQL Transformations

JSON Data Parsing

Raw Yelp data arrives as JSON files. Snowflake's VARIANT type and semi-structured data functions parse nested JSON into relational tables:

-- Parse nested JSON reviews
SELECT 
  raw_data:business_id::STRING as business_id,
  raw_data:review_id::STRING as review_id,
  raw_data:stars::NUMBER as rating,
  raw_data:text::STRING as review_text,
  raw_data:date::TIMESTAMP as review_date
FROM RESTAURANTDATATABLE;

Sentiment Analysis

Custom SQL-based sentiment scoring using keyword matching and weighted scoring algorithms. Positive and negative indicators are extracted and aggregated over time.

Review Clustering

Groups similar reviews using text analysis and categorization algorithms to identify common themes and customer concerns.

Competitor Analysis

Geographic proximity queries using Snowflake's geospatial functions to find competing restaurants and compare ratings.

Data Processing Workflow

1. Split large JSON files into manageable chunks
2. Upload to Snowflake stage via Snowsight UI
3. Load into VARIANT column table
4. Parse JSON into structured base tables
5. Run Phase 1-3 SQL for core transformations
6. Run Phase 4 for advanced analytics
7. Query results via Phase 5 query tools

Performance Optimizations

Clustering

Tables are clustered by commonly filtered columns (business_id, review_date) to improve query performance:

ALTER TABLE RESTAURANTDATATABLE 
CLUSTER BY (business_id, review_date);

Warehouse Sizing

Dynamic warehouse scaling from Small to Large based on workload:

Small: Development and light queries (~$2-4/hour)
Medium: Standard analytics workloads
Large: Heavy aggregations and ETL (~$8/hour)

Auto-Suspend

Warehouses configured to auto-suspend after 5 minutes of inactivity to control costs during development.

Technology Stack

Technology	Purpose
Snowflake	Cloud data warehouse
Snowsight UI	Web-based SQL interface
Python	Automation scripts
snowflake-connector-python	Database connectivity
Streamlit	Dashboard application

What I Learned

This project taught me cloud data warehouse fundamentals: semi-structured data handling, query optimization through clustering, cost management via warehouse sizing, and building automated ETL pipelines. Snowflake's separation of storage and compute enables flexible scaling that traditional databases can't match.

Use Cases

Restaurant performance benchmarking
Customer sentiment tracking over time
Competitive analysis by location
Trend identification in dining preferences