Projects

Selected work from my data science coursework and personal projects.

The Sound of Genre: Visualizing Spotify's Audio DNA

Using R and ggplot2 to compare 114,000 Spotify tracks across 125 genres — finding the audio fingerprints that make each genre sound distinct.

Overview

For my second portfolio project I wanted to work in a different language and lean into a different strength: visualization. R's ggplot2 is widely considered the gold standard for statistical graphics, and Spotify's audio feature data is rich enough to tell several stories at once. Using a public Kaggle dataset of 114,000 tracks tagged with audio metrics (danceability, energy, valence, acousticness, and more), I produced three charts that together describe how genre, mood, and popularity show up in the underlying data.

Tools: R (tidyverse, ggplot2), RStudio.

Finding 1: Every genre has a distinct audio fingerprint

Bar chart comparing average danceability, energy, valence, and acousticness across six music genres: classical, country, EDM, hip-hop, jazz, and metal

Six genres, four audio features, one chart. Classical lives in a near-pure-acoustic, low-energy bubble. Metal is the energy outlier. Hip-hop leads on danceability. Country and jazz sit closer to the middle, blending features in different proportions. The chart makes audible characteristics visible — anyone reading it can match the bars to their mental model of each genre.

Finding 2: The Mood Map separates genres by energy and emotion

Scatter plot showing 500 sample tracks per genre plotted by energy (x-axis) and valence (y-axis), revealing distinct clusters per genre across the four mood quadrants

Plotting energy against valence (musical positivity) reveals four mood quadrants. Classical clusters in the calm/melancholy bottom-left. Metal lives on the right side (high energy), spanning both happy and angry valences. EDM tilts toward upbeat. Hip-hop spreads across the entire valence range — the most emotionally diverse genre in the sample. Every dot is a real track; every cluster is a genre's emotional signature.

Finding 3: Hits favor energy and movement, not happiness

Density plot comparing audio feature distributions of popular tracks (popularity 70 or above) versus obscure tracks (under 30) across four features: danceability, energy, valence, and acousticness

Comparing popular tracks (popularity 70+) against obscure ones (under 30) reveals real but modest patterns. Hits skew toward higher energy and slightly higher danceability. Acoustic tracks are rare among hits — pure acoustic music doesn't dominate Spotify charts. The biggest surprise: valence (positivity) shows almost no difference between popular and obscure tracks. People listen to all moods at all popularity levels. The idea that "happy songs sell" turns out to be wrong, at least in this data.

The Code

The first half of the work is data preparation. With the tidyverse pipe operator (|>), the transformations read like English — take the data, filter to selected genres, group, summarize, then reshape:

chart1_data <- df |>
  filter(track_genre %in% selected_genres) |>
  group_by(track_genre) |>
  summarise(
    Danceability = mean(danceability),
    Energy = mean(energy),
    Valence = mean(valence),
    Acousticness = mean(acousticness)
  ) |>
  pivot_longer(
    cols = -track_genre,
    names_to = "feature",
    values_to = "value"
  )

Then ggplot2 builds the chart layer by layer. Each + adds one piece — the geometry, the color scale, the labels, the theme. It's a more declarative approach than matplotlib's procedural style:

ggplot(chart1_data, aes(x = track_genre, y = value, fill = feature)) +
  geom_col(position = "dodge", width = 0.8) +
  scale_fill_manual(values = c(
    "Danceability" = "#C8A15C",
    "Energy"       = "#E31837",
    "Valence"      = "#FFFFFF",
    "Acousticness" = "#888888"
  )) +
  labs(
    title = "Genre Fingerprints: How 6 Genres Sound, by the Numbers",
    subtitle = "Average audio feature values across 1,000 tracks per genre",
    y = "Average value (0-1 scale)"
  ) +
  theme_minimal(base_size = 12)

Takeaways

This project leaned into R's two biggest strengths: expressive data manipulation with the tidyverse and polished visualizations with ggplot2. Compared to the equivalent matplotlib work in my cinema project, ggplot2 produces noticeably more refined charts with less code — and faceting (splitting one chart into multiple panels) is a single line in R rather than a manual loop. Each tool has its strengths; doing this second project in R let me feel them firsthand.

Code available on GitHub.

100 Years of Cinema

Analyzing 1,000 top-rated films from a public IMDb dataset to surface trends in runtime, recency bias, and directorial consistency.

Overview

Using a public Kaggle dataset of IMDb's top 1,000 films, I built a small end-to-end data pipeline: cleaning the raw CSV with Python, loading it into a SQLite database, writing SQL to surface insights, and visualizing the findings with matplotlib. Three findings stood out.

Tools: Python (pandas, sqlite, matplotlib), SQL

Finding 1: Movies have gotten dramatically longer

Bar chart showing average movie 
        runtime increasing from 86 minutes in the 1920s to 128 minutes in the 2010s

Average runtime climbed from 86 minutes in the 1920s to 128 minutes in the 2010s — a 50% increase over a century. The biggest jump came between the silent era (1920s) and the studio era (1930s–1950s), once sound and longer-form storytelling became the norm. Modern epics and franchise films have kept runtimes near the two-hour mark.

Finding 2: "Best of" lists have a strong recency bias

Bar chart showing the number of 
        films in IMDb's top 1000 by decade, with massive spikes in the 1990s, 2000s, 
        and 2010s

Just 11 films from the 1920s made the top 1,000 — versus 242 from the 2010s. Yet the average rating barely shifts across decades (8.13 in the 1920s vs. 7.92 in the 2010s). The takeaway: "best of all time" lists reflect who's voting now as much as objective film quality. (Note: 2020s data only includes films through 2020, hence the small bar.)

Finding 3: Christopher Nolan is modern cinema's most consistent director

Horizontal bar chart of the top 10 
        directors by average IMDb rating, with Christopher Nolan at the top with an 
        8.46 average across 8 films

Of directors with at least three films in the top 1,000, Christopher Nolan leads with an 8.46 average across eight films — well above the dataset's overall average of 7.95. Behind him: Peter Jackson and Francis Ford Coppola tied at 8.40. Notable for sheer volume of acclaimed work: Akira Kurosawa with 10 films in the list.

The Code

The pipeline runs in three steps. First, cleaning the raw data - IMDb's Runtime, Released_Year, and Gross columns all come in as text and need to be coerced into proper numeric types:

import pandas as pd
import sqlite3

df = pd.read_csv('imdb_top_1000.csv')

# Strip " min" suffix and convert to integer
df['Runtime'] = df['Runtime'].str.replace(' min', '').astype(int)

# Coerce year to numeric (blanks become NaN)
df['Released_Year'] = pd.to_numeric(df['Released_Year'], errors='coerce')

# Strip commas from gross figures and coerce to numeric
df['Gross'] = df['Gross'].str.replace(',', '')
df['Gross'] = pd.to_numeric(df['Gross'], errors='coerce')

# Save to SQLite for querying
conn = sqlite3.connect('movies.db')
df.to_sql('movies', conn, if_exists='replace', index=False)
conn.close()

Then the SQL - bucketing films into decades and aggregating ratings, runtime, and counts:

SELECT (CAST(Released_Year AS INTEGER) / 10) * 10 AS decade,
      COUNT(*) AS movie_count,
      ROUND(AVG(IMDB_Rating), 2) AS avg_rating,
      ROUND(AVG(Runtime), 0) AS avg_runtime_min
FROM movies
WHERE Released_Year IS NOT NULL
GROUP BY decade
ORDER BY decade;

And for the top directors finding, a HAVING clause filters out directors with fewer than three films, so the list reflects consistency across a body of work rather than one-off masterpieces:

SELECT Director,
      COUNT(*) AS movie_count,
      ROUND(AVG(IMDB_Rating), 2) AS avg_rating
FROM movies
GROUP BY Director
HAVING COUNT(*) >= 3
ORDER BY avg_rating DESC
LIMIT 10;

Takeaways

This was a small project, but it touched every stage of a real data workflow: ingestion, cleaning, storage, querying, visualization, and communication. The most valuable lesson wasn't technical - it was that data verification matters before analysis matters. Spotting the recency bias before drawing conclusions about "decline in cinema quality" is the difference between a real insight and a misleading one.

Code and data available on GitHub.

More Coming Soon