Selected work from my data science coursework and personal projects.
Using R and ggplot2 to compare 114,000 Spotify tracks across 125 genres — finding the audio fingerprints that make each genre sound distinct.
For my second portfolio project I wanted to work in a different language
and lean into a different strength: visualization. R's ggplot2 is
widely considered the gold standard for statistical graphics, and Spotify's
audio feature data is rich enough to tell several stories at once. Using a public
Kaggle dataset of 114,000 tracks tagged with audio metrics (danceability, energy,
valence, acousticness, and more), I produced three charts that together describe
how genre, mood, and popularity show up in the underlying data.
Tools: R (tidyverse, ggplot2), RStudio.
Six genres, four audio features, one chart. Classical lives in a near-pure-acoustic, low-energy bubble. Metal is the energy outlier. Hip-hop leads on danceability. Country and jazz sit closer to the middle, blending features in different proportions. The chart makes audible characteristics visible — anyone reading it can match the bars to their mental model of each genre.
Plotting energy against valence (musical positivity) reveals four mood quadrants. Classical clusters in the calm/melancholy bottom-left. Metal lives on the right side (high energy), spanning both happy and angry valences. EDM tilts toward upbeat. Hip-hop spreads across the entire valence range — the most emotionally diverse genre in the sample. Every dot is a real track; every cluster is a genre's emotional signature.
Comparing popular tracks (popularity 70+) against obscure ones (under 30) reveals real but modest patterns. Hits skew toward higher energy and slightly higher danceability. Acoustic tracks are rare among hits — pure acoustic music doesn't dominate Spotify charts. The biggest surprise: valence (positivity) shows almost no difference between popular and obscure tracks. People listen to all moods at all popularity levels. The idea that "happy songs sell" turns out to be wrong, at least in this data.
The first half of the work is data preparation. With the tidyverse pipe
operator (|>), the transformations read like English — take the data,
filter to selected genres, group, summarize, then reshape:
chart1_data <- df |>
filter(track_genre %in% selected_genres) |>
group_by(track_genre) |>
summarise(
Danceability = mean(danceability),
Energy = mean(energy),
Valence = mean(valence),
Acousticness = mean(acousticness)
) |>
pivot_longer(
cols = -track_genre,
names_to = "feature",
values_to = "value"
)
Then ggplot2 builds the chart layer by layer. Each + adds one piece — the geometry, the color scale, the labels, the theme. It's a more declarative approach than matplotlib's procedural style:
ggplot(chart1_data, aes(x = track_genre, y = value, fill = feature)) +
geom_col(position = "dodge", width = 0.8) +
scale_fill_manual(values = c(
"Danceability" = "#C8A15C",
"Energy" = "#E31837",
"Valence" = "#FFFFFF",
"Acousticness" = "#888888"
)) +
labs(
title = "Genre Fingerprints: How 6 Genres Sound, by the Numbers",
subtitle = "Average audio feature values across 1,000 tracks per genre",
y = "Average value (0-1 scale)"
) +
theme_minimal(base_size = 12)
This project leaned into R's two biggest strengths: expressive data manipulation with the tidyverse and polished visualizations with ggplot2. Compared to the equivalent matplotlib work in my cinema project, ggplot2 produces noticeably more refined charts with less code — and faceting (splitting one chart into multiple panels) is a single line in R rather than a manual loop. Each tool has its strengths; doing this second project in R let me feel them firsthand.
Code available on GitHub.
Analyzing 1,000 top-rated films from a public IMDb dataset to surface trends in runtime, recency bias, and directorial consistency.
Using a public Kaggle dataset of IMDb's top 1,000 films, I built a small end-to-end data pipeline: cleaning the raw CSV with Python, loading it into a SQLite database, writing SQL to surface insights, and visualizing the findings with matplotlib. Three findings stood out.
Tools: Python (pandas, sqlite, matplotlib), SQL
Average runtime climbed from 86 minutes in the 1920s to 128 minutes in the 2010s — a 50% increase over a century. The biggest jump came between the silent era (1920s) and the studio era (1930s–1950s), once sound and longer-form storytelling became the norm. Modern epics and franchise films have kept runtimes near the two-hour mark.
Just 11 films from the 1920s made the top 1,000 — versus 242 from the 2010s. Yet the average rating barely shifts across decades (8.13 in the 1920s vs. 7.92 in the 2010s). The takeaway: "best of all time" lists reflect who's voting now as much as objective film quality. (Note: 2020s data only includes films through 2020, hence the small bar.)
Of directors with at least three films in the top 1,000, Christopher Nolan leads with an 8.46 average across eight films — well above the dataset's overall average of 7.95. Behind him: Peter Jackson and Francis Ford Coppola tied at 8.40. Notable for sheer volume of acclaimed work: Akira Kurosawa with 10 films in the list.
The pipeline runs in three steps. First, cleaning the raw data - IMDb's
Runtime, Released_Year, and Gross
columns all come in as text and need to be coerced into proper numeric
types:
import pandas as pd
import sqlite3
df = pd.read_csv('imdb_top_1000.csv')
# Strip " min" suffix and convert to integer
df['Runtime'] = df['Runtime'].str.replace(' min', '').astype(int)
# Coerce year to numeric (blanks become NaN)
df['Released_Year'] = pd.to_numeric(df['Released_Year'], errors='coerce')
# Strip commas from gross figures and coerce to numeric
df['Gross'] = df['Gross'].str.replace(',', '')
df['Gross'] = pd.to_numeric(df['Gross'], errors='coerce')
# Save to SQLite for querying
conn = sqlite3.connect('movies.db')
df.to_sql('movies', conn, if_exists='replace', index=False)
conn.close()
Then the SQL - bucketing films into decades and aggregating ratings, runtime, and counts:
SELECT (CAST(Released_Year AS INTEGER) / 10) * 10 AS decade,
COUNT(*) AS movie_count,
ROUND(AVG(IMDB_Rating), 2) AS avg_rating,
ROUND(AVG(Runtime), 0) AS avg_runtime_min
FROM movies
WHERE Released_Year IS NOT NULL
GROUP BY decade
ORDER BY decade;
And for the top directors finding, a HAVING clause filters out
directors with fewer than three films, so the list reflects consistency
across a body of work rather than one-off masterpieces:
SELECT Director,
COUNT(*) AS movie_count,
ROUND(AVG(IMDB_Rating), 2) AS avg_rating
FROM movies
GROUP BY Director
HAVING COUNT(*) >= 3
ORDER BY avg_rating DESC
LIMIT 10;
This was a small project, but it touched every stage of a real data workflow: ingestion, cleaning, storage, querying, visualization, and communication. The most valuable lesson wasn't technical - it was that data verification matters before analysis matters. Spotting the recency bias before drawing conclusions about "decline in cinema quality" is the difference between a real insight and a misleading one.
Code and data available on GitHub.