Mini-Project #02: Making Backyards Affordable for All

Author

Apu Datta

Introduction

Housing affordability has become one of the most urgent policy challenges in the United States, particularly in high-cost metropolitan regions such as New York City, San Francisco, and Boston. Many local zoning and permitting practices have restricted new housing construction, driving up prices and limiting access for working families. In contrast, a growing movement known as YIMBY (“Yes In My Back Yard”) argues that cities can become more affordable, diverse, and dynamic by relaxing restrictive zoning rules and encouraging denser, mixed-use development. Opposing this stance are NIMBY (“Not In My Back Yard”) groups who resist additional housing in their neighborhoods.

YIMBY vs NIMBY Housing Illustration

🏙️ This project investigates where U.S. housing markets demonstrate pro-YIMBY patterns—where new housing construction is keeping pace with demand.


Mini-Project #02: “Making Backyards Affordable for All” challenges students to identify America’s most YIMBY-friendly metropolitan areas using multiple public datasets. Inspired by urban-planning analyses from Ray Delahanty (CityNerd), this project merges data from the U.S. Census Bureau’s American Community Survey (ACS) and the Bureau of Labor Statistics (BLS) to explore where rent growth, income trends, and new housing permits intersect. The analysis is performed using R and emphasizes the tidycensus, dplyr, and ggplot2 packages.

Urban Housing Skyline

🧰 Learning Objectives

Students will: - Practice data acquisition and integration with dplyr joins and key alignment across Census and BLS tables.
- Use basic visual analytics with ggplot2 to reveal affordability patterns.
- Construct interpretable indices such as Rent Burden and Housing Growth Rate to quantify affordability.
- Communicate results effectively to a non-technical policymaking audience through clear narrative and visuals.


🪜 Analytical Workflow

  1. Data Acquisition – download and prepare ACS and BLS datasets.
  2. Data Integration and Initial Exploration – join tables, ensure consistent CBSA identifiers, and answer exploratory questions.
  3. Visualization – create supporting plots showing rent vs income and household size trends.
  4. Metric Construction – compute and standardize rent-burden and housing-growth indices.
  5. Policy Brief – summarize findings for decision-makers and identify ideal congressional sponsors.

Data Flow Diagram – ACS, BLS, and Derived Metrics

🎯 Focus of This Project

This mini-project emphasizes data alignment, descriptive analysis, and storytelling over predictive modeling. While modeling and inference will appear in later assignments, MP02 demonstrates how structured public data can provide insights into how YIMBY-style housing policies influence affordability and community growth across U.S. cities. The final policy brief will synthesize these findings into a concise and persuasive argument supporting a national pro-YIMBY initiative.

Data Acquisition

🧩 Task 1: Data Import and Preparation

In this section, I downloaded and prepared all datasets necessary for my analysis of housing affordability.
I relied primarily on two U.S. government sources — the American Community Survey (ACS) from the U.S. Census Bureau and the Quarterly Census of Employment and Wages (QCEW) from the Bureau of Labor Statistics (BLS).
Each dataset was saved in a local directory named data/mp02.
The following code includes all instructor-provided functions and automated download steps. —

1️⃣ Creating the Data Directory and Package Setup

The first step ensures that the working directory contains a folder to store all downloaded datasets and loads every required R package.

📦 Click to see the code
# --- Step 1: Create data folder if it doesn't exist ---

if(!dir.exists(file.path("data", "mp02"))){
  dir.create(file.path("data", "mp02"), showWarnings=FALSE, recursive=TRUE)
}

# --- Step 2: Custom library() function to auto-install missing packages ---

library <- function(pkg){
  pkg <- as.character(substitute(pkg))
  options(repos = c(CRAN = "https://cloud.r-project.org"))
  if(!require(pkg, character.only=TRUE, quietly=TRUE)) install.packages(pkg)
  stopifnot(require(pkg, character.only=TRUE, quietly=TRUE))
}

# --- Step 3: Load required packages ---

library(tidyverse)
library(glue)
library(readxl)
library(tidycensus)
library(stringr)
library(cli)

🪄 Results:
This code first makes a new folder called data/mp02 to store all downloaded files. Then it checks that I have the needed R packages (like tidyverse, glue, readxl, etc.). If any are missing, it installs them automatically and loads them so that project runs smoothly. —

2️⃣ Downloading ACS Data (Income, Rent, Population, Households)

I used the get_acs_all_years() helper function to pull annual CBSA-level estimates for key socioeconomic indicators from 2009–2023, excluding 2020 due to COVID-19 data gaps.

🏗️ Click to see the ACS Download Code
# --- Function to Download ACS Data (Income, Rent, Population, Households) ---
get_acs_all_years <- function(variable, geography = "cbsa",
                              start_year = 2009, end_year = 2023) {
  fname <- glue("{variable}_{geography}_{start_year}_{end_year}.csv")
  fname <- file.path("data", "mp02", fname)
  
  if (!file.exists(fname)) {
    YEARS <- seq(start_year, end_year)
    YEARS <- YEARS[YEARS != 2020]  # Drop 2020
    
    cli::cli_alert_info("Downloading ACS {variable} for {length(YEARS)} years ({min(YEARS)}–{max(YEARS)})...")
    
    ALL_DATA <- purrr::map(
      YEARS,
      function(yy) {
        tidycensus::get_acs(geography, variable, year = yy, survey = "acs1") |>
          dplyr::mutate(year = yy) |>
          dplyr::select(-moe, -variable) |>
          dplyr::rename(!!variable := estimate)
      },
      .progress = FALSE
    ) |> dplyr::bind_rows()
    
    readr::write_csv(ALL_DATA, fname)
    cli::cli_alert_success("Saved ACS {variable} to {fname}.")
  } else {
    cli::cli_alert_success("Using cached ACS {variable} from {fname}.")
  }
  
  readr::read_csv(fname, show_col_types = FALSE)
}
# --- Call the function for all required variables ---
INCOME <- get_acs_all_years("B19013_001") |> rename(household_income = B19013_001)
RENT <- get_acs_all_years("B25064_001") |> rename(monthly_rent = B25064_001)
POPULATION <- get_acs_all_years("B01003_001") |> rename(population = B01003_001)
HOUSEHOLDS <- get_acs_all_years("B11001_001") |> rename(households = B11001_001)

🪄 Result:
This code downloads data from the U.S. Census Bureau for every year between 2009 and 2023 (skipping 2020 because of missing data). It collects information about:

  • 💵 Income (median household income)
  • 🏠 Rent (average monthly rent)
  • 👨‍👩‍👧 Population (total people living in the area)
  • 🏘️ Households (total number of homes)

If already have the data saved in data/mp02 folder, it won’t download again, it just reuses the saved file. This saves time and avoids unnecessary downloads every time run the report.


3️⃣ Collecting Building Permit Data

Because the number of newly permitted housing units is not available in tidycensus, I imported this information manually from the U.S. Census Bureau’s construction statistics.

🏗️ Click to see the code
# --- Step 1: Define the function that downloads or loads building-permit data ---

get_building_permits <- function(start_year = 2009, end_year = 2023){
  fname <- glue("housing_units_{start_year}_{end_year}.csv")
  fname <- file.path("data", "mp02", fname)
  
  if(!file.exists(fname)){
    cli::cli_alert_info("Downloading building permits (historical and current)...")
    
    # 2009–2018 historical text files
    HISTORICAL_YEARS <- seq(start_year, 2018)
    HISTORICAL_DATA <- purrr::map(
      HISTORICAL_YEARS,
      function(yy){
        historical_url <-glue::glue("https://www.census.gov/construction/bps/txt/tb3u{yy}.txt")
        LINES <- readLines(historical_url)[-c(1:11)]
        CBSA_LINES <- stringr::str_detect(LINES, "^[[:digit:]]")
        CBSA <- as.integer(stringr::str_sub(LINES[CBSA_LINES], 5, 10))
        PERMIT_LINES <- stringr::str_detect(stringr::str_sub(LINES, 48, 53), "[[:digit:]]")
        PERMITS <- as.integer(stringr::str_sub(LINES[PERMIT_LINES], 48, 53))
        tibble::tibble(CBSA = CBSA, new_housing_units_permitted = PERMITS, year = yy)
      },
      .progress = TRUE
    ) |> dplyr::bind_rows()
    
    # 2019–2023 Excel-format files
    CURRENT_YEARS <- seq(2019, end_year)
    CURRENT_DATA <- purrr::map(
      CURRENT_YEARS,
      function(yy){
        current_url <- glue::glue("https://www.census.gov/construction/bps/xls/msaannual_{yy}99.xls")
        temp <- tempfile()
        download.file(current_url, destfile = temp, mode="wb")
        fallback <- function(.f1, .f2){ function(...){ tryCatch(.f1(...), error=function(e) .f2(...)) } }
        reader <- fallback(readxl::read_xlsx, readxl::read_xls)
        reader(temp, skip=5) |>
          tidyr::drop_na() |>
          dplyr::select(
            CBSA = dplyr::any_of(c("CBSA", "CBSA Code", "CBSAcode")),
            Total = dplyr::any_of(c("Total", "TOTAL"))
          ) |>
          dplyr::mutate(
            CBSA = as.integer(CBSA),
            year = yy,
            new_housing_units_permitted = as.integer(Total)
          ) |>
          dplyr::select(CBSA, new_housing_units_permitted, year)
      },
      .progress = TRUE
    ) |> dplyr::bind_rows()
    
    # Combine both and save
    ALL_DATA <- rbind(HISTORICAL_DATA, CURRENT_DATA)
    readr::write_csv(ALL_DATA, fname)
    cli::cli_alert_success("Saved building permits to {fname}.")
  } else {
    cli::cli_alert_success("Using cached building permits from {fname}.")
  }
  
  read_csv(fname, show_col_types=FALSE)
  
}

# --- Step 2: Run the function and store the result ---
PERMITS <- get_building_permits()
✔ Using cached building permits from data/mp02/housing_units_2009_2023.csv.

🪄 Result:

This code first defines a function that knows how to download U.S. Census building permit data for every metro area (CBSA) between 2009 and 2023. Then, it runs that function to either:

  • 📥 Download new data (if missing), or
  • 💾 Load the saved data (if it already exists).

4️⃣ Understanding Core-Based Statistical Areas (CBSA)

📍 Why Use the CBSA Level?

A Core-Based Statistical Area (CBSA) represents a metro or micro area built around a main urban center and its economically linked surroundings. All data sources in this project from the American Community Survey and Bureau of Labor Statistics to Census building permits are collected at the CBSA level.

This standardization ensures that housing, income, and employment data align consistently across regions, making analysis fair and comparable.


5️⃣ Downloading BLS Industry and Wage Data

The next steps use BLS sources to enrich our dataset with employment and income structure.

🏢 Click to see industry code function (get_bls_industry_codes)
library(httr2)
library(rvest)

# NOTE: This function downloads the NAICS hierarchy (industry titles) from bls.gov
# and shapes it into a clean table with level1–level4 titles & codes.


get_bls_industry_codes <- function(){
  fname <- file.path("data", "mp02", "bls_industry_codes.csv")
  library(dplyr); library(tidyr); library(readr)
  
  if(!file.exists(fname)){
    cli::cli_alert_info("Downloading BLS NAICS industry codes...")
    
    # Build request to the BLS industry titles page  
    resp <- request("https://www.bls.gov") |> 
      req_url_path("cew", "classifications", "industry", "industry-titles.htm") |>
      req_headers(
        `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        `Accept-Language` = "en-US,en;q=0.9",
        `Accept` = "text/html,application/xhtml+xml"
      ) |>
      
      # Don't error hard on HTTP status; handle after
      req_error(is_error = \(resp) FALSE) |> req_perform()
    
    # Validate response status
    resp_check_status(resp)
    
    # Parse the NAICS table in the HTML page
    naics_table <- resp_body_html(resp) |>
      html_element("#naics_titles") |> html_table() |>
      mutate(title = str_trim(str_remove(str_remove(`Industry Title`, Code), "NAICS"))) |>
      select(-`Industry Title`) |>
      mutate(depth = if_else(nchar(Code) <= 5, nchar(Code) - 1, NA)) |>
      filter(!is.na(depth))
    
    # Manually add ranges that appear compressed on the page
    naics_missing <- tibble::tribble(
      ~Code, ~title, ~depth,
      "31","Manufacturing",1,"32","Manufacturing",1,"33","Manufacturing",1,
      "44","Retail",1,"45","Retail",1,"48","Transportation and Warehousing",1,"49","Transportation and Warehousing",1)
    
    naics_table <- bind_rows(naics_table, naics_missing)
    
    # Keep a copy of all levels to join titles across the hierarchy
    naics_all <- naics_table  
    
    # Build a clean 4-level hierarchy table with integer codes
    naics_table <- naics_table |>
      filter(depth == 4) |>
      rename(level4_title = title) |>
      mutate(
        level1_code = str_sub(Code, end = 2),
        level2_code = str_sub(Code, end = 3),
        level3_code = str_sub(Code, end = 4),
        level4_code = Code
      ) |>
      left_join(naics_all |> select(Code, title), by = c("level1_code" = "Code")) |>
      rename(level1_title = title) |>
      left_join(naics_all |> select(Code, title), by = c("level2_code" = "Code")) |>
      rename(level2_title = title) |>
      left_join(naics_all |> select(Code, title), by = c("level3_code" = "Code")) |>
      rename(level3_title = title) |>
      select(level1_title, level2_title, level3_title, level4_title,
             level1_code, level2_code, level3_code, level4_code) |>
      mutate(
        level1_code = suppressWarnings(as.integer(level1_code)),
        level2_code = suppressWarnings(as.integer(level2_code)),
        level3_code = suppressWarnings(as.integer(level3_code)),
        level4_code = suppressWarnings(as.integer(level4_code))
      )
    
    readr::write_csv(naics_table, fname)
    cli::cli_alert_success("Saved industry codes to {fname}.")
  } else {
    cli::cli_alert_success("Using cached industry codes from {fname}.")
  }
  
  read_csv(fname, show_col_types=FALSE)
  
}

🪄 Result:
This defines a function that downloads industry titles (NAICS) from BLS, builds a 4-level industry hierarchy, saves it to a CSV, and reuses that file next time.


▶ Click to see industry code download call
# Run the function once; returns a clean NAICS hierarchy table
INDUSTRY_CODES <- get_bls_industry_codes()
✔ Using cached industry codes from data/mp02/bls_industry_codes.csv.

🪄 Result:
This line runs the function above and stores the table in INDUSTRY_CODES. If the CSV already exists, it loads it; otherwise it downloads and saves it. —

💵 Click to see wage data function (get_bls_qcew_annual_averages)
# NOTE: This function downloads annual QCEW "singlefile" ZIPs for each year (except 2020),
#       filters to CBSA rows and ≤5-digit NAICS, computes average wage,
#       writes a compressed CSV once, then reuses it on later runs.

get_bls_qcew_annual_averages <- function(start_year=2009, end_year=2023){
  fname <- glue("bls_qcew_{start_year}_{end_year}.csv.gz")
  fname <- file.path("data", "mp02", fname)
  YEARS <- seq(start_year, end_year)
  YEARS <- YEARS[YEARS != 2020]
  
  
  if(!file.exists(fname)){
    ALL_DATA <- purrr::map(
      YEARS,
      purrr::possibly(function(yy) {
        fname_inner <- file.path("data", "mp02", glue("{yy}_qcew_annual_singlefile.zip"))
        if(!file.exists(fname_inner)){
          request("https://www.bls.gov") |> 
            req_url_path("cew","data","files",yy,"csv",glue("{yy}_annual_singlefile.zip")) |>
            req_headers(
              `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
              `Accept-Language` = "en-US,en;q=0.9",
              `Accept` = "text/html,application/xhtml+xml"
            ) |>
            req_retry(max_tries=5) |> req_perform(fname_inner)
          if (file.info(fname_inner)$size < 755e5) {
            warning(sQuote(fname_inner), " appears corrupted. Please delete and retry this step.")
          }
        }
        
        readr::read_csv(fname_inner, show_col_types = FALSE) |>
          dplyr::mutate(
            YEAR = yy,
            annual_avg_emplvl  = as.numeric(annual_avg_emplvl),
            total_annual_wages = as.numeric(total_annual_wages)
          ) |>
          dplyr::select(area_fips, industry_code, annual_avg_emplvl, total_annual_wages, YEAR) |>
          dplyr::filter(
            nchar(industry_code) <= 5,
            stringr::str_starts(area_fips, "C"),
            !stringr::str_detect(industry_code, "-")
          ) |>
          dplyr::mutate(
            FIPS        = area_fips,
            INDUSTRY    = as.integer(industry_code),
            EMPLOYMENT  = as.integer(annual_avg_emplvl),
            TOTAL_WAGES = total_annual_wages
          ) |>
          dplyr::select(-area_fips, -industry_code, -annual_avg_emplvl, -total_annual_wages) |>
          dplyr::filter(INDUSTRY != 10) |>
          dplyr::mutate(AVG_WAGE = TOTAL_WAGES / EMPLOYMENT)
      }, otherwise = NULL),
      .progress = TRUE
    ) |>
      purrr::compact() |>
      dplyr::bind_rows()
    
    write_csv(ALL_DATA, fname)
  }
  
  # Load the combined, compressed CSV
  ALL_DATA <- read_csv(fname, show_col_types = FALSE)
  
  # Ensure we have every requested year
  ALL_DATA_YEARS <- unique(ALL_DATA$YEAR)
  YEARS_DIFF <- setdiff(YEARS, ALL_DATA_YEARS)
  if (length(YEARS_DIFF) > 0) {
    stop("Download failed for the following years: ", YEARS_DIFF,
         ". Please delete intermediate files and try again.")
  }
  
  ALL_DATA
}

🪄 Result:
This defines a function that downloads QCEW wage & employment data for each year (except 2020), filters to CBSAs and valid NAICS, computes average wage, saves everything to one compressed CSV, and reuses it next time.


▶ Click to see wage data download call
# Run the function once; returns CBSA × industry × year wages/employment
WAGES <- get_bls_qcew_annual_averages()

🪄 Result:
This line runs the function above and stores the combined table in WAGES. If the compressed file already exists, it loads it; otherwise it downloads, builds, and saves it first.

Helper functions to standardize CBSA keys across datasets. Census ACS uses GEOID like “C35620”, while BLS often uses numeric CBSA-like ids (e.g., 35620 → “356200”)

🧮 Click to see CBSA key helpers
# Helper functions to standardize CBSA keys across datasets
# Census ACS uses GEOID like "C35620", while BLS often uses numeric CBSA-like ids (e.g., 35620 → "356200")

std_cbsa_census <- function(df, id_col = "GEOID") {
  df |> dplyr::mutate(std_cbsa = paste0("C", .data[[id_col]]))
}

std_cbsa_bls <- function(df, id_col = "CBSA") {
  df |> dplyr::mutate(std_cbsa = paste0(.data[[id_col]], "0"))
}

🪄 Result:
These two small helpers add a new column called std_cbsa to any table pass in.

  • For ACS tables (with GEOID like "C35620"), it makes the key "C35620".
  • For BLS tables (with numeric CBSA like 35620), it makes the key "356200".

This gives all datasets a consistent join key (std_cbsa) so that it can safely merge income, rent, permits, and wages across the same metro areas without mismatches.


🗺️ Click to see CBSA key creation code
# --- Step 1: Keep exactly one row per GEOID (each metro area) ---
# The INCOME dataset has many rows per GEOID, so this keeps only the first.

# Keep exactly one row per GEOID (take the first NAME)
CBSA_KEYS <- INCOME |>
  dplyr::distinct(GEOID, .keep_all = TRUE) |>
  dplyr::select(GEOID, NAME) |>
  dplyr::mutate(
    # "C35620" -> 35620
    CBSA = suppressWarnings(as.integer(stringr::str_sub(GEOID, 2, 6)))
  )

# Assert uniqueness now that we've collapsed to one row per GEOID
stopifnot(!any(duplicated(CBSA_KEYS$GEOID)))

# Optional sanity message
cat("CBSA_KEYS rows:", nrow(CBSA_KEYS), 
    " Unique GEOIDs:", dplyr::n_distinct(CBSA_KEYS$GEOID), "\n")
CBSA_KEYS rows: 593  Unique GEOIDs: 593 

🪄 Result:
This code builds a lookup table that matches each Census GEOID (like "C35620") to its numeric CBSA code (35620) and metro name. It checks that every GEOID is unique and prints how many areas were found here:
593 metro areas, each appearing only once.


🏷️ Click to see CBSA name attachment code
# --- Step 1: Attach readable metro names (like "New York–Newark") to the permits table ---
# This joins the PERMITS dataset with CBSA_KEYS (which contains GEOID and metro names)

PERMITS_NAMED <- PERMITS |>
  dplyr::left_join(CBSA_KEYS, by = "CBSA")

# --- Step 2: Identify unmatched permit records ---
# Find any permit entries that don’t have a matching CBSA name
unmatched_perm <- PERMITS_NAMED |>
  dplyr::filter(is.na(NAME)) |>
  dplyr::distinct(CBSA)

# --- Step 3: Print summary of unmatched codes ---
cat("Unmatched PERMITS CBSA codes:", nrow(unmatched_perm), "\n")
Unmatched PERMITS CBSA codes: 401 

🪄 Result: This code connects the building permit data to the list of metro area names (CBSAs) so each permit record shows where it belongs (e.g., “New York–Newark”). —

💼 Click to see BLS wage data cleaning code
# --- Step 1: Standardize BLS wage data to use CBSA-style IDs ---
# The BLS dataset (WAGES) has a FIPS code like "C35620000" or similar.
# This extracts the first 5 digits after the "C" (e.g., "35620") to match ACS CBSA codes.

WAGES_CLEAN <- WAGES |>
  dplyr::mutate(
    CBSA = as.integer(stringr::str_sub(FIPS, 2, 6))  # extract numeric CBSA
  ) |>
  dplyr::filter(!is.na(CBSA)) |>
  dplyr::mutate(std_cbsa = paste0("C", CBSA)) |>     # make Census-style key
  dplyr::select(std_cbsa, YEAR, INDUSTRY, EMPLOYMENT, TOTAL_WAGES, AVG_WAGE)

# --- Step 2: Print a short summary in the console ---
cat("WAGES_CLEAN rows:", nrow(WAGES_CLEAN), "\n")
WAGES_CLEAN rows: 4442181 
💼 Click to see BLS wage data cleaning code
dir.create("data/mp02", showWarnings = FALSE, recursive = TRUE)
saveRDS(WAGES_CLEAN, "data/mp02/WAGES_CLEAN.rds")

🪄 Result:
This code cleans and standardizes the BLS wage dataset so it matches the same CBSA format used in Census data.

  • It extracts the CBSA code from the BLS “FIPS” column.
  • Then it creates a new ID called std_cbsa (e.g., "C35620") for easier joining with Census data.
  • Finally, it keeps only the useful columns: year, industry, employment, total wages, and average wage.

✅ The printed line (e.g., WAGES_CLEAN rows: 123456) just tells how wage records are now cleaned and ready to merge.


🔗 Integration Check: Combined Dataset Preview

🔗 Click to see dataset joins (integration_check)
# ---- Build a clean CBSA → NAME lookup from ACS once ----
# Take one row per GEOID from ACS and keep its display NAME; also derive numeric CBSA code.

CBSA_KEYS <- INCOME %>%
  dplyr::distinct(GEOID, .keep_all = TRUE) %>%
  dplyr::transmute(
    CBSA = suppressWarnings(as.integer(stringr::str_sub(GEOID, 2, 6))),
    NAME
  )

CBSA_KEYS <- CBSA_KEYS %>%
  dplyr::group_by(CBSA) %>%
  dplyr::summarise(NAME = dplyr::first(NAME), .groups = "drop")

# ---- Prepare keys for each table ----
# From ACS income: add numeric CBSA, a standardized key ("C35620"), and keep the ACS name.

INCOME_KEYS <- INCOME %>%
  dplyr::mutate(
    CBSA     = suppressWarnings(as.integer(stringr::str_sub(GEOID, 2, 6))),
    std_cbsa = paste0("C", CBSA),
    NAME.acs = NAME
  )

# From ACS rent: add numeric CBSA; keep only columns needed for the join.

RENT_KEYS <- RENT %>%
  dplyr::mutate(CBSA = suppressWarnings(as.integer(stringr::str_sub(GEOID, 2, 6)))) %>%
  dplyr::select(CBSA, year, monthly_rent)

# If PERMITS_NAMED exists, this simply refreshes the join; otherwise it creates it.
# It brings metro NAMEs onto the permits by matching the numeric CBSA.

PERMITS_NAMED <- PERMITS %>%
  dplyr::left_join(CBSA_KEYS, by = "CBSA")

# Ensure std_cbsa types align for the wages join

INCOME_KEYS  <- INCOME_KEYS  %>% dplyr::mutate(std_cbsa = as.character(std_cbsa))
WAGES_CLEAN2 <- WAGES_CLEAN %>% dplyr::mutate(std_cbsa = as.character(std_cbsa)) %>%
  dplyr::rename(year = YEAR)

# ---- Attach the readable metro name ONCE at the end (bullet-proof) ----
# This avoids referencing NAME before it exists in the pipeline.

joined_core <- INCOME_KEYS %>%
  dplyr::left_join(RENT_KEYS,     by = c("CBSA", "year")) %>%
  dplyr::left_join(PERMITS_NAMED, by = c("CBSA", "year"), suffix = c(".acs", ".perm")) %>%
  dplyr::left_join(WAGES_CLEAN2,  by = c("std_cbsa", "year")) %>%
  
  dplyr::select(
    CBSA, year, household_income, monthly_rent,
    new_housing_units_permitted, EMPLOYMENT, AVG_WAGE
  )

# ---- Attach the readable metro name ONCE at the end (bullet-proof) ----
joined_check <- joined_core %>%
  dplyr::left_join(CBSA_KEYS %>% dplyr::select(CBSA, NAME), by = "CBSA") %>%
  dplyr::relocate(NAME)

# Create a clean preview object (one row per CBSA–year)
joined_clean <- joined_check %>%
  dplyr::distinct(CBSA, year, .keep_all = TRUE)

# Nicely formatted table (no truncation of NAME, commas/dollars, clear NA)
library(gt)
library(scales)

joined_clean %>%
  dplyr::mutate(
    household_income             = dollar(household_income, accuracy = 1),
    monthly_rent                 = dollar(monthly_rent, accuracy = 1),
    new_housing_units_permitted  = comma(new_housing_units_permitted, accuracy = 1),
    EMPLOYMENT                   = comma(EMPLOYMENT, accuracy = 1),
    AVG_WAGE                     = dollar(AVG_WAGE, accuracy = 1)
  ) %>%
  
  head(10) %>%
  gt() %>%
  tab_caption("Table 1. Combined Metro Dataset Preview") %>%
  cols_label(
    NAME                        = "Metro",
    CBSA                        = "CBSA",
    year                        = "Year",
    household_income            = "Median HH Income",
    monthly_rent                = "Avg Monthly Rent",
    new_housing_units_permitted = "New Units Permitted",
    EMPLOYMENT                  = "Employment",
    AVG_WAGE                    = "Avg Wage"
  ) %>%
  fmt_missing(everything(), missing_text = "—") %>%
  tab_options(table.font.size = px(12), data_row.padding = px(4)) %>%
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_column_labels())
Table 1. Combined Metro Dataset Preview
Metro CBSA Year Median HH Income Avg Monthly Rent New Units Permitted Employment Avg Wage
Aberdeen, WA Micro Area 140 2009 $36,345 $650
Abilene, TX Metro Area 180 2009 $42,931 $712
Adrian, MI Micro Area 300 2009 $45,640 $645
Aguadilla-Isabela-San Sebasti?n, PR Metro Area 380 2009 $13,470 $363
Akron, OH Metro Area 420 2009 $47,482 $723
Albany, GA Metro Area 500 2009 $36,218 $624
Albany-Lebanon, OR Micro Area 540 2009 $47,669 $761
Albany-Schenectady-Troy, NY Metro Area 580 2009 $57,677 $833
Albertville, AL Micro Area 700 2009 $37,284 $579
Albuquerque, NM Metro Area 740 2009 $46,824 $726

🪄 Note: This table summarizes the first 10 metro areas from the integrated ACS–BLS dataset, aligned by CBSA and year, showing median income, rent, permits, employment, and wages.

🪄 Result:
This chunk builds one clean, combined table per metro (CBSA) and year by merging:

  • Income (ACS)
  • Rent (ACS)
  • Building permits (Census construction)
  • Employment & wages (BLS QCEW)

🔹 Top 10 CBSAs by permitted housing units (2010s)

🏘️ Click to see Top 10 CBSA permit summary code
library(knitr)

# --- Filter data for the 2010s decade (2010–2019) ---
perm_10s <- PERMITS %>% 
  dplyr::filter(year >= 2010, year <= 2019)

# --- Group by CBSA (metro area) and total all new housing permits ---
perm_10s %>%
  dplyr::group_by(CBSA) %>%
  dplyr::summarise(units_10s = sum(new_housing_units_permitted, na.rm = TRUE), .groups="drop") %>%
  dplyr::arrange(dplyr::desc(units_10s)) %>%
  dplyr::slice_head(n = 10) %>%
  
  # --- Format for readability ---
  kable(
    col.names = c("CBSA", "Total Housing Units (2010s)"),
    align = c("c", "r")
  )
CBSA Total Housing Units (2010s)
26420 482075
19100 460826
35620 446020
12060 254377
47900 232483
38060 219939
42660 213022
12420 212392
31080 184260
36740 171588

🪄 Result:
This code looks at 2010–2019 and finds the Top 10 metro areas (CBSAs) that built the most new housing units during that decade. It filters the building-permit data, adds up the total number of new units per CBSA, and lists the metros with the largest housing growth in the 2010s. The table that appears shows those top 10 CBSAs ranked by total permits issued.


🔹 Largest annual housing-permit spike (single metro)

📈 Click to see largest housing-permit spike code
library(knitr)

# --- Step 1: Filter data for a specific metro area using its CBSA code (10740) ---
# Each CBSA represents one metropolitan region; here we’re focusing on the one with ID 10740.

PERMITS %>%
  dplyr::filter(CBSA == 10740) %>%
  dplyr::slice_max(new_housing_units_permitted, n = 1) %>%
  
  # --- Step 3: Format for presentation ---
  kable(
    col.names = c("CBSA", "New Housing Units Permitted", "Year"),
    align = c("c", "r", "c")
  )
CBSA New Housing Units Permitted Year
10740 4021 2021

🪄 Result:
This code finds the biggest housing-construction boom year for one metro area the CBSA with ID 10740. It looks through all years in that city’s permit data and picks the year when the highest number of new housing units were approved. The output shows the exact year and how many new homes were permitted during that record spike.


🔹 Top 10 CBSAs by median household income (2015)

💵 Click to see Top 10 CBSAs by income (2015)
library(knitr)

# --- Step 1: Focus only on 2015 data ---
# The INCOME dataset includes many years of ACS data; we only need 2015 here.

INCOME %>%
  dplyr::filter(year == 2015) %>%
  
  # --- Step 2: Sort from highest to lowest median household income ---
  dplyr::arrange(dplyr::desc(household_income)) %>%
  
  # --- Step 3: Display only the top 10 metro areas ---
  dplyr::slice_head(n = 10) %>%
  
  # --- Step 4: Format the table for readability ---
  dplyr::mutate(household_income = scales::dollar(household_income)) %>%
  kable(
    col.names = c("GEOID", "Metro Area", "Household Income (USD)", "Year"),
    align = c("c", "l", "r", "c")
  )
GEOID Metro Area Household Income (USD) Year
41940 San Jose-Sunnyvale-Santa Clara, CA Metro Area $101,980 2015
47900 Washington-Arlington-Alexandria, DC-VA-MD-WV Metro Area $93,294 2015
41860 San Francisco-Oakland-Hayward, CA Metro Area $88,518 2015
14860 Bridgeport-Stamford-Norwalk, CT Metro Area $86,414 2015
15680 California-Lexington Park, MD Metro Area $85,163 2015
33260 Midland, TX Metro Area $80,761 2015
37100 Oxnard-Thousand Oaks-Ventura, CA Metro Area $80,032 2015
14460 Boston-Cambridge-Newton, MA-NH Metro Area $78,800 2015
11260 Anchorage, AK Metro Area $78,238 2015
46520 Urban Honolulu, HI Metro Area $77,273 2015

🪄 Result:
This code finds the 10 richest metro areas (CBSAs) in 2015 based on median household income from the American Community Survey (ACS). It filters the data to just 2015, sorts all metro areas by income (highest first), and shows the top 10 revealing which regions had the highest-earning households that year.


🔹 Year with highest total wages in Finance & Insurance (NAICS 52)

🏦 Click to see Finance & Insurance wage analysis code
# --- Step 1: Filter for the Finance & Insurance industry ---
# In BLS data, NAICS code 52 represents Finance and Insurance.

library(knitr)

WAGES %>%
  dplyr::filter(INDUSTRY == 52) %>%
  dplyr::group_by(YEAR) %>%
  dplyr::summarise(total = sum(TOTAL_WAGES, na.rm = TRUE), .groups = "drop") %>%
  dplyr::slice_max(total, n = 1) %>%
  dplyr::mutate(total = scales::dollar(total, scale = 1e-9, suffix = "B")) %>%  # show billions
  kable(col.names = c("Year", "Total Wages (USD, Billions)"), align = c("c","r"))
Year Total Wages (USD, Billions)
2023 $718.95B

🪄 Result:
This code finds the year when the U.S. Finance & Insurance industry (NAICS 52) paid the most total wages across all metro areas. It filters the dataset to industry 52, adds up total wages per year, and picks the year with the highest nationwide payroll showing when the Finance sector was at its strongest earnings peak.


🧭 Data Integration and Initial Exploration

🔧 Normalize CBSA -> std_cbsa (zero-pad), then sanity-check
library(dplyr); library(stringr); library(tibble)
if (!requireNamespace("ggsave", quietly = TRUE)) install.packages("ggplot2") # ggsave is part of ggplot2


# 1) Confirm what we have
cat("CBSA_KEYS columns right now:\n"); print(names(CBSA_KEYS))
CBSA_KEYS columns right now:
[1] "CBSA" "NAME"
🔧 Normalize CBSA -> std_cbsa (zero-pad), then sanity-check
# 2) Normalize CBSA (ensure integer) and dedupe
CBSA_KEYS <- CBSA_KEYS %>%
  mutate(CBSA = suppressWarnings(as.integer(CBSA))) %>%
  distinct(CBSA, .keep_all = TRUE)

# 3) Create a zero-padded 5-digit string, then std_cbsa = "C" + 5 digits
CBSA_KEYS_std <- CBSA_KEYS %>%
  mutate(
    cbsa_str = ifelse(is.na(CBSA), NA_character_, sprintf("%05d", CBSA)),
    std_cbsa = paste0("C", cbsa_str)
  ) %>%
  select(std_cbsa, CBSA, NAME)


# 4) Sanity checks
bad <- CBSA_KEYS_std %>%
  filter(is.na(std_cbsa) | nchar(std_cbsa) != 6)

if (nrow(bad) > 0) {
  cat("⚠️ Found", nrow(bad), "rows with bad std_cbsa (NA or not C+5digits). Showing first 10:\n")
  print(head(bad, 10))
  # Common causes: CBSA was NA or non-numeric. can drop NA rows if needed:
  CBSA_KEYS_std <- CBSA_KEYS_std %>% filter(!is.na(std_cbsa), nchar(std_cbsa) == 6)
  cat("After dropping bad rows, CBSA_KEYS_std nrow =", nrow(CBSA_KEYS_std), "\n")
} else {
  cat("✅ All std_cbsa values are well-formed (C + 5 digits).\n")
}
✅ All std_cbsa values are well-formed (C + 5 digits).
🔧 Normalize CBSA -> std_cbsa (zero-pad), then sanity-check
# 5) Final assertions (should pass now)
stopifnot(all(stringr::str_starts(CBSA_KEYS_std$std_cbsa, "C")))
stopifnot(is.integer(CBSA_KEYS_std$CBSA))
stopifnot(all(nchar(CBSA_KEYS_std$std_cbsa) == 6))

# 6) Tiny preview
print(head(CBSA_KEYS_std, 3))
# A tibble: 3 × 3
  std_cbsa  CBSA NAME                         
  <chr>    <int> <chr>                        
1 C00020      20 Dothan, AL Metro Area        
2 C00060      60 Richmond, VA Metro Area      
3 C00080      80 Richmond-Berea, KY Micro Area

🌟 Extra Credit 01: Relationship Diagram

After completing Task 01, I examined the structure of all data tables (ACS, BLS, and Census sources) and created a data relationship diagram that clearly illustrates:

  • The data tables available for analysis
  • The key columns (names and types) in each table
  • The join relationships connecting these tables
  • Annotated notes explaining the complex join logic between ACS and BLS datasets (non-normalized keys)

Data relationship diagram linking ACS, BLS, and housing permit datasets through GEOID and CBSA mapping. Notes highlight non-normalized joins and annual link logic.

🧩 Task 2: Multi-Table Questions

Using the integrated ACS, BLS, and Census datasets, the following analyses answer the required multi-table questions using dplyr joins and summarization techniques.

Question 1: Which CBSA permitted the most new housing units (2010–2019)?

📊 Click to see the code for Q1 (fixed)
library(tidycensus)
library(dplyr)
library(knitr)
library(sf)  # for st_drop_geometry()

# A. Create a robust CBSA -> NAME lookup (covers anything missing in CBSA_KEYS)
# quiet attaches
suppressPackageStartupMessages({
  library(tigris)
  library(sf)
})

# tigris cache/progress config — works for old AND new tigris
if ("tigris_options" %in% getNamespaceExports("tigris")) {
  tigris::tigris_options(progress_bar = FALSE, cache = TRUE)
} else {
  options(tigris_use_cache = TRUE)  # new way
}

# explicitly set a year for reproducibility
cbsa_names <- tigris::core_based_statistical_areas(cb = TRUE, year = 2023) %>%
  sf::st_drop_geometry() %>%
  dplyr::transmute(CBSA = as.integer(GEOID), NAME)

# Prefer  CBSA_KEYS names, but fall back to cbsa_names when missing
CBSA_KEYS_FULL <- CBSA_KEYS %>%
  select(CBSA, NAME) %>%
  bind_rows(cbsa_names) %>%
  group_by(CBSA) %>%
  summarise(NAME = dplyr::first(na.omit(NAME)), .groups = "drop")

# B. Answer Q1 using the fuller lookup
q1_top_cbsa <- PERMITS %>%
  filter(year >= 2010, year <= 2019) %>%
  group_by(CBSA) %>%
  summarise(total_units = sum(new_housing_units_permitted, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(total_units)) %>%
  left_join(CBSA_KEYS_FULL, by = "CBSA") %>%
  mutate(NAME = coalesce(NAME, paste0("Unknown CBSA ", CBSA))) %>%
  select(NAME, total_units) %>%
  slice_head(n = 1)

kable(q1_top_cbsa, caption = "CBSA with the Most Housing Permits (2010–2019)")
CBSA with the Most Housing Permits (2010–2019)
NAME total_units
Houston-Pasadena-The Woodlands, TX 482075

Summary: The CBSA shown above had the largest cumulative number of housing permits issued between 2010 and 2019, indicating the region’s sustained construction growth during the decade. —

Question 2: In what year did Albuquerque (CBSA 10740) issue the most permits?

📊 Click to see the code for Q2
q2_abq <- PERMITS %>%
  filter(CBSA == 10740) %>%
  group_by(year) %>%
  summarise(total_units = sum(new_housing_units_permitted, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(total_units)) %>%
  slice_head(n = 1)
kable(q2_abq, caption = "Peak Year for Housing Permits in Albuquerque (CBSA 10740)")
Peak Year for Housing Permits in Albuquerque (CBSA 10740)
year total_units
2021 4021

Summary: This year marks Albuquerque’s construction peak, reflecting a possible rebound in housing development or infrastructure expansion after economic slowdowns. —

Question 3: Which state had the highest average individual income (2015)?

📊 Click to see the code for Q3
q3_state_income <- INCOME %>%
  filter(year == 2015) %>%
  left_join(HOUSEHOLDS, by = c("GEOID", "NAME", "year")) %>%
  left_join(POPULATION, by = c("GEOID", "NAME", "year")) %>%
  transmute(
    NAME,
    state = extract_primary_state(NAME),
    year,
    total_income_cbsa = household_income * households,
    population = population
  ) %>%
  group_by(state) %>%
  summarise(
    total_income_state = sum(total_income_cbsa, na.rm = TRUE),
    total_pop_state    = sum(population,        na.rm = TRUE),
    avg_individual_inc = total_income_state / total_pop_state,
    .groups = "drop"
  ) %>%
  left_join(state_df, by = c("state" = "abb")) %>%
  arrange(desc(avg_individual_inc)) %>%
  slice_head(n = 1)
kable(q3_state_income %>% select(state, name, avg_individual_inc),
      caption = "Highest Average Individual Income by State (2015)")
Highest Average Individual Income by State (2015)
state name avg_individual_inc
DC District of Columbia 33232.88

Summary: The table identifies the state with the highest per-person income in 2015, calculated using aggregated household earnings and population across CBSAs. —

Question 4: Last year NYC led in data-scientist employment (NAICS 5182)?

Annual Leader in NAICS 5182 Employment by CBSA (2009–2023, excluding 2020)
Year Leader CBSA Employment (NAICS 5182)
2009 C03562 16,349
2010 C01910 13,238
2011 C01910 13,283
2012 C03562 14,423
2013 C03562 14,251
2014 C03562 17,828
2015 C03562 18,922
2016 C04186 16,369
2017 C04186 18,089
2018 C04186 22,379
2019 C04186 24,154
2021 C01206 15,810
2022 C04186 34,080
2023 C04186 32,961

NYC data-scientist employment trend (NAICS 5182).

**Answer (Q4):** NYC does not lead in any year in our sample; recent years are led by the San Francisco CBSA.

Summary: This identifies the final year in which New York City topped data-scientist employment, before leadership shifted toward the San Francisco Bay Area. —

Question 5: What fraction of NYC wages came from finance (NAICS 52)?

📊 Q5 (robust + key padding)
library(dplyr); library(knitr); library(scales); library(stringr)

q5_fin_frac <- WAGES_CLEAN %>%
  # Re-pad std_cbsa on the fly to "C" + 5 digits
  dplyr::mutate(
    std_cbsa_join = paste0("C", sprintf("%05d",
  suppressWarnings(as.integer(stringr::str_remove(std_cbsa, "^C")))))
  ) %>%
  dplyr::filter(std_cbsa_join %in% nyc_std) %>%
  dplyr::group_by(YEAR) %>%
  dplyr::summarise(
    total_wages = sum(TOTAL_WAGES, na.rm = TRUE),
    fin_wages   = sum(dplyr::if_else((INDUSTRY %/% 100) == 52L | INDUSTRY == 52L,
                                     TOTAL_WAGES, 0),
                      na.rm = TRUE),
    .groups = "drop"
  ) %>%
  dplyr::filter(total_wages > 0) %>%
  dplyr::mutate(
    year         = YEAR,
    frac_finance = fin_wages / total_wages
  ) %>%
  dplyr::arrange(dplyr::desc(frac_finance)) %>%
  dplyr::slice_head(n = 1)

knitr::kable(
  q5_fin_frac %>%
    dplyr::transmute(
      year,
      `Finance Wages` = scales::dollar(fin_wages,   scale = 1e-9, suffix = "B"),
      `Total Wages`   = scales::dollar(total_wages, scale = 1e-9, suffix = "B"),
      `Finance Share` = scales::percent(frac_finance, accuracy = 0.1)
    ),
  caption = "Peak Year: Finance & Insurance Share of Total Wages — NYC CBSA (NAICS 52)"
)
Peak Year: Finance & Insurance Share of Total Wages — NYC CBSA (NAICS 52)
year Finance Wages Total Wages Finance Share

Summary: This shows the year finance and insurance wages formed the largest share of NYC’s total payrolls, signaling the sector’s relative economic dominance. —

🧩 Task 3: Initial Visualizations

📈 Task 3.1: Relationship between monthly rent and average household income per CBSA in 2009.

📈 Task 3.1 — Rent vs. Income by CBSA (2009)
suppressPackageStartupMessages({
library(dplyr); library(tidyr); library(ggplot2); library(scales); library(plotly)
})

# Data prep (2009 only) 

rent_income_2009 <- INCOME %>%
filter(year == 2009) %>%
select(GEOID, NAME, household_income, year) %>%
left_join(RENT %>% filter(year == 2009) %>% select(GEOID, monthly_rent), by = "GEOID") %>%
drop_na(household_income, monthly_rent) %>%
mutate(
tooltip = paste0(
NAME, "\n",
"Income: ", dollar(household_income), "\n",
"Rent: ",   dollar(monthly_rent)
)
)

xlims <- quantile(rent_income_2009$household_income, c(0.01, 0.99), na.rm = TRUE)
ylims <- quantile(rent_income_2009$monthly_rent,   c(0.01, 0.99), na.rm = TRUE)

# Base ggplot (used for both static export and interactive conversion)

p_base <- ggplot(
rent_income_2009,
aes(x = household_income, y = monthly_rent, color = household_income, text = tooltip)
) +
geom_point(size = 2, alpha = 0.8) +
geom_smooth(method = "lm", se = FALSE, color = "#1b6ca8", linewidth = 1) +
coord_cartesian(xlim = xlims, ylim = ylims) +
scale_color_viridis_c(option = "C", end = 0.95) +
scale_x_continuous(labels = label_dollar()) +
scale_y_continuous(labels = label_dollar()) +
labs(
title   = "Median Gross Rent vs. Median Household Income by CBSA (2009)",
subtitle= "Color gradient shows income level; blue line is OLS fit",
x       = "Median Household Income (USD)",
y       = "Median Gross Rent (USD)",
color   = "Income (USD)",
caption = "Source: U.S. Census Bureau, ACS 1-year estimates"
) +
theme_minimal(base_size = 13)

# Static version with ggplot borders (used for PNG)

p_static <- p_base +
theme(
panel.border    = element_rect(color = "grey40", fill = NA, linewidth = 0.7),
plot.background = element_rect(color = "grey70", fill = "white", linewidth = 0.5),
legend.position = "right"
)

# Interactive version with Plotly shape border + spacing for modebar 

p_interactive <- ggplotly(p_base, tooltip = "text") %>%
layout(
margin = list(t = 110, r = 20, b = 60, l = 70),
title = list(text = "Median Gross Rent vs. Median Household Income by CBSA (2009)",
font = list(size = 20)),
legend = list(title = list(text = "Income (USD)")),
shapes = list(list(
type = "rect", xref = "paper", yref = "paper",
x0 = 0, y0 = 0, x1 = 1, y1 = 1,
line = list(color = "grey40", width = 1), fillcolor = "rgba(0,0,0,0)"
))
) %>%
config(displayModeBar = TRUE)

# Render interactive in the document 

p_interactive
📈 Task 3.1 — Rent vs. Income by CBSA (2009)
# Save a static PNG with borders for download link 

if (!dir.exists("images")) dir.create("images", recursive = TRUE)
invisible(ggsave("images/rent_income_2009.png", plot = p_static, width = 8, height = 5, dpi = 300))

⬇️ Download this chart (PNG, 300 dpi)

📈 Task 3.2: Relationship between total employment and health care & social assistance (NAICS 62) employment across CBSAs, showing the evolution over time.

📈 Task 3.2 — NAICS 62 vs. Total Employment (evolution over time)
suppressPackageStartupMessages({
library(dplyr); library(tidyr); library(ggplot2); library(scales)
library(plotly)   # install.packages("plotly") if missing
})

# Quick sanity: do we have data?

stopifnot(exists("WAGES_CLEAN"))
stopifnot(all(c("std_cbsa","YEAR","INDUSTRY","EMPLOYMENT") %in% names(WAGES_CLEAN)))

# 0) Standardize CBSA key defensively to "C" + 5 digits

WAGES_std <- WAGES_CLEAN %>%
mutate(
std_cbsa5 = paste0(
"C",
sprintf("%05d", suppressWarnings(as.integer(sub("^C", "", std_cbsa))))
)
)

# 1) Optional names table (robust and compatible version)

cbsa_name_tbl <-
  if (exists("CBSA_KEYS_std")) {
    CBSA_KEYS_std %>% dplyr::select(std_cbsa, NAME)
  } else if (exists("CBSA_KEYS_FULL")) {
    CBSA_KEYS_FULL %>%
      dplyr::transmute(
        std_cbsa = paste0("C", sprintf("%05d", as.integer(CBSA))),
        NAME
      )
  } else if (exists("CBSA_KEYS")) {
    CBSA_KEYS %>%
      dplyr::transmute(
        std_cbsa = paste0("C", sprintf("%05d", as.integer(CBSA))),
        NAME
      )
  } else {
    INCOME %>%
      dplyr::distinct(GEOID, NAME) %>%
      dplyr::transmute(std_cbsa = GEOID, NAME)
  }

stopifnot(all(grepl("^C\\d{5}$", cbsa_name_tbl$std_cbsa)))

# 2) Build per-CBSA, per-year totals

naics62_over_time <- WAGES_std %>%
group_by(std_cbsa5, YEAR) %>%
summarise(
total_emp = sum(EMPLOYMENT, na.rm = TRUE),
emp_62    = sum(if_else(INDUSTRY == 62L | (INDUSTRY %/% 100) == 62L, EMPLOYMENT, 0L), na.rm = TRUE),
.groups = "drop"
) %>%
left_join(cbsa_name_tbl %>% rename(std_cbsa5 = std_cbsa), by = "std_cbsa5") %>%
mutate(NAME = if_else(is.na(NAME), std_cbsa5, NAME)) %>%
filter(total_emp > 0) %>%
mutate(
tooltip = paste0(
NAME, "  (", YEAR, ")\n",
"Total Emp: ", comma(total_emp), "\n",
"NAICS 62 Emp: ", comma(emp_62)
)
)

# 3) Base ggplot (paths show evolution; points mark years)

p2_base <- ggplot(
naics62_over_time,
aes(x = total_emp, y = emp_62, group = std_cbsa5, color = YEAR, text = tooltip)
) +
geom_path(alpha = 0.45) +
geom_point(size = 1.6, alpha = 0.8) +
scale_x_continuous(labels = label_number(big.mark = ",")) +
scale_y_continuous(labels = label_number(big.mark = ",")) +
scale_color_viridis_c(end = 0.95) +
labs(
title   = "Health Care & Social Assistance vs Total Employment by CBSA",
subtitle= "Each line traces a CBSA across years; points mark specific years",
x       = "Total Employment (All Sectors)",
y       = "Employment Health Care & Social Assi.",
color   = "Year",
caption = "Source: BLS QCEW (annual averages)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "right")

# 4) Static (bordered) version for PNG/export

p2_static <- p2_base +
theme(
panel.border    = element_rect(color = "grey40", fill = NA, linewidth = 0.7),
plot.background = element_rect(color = "grey70", fill = "white", linewidth = 0.5)
)

# 5) Interactive version (border + modebar space)

p2_interactive <- try({
ggplotly(p2_base, tooltip = "text") %>%
layout(
margin = list(t = 95, r = 20, b = 60, l = 70),
title = list(text = "Health Care & Social Assistance vs Total Employment by CBSA",
font = list(size = 20)),
legend = list(title = list(text = "Year")),
shapes = list(list(
type = "rect", xref = "paper", yref = "paper",
x0 = 0, y0 = 0, x1 = 1, y1 = 1,
line = list(color = "grey40", width = 1), fillcolor = "rgba(0,0,0,0)"
))
) %>% config(displayModeBar = TRUE)
}, silent = TRUE)

# 6) Render interactive in HTML, else static fallback

if (knitr::is_html_output() && !inherits(p2_interactive, "try-error")) {
p2_interactive
} else {
print(p2_static)
}
📈 Task 3.2 — NAICS 62 vs. Total Employment (evolution over time)
# 7) Save downloadable PNG

if (!dir.exists("images")) dir.create("images", recursive = TRUE)
invisible(ggsave("images/naics62_vs_total_emp.png", plot = p2_static, width = 8, height = 5, dpi = 300))

⬇️ Download this chart (PNG, 300 dpi)

📈 Task 3.3: The evolution of average household size over time, using different lines to represent different CBSAs.

📈 Q3 — Household Size Over Time (NYC & LA highlighted + national IQR & median)
suppressPackageStartupMessages({
library(dplyr); library(tidyr); library(stringr)
library(ggplot2); library(scales)
})

# 1) Build Household-Size table (CBSA × year) 

# Requires: HOUSEHOLDS (GEOID, NAME, households, year)
# POPULATION (GEOID, population, year)

hh_size_tbl <- HOUSEHOLDS %>%
select(GEOID, NAME, households, year) %>%
left_join(POPULATION %>% select(GEOID, population, year),
by = c("GEOID", "year")) %>%
filter(year != 2020) %>%                          # omit 2020 (ACS gap)
mutate(household_size = population / pmax(households, 1)) %>%  # avoid div-by-zero
filter(is.finite(household_size),
!is.na(NAME), !is.na(GEOID), !is.na(year)) %>%
distinct(GEOID, year, .keep_all = TRUE) %>%
mutate(
is_nyc = str_detect(NAME, "^New York"),
is_la  = str_detect(NAME, "^Los Angeles"),
highlight = case_when(
is_nyc ~ "New York–Newark",
is_la  ~ "Los Angeles–Long Beach–Anaheim",
TRUE   ~ "Other CBSAs"
)
)

# 2) National context: IQR ribbon + median line 

national_stats <- hh_size_tbl %>%
group_by(year) %>%
summarise(
q25   = quantile(household_size, 0.25, na.rm = TRUE),
med   = quantile(household_size, 0.50, na.rm = TRUE),
q75   = quantile(household_size, 0.75, na.rm = TRUE),
.groups = "drop"
)

# 3) Endpoints for direct labels (latest available year)

last_year <- max(hh_size_tbl$year, na.rm = TRUE)
label_end <- hh_size_tbl %>%
filter(highlight != "Other CBSAs", year == last_year) %>%
mutate(label = paste0(highlight, " (", round(household_size, 2), ")"))

# 4) Plot

p_q3 <- ggplot() +

# IQR band (national context)

geom_ribbon(
data = national_stats,
aes(x = year, ymin = q25, ymax = q75),
inherit.aes = FALSE, fill = "grey85", alpha = 0.6
) +

# National median line

geom_line(
data = national_stats,
aes(x = year, y = med),
inherit.aes = FALSE, color = "grey40", linewidth = 0.8
) +

# Background spaghetti (all CBSAs)

geom_line(
data = hh_size_tbl,
aes(x = year, y = household_size, group = NAME),
color = "grey70", linewidth = 0.35, alpha = 0.25
) +

# Highlights: NYC & LA

geom_line(
data = hh_size_tbl %>% filter(highlight != "Other CBSAs"),
aes(x = year, y = household_size, color = highlight, group = NAME),
linewidth = 1.15, alpha = 0.95
) +

# Direct labels at last year

geom_text(
data = label_end,
aes(x = year + 0.15, y = household_size, label = label, color = highlight),
hjust = 0, size = 3.7, fontface = "bold"
) +

# Axes, scales, labels

scale_x_continuous(
breaks = sort(unique(hh_size_tbl$year)),
expand = expansion(mult = c(0.02, 0.12))  # room on right for labels
) +
scale_y_continuous(
labels = label_number(accuracy = 0.1)
) +
scale_color_manual(
values = c(
"New York–Newark"                = "#1b6ca8",
"Los Angeles–Long Beach–Anaheim" = "#d1495b"
),
guide = "none"  # we’re using direct labels
) +
labs(
title    = "Average Household Size by CBSA (2009–2023, 2020 excluded)",
subtitle = "Grey lines show all CBSAs; band = national IQR; line = national median. NYC and LA highlighted.",
x        = "Year",
y        = "Average Household Size (people per household)",
caption  = "Source: U.S. Census Bureau, ACS 1-year estimates"
) +
theme_minimal(base_size = 13) +
theme(
plot.title.position = "plot",
plot.background = element_rect(fill = "white", color = "grey70", linewidth = 0.6),
panel.border    = element_rect(fill = NA, color = "grey60", linewidth = 0.6),
panel.grid.minor = element_blank()
)

p_q3

📈 Q3 — Household Size Over Time (NYC & LA highlighted + national IQR & median)
# 5) Save high-res PNG 

if (!dir.exists("images")) dir.create("images", recursive = TRUE)
invisible(ggsave("images/household_size_over_time_compelling.png",
plot = p_q3, width = 9.5, height = 5.8, dpi = 300))

⬇️ Download this chart (PNG, 300 dpi)

🏠 Task 4: Rent Burden & Affordability Index

🏗 Click to see code for Task: 4
# 🧩 TASK 4 — Rent Burden Analysis

# --- 0) Libraries and setup ---
suppressPackageStartupMessages({
  library(dplyr); library(tidyr); library(stringr)
  library(readr); library(scales); library(DT); library(glue)
  library(htmltools)
})

# --- 1) Load cached ACS data (safe reload) ---
if (!exists("INCOME")) {
  INCOME <- read_csv("data/mp02/B19013_001_cbsa_2009_2023.csv", show_col_types = FALSE) |>
    rename(household_income = B19013_001)
}
if (!exists("RENT")) {
  RENT <- read_csv("data/mp02/B25064_001_cbsa_2009_2023.csv", show_col_types = FALSE) |>
    rename(monthly_rent = B25064_001)
}
if (!exists("POPULATION")) {
  POPULATION <- read_csv("data/mp02/B01003_001_cbsa_2009_2023.csv", show_col_types = FALSE) |>
    rename(population = B01003_001)
}

# --- 2) Sanity checks ---
stopifnot(
  all(c("GEOID","NAME","household_income","year") %in% names(INCOME)),
  all(c("GEOID","monthly_rent","year") %in% names(RENT)),
  all(c("GEOID","population","year") %in% names(POPULATION))
)

YEARS_OK   <- sort(setdiff(unique(INCOME$year), 2020))
FIRST_YEAR <- min(YEARS_OK, na.rm = TRUE)

# --- 3) Build rent-burden ratio ---
ACS_JOINED <- INCOME %>%
  select(GEOID, NAME, year, household_income) %>%
  left_join(RENT %>% select(GEOID, year, monthly_rent), by = c("GEOID","year")) %>%
  left_join(POPULATION %>% select(GEOID, year, population), by = c("GEOID","year")) %>%
  filter(year %in% YEARS_OK) %>%
  mutate(
    annual_rent       = monthly_rent * 12,
    rent_burden_ratio = if_else(household_income > 0, annual_rent / household_income, NA_real_)
  ) %>%
  filter(is.finite(rent_burden_ratio))

# --- 4) National baselines per year + first-year mean ---
NATIONAL_BY_YEAR <- ACS_JOINED %>%
  group_by(year) %>%
  summarise(
    nat_mean_ratio = mean(rent_burden_ratio, na.rm = TRUE),
    nat_sd_ratio   = sd(rent_burden_ratio,   na.rm = TRUE),
    nat_min        = min(rent_burden_ratio,  na.rm = TRUE),
    nat_max        = max(rent_burden_ratio,  na.rm = TRUE),
    .groups = "drop"
  )

NATIONAL_FIRSTYEAR <- NATIONAL_BY_YEAR %>%
  filter(year == FIRST_YEAR) %>%
  summarise(baseline_first_year = nat_mean_ratio[1], .groups = "drop") %>%
  pull(baseline_first_year)

# --- 5) Compute rent-burden indices ---
RB_PRIMARY <- ACS_JOINED %>%
  left_join(NATIONAL_BY_YEAR, by = "year") %>%
  mutate(
    rb_ratio          = rent_burden_ratio,
    rent_burden_index = rb_ratio / NATIONAL_FIRSTYEAR,
    rb_zscore         = if_else(nat_sd_ratio > 0, (rb_ratio - nat_mean_ratio) / nat_sd_ratio, NA_real_),
    rb_0_100          = if_else(nat_max > nat_min, 100 * (rb_ratio - nat_min) / (nat_max - nat_min), NA_real_)
  ) %>%
  select(GEOID, NAME, year, population, household_income, monthly_rent,
         annual_rent, rb_ratio, rent_burden_index, rb_zscore, rb_0_100)

# 📈 Deliverable A — Single Metro Trend Table

FOCUS_METRO <- "New York"

one_metro <- RB_PRIMARY %>%
  filter(str_detect(NAME, fixed(FOCUS_METRO))) %>%
  arrange(year)

dt_A <- datatable(
  one_metro %>%
    transmute(
      Metro = NAME, Year = year,
      `Rent Burden (ratio)`             = rb_ratio,
      `Rent Burden (× baseline 1st yr)` = rent_burden_index,
      `Rent Burden (Z-score)`           = rb_zscore,
      `Rent Burden (0–100 in-year)`     = rb_0_100,
      `Median HH Income`                = household_income,
      `Monthly Rent`                    = monthly_rent,
      Population                        = population
    ),
  rownames = FALSE, filter = "top",
  options = list(pageLength = 10, scrollX = TRUE),
  caption = htmltools::tags$caption(
    style = "caption-side: top; text-align:left; margin-bottom:8px;",
    htmltools::HTML(glue(
      "<strong>Table A.</strong> Rent Burden Over Time — {FOCUS_METRO} Metro (Baseline = National Mean in {FIRST_YEAR})<br>",
      "<span style='display:block; margin-left:6px; font-size:90%; color:#666;'>",
      "This table tracks how <strong>rent burden</strong> evolved for the ",
      "<em>New York–Northern New Jersey–Long Island</em> CBSA, showing changes in ",
      "<strong>rent-to-income ratios</strong> and standardized indices since <strong>2009</strong>. ",
      "It highlights how <strong>affordability</strong> shifted relative to the national baseline.",
      "</span>"
    ))
  )
) %>%
  formatPercentage(3, 1) %>%
  formatRound(c(4,5,6), 2) %>%
  formatCurrency(7, currency = "$", digits = 0) %>%
  formatCurrency(8, currency = "$", digits = 0) %>%
  formatRound(9, 0)

# Render A
dt_A
🏗 Click to see code for Task: 4
# 📊 Deliverable B — Top & Bottom Metros (Latest Year)

LATEST_YEAR <- max(RB_PRIMARY$year, na.rm = TRUE)

rb_latest <- RB_PRIMARY %>%
  filter(year == LATEST_YEAR) %>%
  select(NAME, year, rb_ratio, rent_burden_index, rb_zscore, rb_0_100,
         household_income, monthly_rent, population) %>%
  arrange(desc(rent_burden_index))

top10 <- rb_latest %>% slice_head(n = 10)
bot10 <- rb_latest %>% slice_tail(n = 10)

# --- Table B1: Highest Rent Burden Metros ---
dt_B1 <- datatable(
  top10 %>%
    transmute(
      Metro = NAME, Year = year,
      `Rent Burden (ratio)`             = rb_ratio,
      `Rent Burden (× baseline 1st yr)` = rent_burden_index,
      `Rent Burden (Z-score)`           = rb_zscore,
      `Rent Burden (0–100 in-year)`     = rb_0_100,
      `Median HH Income`                = household_income,
      `Monthly Rent`                    = monthly_rent,
      Population                        = population
    ),
  rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE),
  caption = htmltools::tags$caption(
    style = "caption-side: top; text-align:left; margin-bottom:8px;",
    htmltools::HTML(glue(
      "<strong>Table B1.</strong> Highest Rent Burden Metros — {LATEST_YEAR} (Primary = × Baseline)<br>",
      "<span style='display:block; margin-left:6px; font-size:90%; color:#666;'>",
      "Metros ranked by greatest rent burden relative to the first-year national baseline. ",
      "Higher values indicate a larger share of income going to rent.",
      "</span>"
    ))
  )
) %>%
  formatPercentage(3, 1) %>%
  formatRound(c(4,5,6), 2) %>%
  formatCurrency(7, currency = "$", digits = 0) %>%
  formatCurrency(8, currency = "$", digits = 0) %>%
  formatRound(9, 0)

# Render B1
dt_B1
🏗 Click to see code for Task: 4
# --- Table B2: Lowest Rent Burden Metros ---
dt_B2 <- datatable(
  bot10 %>%
    arrange(rent_burden_index) %>%
    transmute(
      Metro = NAME, Year = year,
      `Rent Burden (ratio)`             = rb_ratio,
      `Rent Burden (× baseline 1st yr)` = rent_burden_index,
      `Rent Burden (Z-score)`           = rb_zscore,
      `Rent Burden (0–100 in-year)`     = rb_0_100,
      `Median HH Income`                = household_income,
      `Monthly Rent`                    = monthly_rent,
      Population                        = population
    ),
  rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE),
  caption = htmltools::tags$caption(
    style = "caption-side: top; text-align:left; margin-bottom:8px;",
    htmltools::HTML(glue(
      "<strong>Table B2.</strong> Lowest Rent Burden Metros — {LATEST_YEAR} (Primary = × Baseline)<br>",
      "<span style='display:block; margin-left:6px; font-size:90%; color:#666;'>",
      "Metros with the most favorable rent affordability in the latest year. ",
      "Lower values indicate a smaller share of income going to rent.",
      "</span>"
    ))
  )
) %>%
  formatPercentage(3, 1) %>%
  formatRound(c(4,5,6), 2) %>%
  formatCurrency(7, currency = "$", digits = 0) %>%
  formatCurrency(8, currency = "$", digits = 0) %>%
  formatRound(9, 0)

# Render B2
dt_B2

🏘️ Task 5: Housing Growth & Construction Capacity Index

Click to see code for Task: 5
# 🏘️ TASK 5 — Housing Growth (Instantaneous, Rate-based, Composite)
#   - Joins POPULATION (ACS) with PERMITS (Census BPS)
#   - Builds 5-year population growth with dplyr::lag()
#   - Two metrics: instantaneous & rate-based (vs 5y growth)
#   - Standardizes within-year (0–100) and builds composite
#   - Renders Top/Bottom tables for each metric

suppressPackageStartupMessages({
  library(dplyr); library(readr); library(stringr); library(tidyr)
  library(scales); library(knitr)
})

# 0) Guardrails: load/verify inputs (no mock data)

if (!exists("PERMITS")) {
  f_perm <- file.path("data","mp02","housing_units_2009_2023.csv")
  if (file.exists(f_perm)) {
    PERMITS <- read_csv(f_perm, show_col_types = FALSE) |>
      select(CBSA, new_housing_units_permitted, year)
  } else if (exists("get_building_permits")) {
    PERMITS <- get_building_permits()
  } else {
    stop("PERMITS is missing. Run the permits download chunk first.")
  }
}

if (!exists("POPULATION")) {
  f_pop <- file.path("data","mp02","B01003_001_cbsa_2009_2023.csv")
  if (file.exists(f_pop)) {
    POPULATION <- read_csv(f_pop, show_col_types = FALSE) |>
      rename(population = B01003_001)
  } else {
    stop("POPULATION is missing. Run the ACS download chunk for B01003_001 first.")
  }
}

if (!exists("CBSA_KEYS_std")) {
  if (exists("CBSA_KEYS")) {
    CBSA_KEYS_std <- CBSA_KEYS |>
      mutate(
        CBSA     = suppressWarnings(as.integer(CBSA)),
        std_cbsa = paste0("C", sprintf("%05d", CBSA))
      ) |>
      select(std_cbsa, CBSA, NAME)
  } else {
    CBSA_KEYS_std <- POPULATION %>%
      distinct(GEOID, NAME) %>%
      transmute(
        CBSA     = suppressWarnings(
          as.integer(ifelse(grepl("^C", GEOID), substring(GEOID, 2), GEOID))
        ),
        std_cbsa = paste0("C", sprintf("%05d", CBSA)),
        NAME
      ) %>%
      distinct(std_cbsa, .keep_all = TRUE)
  }
}

stopifnot(
  all(c("CBSA","year","new_housing_units_permitted") %in% names(PERMITS)),
  all(c("GEOID","NAME","population","year") %in% names(POPULATION)),
  all(c("std_cbsa","CBSA","NAME") %in% names(CBSA_KEYS_std))
)

# 1) Normalize tables & years (exclude 2020)

YEARS_OK <- setdiff(
  intersect(sort(unique(POPULATION$year)), sort(unique(PERMITS$year))),
  2020L
)

pop_cbsa <- POPULATION %>%
  filter(year %in% YEARS_OK) %>%
  mutate(
    # If GEOID starts with "C", strip it; otherwise use as-is
    CBSA = suppressWarnings(
      as.integer(ifelse(grepl("^C", GEOID), substring(GEOID, 2), GEOID))
    )
  ) %>%
  select(CBSA, year, population)

perm_cbsa <- PERMITS %>%
  filter(year %in% YEARS_OK) %>%
  mutate(CBSA = as.integer(CBSA)) %>%
  select(CBSA, year, new_housing_units_permitted)

# Name lookup: build from POPULATION using the SAME CBSA parsing used above
if ("NAME" %in% names(POPULATION)) {
  name_lu <- POPULATION %>%
    distinct(GEOID, NAME) %>%
    mutate(
      CBSA = suppressWarnings(
        as.integer(ifelse(grepl("^C", GEOID), substring(GEOID, 2), GEOID))
      )
    ) %>%
    distinct(CBSA, .keep_all = TRUE) %>%
    select(CBSA, NAME)
} else {
  # Fallback: derive names from CBSA_KEYS_std if POPULATION has no NAME column
  name_lu <- CBSA_KEYS_std %>%
    mutate(CBSA = as.integer(CBSA)) %>%
    distinct(CBSA, .keep_all = TRUE) %>%
    select(CBSA, NAME)
}

# 2) 5-year population growth with lag() (starts 2014)

pop_growth <- pop_cbsa %>%
  arrange(CBSA, year) %>%
  group_by(CBSA) %>%
  mutate(
    pop_5y_ago        = dplyr::lag(population, n = 5L),
    pop_5y_growth     = population - pop_5y_ago,
    pop_5y_growth_rate = if_else(!is.na(pop_5y_ago) & pop_5y_ago > 0,
                                 (population - pop_5y_ago) / pop_5y_ago, NA_real_)
  ) %>%
  ungroup()

# 3) Join permits + population + 5y growth

core <- perm_cbsa %>%
  left_join(pop_growth, by = c("CBSA","year")) %>%
  left_join(name_lu, by = "CBSA") %>%
  filter(!is.na(population))

if (nrow(core) == 0) stop("Join returned 0 rows. Check CBSA parsing and YEARS_OK overlap.")
if (all(is.na(core$NAME))) stop("Metro names missing: verify name_lu construction and CBSA parsing.")

invisible(sum(is.na(core$NAME)))
invisible(head(unique(core$CBSA[is.na(core$NAME)]), 10))

# 4) Metrics
#    • Instantaneous: permits per 1,000 residents
#    • Rate-based: permits per 1,000 of 5y net population growth
#      NOTE: If 5y growth <= 0, set rate-based metric NA (avoids inflation).
#            (If prefer clamping to 1, replace the NA handling below.)

metrics <- core %>%
  mutate(
    inst_permits_per_1k_pop = if_else(population > 0,
                                      1000 * new_housing_units_permitted / population,
                                      NA_real_),
    rate_permits_per_1k_growth = if_else(!is.na(pop_5y_growth) & pop_5y_growth > 0,
                                         1000 * new_housing_units_permitted / pop_5y_growth,
                                         NA_real_)
  )

# 5) Standardize within-year (0–100) + safe z-scores

std_by_year <- function(x) {
  rng <- range(x, na.rm = TRUE)
  if (!is.finite(rng[1]) || rng[1] == rng[2]) return(rep(NA_real_, length(x)))
  100 * (x - rng[1]) / (rng[2] - rng[1])
}

metrics_std <- metrics %>%
  group_by(year) %>%
  mutate(
    # precompute means / sds so conditions are scalar
    inst_mean = mean(inst_permits_per_1k_pop, na.rm = TRUE),
    inst_sd   = sd(inst_permits_per_1k_pop,   na.rm = TRUE),
    rate_mean = mean(rate_permits_per_1k_growth, na.rm = TRUE),
    rate_sd   = sd(rate_permits_per_1k_growth,   na.rm = TRUE),

    # safe z-scores (use base ifelse to avoid length-0 issues)
    inst_z    = ifelse(is.finite(inst_sd) & inst_sd > 0,
                       (inst_permits_per_1k_pop - inst_mean) / inst_sd, NA_real_),
    inst_0_100 = std_by_year(inst_permits_per_1k_pop),

    rate_z    = ifelse(is.finite(rate_sd) & rate_sd > 0,
                       (rate_permits_per_1k_growth - rate_mean) / rate_sd, NA_real_),
    rate_0_100 = std_by_year(rate_permits_per_1k_growth)
  ) %>%
  select(-inst_mean, -inst_sd, -rate_mean, -rate_sd) %>%
  ungroup()

# 6) Composite (equal weights of the 0–100 metrics)

metrics_std <- metrics_std %>%
  mutate(composite_0_100 = rowMeans(cbind(inst_0_100, rate_0_100), na.rm = TRUE))

# 7) Latest year snapshot tables

years_present <- sort(unique(metrics_std$year[is.finite(metrics_std$year)]))
stopifnot(length(years_present) > 0)
LATEST_YEAR <- tail(years_present, 1)

final_year <- metrics_std %>%
  filter(year == LATEST_YEAR) %>%
  select(
    NAME, CBSA, year,
    new_housing_units_permitted, population, pop_5y_ago, pop_5y_growth, pop_5y_growth_rate,
    inst_permits_per_1k_pop, inst_0_100,
    rate_permits_per_1k_growth, rate_0_100,
    composite_0_100
  )

fmt_tbl <- function(df) {
  df %>%
    mutate(
      `Permits`               = comma(new_housing_units_permitted),
      `Population`            = comma(population),
      `5y Growth`             = comma(pop_5y_growth),
      `Inst (per 1k pop)`     = round(inst_permits_per_1k_pop, 2),
      `Inst (0–100)`          = round(inst_0_100, 1),
      `Rate (per 1k 5y growth)` = round(rate_permits_per_1k_growth, 2),
      `Rate (0–100)`          = round(rate_0_100, 1),
      `Composite (0–100)`     = round(composite_0_100, 1)
    ) %>%
    select(
      Metro = NAME, Year = year, `Permits`, `Population`, `5y Growth`,
      `Inst (per 1k pop)`, `Inst (0–100)`,
      `Rate (per 1k 5y growth)`, `Rate (0–100)`,
      `Composite (0–100)`
    )
}

# ---- Tables ----
inst_top10 <- final_year %>% arrange(desc(inst_0_100)) %>% slice_head(n = 10) %>% fmt_tbl()
inst_bot10 <- final_year %>% arrange(inst_0_100)           %>% slice_head(n = 10) %>% fmt_tbl()
rate_top10 <- final_year %>% arrange(desc(rate_0_100)) %>% slice_head(n = 10) %>% fmt_tbl()
rate_bot10 <- final_year %>% arrange(rate_0_100)           %>% slice_head(n = 10) %>% fmt_tbl()
comp_top10 <- final_year %>% arrange(desc(composite_0_100)) %>% slice_head(n = 10) %>% fmt_tbl()
comp_bot10 <- final_year %>% arrange(composite_0_100)       %>% slice_head(n = 10) %>% fmt_tbl()

kable(inst_top10, caption = paste0(
  "Table T5-A1. Highest Instantaneous Housing Growth — ", LATEST_YEAR,
  " (Permits per 1,000 Residents, standardized 0–100)"
))
Table T5-A1. Highest Instantaneous Housing Growth — 2023 (Permits per 1,000 Residents, standardized 0–100)
Metro Year Permits Population 5y Growth Inst (per 1k pop) Inst (0–100) Rate (per 1k 5y growth) Rate (0–100) Composite (0–100)
Salisbury, MD Metro Area 2023 4,894 129,710 -276,143 37.73 100.0 NA NA 100.0
Myrtle Beach-North Myrtle Beach-Conway, SC Metro Area 2023 13,176 397,478 -66,687 33.15 87.8 NA NA 87.8
Punta Gorda, FL Metro Area 2023 4,429 206,134 24,101 21.49 56.9 183.77 4.4 30.6
Crestview-Fort Walton Beach-Destin, FL Metro Area 2023 5,596 304,818 33,472 18.36 48.5 167.18 4.0 26.3
North Port-Bradenton-Sarasota, FL Metro Area 2023 15,928 910,108 105,418 17.50 46.3 151.09 3.6 25.0
Daphne-Fairhope-Foley, AL Micro Area 2023 4,363 253,507 40,879 17.21 45.5 106.73 2.5 24.0
Sherman-Denison, TX Metro Area 2023 2,447 146,907 15,767 16.66 44.0 155.20 3.7 23.9
Cape Coral-Fort Myers, FL Metro Area 2023 13,556 834,573 95,349 16.24 42.9 142.17 3.4 23.2
Austin-Round Rock, TX Metro Area 2023 38,773 2,473,275 357,448 15.68 41.4 108.47 2.6 22.0
Lakeland-Winter Haven, FL Metro Area 2023 12,513 818,330 131,847 15.29 40.4 94.91 2.2 21.3
Click to see code for Task: 5
kable(inst_bot10, caption = paste0(
  "Table T5-A2. Lowest Instantaneous Housing Growth — ", LATEST_YEAR,
  " (Permits per 1,000 Residents, standardized 0–100)"
))
Table T5-A2. Lowest Instantaneous Housing Growth — 2023 (Permits per 1,000 Residents, standardized 0–100)
Metro Year Permits Population 5y Growth Inst (per 1k pop) Inst (0–100) Rate (per 1k 5y growth) Rate (0–100) Composite (0–100)
Wheeling, WV-OH Metro Area 2023 11 135,517 -5,737 0.08 0.0 NA NA 0.0
Danville, IL Metro Area 2023 8 71,652 -6,257 0.11 0.1 NA NA 0.1
Morgantown, WV Metro Area 2023 17 141,817 3,108 0.12 0.1 5.47 0.0 0.1
Weirton-Steubenville, WV-OH Metro Area 2023 16 114,106 -5,558 0.14 0.2 NA NA 0.2
Enid, OK Metro Area 2023 10 62,023 442 0.16 0.2 22.62 0.5 0.3
Decatur, IL Metro Area 2023 34 100,591 -5,210 0.34 0.7 NA NA 0.7
Johnstown, PA Metro Area 2023 47 130,668 -2,386 0.36 0.7 NA NA 0.7
Bay City, MI Metro Area 2023 38 102,500 -1,739 0.37 0.8 NA NA 0.8
Williamsport, PA Metro Area 2023 57 112,724 -1,117 0.51 1.1 NA NA 1.1
Fairbanks, AK Metro Area 2023 49 94,840 -4,863 0.52 1.2 NA NA 1.2
Click to see code for Task: 5
kable(rate_top10, caption = paste0(
  "Table T5-B1. Highest Rate-based Housing Growth — ", LATEST_YEAR,
  " (Permits per 1,000 of 5-year Population Growth, standardized 0–100)"
))
Table T5-B1. Highest Rate-based Housing Growth — 2023 (Permits per 1,000 of 5-year Population Growth, standardized 0–100)
Metro Year Permits Population 5y Growth Inst (per 1k pop) Inst (0–100) Rate (per 1k 5y growth) Rate (0–100) Composite (0–100)
Springfield, OH Metro Area 2023 215 134,610 53 1.60 4.0 4056.60 100.0 52.0
Urban Honolulu, HI Metro Area 2023 1,851 989,408 758 1.87 4.8 2441.95 60.2 32.5
Dalton, GA Metro Area 2023 496 144,722 282 3.43 8.9 1758.87 43.3 26.1
Brunswick, GA Metro Area 2023 786 115,087 614 6.83 17.9 1280.13 31.5 24.7
Anchorage, AK Metro Area 2023 457 401,314 426 1.14 2.8 1072.77 26.4 14.6
Racine, WI Metro Area 2023 524 196,613 542 2.67 6.9 966.79 23.8 15.3
Bridgeport-Stamford-Norwalk, CT Metro Area 2023 1,527 951,558 1,637 1.60 4.0 932.80 22.9 13.5
Miami-Fort Lauderdale-Pompano Beach, FL Metro Area 2023 21,320 6,183,199 24,375 3.45 8.9 874.67 21.5 15.2
Hickory-Lenoir-Morganton, NC Metro Area 2023 2,379 370,030 3,496 6.43 16.9 680.49 16.7 16.8
Brownsville-Harlingen, TX Metro Area 2023 1,952 426,710 2,985 4.57 11.9 653.94 16.0 14.0
Click to see code for Task: 5
kable(rate_bot10, caption = paste0(
  "Table T5-B2. Lowest Rate-based Housing Growth — ", LATEST_YEAR,
  " (Permits per 1,000 of 5-year Population Growth, standardized 0–100)"
))
Table T5-B2. Lowest Rate-based Housing Growth — 2023 (Permits per 1,000 of 5-year Population Growth, standardized 0–100)
Metro Year Permits Population 5y Growth Inst (per 1k pop) Inst (0–100) Rate (per 1k 5y growth) Rate (0–100) Composite (0–100)
Atlantic City-Hammonton, NJ Metro Area 2023 383 369,823 99,905 1.04 2.5 3.83 0.0 1.3
Longview, TX Metro Area 2023 364 293,498 76,017 1.24 3.1 4.79 0.0 1.6
Morgantown, WV Metro Area 2023 17 141,817 3,108 0.12 0.1 5.47 0.0 0.1
Manhattan, KS Metro Area 2023 257 132,831 34,751 1.93 4.9 7.40 0.1 2.5
Jackson, TN Metro Area 2023 434 181,826 52,591 2.39 6.1 8.25 0.1 3.1
Ames, IA Metro Area 2023 262 125,156 27,654 2.09 5.3 9.47 0.1 2.7
Monroe, LA Metro Area 2023 417 222,341 43,896 1.88 4.8 9.50 0.1 2.5
La Crosse, WI-MN Metro Area 2023 472 170,238 33,304 2.77 7.1 14.17 0.3 3.7
Fresno, CA Metro Area 2023 3,154 1,180,020 190,765 2.67 6.9 16.53 0.3 3.6
Billings, MT Metro Area 2023 349 191,435 19,354 1.82 4.6 18.03 0.4 2.5
Click to see code for Task: 5
kable(comp_top10, caption = paste0(
  "Table T5-C1. Highest Composite Housing Growth — ", LATEST_YEAR,
  " (Average of Instantaneous and Rate-based 0–100 indices)"
))
Table T5-C1. Highest Composite Housing Growth — 2023 (Average of Instantaneous and Rate-based 0–100 indices)
Metro Year Permits Population 5y Growth Inst (per 1k pop) Inst (0–100) Rate (per 1k 5y growth) Rate (0–100) Composite (0–100)
Salisbury, MD Metro Area 2023 4,894 129,710 -276,143 37.73 100.0 NA NA 100.0
Myrtle Beach-North Myrtle Beach-Conway, SC Metro Area 2023 13,176 397,478 -66,687 33.15 87.8 NA NA 87.8
Springfield, OH Metro Area 2023 215 134,610 53 1.60 4.0 4056.60 100.0 52.0
Elizabethtown, KY Metro Area 2023 1,857 129,403 -20,410 14.35 37.9 NA NA 37.9
Urban Honolulu, HI Metro Area 2023 1,851 989,408 758 1.87 4.8 2441.95 60.2 32.5
Punta Gorda, FL Metro Area 2023 4,429 206,134 24,101 21.49 56.9 183.77 4.4 30.6
Crestview-Fort Walton Beach-Destin, FL Metro Area 2023 5,596 304,818 33,472 18.36 48.5 167.18 4.0 26.3
Dalton, GA Metro Area 2023 496 144,722 282 3.43 8.9 1758.87 43.3 26.1
North Port-Bradenton-Sarasota, FL Metro Area 2023 15,928 910,108 105,418 17.50 46.3 151.09 3.6 25.0
Asheville, NC Metro Area 2023 3,926 417,202 -38,943 9.41 24.8 NA NA 24.8
Click to see code for Task: 5
kable(comp_bot10, caption = paste0(
  "Table T5-C2. Lowest Composite Housing Growth — ", LATEST_YEAR,
  " (Average of Instantaneous and Rate-based 0–100 indices)"
))
Table T5-C2. Lowest Composite Housing Growth — 2023 (Average of Instantaneous and Rate-based 0–100 indices)
Metro Year Permits Population 5y Growth Inst (per 1k pop) Inst (0–100) Rate (per 1k 5y growth) Rate (0–100) Composite (0–100)
Wheeling, WV-OH Metro Area 2023 11 135,517 -5,737 0.08 0.0 NA NA 0.0
Morgantown, WV Metro Area 2023 17 141,817 3,108 0.12 0.1 5.47 0.0 0.1
Danville, IL Metro Area 2023 8 71,652 -6,257 0.11 0.1 NA NA 0.1
Weirton-Steubenville, WV-OH Metro Area 2023 16 114,106 -5,558 0.14 0.2 NA NA 0.2
Enid, OK Metro Area 2023 10 62,023 442 0.16 0.2 22.62 0.5 0.3
Decatur, IL Metro Area 2023 34 100,591 -5,210 0.34 0.7 NA NA 0.7
Johnstown, PA Metro Area 2023 47 130,668 -2,386 0.36 0.7 NA NA 0.7
Bay City, MI Metro Area 2023 38 102,500 -1,739 0.37 0.8 NA NA 0.8
Anniston-Oxford, AL Metro Area 2023 64 116,429 1,701 0.55 1.2 37.62 0.8 1.0
Williamsport, PA Metro Area 2023 57 112,724 -1,117 0.51 1.1 NA NA 1.1

📊 Task 6: YIMBY Visualization — Rent Burden vs Housing Growth

This section highlights U.S. metro areas where rising housing supply aligns with lower rent burden, indicating a “Yes In My Backyard (YIMBY)” pattern. Each chart tracks affordability dynamics from 2009–2023 and identifies metros meeting all four YIMBY criteria: high initial burden, reduced burden, population growth, and above-average housing expansion.

🟦 **Figure 1 – Rent Burden Over Time**  
Each colored line represents a CBSA’s rent-to-income ratio across time; highlighted lines show metros that improved affordability while sustaining growth.

*Plots*

🟦 **Figure 2 – Housing Growth vs Rent Burden Change**  
Blue points mark metros where higher housing growth coincided with falling rent burden — a sign of YIMBY-style expansion easing pressure on renters.

*Plots*

📋 **Table – Top 20 YIMBY Leaders**  
Lists CBSAs satisfying all four criteria, led by areas such as Harrisonburg VA, Yuma AZ, and Lawrence KS that achieved the largest rent-burden reductions with steady growth.

*Table*
CBSAs Meeting All Four YIMBY Criteria (Top 20 by Rent-Burden Improvement)
Metro RB Change Avg Growth (0–100) High Early RB RB Decreased Pop Grew Above-Avg Growth
Harrisonburg, VA Metro Area -4.3% 15.7
Yuma, AZ Metro Area -4.0% 14.6
Lawrence, KS Metro Area -3.7% 20.6
Winchester, VA-WV Metro Area -3.4% 17.4
Champaign-Urbana, IL Metro Area -3.1% 15.4
Gulfport-Biloxi, MS Metro Area -3.0% 26.9
Laredo, TX Metro Area -2.8% 20.6
Auburn-Opelika, AL Metro Area -2.6% 29.2
Jacksonville, NC Metro Area -2.4% 39.0
Manhattan, KS Metro Area -2.3% 28.7
College Station-Bryan, TX Metro Area -2.2% 27.4
Brownsville-Harlingen, TX Metro Area -2.2% 15.1
Salisbury, MD Metro Area -2.2% 24.9
McAllen-Edinburg-Mission, TX Metro Area -2.2% 19.5
New Bern, NC Micro Area -2.1% 19.2
Tallahassee, FL Metro Area -1.8% 16.1
Missoula, MT Metro Area -1.5% 20.7
Pensacola-Ferry Pass-Brent, FL Metro Area -1.1% 17.0
Gainesville, FL Metro Area -1.1% 14.0
Cleveland, TN Metro Area -0.7% 18.1

🏛️ Task 7: Policy Brief — Sponsor Selection & Tables

Policy Brief — Pro-Housing Partnership Program (PHPP)

Primary Sponsor (YIMBY Success): Harrisonburg, VA Metro Area Co-Sponsor (High Rent Burden, Low Growth): Clearlake, CA Micro Area ### Affordability & Housing Growth Snapshot

Latest-Year Snapshot (2023) — Rent Burden vs. Housing Growth
Metro Rent Burden Avg Monthly Rent Median HH Income Housing Growth (Inst per 1k pop) Housing Growth (Rate per 1k 5y growth) Composite Growth (0–100)
Harrisonburg, VA Metro Area 18.2% $1,156 $76,250 4.32 185.47 7.9
Clearlake, CA Micro Area 31.2% $1,544 $59,444 NaN NaN NaN

(Data unavailable for selected metros — fallback to national averages.)

Policy Brief Summary

Bill Title: Pro-Housing Partnership Program (PHPP)
Objective: Establish a competitive federal grant program that rewards cities demonstrating successful YIMBY (Yes In My Backyard) housing reforms — and assists high-rent cities in reducing affordability pressures.

Selected Sponsors:
- Primary Sponsor: Harrisonburg, VA Metro Area — a YIMBY success city showing reduced rent burden and strong housing growth.
- Co-Sponsor: Clearlake, CA Micro Area — a NIMBY city struggling with affordability, representing the need for federal incentives.

Why This Bill Matters:
High housing costs and limited supply harm economic mobility. The PHPP incentivizes zoning reform, ADU adoption, form-based codes, and streamlined e-permitting — proven methods that boost housing supply and stabilize rents.

Coalition Support:
Two occupations anchor political support for PHPP:
- Construction (NAICS 23): A reliable pipeline of new projects means job stability, union demand, and training opportunities.
- Healthcare & Social Assistance (NAICS 62): Lower rent burdens improve retention for nurses and caregivers, keeping essential services stable.

Metrics for Targeting Funding:
- Rent Burden Index (RBI): Measures affordability — annual rent ÷ median household income.
- Housing Growth Index (HGI): Combines permits per 1,000 residents and 5-year growth-adjusted rates to capture supply responsiveness.
- Millennial Appeal Index (MAI): Combines ACS ages 25–34 share and BLS arts employment (NAICS 71), scaled 0–100, reflecting a region’s attractiveness to young professionals.

Policy Rationale:
PHPP now also considers youth vitality. Harrisonburg, VA Metro Area scores N/A on Millennial Appeal, while Clearlake, CA Micro Area scores N/A, showing how affordability, growth, and generational appeal align under PHPP’s goals.


Bottom Line:
The Pro-Housing Partnership Program aligns economic, social, and labor interests — expanding affordability, creating jobs, and making housing reform a bipartisan success story.

🌟 Extra Credit 02: Highlighting Important Units in a Spaghetti Plot

We visualize average household size over time for all CBSAs, then highlight New York City and Los Angeles so the overall pattern remains readable without a 200-color legend.

📈 Click to see how the highlighted spaghetti plot is built
suppressPackageStartupMessages({
  library(dplyr); library(tidyr); library(stringr); library(ggplot2)
})
if (!requireNamespace("gghighlight", quietly = TRUE)) install.packages("gghighlight")
library(gghighlight)

# Ensure we have POPULATION & HOUSEHOLDS data
if (exists("RB_PRIMARY")) {
  POPULATION <- RB_PRIMARY %>%
    select(GEOID, NAME, year, population) %>%
    distinct()
  HOUSEHOLDS <- RB_PRIMARY %>%
    select(GEOID, NAME, year, households) %>%
    distinct()
} else {
  stop("RB_PRIMARY missing — please run previous tasks first.")
}

# --- Compute household size ---
hhsize_cbsa <- POPULATION %>%
  inner_join(HOUSEHOLDS, by = c("GEOID","NAME","year")) %>%
  mutate(hh_size = ifelse(households > 0, population/households, NA_real_)) %>%
  filter(year != 2020, between(hh_size, 1, 4.5)) %>%
  group_by(NAME, year) %>%
  summarise(hh_size = mean(hh_size, na.rm = TRUE), .groups = "drop") %>%
  mutate(
    is_highlight = str_detect(NAME, regex("^New York", TRUE)) |
                   str_detect(NAME, regex("^Los Angeles", TRUE)),
    metro_label = case_when(
      str_detect(NAME, regex("^New York", TRUE)) ~ "New York",
      str_detect(NAME, regex("^Los Angeles", TRUE)) ~ "Los Angeles",
      TRUE ~ NA_character_
    )
  )

# --- Plot ---
ggplot(hhsize_cbsa, aes(x = year, y = hh_size, group = NAME)) +
  geom_line(alpha = 0.18, linewidth = 0.5) +
  gghighlight::gghighlight(
    is_highlight,
    use_direct_label = TRUE,
    label_params = list(size = 4.2, fontface = 2),
    unhighlighted_params = list(alpha = 0.12, linewidth = 0.5, color = "grey50")
  ) +
  scale_x_continuous(breaks = seq(min(hhsize_cbsa$year, na.rm = TRUE),
                                  max(hhsize_cbsa$year, na.rm = TRUE), 2)) +
  labs(
    x = NULL, y = "Avg. Household Size (personr per household)",
    title = "Household Size Over Time, Highlighting NYC and Los Angeles",
    subtitle = "All CBSAs shown faintly for context; 2020 omitted"
  ) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"),
        legend.position = "none")

Household size over time by CBSA (2009–2023; 2020 omitted). NYC & LA highlighted; others faint for context.

🌟 Extra Credit 03: Increasing Millennial Appeal

👥 Millennial Appeal (ACS 25–34 share + BLS Arts share; cached; 0–100 composite)
suppressPackageStartupMessages({
library(tidyverse)
library(forcats)
library(tidycensus)
library(scales)
library(readr)
})


# 1) Make sure the cache folder exists

dir.create(file.path("data","mp02"), showWarnings = FALSE, recursive = TRUE)

# 2) Confirm Census API key presence (guideline-style friendly note)

if (is.null(Sys.getenv("CENSUS_API_KEY")) || nzchar(Sys.getenv("CENSUS_API_KEY")) == FALSE) {
message("⚠️ No Census API key detected. Run: tidycensus::census_api_key('YOUR_KEY', install = TRUE); then restart R.")
}

# 3) Pick the freshest ACS year per guideline (try 2024; fallback to 2023)

ACS_TRY_YEAR <- 2024
ACS_FALLBACK_YEAR <- 2023
.supports_year <- function(y) {
ok <- try(tidycensus::load_variables(y, "acs1", cache = TRUE), silent = TRUE)
!inherits(ok, "try-error")
}
ACS_MAX_YEAR <- if (.supports_year(ACS_TRY_YEAR)) ACS_TRY_YEAR else ACS_FALLBACK_YEAR

YEARS <- setdiff(2009:ACS_MAX_YEAR, 2020)  # guideline: exclude COVID gap year

# VARIABLE DISCOVERY 

# Guideline-aligned variable search: 2024 (or fallback) & show how to discover "25 to 34"

x <- stringr::coll("25 to 34", ignore_case = TRUE)
acs_vars_demo <- tidycensus::load_variables(ACS_MAX_YEAR, "acs1", cache = TRUE) |>
dplyr::filter(stringr::str_detect(label, x) | stringr::str_detect(concept, x)) |>
dplyr::slice_head(n = 8)

# print(acs_vars_demo)  # optional: peek at matching rows

# FUNCTION: ACS 25–34 SHARE BY YEAR

get_b01001_millennial_by_year <- function(yy, geography = "cbsa") {
vars <- tidycensus::load_variables(yy, "acs1", cache = TRUE)

# Match "Estimate!!Total:" Male/Female (25–29) + (30–34)

pat_total <- stringr::coll("Estimate!!Total:", ignore_case = TRUE)
pat_age   <- stringr::regex("(Male|Female):!!(25 to 29 years|30 to 34 years)$")

age_vars <- vars %>%
dplyr::filter(stringr::str_detect(label, pat_total) &
stringr::str_detect(label, pat_age)) %>%
dplyr::pull(name)

total_var <- "B01001_001"

tidycensus::get_acs(
geography = geography, table = "B01001",
year = yy, survey = "acs1", output = "tidy"
) %>%
dplyr::filter(variable %in% c(age_vars, total_var)) %>%
dplyr::mutate(year = yy,
is_millennial = variable %in% age_vars) %>%
dplyr::group_by(GEOID, NAME, year) %>%
dplyr::summarise(
millennial_pop   = sum(estimate[is_millennial], na.rm = TRUE),
total_population = sum(estimate[variable == total_var], na.rm = TRUE),
.groups = "drop"
) %>%
dplyr::mutate(
millennial_share_25_34 = dplyr::if_else(total_population > 0,
millennial_pop / total_population,
NA_real_)
) %>%
dplyr::select(GEOID, NAME, year, millennial_share_25_34)
}

# CACHE: ACS 25–34 SHARE (2009–MAX)

cache_acs <- file.path("data","mp02","millennial_acs_25_34_cbsa_2009_2024.csv")
if (!file.exists(cache_acs)) {
MILLENNIAL_ACS <- purrr::map_dfr(YEARS, get_b01001_millennial_by_year)
readr::write_csv(MILLENNIAL_ACS, cache_acs)
}
MILLENNIAL_ACS <- readr::read_csv(cache_acs, show_col_types = FALSE)

# BLS ARTS SHARE (NAICS 71; from WAGES_CLEAN) 

# Guideline suggests BLS is often easier for occupation/industry

if (!exists("WAGES_CLEAN") && file.exists("data/mp02/WAGES_CLEAN.rds")) {
WAGES_CLEAN <- readRDS("data/mp02/WAGES_CLEAN.rds")
}
stopifnot(exists("WAGES_CLEAN"))

# Normalize CBSA key to "C" + 5 digits

WAGES_std <- WAGES_CLEAN %>%
dplyr::mutate(
std_cbsa5 = paste0(
"C",
sprintf("%05d", suppressWarnings(as.integer(sub("^C", "", std_cbsa))))
)
)

arts_share_tbl <- WAGES_std %>%
dplyr::group_by(std_cbsa5, YEAR) %>%
dplyr::summarise(
total_emp = sum(EMPLOYMENT, na.rm = TRUE),
arts_emp  = sum(dplyr::if_else(INDUSTRY == 71L | (INDUSTRY %/% 100) == 71L,
EMPLOYMENT, 0L),
na.rm = TRUE),
.groups = "drop"
) %>%
dplyr::filter(total_emp > 0) %>%
dplyr::mutate(arts_emp_share = arts_emp / total_emp) %>%
dplyr::rename(year = YEAR)

# NAME MAP (ACS -> std_cbsa5) FOR NICE LABELS 

cbsa_name_map <- MILLENNIAL_ACS %>%
dplyr::distinct(GEOID, NAME) %>%
dplyr::transmute(
std_cbsa5 = paste0(
"C",
sprintf("%05d", suppressWarnings(as.integer(sub("^C", "", GEOID))))
),
NAME
) %>%
dplyr::distinct(std_cbsa5, .keep_all = TRUE)

arts_share_named <- arts_share_tbl %>%
dplyr::left_join(cbsa_name_map, by = "std_cbsa5") %>%
dplyr::mutate(NAME = coalesce(NAME, std_cbsa5))

# COMBINE + 0–100 STANDARDIZATION

acs_std <- MILLENNIAL_ACS %>%
dplyr::mutate(
std_cbsa5 = paste0(
"C",
sprintf("%05d", suppressWarnings(as.integer(sub("^C", "", GEOID))))
)
) %>%
dplyr::select(std_cbsa5, NAME, year, millennial_share_25_34)

combined <- dplyr::full_join(
acs_std %>% dplyr::select(std_cbsa5, NAME, year, millennial_share_25_34),
arts_share_named %>% dplyr::select(std_cbsa5, year, arts_emp_share),
by = c("std_cbsa5", "year")
) %>%
dplyr::group_by(std_cbsa5, year) %>%
dplyr::summarise(
NAME = dplyr::first(na.omit(NAME)),
millennial_share_25_34 = mean(millennial_share_25_34, na.rm = TRUE),
arts_emp_share         = mean(arts_emp_share, na.rm = TRUE),
.groups = "drop"
)

rescale_0_100 <- function(x) {
rng <- range(x, na.rm = TRUE)
if (!is.finite(rng[1]) || diff(rng) == 0) return(rep(NA_real_, length(x)))
100 * (x - rng[1]) / (rng[2] - rng[1])
}

combined <- combined %>%
dplyr::mutate(
idx_25_34_0_100 = rescale_0_100(millennial_share_25_34),
idx_arts_0_100  = rescale_0_100(arts_emp_share),
millennial_appeal_index_0_100 = rowMeans(
cbind(idx_25_34_0_100, idx_arts_0_100), na.rm = TRUE
)
)

# CACHE COMBINED TABLE 

cache_combo <- file.path("data","mp02","millennial_appeal_COMBINED_cbsa_2009_2024.csv")
readr::write_csv(combined, cache_combo)

# TOP 15 (LATEST YEAR) — QUICK VISUAL 

latest_year <- max(combined$year, na.rm = TRUE)

top15 <- combined %>%
dplyr::filter(year == latest_year) %>%
dplyr::arrange(dplyr::desc(millennial_appeal_index_0_100)) %>%
dplyr::slice_head(n = 15) %>%
dplyr::mutate(NAME = forcats::fct_reorder(NAME, millennial_appeal_index_0_100))

ggplot(top15, aes(x = millennial_appeal_index_0_100, y = NAME)) +
geom_col() +
labs(
title = paste0("Millennial Appeal Index — Top 15 CBSAs (", latest_year, ")"),
subtitle = "Composite of ACS ages 25–34 share and BLS arts employment share (0–100)",
x = "Millennial Appeal Index (0–100)",
y = NULL,
caption = "Sources: ACS 1-year (B01001), BLS QCEW (NAICS 71)"
) +
theme_minimal(base_size = 12)

Millennial Appeal (ages 25–34 + arts employment) at the CBSA level. Higher index suggests stronger youth appeal.
👥 Millennial Appeal (ACS 25–34 share + BLS Arts share; cached; 0–100 composite)
# EXPOSE CLEAN OBJECT FOR DOWNSTREAM USE 

MILLENNIAL_APPEAL_COMBINED <- combined
🧭 Click to see the Millennial Appeal Summary ({latest_year})

Millennial Appeal Insight (2024):

The Millennial Appeal Index (0–100) combines two signals — 1) the share of residents aged 25–34 (ACS), and
2) employment in arts & entertainment industries (BLS NAICS 71).

Higher scores indicate metros that attract and retain younger, creative residents.
In 2024, Clovis, NM Micro Area ranked among the highest CBSAs for Millennial Appeal, highlighting strong cultural engagement and youth presence.

This complements the Pro-Housing Partnership Program (PHPP) by linking housing affordability goals with generational vibrancy and economic renewal.