인공지능 데이터 전처리 결과서

1. 프로젝트 개요

목적: 음악 구독 서비스 고객의 이탈(churned) 예측을 위해 데이터 구조를 파악하고, 전처리 및 병합 결과를 점검하여 모델링 가능한 분석용 데이터를 구축한다.
데이터 출처
- 고객 데이터: Kaggle Streaming Subscription Churn Model
- 지역 데이터: Kaggle US Census Demographic Data
데이터 구성
- 고객 원천 데이터(music_df): 125,000행, 20개 컬럼
- Census 원천 데이터(census_df): 74,001행, 37개 컬럼
- 최종 분석 데이터(final_df → model_df): 125,000행, 20개 컬럼
비고
- 노트북에서는 전처리 후 final_df를 만들고, 이를 model_df로 복사하여 EDA를 수행했다.
- 최종 데이터는 범주형 5개, 수치형 15개(타깃 포함)로 구성된다.

1.1. 데이터 테이블 관계

본 프로젝트의 데이터는 원천 테이블 2개와 파생 테이블 2개로 구성된다.

ERD 구조

erDiagram
    music_df {
        int customer_id PK
        int age
        string location FK
        string subscription_type
        string payment_plan
        int num_subscription_pauses
        string payment_method
        string customer_service_inquiries
        int signup_date
        numeric weekly_hours
        numeric average_session_length
        numeric song_skip_rate
        int weekly_songs_played
        int weekly_unique_songs
        int num_favorite_artists
        int num_platform_friends
        int num_playlists_created
        int num_shared_playlists
        int notifications_clicked
        int churned
    }

    census_df {
        int CensusTract PK
        string State FK
        int TotalPop
        numeric Income
    }

    state_stats {
        string State PK
        int State_TotalPop
        numeric State_AvgIncome
    }

    model_df {
        int age
        string location FK
        string subscription_type
        string payment_plan
        int num_subscription_pauses
        string payment_method
        string customer_service_inquiries
        numeric weekly_hours
        numeric average_session_length
        numeric song_skip_rate
        int weekly_songs_played
        int weekly_unique_songs
        int num_favorite_artists
        int num_platform_friends
        int num_playlists_created
        int num_shared_playlists
        int notifications_clicked
        int churned
        numeric State_AvgIncome
        int tenure_days
    }

    census_df }o--|| state_stats : "group by State"
    music_df }o--|| state_stats : "join on location = State"
    music_df ||--o{ model_df : "base table"
    state_stats ||--o{ model_df : "adds regional income"

관계 설명

1) `music_df`

음악 구독 사용자 단위 원천 데이터이다.

고객의 연령, 지역, 구독 유형, 결제 방식, 청취 행동, 플레이리스트 활동, 이탈 여부 등의 정보를 포함한다.

2) `census_df`

미국 Census tract 단위 인구/소득 데이터이다.

State를 기준으로 여러 행이 존재하며, 주별 인구 및 소득 통계의 원천으로 사용된다.

3) `state_stats`

census_df를 State 기준으로 집계한 파생 테이블이다.

State_TotalPop: 주별 총인구