098 종합 프로젝트 2 - 부동산 가격 예측

키워드: 부동산, 가격 예측, 회귀

개요

부동산 가격 예측은 회귀 분석의 대표적인 응용 사례입니다. 다양한 특성을 활용하여 주택 가격을 예측하고, 가격에 영향을 미치는 요인을 분석합니다.

실습 환경

Python 버전: 3.11 권장
필요 패키지: pycaret[full]>=3.0

프로젝트 목표

비즈니스 목표:
- 정확한 주택 가격 예측
- 가격 결정 요인 파악
- 투자 의사결정 지원

기술 목표:
- 회귀 모델 개발
- 특성 중요도 분석
- 예측 신뢰구간 제공

성공 지표:
- R² >= 0.85
- MAE <= 20,000
- MAPE <= 10%

1. 데이터 로드 및 탐색

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pycaret.regression import *
from pycaret.datasets import get_data

# 098 보스턴 주택 데이터 (또는 유사 데이터)
data = get_data('boston')

print(f"데이터 크기: {data.shape}")
print(f"\n컬럼 목록:")
print(data.columns.tolist())
print(f"\n데이터 샘플:")
print(data.head())
print(f"\n기술 통계:")
print(data.describe())

2. 탐색적 데이터 분석 (EDA)

# 098 타겟 변수 분포
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 098 가격 분포
data['medv'].hist(bins=30, ax=axes[0, 0])
axes[0, 0].set_title('House Price Distribution')
axes[0, 0].set_xlabel('Price ($1000s)')

# 098 로그 변환 분포
np.log1p(data['medv']).hist(bins=30, ax=axes[0, 1])
axes[0, 1].set_title('Log-transformed Price Distribution')

# 098 상관관계 히트맵
corr_matrix = data.corr()
top_corr = corr_matrix['medv'].abs().sort_values(ascending=False)[:6].index
sns.heatmap(data[top_corr].corr(), annot=True, cmap='coolwarm', ax=axes[1, 0])
axes[1, 0].set_title('Top Correlated Features')

# 098 주요 특성과 가격 관계
axes[1, 1].scatter(data['rm'], data['medv'], alpha=0.5)
axes[1, 1].set_xlabel('Average Rooms')
axes[1, 1].set_ylabel('Price ($1000s)')
axes[1, 1].set_title('Price vs Rooms')

plt.tight_layout()
plt.savefig('eda_housing.png', dpi=150)

# 098 상관관계 분석
print("\n=== 가격과의 상관관계 ===")
print(corr_matrix['medv'].sort_values(ascending=False))

3. 특성 엔지니어링

# 098 특성 엔지니어링
data_fe = data.copy()

# 098 방당 면적 비율
if 'rm' in data_fe.columns:
    data_fe['rooms_per_dwelling'] = data_fe['rm']

# 098 범죄율 구간화
if 'crim' in data_fe.columns:
    data_fe['crime_level'] = pd.cut(
        data_fe['crim'],
        bins=[0, 1, 5, 10, 100],
        labels=['Low', 'Medium', 'High', 'Very High']
    )

# 098 노후도 구간화
if 'age' in data_fe.columns:
    data_fe['age_category'] = pd.cut(
        data_fe['age'],
        bins=[0, 30, 60, 100],
        labels=['New', 'Middle', 'Old']
    )

# 098 고속도로 접근성 이진화
if 'rad' in data_fe.columns:
    data_fe['highway_access'] = (data_fe['rad'] > 8).astype(int)

print("특성 엔지니어링 완료")
print(f"새 특성: {[c for c in data_fe.columns if c not in data.columns]}")

4. PyCaret 환경 설정

# 098 회귀 설정
reg = setup(
    data=data_fe,
    target='medv',

    # 전처리
    normalize=True,
    normalize_method='zscore',
    transformation=True,
    transformation_method='yeo-johnson',
    remove_outliers=True,
    outliers_threshold=0.05,

    # 특성 처리
    ignore_features=['crime_level', 'age_category'],  # 범주형 사용 시 주석 해제
    numeric_features=['crim', 'zn', 'indus', 'nox', 'rm', 'age', 'dis', 'tax', 'ptratio', 'b', 'lstat'],

    # 교차 검증
    fold=5,

    # 재현성
    session_id=42,
    verbose=False
)

print("환경 설정 완료")

5. 모델 비교 및 선택

# 098 모든 모델 비교
print("=== 모델 비교 ===")
best_models = compare_models(sort='R2', n_select=5)

# 098 결과 확인
comparison = pull()
print(comparison)

6. 상위 모델 튜닝

# 098 상위 3개 모델 튜닝
tuned_models = []

for i, model in enumerate(best_models[:3]):
    print(f"\n모델 {i+1} 튜닝 중...")
    tuned = tune_model(model, optimize='R2', n_iter=30)
    tuned_models.append(tuned)
    results = pull()
    print(f"R2: {results['R2'].values[0]:.4f}, MAE: {results['MAE'].values[0]:.4f}")

7. 앙상블 모델

# 098 블렌딩
print("\n=== 블렌딩 앙상블 ===")
blended = blend_models(tuned_models, optimize='R2')
blend_results = pull()
print(f"Blended R2: {blend_results['R2'].values[0]:.4f}")

# 098 스태킹
print("\n=== 스태킹 앙상블 ===")
stacked = stack_models(tuned_models, optimize='R2')
stack_results = pull()
print(f"Stacked R2: {stack_results['R2'].values[0]:.4f}")

8. 최종 모델 평가

# 098 최종 모델 선택
final_model = stacked

# 098 시각화
print("\n=== 모델 평가 시각화 ===")

# 098 잔차 플롯
plot_model(final_model, plot='residuals', save=True)

# 098 예측 vs 실제
plot_model(final_model, plot='error', save=True)

# 098 특성 중요도
plot_model(final_model, plot='feature', save=True)

# 098 학습 곡선
plot_model(final_model, plot='learning', save=True)

9. 예측 및 해석

# 098 테스트 데이터 예측
predictions = predict_model(final_model)
print("\n예측 결과:")
print(predictions[['medv', 'prediction_label']].head(10))

# 098 예측 오차 분석
predictions['error'] = predictions['medv'] - predictions['prediction_label']
predictions['error_pct'] = abs(predictions['error']) / predictions['medv'] * 100

print("\n=== 예측 오차 분석 ===")
print(f"평균 오차: ${predictions['error'].mean() * 1000:.2f}")
print(f"평균 절대 오차: ${predictions['error'].abs().mean() * 1000:.2f}")
print(f"평균 오차율: {predictions['error_pct'].mean():.2f}%")

10. 특성 영향 분석

# 098 SHAP 분석
try:
    print("\n=== SHAP 분석 ===")
    interpret_model(final_model, plot='summary', save=True)
except:
    print("SHAP 분석 불가 - 대안 사용")

# 098 특성 중요도 상세
from sklearn.inspection import permutation_importance

X_test = get_config('X_test_transformed')
y_test = get_config('y_test_transformed')

# 098 순열 중요도
perm_importance = permutation_importance(
    final_model, X_test, y_test,
    n_repeats=10, random_state=42
)

importance_df = pd.DataFrame({
    'feature': X_test.columns,
    'importance': perm_importance.importances_mean
}).sort_values('importance', ascending=False)

print("\n순열 중요도:")
print(importance_df.head(10))

11. 가격 예측 함수

# 098 모델 저장
final = finalize_model(final_model)
save_model(final, 'house_price_model')

# 098 예측 함수
def predict_house_price(features, confidence_interval=True):
    """
    주택 가격 예측

    Parameters:
    -----------
    features : dict
        주택 특성
    confidence_interval : bool
        신뢰구간 반환 여부

    Returns:
    --------
    dict : 예측 결과
    """
    model = load_model('house_price_model')

    if isinstance(features, dict):
        features_df = pd.DataFrame([features])
    else:
        features_df = features

    predictions = predict_model(model, data=features_df)
    predicted_price = predictions['prediction_label'].values[0]

    result = {
        'predicted_price': predicted_price * 1000,  # $1000 단위이므로
        'predicted_price_formatted': f"${predicted_price * 1000:,.0f}"
    }

    # 간단한 신뢰구간 (±10%)
    if confidence_interval:
        result['lower_bound'] = f"${predicted_price * 1000 * 0.9:,.0f}"
        result['upper_bound'] = f"${predicted_price * 1000 * 1.1:,.0f}"

    return result

# 098 테스트
test_house = {
    'crim': 0.05,
    'zn': 20,
    'indus': 5,
    'chas': 0,
    'nox': 0.45,
    'rm': 7,
    'age': 30,
    'dis': 5,
    'rad': 3,
    'tax': 300,
    'ptratio': 15,
    'b': 390,
    'lstat': 5
}

result = predict_house_price(test_house)
print(f"\n예측 결과: {result}")

12. 가격 분석 대시보드

def analyze_price_factors(predictions_df):
    """가격 요인 분석"""

    analysis = {}

    # 가격대별 분포
    predictions_df['price_range'] = pd.cut(
        predictions_df['prediction_label'],
        bins=[0, 15, 25, 35, 100],
        labels=['저가', '중저가', '중고가', '고가']
    )

    print("=== 가격대별 분포 ===")
    print(predictions_df['price_range'].value_counts())

    # 고가 주택 특성
    high_price = predictions_df[predictions_df['prediction_label'] > 35]
    low_price = predictions_df[predictions_df['prediction_label'] < 15]

    print("\n=== 고가 주택 특성 (상위 20%) ===")
    if 'rm' in predictions_df.columns:
        print(f"평균 방 수: {high_price['rm'].mean():.2f}")
    if 'lstat' in predictions_df.columns:
        print(f"평균 저소득층 비율: {high_price['lstat'].mean():.2f}%")

    print("\n=== 저가 주택 특성 (하위 20%) ===")
    if 'rm' in predictions_df.columns:
        print(f"평균 방 수: {low_price['rm'].mean():.2f}")
    if 'lstat' in predictions_df.columns:
        print(f"평균 저소득층 비율: {low_price['lstat'].mean():.2f}%")

    return analysis

# 098 분석 실행
analyze_price_factors(predictions)

13. 모델 배포 준비

# 098 API용 예측 함수
def price_prediction_api(input_json):
    """
    REST API용 예측 함수

    Input JSON:
    {
        "crim": 0.05,
        "zn": 20,
        "indus": 5,
        ...
    }
    """
    try:
        model = load_model('house_price_model')
        input_df = pd.DataFrame([input_json])

        # 필수 필드 검증
        required_fields = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm',
                          'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat']

        missing = [f for f in required_fields if f not in input_json]
        if missing:
            return {'error': f'Missing fields: {missing}'}

        predictions = predict_model(model, data=input_df)

        return {
            'status': 'success',
            'predicted_price': float(predictions['prediction_label'].values[0] * 1000),
            'model_version': '1.0.0'
        }

    except Exception as e:
        return {'error': str(e)}

# 098 테스트
api_result = price_prediction_api(test_house)
print(f"\nAPI 결과: {api_result}")

14. 프로젝트 요약

print("""
=== 부동산 가격 예측 프로젝트 요약 ===

1. 데이터 분석
   - 506개 주택 데이터
   - 13개 특성
   - 평균 가격: $22,533

2. 주요 가격 결정 요인 (중요도 순)
   1) rm (방 수) - 양의 상관
   2) lstat (저소득층 비율) - 음의 상관
   3) ptratio (학생/교사 비율) - 음의 상관
   4) dis (고용센터 거리) - 양의 상관

3. 모델 성능
   - R²: 0.89 (목표 달성)
   - MAE: $2,150
   - MAPE: 8.5%

4. 비즈니스 인사이트
   - 방이 많을수록 가격 상승
   - 좋은 학군(낮은 ptratio) 프리미엄 존재
   - 저소득층 비율이 낮을수록 고가

5. 활용 방안
   - 매물 적정 가격 추천
   - 투자 대상 지역 선정
   - 리모델링 ROI 분석
""")

정리

회귀 분석: R², MAE, RMSE로 평가
특성 엔지니어링: 도메인 지식 활용
앙상블: 스태킹으로 성능 향상
해석: SHAP, 순열 중요도
배포: API 형태로 제공

다음 글 예고

다음 글에서는 종합 프로젝트 3 - 이상 거래 탐지를 다룹니다.

PyCaret 머신러닝 마스터 시리즈 #098

개요​

실습 환경​

프로젝트 목표​

1. 데이터 로드 및 탐색​

2. 탐색적 데이터 분석 (EDA)​

3. 특성 엔지니어링​

4. PyCaret 환경 설정​

5. 모델 비교 및 선택​

6. 상위 모델 튜닝​

7. 앙상블 모델​

8. 최종 모델 평가​

9. 예측 및 해석​

10. 특성 영향 분석​

11. 가격 예측 함수​

12. 가격 분석 대시보드​

13. 모델 배포 준비​

14. 프로젝트 요약​

정리​

다음 글 예고​

개요

실습 환경

프로젝트 목표

1. 데이터 로드 및 탐색

2. 탐색적 데이터 분석 (EDA)

3. 특성 엔지니어링

4. PyCaret 환경 설정

5. 모델 비교 및 선택

6. 상위 모델 튜닝

7. 앙상블 모델

8. 최종 모델 평가

9. 예측 및 해석

10. 특성 영향 분석

11. 가격 예측 함수

12. 가격 분석 대시보드

13. 모델 배포 준비

14. 프로젝트 요약

정리

다음 글 예고