빅데이터분석기사 제2유형 : ③ 피처엔지니어 (스케일/인코딩)

728x90

데이터 불러오기와 전처리(2번에서 활용한 전처리)

# 데이터 불러오기
import pandas as pd
X_train = pd.read_csv("X_train.csv")
y_train =pd.read_csv("y_train.csv")
X_test = pd.read_csv("X_test.csv")

# 전처리

# X_train데이터
X_train['workclass'] = X_train['workclass'].fillna(X_train['workclass'].mode()[0])
X_train['native.country'] = X_train['native.country'].fillna(X_train['native.country'].mode()[0])
X_train['occupation'] = X_train['occupation'].fillna("X")
X_train['age'] = X_train['age'].fillna(int(X_train['age'].mean()))
X_train['hours.per.week'] = X_train['hours.per.week'].fillna(X_train['hours.per.week'].median())

# X_test데이터
X_test['workclass'] = X_test['workclass'].fillna(X_test['workclass'].mode()[0])
X_test['native.country']  = X_test['native.country'].fillna(X_test['native.country'].mode()[0])
X_test['occupation'] = X_test['occupation'].fillna("X")
X_test['age'] = X_test['age'].fillna(int(X_train['age'].mean()))
X_test['hours.per.week'] = X_test['hours.per.week'].fillna(X_train['hours.per.week'].median())

수치형 데이터와 범주형 데이터 분리

# 데이터의 타입 확인 (X_train.info())

😊 수치형 데이터 : X_train.select_dtypes(exclude='object').copy()

😊 범주형 데이터 : X_train.select_dtypes(include='object').copy()

# 수치형 컬럼과 범주형 컬럼 데이터 나누기

#사본을 복사할때는 항상 .copy()를 써준다. waring이 뜰 수 도있다 ! 

#범주형 데이터를 제외하고 포함한다
n_train = X_train.select_dtypes(exclude='object').copy()
n_test = X_test.select_dtypes(exclude='object').copy()

#범주형 데이터만 포함한다 
c_train = X_train.select_dtypes(include='object').copy()
c_test = X_test.select_dtypes(include='object').copy()


# # 데이터를 매번 새롭게 불러오기 위해 함수로 제작 함
# def get_nc_data():
#     X_train = pd.read_csv("X_train.csv")
#     X_test = pd.read_csv("X_test.csv")
#     y_train = pd.read_csv("y_train.csv")

#     n_train = X_train.select_dtypes(exclude='object').copy()
#     n_test = X_test.select_dtypes(exclude='object').copy()
#     c_train = X_train.select_dtypes(include='object').copy()
#     c_test = X_test.select_dtypes(include='object').copy()
#     return n_train, n_test, c_train, c_test

# n_train, n_test, c_train, c_test = get_nc_data() # 데이터 새로 불러오기

[수치형] 스케일링 (Min-Max, 정규화, 로버스트, 로그변환)

트리기반의 모델은 입력의 스케일을 크게 신경쓰지 않아도 됨
선형회귀나 로지스틱 회귀 등과 같은 모델은 입력의 스케일링에 영향을 받음

1) Min-Max

- 모든 값을 0~1사이로 변환 해줌

- from sklearn.preprocessing import MinMaxScaler

- 학습데이터는 fit_transform

- 테스트데이터는 transform (학습하지 않고 변환만함)

# 스케일링 작업할 컬럼명
cols = ['age', 'fnlwgt', 'education.num', 'capital.gain', 'capital.loss', 'hours.per.week']

# 민-맥스 스케일링 MinMaxScaler (모든 값이 0과 1사이)
n_train, n_test, c_train, c_test = get_nc_data() # 데이터 새로 불러오기

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
#display는 시험볼때 못씀..^^
display(n_train.head(2))
#학습데이터는 학습+변환
n_train[cols] = scaler.fit_transform(n_train[cols])
#테스트데이터는 학습(fit)하지 않고 변환만(transform)
n_test[cols]= scaler.transform(n_test[cols])
display(n_train.head(2))

2) 표준화 (StandardScaler)

- from sklearn.preprocessing import StandardScaler

# 표준화 StandardScaler (Z-score 정규화, 평균이 0 표준편차가 1인 표준 정규분포로 변경)
n_train, n_test, c_train, c_test = get_nc_data() # 데이터 새로 불러오기

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
display(n_train.head(2))
n_train[cols]=scaler.fit_transform(n_train[cols])
n_test[cols]=scaler.transform(n_test[cols])
display(n_train.head(2))

3) 로버스트 스케일링

- from sklearn.preprocessing import RobustScaler

- 이상치 영향을 최소화 하는 장점이 있음!

# 로버스트 스케일링 : 중앙값과 사분위 값 활용, 이상치 영향 최소화 장점

#min-max와 표준화는 이상치가 있을때 예쁘게 안나옴 
n_train, n_test, c_train, c_test = get_nc_data() # 데이터 새로 불러오기

from sklearn.preprocessing import RobustScaler 

scaler = RobustScaler()
n_train[cols] = scaler.fit_transform(n_train[cols])
n_test[cols]=scaler.transform(n_test[cols])

4) 로그변환

# 로그 변환 전후 확인 (한곳으로 몰렸을때 물론, 시각화는 못씀)
import numpy as np
print(X_train['fnlwgt'][:3])
np.log1p(X_train['fnlwgt'])[:3]

# np.exp 로그변환을 되돌리고 싶을때 / 완벽하게 돌아오지는 않음 
np.exp(np.log1p(X_train['fnlwgt']))

[범주형] 인코딩

라벨(label) 인코딩 : 사과 0, 배 1, 수박 2
원핫(one-hot) 인코딩 : 사과 100 배 010 수박 001 : 개수에 따라 숫자가 늘어 날 수 있음

범주형 컬럼 찾기

# object 컬럼명
cols=['workclass', 'education','marital.status','occupation','relationship','race','sex','native.country']

1) 라벨 인코딩

- 인코딩에는 반복문이 들어감

# 라벨 인코딩
n_train, n_test, c_train, c_test = get_nc_data() # 데이터 새로 불러오기
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for col in cols:
  le = LabelEncoder()
  c_train[col] = le.fit_transform(c_train[col])
  c_test[col]=le.transform(c_test[col])

2) 원핫 인코딩

- 판다스에서 처리 할 수 있음

# 원핫 인코딩
n_train, n_test, c_train, c_test = get_nc_data() # 데이터 새로 불러오기

c_train=pd.get_dummies(c_train[cols])
c_test=pd.get_dummies(c_test[cols])

데이터 합치기

- concat 함수는 기본이 세로이므로, 가로로 합칠때는 axis 1을 해준다

# 분리한 데이터 다시 합침
n_train, n_test, c_train, c_test = get_nc_data() # 데이터 새로 불러오기

# concat 함수는 기본적으로 위아래로 합치게 되어 있으므로 오른쪽으로 합칠때는 axis를 1로 해주어야한다.
X_train = pd.concat([n_train,n_test],axis=1)
X_test = pd.concat([n_train,n_test],axis=1)
#여기서 반드시 컬럼수는 일치해야함 (두번째)
print(X_train.shape, X_test.shape)

데이터 합치기 - 예시

# 데이터 새로 불러오기
import pandas as pd

X_train = pd.read_csv("X_train.csv")
X_test = pd.read_csv("X_test.csv")
y_train = pd.read_csv("y_train.csv")

# train, test 합쳐서 인코딩 후 분리하기

#범주형 변수
cols=list(X_train.columns[X_train.dtypes==object])
print(X_train.shape, X_test.shape)

#train과 test를 세로로 합침 
all_df=pd.concat([X_train,X_test])
#원핫인코딩
all_df=pd.get_dummies(all_df[cols])

#다시 나눠줌
line=int(X_train.shape[0])
X_train=all_df.iloc[:line,:].copy()
X_test=all_df.iloc[line:,:].copy()

print(X_train.shape, X_test.shape)

⭐전체 정리

# 데이터 분리
n_train, n_test, c_train, c_test = get_nc_data() # 데이터 새로 불러오기

# 수치형 - 민맥스 스케일링
cols = ['age', 'fnlwgt', 'education.num', 'capital.gain', 'capital.loss', 'hours.per.week']
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
n_train[cols] = scaler.fit_transform(n_train[cols])
n_test[cols] = scaler.transform(n_test[cols])

# 라벨 인코딩
cols = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
from sklearn.preprocessing import LabelEncoder
# le = LabelEncoder() # 강의에서 중복 선언됨. 문제는 없음
for col in cols:
    le = LabelEncoder()
    c_train[col] = le.fit_transform(c_train[col])
    c_test[col] = le.transform(c_test[col])
    
# 분리한 데이터 다시 합침
X_train = pd.concat([n_train, c_train], axis=1)
X_test = pd.concat([n_test, c_test], axis=1)
print(X_train.shape, X_test.shape)
X_train.head()

728x90

저작자표시 (새창열림)

'IT&게임 > 빅데이터분석기사(빅분기)' 카테고리의 다른 글

빅데이터분석기사 제2유형 : ⑥ 평가지표 (0)	2024.06.06
빅데이터분석기사 제2유형 : ④ 머신러닝 (분류) (2)	2024.06.06
빅데이터분석기사 제2유형 : ② 데이터 전처리 (결측치/이상치) (1)	2024.06.04
빅데이터분석기사 제2유형 : ① EDA (0)	2024.06.04
빅데이터 분석기사 - 1유형 예제문제 학습하기(결측치2) (0)	2024.06.03

푸른달 하루

빅데이터분석기사 제2유형 : ③ 피처엔지니어 (스케일/인코딩)

수치형 데이터와 범주형 데이터 분리

[수치형] 스케일링 (Min-Max, 정규화, 로버스트, 로그변환)

[범주형] 인코딩

데이터 합치기

⭐전체 정리

'IT&게임 > 빅데이터분석기사(빅분기)' 카테고리의 다른 글

댓글

티스토리툴바

빅데이터분석기사 제2유형 : ③ 피처엔지니어 (스케일/인코딩)

수치형 데이터와 범주형 데이터 분리

[수치형] 스케일링 (Min-Max, 정규화, 로버스트, 로그변환)

[범주형] 인코딩

데이터 합치기

⭐전체 정리

'IT&게임 > 빅데이터분석기사(빅분기)' 카테고리의 다른 글

관련글

댓글

티스토리툴바