[AI] 챕터03 - 데이터와 전처리

공부할 챕터

챕터	주제	간단 설명
1	머신러닝이란 무엇인가	머신러닝의 정의, 동작 원리, AI와의 관계
2	머신러닝의 분류	지도학습, 비지도학습, 강화학습의 차이
3	데이터와 전처리	데이터가 왜 중요한가, 어떻게 다듬는가
4	특징(Feature)과 레이블(Label)	입력과 출력의 개념, 특징 추출
5	학습과 예측	모델 훈련(training), 예측(predict)의 의미
6	성능 평가	정확도, 정밀도, 재현율, F1 Score 등
7	대표 알고리즘 이해	회귀, 분류, 군집 등 알고리즘 소개
8	과적합과 일반화	학습을 너무 많이/적게 했을 때 문제
9	실습과 프로젝트	간단한 실전 예제, 모델 만들기

데이터 전처리의 중요성

머신러닝에서 좋은 데이터 없이는 좋은 모델도 없다는 말이 있다! 아무리 성능 좋은 알고리즘을 써도 더럽거나 이상한 데이터를 넣으면 모델도 엉터리 결과를 낸다!

전처리(Preprocessing)란?

모델 학습 전에 데이터의 품질을 높이기 위해 정리, 정제하는 모든 작업을 말한다!

전처리의 대표적인 작업들

1. 결측값 처리 (Missing Values)

Null, NaN, 빈 문자열 등 -> 제거하거나 대체해야함!

# 누락 값을 0으로 대체
df.fillna(0)

2. 이상치 제거(Outlier Detection)

극단적으로 큰 값, 오류로 인해 잘못 기록된 값 제거

q1 = df["salary"].quantile(0.25)
q3 = df["salary"].quantile(0.75)
iqr = q3 - q1
filtered - df[(df["salary"] > q1 - 1.5*iqr) & (df["salary"] < q3 + 1.5*iqr)]

3. 정규화/표준화(Scaling)

모델이 숫자 크기에 민감할 수 있기 때문에 값의 범위를 맞추는 작업

정규화(Normalization): 모든 값을 0 - 1 사이로 조정
표준화(Standardzation): 평균 0, 표준편차 1로 조정

from sklearn.preprocessing import StandardScaler
scaler - StandardScaler()
scaled_data = scaler.fit_transform(data)

4. 인코딩(Enclding)

머신러닝은 텍스트를 처리할 수 없다. 문자열을 숫자로 바꿔줘야함!

레이블 인코딩: 카테고리 -> 숫자 (예: 남=0, 여=1)
원-핫 인코딩(One-hot Encoding): 카테고리 -> 이진벡터 (예: [0,1,0])

5. 특징 선택(Feature Selection) & 차원 축소(Dimensionality Reduction)

너무 많은 컬럼(특징)은 모델을 느리게 하거나 과적합을 일으킬 수 있다.

불필요한 열 제거
상관관계 높은 변수들 중 하나만 선택
PCA 같은 기법으로 차원 축소 가능

데이터 분리: 학습용 vs 테스트용

훈련 데이터(Train Set): 모델을 학습하는 데 사용

테스트 데이터(Test Set): 성능을 평가하는 데 사용

보통 train_test_split() 함수를 써서 80:20 이나 70:30으로 나눈다.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

https://www.v7labs.com/blog/data-preprocessing-guide

Data Preprocessing in Machine Learning [Steps & Techniques]

What is data preprocessing and why does it matter? Learn about data preprocessing steps and techniques for building accurate AI models.

www.v7labs.com

https://medium.com/@rahulrastogi1104/what-is-data-preprocessing-in-machine-learning-2a31c81e5646

What is data preprocessing in machine learning?

Lets continue our learning

medium.com

https://www.almabetter.com/bytes/articles/outlier-detection-methods-and-techniques-in-machine-learning-with-examples

How to Detect Outliers in Machine Learning? (With Examples)

Learn about the methods of how to identify outliers in machine learning and the potential impact of outlier detection in data analysis and decision making.

www.almabetter.com

https://www.appliedaicourse.com/blog/feature-scaling-in-machine-learning/

Feature Scaling In Machine Learning: What Is It?

In machine learning, algorithms rely on data to learn patterns and make predictions. However, raw data is rarely ready for direct use by these models. Data preprocessing is a critical step that can significantly affect the performance of machine learning m

www.appliedaicourse.com

https://medium.com/@Ramsha_ML/why-feature-selection-isnt-optional-in-machine-learning-6656cc93ebab

Why Feature Selection Isn’t Optional in Machine Learning

In the world of machine learning, data is power but too much of it can be a problem.

medium.com

** 그냥 하루하루 개인 공부한 것을 끄적 거리는 공간입니다.

이곳 저곳에서 구글링한 것과 강의 들은 내용이 정리가 되었습니다.

그림들은 그림밑에 출처표시를 해놓았습니다.

문제가 될시 말씀해주시면 해당 부분은 삭제 하도록하겠습니다. **

'public void static main > AI' 카테고리의 다른 글

[AI] 챕터05 - 학습과 예측 (3)	2025.07.17
[AI] 챕터04 - 특징과 레이블 (3)	2025.07.17
[AI] 챕터02 - Machine Learning의 분류 (3)	2025.07.16
[AI] 챕터01 - Machine Learning 공부를 해보자 (5)	2025.07.14
[LangChain] LLM 체인, 멀티 체인 (0)	2025.07.14

리뮤의 블로그

[AI] 챕터03 - 데이터와 전처리

공부할 챕터

데이터 전처리의 중요성

전처리(Preprocessing)란?

전처리의 대표적인 작업들

1. 결측값 처리 (Missing Values)

2. 이상치 제거(Outlier Detection)

3. 정규화/표준화(Scaling)

4. 인코딩(Enclding)

5. 특징 선택(Feature Selection) & 차원 축소(Dimensionality Reduction)

데이터 분리: 학습용 vs 테스트용

'public void static main > AI' 카테고리의 다른 글

댓글

티스토리툴바

[AI] 챕터03 - 데이터와 전처리

공부할 챕터

데이터 전처리의 중요성

전처리(Preprocessing)란?

전처리의 대표적인 작업들

1. 결측값 처리 (Missing Values)

2. 이상치 제거(Outlier Detection)

3. 정규화/표준화(Scaling)

4. 인코딩(Enclding)

5. 특징 선택(Feature Selection) & 차원 축소(Dimensionality Reduction)

데이터 분리: 학습용 vs 테스트용

'public void static main > AI' 카테고리의 다른 글

관련글

댓글

티스토리툴바