작성자: 정서연 Email: [email protected] Phone: 010-4244-5669
<aside> <img src="/icons/info-alternate_lightgray.svg" alt="/icons/info-alternate_lightgray.svg" width="40px" /> Information
</aside>
데이터 불러오기 및 헤드 확인
df_train = pd.read_csv('C:/Users/jsy56/titanic/input/train.csv') df_test = pd.read_csv('C:/Users/jsy56/titanic/input/test.csv') df_train.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN |
| 변수(feature, variable) | 정의 | 설명 | 타입 |
|---|---|---|---|
| survival | 생존여부 | target label 임. 1, 0 으로 표현됨 | integer |
| Pclass | 티켓의 클래스 | 1 = 1st, 2 = 2nd, 3 = 3rd 클래스로 나뉘며 categorical feature | integer |
| sex | 성별 | male, female 로 구분되며 binary | string |
| Age | 나이 | continuous | integer |
| sibSp | 함께 탑승한 형제와 배우자의 수 | quantitative | integer |
| parch | 함께 탑승한 부모, 아이의 수 | quantitative | integer |
| ticket | 티켓 번호 | alphabat + integer | string |
| fare | 탑승료 | continuous | float |
| cabin | 객실 번호 | alphabat + integer | string |
| embared | 탑승 항구 | C = Cherbourg, Q = Queenstown, S = Southampton | string |
학습 데이터의 기술통계값
df_train.describe()
| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
|---|---|---|---|---|---|---|---|
| count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
| mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
| std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
| min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
| 50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
| 75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
| max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
테스트 데이터의 기술통계값
df_test.describe()
| PassengerId | Pclass | Age | SibSp | Parch | Fare | |
|---|---|---|---|---|---|---|
| count | 418.000000 | 418.000000 | 332.000000 | 418.000000 | 418.000000 | 417.000000 |
| mean | 1100.500000 | 2.265550 | 30.272590 | 0.447368 | 0.392344 | 35.627188 |
| std | 120.810458 | 0.841838 | 14.181209 | 0.896760 | 0.981429 | 55.907576 |
| min | 892.000000 | 1.000000 | 0.170000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 996.250000 | 1.000000 | 21.000000 | 0.000000 | 0.000000 | 7.895800 |
| 50% | 1100.500000 | 3.000000 | 27.000000 | 0.000000 | 0.000000 | 14.454200 |
| 75% | 1204.750000 | 3.000000 | 39.000000 | 1.000000 | 0.000000 | 31.500000 |
| max | 1309.000000 | 3.000000 | 76.000000 | 8.000000 | 9.000000 | 512.329200 |
<aside> 💡 NULL 값, 이상치, 오류값 확인 필요
</aside>
for colin df_train.columns:
msg_tr = 'column:{:>10}\\t Percent of NaN value:{:.2f}%'.format(col, 100 * (df_train[col].isnull().sum() / df_train[col].shape[0]))
print(msg_tr)
column: PassengerId Percent of NaN value: 0.00% column: Survived Percent of NaN value: 0.00% column: Pclass Percent of NaN value: 0.00% column: Name Percent of NaN value: 0.00% column: Sex Percent of NaN value: 0.00% column: Age Percent of NaN value: 19.87% column: SibSp Percent of NaN value: 0.00% column: Parch Percent of NaN value: 0.00% column: Ticket Percent of NaN value: 0.00% column: Fare Percent of NaN value: 0.00% column: Cabin Percent of NaN value: 77.10% column: Embarked Percent of NaN value: 0.22%