파이썬 내용정리

파이썬 데이터분석을 공부중임

파이썬, 판다스 공부 중인데

생소하고 헷갈리고 낯설어 죽겠다.ㅠㅠㅋㅋㅋ

adsp 공부 안 했으면 진짜 더 감 못 잡았겠다 싶다..

------------------------------------------------------

캐글 타이타닉 스코어 올리기

1. pclass, sex, fare : 0.784
2. + embarked encoding : 0.789
3. + AgeType : 0.784
4. + depth 5->8 : 0.784
5. AgeType 삭제, Single 추가, depth 8->6 : 0.775
6. depth 6 -> 8 : 0.78468
7. depth 8 -> 10 : 0.79425
8. depth 10 -> 12 : 0.79425

------------------------------------------------------

파이썬 코드

주석은 나중에.. ㅠㅠㅠㅠㅋㅋㅋㅋ

------------------------------------------------------

import pandas as pd

train = pd.read_csv("train.csv", index_col = "PassengerId")
train.head()
print (train. shape)

test = pd.read_csv("test.csv", index_col = "PassengerId")
test.head()
print (test. shape)

test["Fare"]

## 제일 먼저 : 전처리, preprocessing

# Encoding sex

train["Sex_encode"] = train["Sex"].replace ("male",0).replace("female",1)
print (train.shape)
train[["Sex","Sex_encode"]].head()

test["Sex_encode"] = test["Sex"].replace ("male",0).replace("female",1)
print (test.shape)
test[["Sex","Sex_encode"]].head()

test.head()

# fill in missing fare

train [train["Fare"].isna() ]

test [test["Fare"].isna() ]

test["Fare"] = test["Fare"].fillna(0)

test.head()

## Encode Embarked

# One Hot Encoding
# true == 1, false == 0
# C == [1,0,0]
# S == [0,1,0]
# Q == [0,0,1]

train["Embarked_C"] = train["Embarked"]== "C"
train["Embarked_S"] = train["Embarked"]== "S"
train["Embarked_Q"] = train["Embarked"]== "Q"
train
print(train.shape)
train[["Embarked","Embarked_C","Embarked_S","Embarked_Q"]]

test["Embarked_C"] = test["Embarked"]== "C"
test["Embarked_S"] = test["Embarked"]== "S"
test["Embarked_Q"] = test["Embarked"]== "Q"
# test
# print(test.shape)
# test[["Embarked","Embarked_C","Embarked_S","Embarked_Q"]]

test

# Age

# 나이가 15세 미만인 승객을 색인한 뒤, AgeType이라는 새로운 컬럼에 "Young"이라는 값을 넣습니다.
train.loc[train["Age"] < 15, "AgeType"] = "Young"

# 비슷하게 나이가 15세 이상 30세 미만인 승객의 AgeType에는 "Medium"이라는 값을 넣습니다.
train.loc[(train["Age"] >= 15) & (train["Age"] < 30), "AgeType"] = "Medium"

# 비슷하겍 30세 이상인 승객의 AgeType에는 "Old"이라는 값을 넣습니다.
train.loc[train["Age"] >= 30, "AgeType"] = "Old"

train["Age_Young"] = train["AgeType"] =="Young"
train["Age_Medium"] = train["AgeType"] =="Medium"
train["Age_Old"] = train["AgeType"] =="Old"

train.head()

# train[train["Age"].isna() == True ]

train["Age"] = train["Age"].fillna(0)

train.head()

train[train["Age"].isna() == True ]

test[test["Age"].isna() == True ]

test["Age"] = test["Age"].fillna(0)

# 나이가 15세 미만인 승객을 색인한 뒤, AgeType이라는 새로운 컬럼에 "Young"이라는 값을 넣습니다.
test.loc[test["Age"] < 15, "AgeType"] = "Young"

# 비슷하게 나이가 15세 이상 30세 미만인 승객의 AgeType에는 "Medium"이라는 값을 넣습니다.
test.loc[(test["Age"] >= 15) & (test["Age"] < 30), "AgeType"] = "Medium"

# 비슷하겍 30세 이상인 승객의 AgeType에는 "Old"이라는 값을 넣습니다.
test.loc[test["Age"] >= 30, "AgeType"] = "Old"

test["Age_Young"] = test["AgeType"] =="Young"
test["Age_Medium"] = test["AgeType"] =="Medium"
test["Age_Old"] = test["AgeType"] =="Old"

test.head()

# single , sibsp, parch

train["Single"] = (train["SibSp"] == 0) & (train["Parch"] == 0)
train.head()
train[["Single","SibSp","Parch"]]

test["Single"] = (test["SibSp"] == 0) & (test["Parch"] == 0)
test.head()
test[["Single","SibSp","Parch"]]

# train

# Feature = Pclass, Sex, Fare
# Label = Survived

#"Age_Young","Age_Medium","Age_Old"
feature_names = ["Pclass","Sex_encode","Fare","Embarked_C","Embarked_S","Embarked_Q","Single"]
feature_names

X_train = train[feature_names]
print (X_train.shape)
X_train.head()

X_test = test[feature_names]
print (X_test.shape)
X_test.head()

label_name = "Survived"
label_name

y_train = train[label_name]
print (y_train.shape)
y_train.head()

# use decission tree

#import sklean
#sklearn.tree.DecisionTreeClassifier

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier( max_depth = 12)
model

#model

#fit (train) -> predict (test)

model.fit(X_train, y_train)

prediction = model.predict(X_test)
prediction
print(prediction.shape)
prediction[0:5]

submit = pd.read_csv("gender_submission.csv", index_col="PassengerId")
submit

submit["Survived"] = prediction

submit.to_csv("decicion_tree.csv")

저작자표시 비영리 변경금지 (새창열림)

'통계분석' 카테고리의 다른 글

인공지능 머신러닝 비학위과정 수강기 : 0. 지원, 등록 (0)	2020.08.14
ds school 데이터분석 입문반 수강후기 (0)	2020.05.21
[데이터분석] 스스로 생각하는 내 테크트리 (0)	2020.02.22
[데이터분석] 비전공자 ADsP합격후기/데이터분석 준전문가합격후기 (0)	2020.02.09
[SQL] SQLD 합격후기 (공부법, 공부기간, 참고한 서적 등) (0)	2020.02.08

펭 스튜디오

파이썬 내용정리

'통계분석' 카테고리의 다른 글

티스토리툴바

파이썬 내용정리

'통계분석' 카테고리의 다른 글

관련글

티스토리툴바