본문 바로가기
통계분석

파이썬 내용정리

by 메두 드 펭 2020. 4. 16.

파이썬 데이터분석을 공부중임

 

파이썬, 판다스 공부 중인데

 

생소하고 헷갈리고 낯설어 죽겠다.ㅠㅠㅋㅋㅋ

 

adsp 공부 안 했으면 진짜 더 감 못 잡았겠다 싶다..

 

 

------------------------------------------------------

 

캐글 타이타닉 스코어 올리기

 

1. pclass, sex, fare : 0.784
2. + embarked encoding : 0.789
3. + AgeType : 0.784 
4. + depth 5->8 : 0.784
5. AgeType 삭제, Single 추가, depth 8->6 : 0.775
6. depth 6 -> 8 : 0.78468
7. depth 8 -> 10 : 0.79425
8. depth 10 -> 12 : 0.79425

 

------------------------------------------------------

 

 

파이썬 코드

 

주석은 나중에.. ㅠㅠㅠㅠㅋㅋㅋㅋ

 

 

------------------------------------------------------

 

import pandas as pd

train = pd.read_csv("train.csv", index_col = "PassengerId")
train.head()
print (train. shape)

test = pd.read_csv("test.csv",  index_col = "PassengerId")
test.head()
print (test. shape)

test["Fare"]

## 제일 먼저 : 전처리, preprocessing

# Encoding sex

train["Sex_encode"] = train["Sex"].replace ("male",0).replace("female",1)
print (train.shape)
train[["Sex","Sex_encode"]].head()

test["Sex_encode"] = test["Sex"].replace ("male",0).replace("female",1)
print (test.shape)
test[["Sex","Sex_encode"]].head()

test.head()

# fill in missing fare 

train [train["Fare"].isna() ]

test [test["Fare"].isna() ]

test["Fare"] = test["Fare"].fillna(0)


test.head()

## Encode Embarked 


# One Hot Encoding 
# true == 1, false == 0
# C == [1,0,0]
# S == [0,1,0]
# Q == [0,0,1]


train["Embarked_C"] = train["Embarked"]== "C"
train["Embarked_S"] = train["Embarked"]== "S"
train["Embarked_Q"] = train["Embarked"]== "Q"
train
print(train.shape)
train[["Embarked","Embarked_C","Embarked_S","Embarked_Q"]]


test["Embarked_C"] = test["Embarked"]== "C"
test["Embarked_S"] = test["Embarked"]== "S"
test["Embarked_Q"] = test["Embarked"]== "Q"
# test
# print(test.shape)
# test[["Embarked","Embarked_C","Embarked_S","Embarked_Q"]]

test

# Age



# 나이가 15세 미만인 승객을 색인한 뒤, AgeType이라는 새로운 컬럼에 "Young"이라는 값을 넣습니다.
train.loc[train["Age"] < 15, "AgeType"] = "Young"

# 비슷하게 나이가 15세 이상 30세 미만인 승객의 AgeType에는 "Medium"이라는 값을 넣습니다.
train.loc[(train["Age"] >= 15) & (train["Age"] < 30), "AgeType"] = "Medium"

# 비슷하겍 30세 이상인 승객의 AgeType에는 "Old"이라는 값을 넣습니다.
train.loc[train["Age"] >= 30, "AgeType"] = "Old"



train["Age_Young"] = train["AgeType"] =="Young"
train["Age_Medium"] = train["AgeType"] =="Medium"
train["Age_Old"] = train["AgeType"] =="Old"

train.head()

# train[train["Age"].isna() == True ]

train["Age"] = train["Age"].fillna(0)



train.head()

train[train["Age"].isna() == True ]

test[test["Age"].isna() == True ]

test["Age"] = test["Age"].fillna(0)

# 나이가 15세 미만인 승객을 색인한 뒤, AgeType이라는 새로운 컬럼에 "Young"이라는 값을 넣습니다.
test.loc[test["Age"] < 15, "AgeType"] = "Young"

# 비슷하게 나이가 15세 이상 30세 미만인 승객의 AgeType에는 "Medium"이라는 값을 넣습니다.
test.loc[(test["Age"] >= 15) & (test["Age"] < 30), "AgeType"] = "Medium"

# 비슷하겍 30세 이상인 승객의 AgeType에는 "Old"이라는 값을 넣습니다.
test.loc[test["Age"] >= 30, "AgeType"] = "Old"

test["Age_Young"] = test["AgeType"] =="Young"
test["Age_Medium"] = test["AgeType"] =="Medium"
test["Age_Old"] = test["AgeType"] =="Old"

test.head()

# single , sibsp, parch

train["Single"] = (train["SibSp"] == 0) & (train["Parch"] == 0)
train.head()
train[["Single","SibSp","Parch"]]

test["Single"] = (test["SibSp"] == 0) & (test["Parch"] == 0)
test.head()
test[["Single","SibSp","Parch"]]





# train 

# Feature = Pclass, Sex, Fare 
# Label = Survived 

#"Age_Young","Age_Medium","Age_Old"
feature_names = ["Pclass","Sex_encode","Fare","Embarked_C","Embarked_S","Embarked_Q","Single"]
feature_names

X_train = train[feature_names]
print (X_train.shape)
X_train.head()

X_test = test[feature_names]
print (X_test.shape)
X_test.head()

label_name = "Survived"
label_name

y_train = train[label_name]
print (y_train.shape)
y_train.head()



# use decission tree 

#import sklean
#sklearn.tree.DecisionTreeClassifier

from sklearn.tree import DecisionTreeClassifier 

model = DecisionTreeClassifier( max_depth = 12)
model

#model

#fit (train) -> predict (test)

model.fit(X_train, y_train)

prediction = model.predict(X_test)
prediction
print(prediction.shape)
prediction[0:5]

submit = pd.read_csv("gender_submission.csv", index_col="PassengerId")
submit


submit["Survived"] = prediction

submit.to_csv("decicion_tree.csv")