๐Ÿ˜„ NLP - ํ•œ๊ตญ์–ด ์˜ํ™” ๋ฆฌ๋ทฐ ๊ฐ์ •๋ถ„์„

@winuss ยท May 24, 2020 ยท 15 min read

ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ํ…์ŠคํŠธ ๋ถ„์„์„ ํ•ด๋ณด์ž. ์•„๋ž˜ ๋ฐ์ดํ„ฐ๋Š” ํ•œ๊ตญ์–ด ๋ถ„์„ ํ•™์Šต์„ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ํ•œ๊ธ€ ๋ถ„์„์„ ์œ„ํ•ด Konlpy๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ํ…์„œํ”Œ๋กœ ์ผ€๋ผ์Šค๋ฅผ ์ด์šฉํ•ด ๋ชจ๋ธ์„ ๋งŒ๋“ค๋„๋ก ํ•˜๊ฒ ๋‹ค.

๋ฐ์ดํ„ฐ์…‹ : Naver sentiment movie corpus (๋‹ค์šด๋กœ๋“œ ๋งํฌ : https://github.com/e9t/nsmc/)

NSMC ์•ฝ์–ด๊นŒ์ง€ ์‚ฌ์šฉํ•  ์ •๋„๋กœ ๋งŽ์ด๋“ค ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ธ๋“ฏ ์‹ถ๋‹ค.

๋ฐ์ดํ„ฐ ์„ค๋ช…

์˜ํ™” ๋ฆฌ๋ทฐ ์ค‘ ์˜ํ™”๋‹น 100๊ฐœ์˜ ๋ฆฌ๋ทฐ์ด๊ณ  ์ด 200,000๊ฐœ์˜ ๋ฆฌ๋ทฐ(train:15๋งŒ, test:5๋งŒ)

1์  ~ 10์  ๊นŒ์ง€์˜ ํ‰์  ์ค‘์—์„œ ์ค‘๋ฆฝ์ ์ธ ํ‰์ (5์ ~8์ )์„ ์ œ์™ธํ•˜๊ณ  ๋ถ„๋ฅ˜๋ฅผ ํ•˜์˜€๋‹ค.

  • ๋ถ€์ • : 1์  ~ 4์ 
  • ๊ธ์ • : 9์  ~ 10์ 

์นผ๋žŒ์ •๋ณด: id, document, label

  • id: ๋ฆฌ๋ทฐ ์•„์ด๋””
  • document: ๋ฆฌ๋ทฐ ๋‚ด์šฉ
  • label: ๋ ˆ์ด๋ธ” (0: negative, 1: positive)

๊ฐ ํŒŒ์ผ์— ๋Œ€ํ•œ ๋ฆฌ๋ทฐ ๊ฐฏ์ˆ˜

  • ratings.txt: All 20๋งŒ
  • ratings_test.txt: 5๋งŒ
  • ratings_train.txt: 15๋งŒ

๋ชจ๋“  ๋ฆฌ๋ทฐํ…์ŠคํŠธ๋Š” 140์ž ์ด๋‚ด์ด๊ณ , ๊ฐ ๊ฐ์ • ๋ถ„๋ฅ˜๋Š” ๋™์ผํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋ง ๋œ๋‹ค.(i.e., random guess yields 50% accuracy)

  • 10๋งŒ๊ฐœ์˜ ๋ถ€์ •์ ์ธ ๋ฆฌ๋ทฐ
  • 10๋งŒ๊ฐœ์˜ ๊ธ์ •์ ์ธ ๋ฆฌ๋ทฐ
  • ์ค‘๋ฆฝ์ ์ธ ๋ฆฌ๋ทฐ๋Š” ์ œ์™ธ

๋ฐ์ดํ„ฐ ์ค€๋น„

๋‹ค์šด๋กœ๋“œ ๋ฐ›์€ ๋ฐ์ดํ„ฐ๋ฅผ pandas๋ฅผ ์ด์šฉํ•ด ์ฝ์–ด๋ณด์ž. ํ•„๋“œ ๊ตฌ๋ฌธ์ด ํƒญ์œผ๋กœ ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— \t๋กœ ๊ตฌ๋ถ„์ž๋ฅผ ์ง€์ •ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค.

import pandas as pd

train_df = pd.read_csv("data_naver_movie/ratings_train.txt", "\t")
test_df = pd.read_csv("data_naver_movie/ratings_test.txt", "\t")

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•ด์•ผ ํ•˜๋Š”๋ฐ, Konlpy๋ฅผ ์ด์šฉํ•ด ํ˜•ํƒœ์†Œ ๋ถ„์„ ๋ฐ ํ’ˆ์‚ฌ ํƒœ๊น…์„ ํ•˜๋„๋ก ํ•˜์ž.

์˜์–ด์˜ ๊ฒฝ์šฐ ์ฃผ์–ด์ง„ ๋‹จ์–ด์˜ ๋นˆ๋„๋งŒ์„ ์‚ฌ์šฉํ•ด์„œ ์ฒ˜๋ฆฌํ•ด๋„ ํฌ๊ฒŒ ๋ฌธ์ œ๋Š” ์—†์ง€๋งŒ ํ•œ๊ตญ์–ด๋Š” ์˜์–ด์™€๋Š” ๋‹ฌ๋ฆฌ ๋„์–ด์“ฐ๊ธฐ๋กœ ์˜๋ฏธ๋ฅผ ๊ตฌ๋ถ„์ง“๊ธฐ์—๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๊ณ , ๋ฆฌ๋ทฐ ํŠน์„ฑ์ƒ ๋งž์ถค๋ฒ•์ด๋‚˜ ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์ œ๋Œ€๋กœ ๋˜์–ด์žˆ์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ •ํ™•ํ•œ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•ด์„œ๋Š” Konlpy๋ฅผ ์ด์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.

Konlpy๋Š” ๋„์–ด์“ฐ๊ธฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ์ •๊ทœํ™”๋ฅผ ์ด์šฉํ•ด์„œ ๋งž์ถค๋ฒ•์ด ํ‹€๋ฆฐ ๋ฌธ์žฅ๋„ ์–ด๋Š ์ •๋„ ๊ณ ์ณ์ฃผ๋ฉด์„œ ํ˜•ํƒœ์†Œ ๋ถ„์„๊ณผ ํ’ˆ์‚ฌ๋ฅผ ํƒœ๊น…ํ•ด์ฃผ๋Š” ์—ฌ๋Ÿฌ ํด๋ž˜์Šค๋ฅผ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋‹ค.^^!

from konlpy.tag import Okt
okt = Okt()
okt.pos(u'ํ”๋“ค๋ฆฌ๋Š” ๊ฝƒ๋“ค ์†์—์„œ ๋„ค ์ƒดํ‘ธํ–ฅ์ด ๋Š๊ปด์ง„๊ฑฐ์•ผ')
[('ํ”๋“ค๋ฆฌ๋Š”', 'Verb'),
 ('๊ฝƒ', 'Noun'),
 ('๋“ค', 'Suffix'),
 ('์†', 'Noun'),
 ('์—์„œ', 'Josa'),
 ('๋„ค', 'Noun'),
 ('์ƒดํ‘ธ', 'Noun'),
 ('ํ–ฅ', 'Noun'),
 ('์ด', 'Josa'),
 ('๋Š๊ปด์ง„๊ฑฐ์•ผ', 'Verb')]

ํ…Œ์ŠคํŠธ ์‚ผ์•„ ๊ฐ„๋‹จํ•œ ๋ฌธ์žฅ์„ ๋„ฃ๊ณ  ํ™•์ธ ํ•ด๋ณด๋ฉด ์ด๋Ÿฐ ํ˜•ํƒœ๋กœ ๋ถ„๋ฆฌ๋ฅผ ํ•ด์ฃผ๋Š” ๊ฒƒ์„ ์•Œ์ˆ˜ ์žˆ๋‹ค.

ํ† ํฌ๋‚˜์ด์ฆˆ ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ค์–ด ์‚ฌ์šฉํ•˜๋„๋ก ํ•œ๋‹ค.

def tokenize(doc):
    #ํ˜•ํƒœ์†Œ์™€ ํ’ˆ์‚ฌ๋ฅผ join
    return ['/'.join(t) for t in okt.pos(doc, norm=True, stem=True)]

norm์€ ์ •๊ทœํ™”, stem์€ ๊ทผ์–ด๋กœ ํ‘œ์‹œํ•˜๊ธฐ๋ฅผ ๋‚˜ํƒ€๋ƒ„

๋ฆฌ๋ทฐ๊ฐ€ null์ธ ๊ฒฝ์šฐ ์œ„ ์œ„ ํ•จ์ˆ˜์—์„œ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์œผ๋‹ˆ ์‚ฌ์ „์— null๊ฐ’ ํ™•์ธํ•ด๋ณด๊ณ  ๋นˆ๋ฏผ์ž์—ด๋กœ ๋Œ€์ฒดํ•˜์ž!

train_df.isnull().any() #document์— null๊ฐ’์ด ์žˆ๋‹ค.
train_df['document'] = train_df['document'].fillna(''); #null๊ฐ’์„ ''๊ฐ’์œผ๋กœ ๋Œ€์ฒด

test_df.isnull().any()
test_df['document'] = test_df['document'].fillna(''); #null๊ฐ’์„ ''๊ฐ’์œผ๋กœ ๋Œ€์ฒด

์ด์ œ ํ•™์Šต๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์ €์žฅํ•ด๋‘์ž.

#tokenize ๊ณผ์ •์€ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆด์ˆ˜ ์žˆ์Œ...
train_docs = [(tokenize(row[1]), row[2]) for row in train_df.values]
test_docs = [(tokenize(row[1]), row[2]) for row in test_df.values]

๋ถ„์„๊ฒฐ๊ณผ๊ฐ€ ๋๋‚ฌ์œผ๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•ํƒœ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ณ€ํ˜• ๋˜์—ˆ์„ ๊ฒƒ์ด๋‹ค.

print(train_docs[0])
print(test_docs[0])
(['์•„/Exclamation', '๋”๋น™/Noun', '../Punctuation', '์ง„์งœ/Noun', '์งœ์ฆ๋‚˜๋‹ค/Adjective', '๋ชฉ์†Œ๋ฆฌ/Noun'], 0)
(['๊ตณ๋‹ค/Adjective', 'ใ…‹/KoreanParticle'], 1)

15๋งŒ ํ•™์Šต๋ฐ์ดํ„ฐ์— ๋ถ„๋ฆฌ๋œ ํ† ํฐ ๊ฐœ์ˆ˜๋ฅผ ์‚ดํŽด๋ณด์ž.

tokens = [t for d in train_docs for t in d[0]]
print("ํ† ํฐ๊ฐœ์ˆ˜:", len(tokens))
ํ† ํฐ๊ฐœ์ˆ˜: 2159921

์ด์ œ ์ด๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  nltk๋ฅผ ์ด์šฉํ•ด ์ „์ฒ˜๋ฆฌ๋ฅผ ํ•œ๋‹ค. Text ํด๋ž˜์Šค๋Š” ๋ฌธ์„œ๋ฅผ ํŽธ๋ฆฌํ•˜๊ฒŒ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋‹ค.

์—ฌ๊ธฐ์„œ๋Š” vocab().most_common ๋งค์„œ๋“œ๋ฅผ ์ด์šฉํ•ด ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ€์žฅ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ๋‹จ์–ด๋ฅผ ๊ฐ€์ ธ์˜ฌ ๋•Œ ์‚ฌ์šฉํ•˜๊ฒ ๋‹ค.

import nltk
text = nltk.Text(tokens, name='NMSC')

#ํ† ํฐ๊ฐœ์ˆ˜
print(len(text.tokens))

#์ค‘๋ณต์„ ์ œ์™ธํ•œ ํ† ํฐ๊ฐœ์ˆ˜
print(len(set(text.tokens)))

#์ถœ๋ ฅ๋นˆ๋„๊ฐ€ ๋†’์€ ์ƒ์œ„ ํ† ํฐ 10๊ฐœ
print(text.vocab().most_common(10))
2159921
49894
[('./Punctuation', 67778), ('์˜ํ™”/Noun', 50818), ('ํ•˜๋‹ค/Verb', 41209), ('์ด/Josa', 38540), ('๋ณด๋‹ค/Verb', 38538), ('์˜/Josa', 30188), ('../Punctuation', 29055), ('๊ฐ€/Josa', 26627), ('์—/Josa', 26468), ('์„/Josa', 23118)]

๋ฐ์ดํ„ฐ ํƒ์ƒ‰

์ถœ๋ ฅ๋นˆ๋„๊ฐ€ ๋†’์€ ์ƒ์œ„ ํ† ํฐ 10๊ฐœ๋ฅผ matplotlib์„ ์ด์šฉํ•ด ๊ทธ๋ž˜ํ”„๋กœ ํ™•์ธํ•ด๋ณด์ž.

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import font_manager, rc
plt.figure(figsize=(20,10))
text.plot(50)

png
png

๋ชจ๋ธ์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ๋ฐฑํ„ฐํ™”๋ฅผ ํ•ด์•ผ ํ•˜๋Š”๋ฐ, ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ํ† ํฐ 10000๊ฐœ๋ฅผ ์‚ฌ์šฉํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐฑํ„ฐํ™” ํ•˜์ž.(์› ํ•ซ ์ธ์ฝ”๋”ฉ ๋Œ€์‹  CountVectorization์„ ์‚ฌ์šฉ)

๋ฌธ์„œ ์ง‘ํ•ฉ์—์„œ ๋‹จ์–ด ํ† ํฐ์„ ์ƒ์„ฑํ•˜๊ณ  ๊ฐ ๋‹จ์–ด์˜ ์ˆ˜๋ฅผ ์„ธ์–ด BOW(Bag of Words) ์ธ์ฝ”๋”ฉํ•œ ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋ฏ€๋กœ 100๊ฐœ๋งŒ ํ•ด๋ณด์ž...

FREQUENCY_COUNT = 100; #์‹œ๊ฐ„์  ์—ฌ์œ ๊ฐ€ ์žˆ๋‹ค๋ฉด 10000๊ฐœ๋ฅผ ํ•ด๋ณด๋„๋ก~
selected_words = [f[0] for f in text.vocab().most_common(FREQUENCY_COUNT)]

์ด ๊ณผ์ •์€ ๋ฐ์ดํ„ฐ ์–‘์ด ํฐ ๋งŒํผ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— ์ด ์ž‘์—…์„ ๋ฐ˜๋ณตํ•˜์ง€ ์•Š๋„๋ก ํƒœ๊น…์„ ๋งˆ์นœ ํ›„์—๋Š” jsonํŒŒ์ผ๋กœ ์ €์žฅํ•˜๋Š” ๊ฒƒ๋„ ์ข‹์€ ๋ฐฉ๋ฒ•์ด๋‹ค.

๋ฌธ์„œ์—์„œ ์ƒ์œ„๋กœ ์„ ํƒ๋œ ๋‹จ์–ด๋“ค์ค‘ ๋ช‡๊ฐœ๊ฐ€ ํฌํ•จ์ด ๋˜๋Š”์ง€๋ฅผ ์•Œ์•„์•ผ ํ•œ๋‹ค.

#๋‹จ์–ด๋ฆฌ์ŠคํŠธ ๋ฌธ์„œ์—์„œ ์ƒ์œ„ 10000๊ฐœ๋“ค์ค‘ ํฌํ•จ๋˜๋Š” ๋‹จ์–ด๋“ค์ด ๊ฐœ์ˆ˜
def term_frequency(doc):
    return [doc.count(word) for word in selected_words]
#๋ฌธ์„œ์— ๋“ค์–ด๊ฐ€๋Š” ๋‹จ์–ด ๊ฐœ์ˆ˜
x_train = [term_frequency(d) for d,_ in train_docs]
x_test = [term_frequency(d) for d,_ in test_docs]
#๋ผ๋ฒจ(1 or 0)
y_train = [c for _,c in train_docs]
y_test = [c for _,c in test_docs]

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด x์ถ• ๋ฐ์ดํ„ฐ์—๋Š” ๋‹จ์–ด๋“ค์ด ๋นˆ๋„์ˆ˜ ์ •๋ณด?, y์ถ•์—๋Š” ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๋ฅผ ๊น”๋”ํ•˜๊ฒŒ ์ •๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด์ œ ๋ฐ์ดํ„ฐ๋ฅผ float๋กœ ํ˜• ๋ณ€ํ™˜ ์‹œ์ผœ์ฃผ๋ฉด ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์€ ๋~~

x_train = np.asarray(x_train).astype('float32')
x_test = np.asarray(x_test).astype('float32')

y_train = np.asarray(y_train).astype('float32')
y_test = np.asarray(y_test).astype('float32')

๋ฐ์ดํ„ฐ ๋ชจ๋ธ๋ง

ํ…์„œํ”Œ๋กœ ์ผ€๋ผ์Šค๋ฅผ ์ด์šฉํ•ด ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด ๋ณด์ž.

๋ ˆ์ด์–ด ๊ตฌ์„ฑ์€ ๋‘๊ฐœ์˜ Danse์ธต์€ 64๊ฐœ์˜ ์œ ๋‹›์„ ๊ฐ€์ง€๊ณ  ํ™œ์„ฑํ•จ์ˆ˜๋Š” relu๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ๋งˆ์ง€๋ง‰์ธต์€ sigmoid ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ๊ธ์ • ๋ฆฌ๋ทฐ์ผ ํ™•๋ฅ ์„ ์ถœ๋ ฅํ•  ๊ฒƒ์ด๋‹ค.

import tensorflow as tf

#๋ ˆ์ด์–ด ๊ตฌ์„ฑ
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(FREQUENCY_COUNT,)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

์†์‹ค ํ•จ์ˆ˜๋Š” binary_crossentropy, RMSprop ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ํ†ตํ•ด ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์„ ์ง„ํ–‰

#ํ•™์Šต ํ”„๋กœ์„ธ์Šค ์„ค์ •
model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=0.001),
    loss=tf.keras.losses.binary_crossentropy,
    metrics=[tf.keras.metrics.binary_accuracy]
    )

๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๋Š” 512, ์—ํฌํฌ๋Š” 10๋ฒˆ์œผ๋กœ ํ•™์Šต

์ž, ์ด์ œ ํ•™์Šต์„ ์‹œ์ผœ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด ๋ณด์ž! ๋จผ๊ฐ€ ์žˆ์–ด ๋ณด์ด๋Š” ์ง„ํ–‰๋ฅ  ์ƒํƒœ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.:)

#ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต
model.fit(x_train, y_train, epochs=10, batch_size=512)
WARNING:tensorflow:From C:\Users\DESKTOP\.conda\envs\nlp\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Epoch 1/10
150000/150000 [==============================] - 17s 115us/sample - loss: 0.5611 - binary_accuracy: 0.6948s - loss: 0.6134 - binary_accuracy:  - ETA: 21s - loss: 0.6108 - binary_accuracy: 0. - ETA: 20s -  - ETA: 15s - loss: 0.5957 - binary_accuracy: 0.67 - ETA: 15s - loss: 0.5955 - binary_accuracy: 0.67 - ETA: 15s - loss: 0.5951 - binary_accura - ETA: 14s - loss: 0.5930 - binary_accuracy: 0. - ETA: 14s - loss: 0.5923 - binary_accuracy:  - ETA: 13s - loss: 0.5911 - binary_accura - ETA: 8s - loss: 0.5785 - binary_accuracy: 0. - ETA: 8s - loss: 0.5771 - binary_accuracy: 0.68 - ETA: 8s - loss: 0.5764 - binary_accuracy:  - ETA: 7s - loss: 0.5751 - binary_accura - ETA: 6s - loss: 0.5727 - binary_accuracy: - ETA: 6s - loss: 0.5713 - binary_accuracy: 0.6 - ETA: 5s - loss: 0.5708 - binary_a - ETA: 4s - loss: 0.5685 - binary_accuracy: - ETA: 4s - loss: 0.5676 - binary_accuracy: 0.691 - ETA: 4s - loss: 0.5674 - binary_a - ETA: 2s - loss: 0.5652 - binary_accuracy: 0.692 - ETA: 2s - loss: 0.5651 - binary_accurac - ETA: 1s - loss: 0.5639 - binary_accuracy - ETA: 1s - loss: 0.5628 - binary_accuracy: 0.694 - ETA: 1s - loss: 0.5627 - binary_accuracy: 0.694 - ETA: 1s - loss: 0.5626 - binary_accuracy: 0.694 - ETA: 1s - loss: 0.5623 - binary_accuracy: 0.694 - ETA: 1s - loss: 0.5623 - binary_accuracy:  - ETA: 0s - loss: 0.5615 - binary_accuracy: 0.69 - ETA: 0s - loss: 0.5612 - binary_accuracy: 0.694 - ETA: 0s - loss: 0.5613 - binary_accuracy: 0.6
Epoch 2/10
150000/150000 [==============================] - 12s 83us/sample - loss: 0.5313 - binary_accuracy: 0.71254s - loss: 0.5221 - binary_ac - ETA: 15s - loss: 0.5345 - binary_accuracy:  - ETA: 16s - loss: 0.5377 - binary_accuracy - ETA: 14s - loss - ETA: 13s - lo - ETA: 7s - loss: 0.5341 - binary_accuracy: 0.71 - ETA: 7s -  - ETA: 4s - loss: 0.5340 - binary_accur - ETA: 3s - loss: 0.5335 - binary_accuracy: 0.71 - ETA: 3s - loss: 0.5335 - binary_ - ETA: 2s - loss: 0.5325 - binary_accuracy: 0 - ETA: 1s - loss: 0.
Epoch 3/10
150000/150000 [==============================] - 13s 86us/sample - loss: 0.5236 - binary_accuracy: 0.7170s - loss: 0.5284 - binary_accuracy - ETA: 10s - los - ETA: 9s - loss: 0.5255 - binary_accurac - ETA: 9s - loss: 0.5250 - binary_accuracy: 0. - ETA: 9s - loss: 0.5255 - binary_accuracy: 0.716 - ETA: 9s - loss: 0.5254 - binary - ETA: 8s - loss: 0.5247 - binary_acc - ETA: 7s - loss: 0.5251 - binary_accuracy: 0.7 - ETA: 7s - loss: 0.5252 - binary_accuracy: 0.71 - ETA: 7s - loss: 0.5249 - binary_accuracy: 0.716 - ETA: 7s - loss: 0.5248 - binary_accuracy: 0.7 - ETA: 6s - loss: 0.5245 - binary_accuracy: - ETA: 6s - loss: 0.5246 - binary_accuracy: 0 - ETA: 5s - loss: 0.5247 - bina - ETA: 5s - loss: 0.5249 - b - ETA: 3s - loss: 0.5246 - binary_ac - ETA: 2s - loss: 0.5246 - binary_ac - ETA: 2s - loss: 0.5244 - binary_accuracy: 0.71 - ETA: 2s - loss: 0.5244 - binary_accuracy: - ETA: 1s - loss: 0.5242 - binary_accur - ETA: 1s - loss: 0.5243 - binary_accuracy: 0 - ETA: 0s - loss: 0.5243 - binary_accuracy - ETA: 0s - loss: 0.5239 - binary_accuracy: 0.7 - ETA: 0s - loss: 0.5237 - binary_accuracy: 0.7 - ETA: 0s - loss: 0.5236 - binary_accuracy: 0.71
Epoch 4/10
150000/150000 [==============================] - 13s 89us/sample - loss: 0.5179 - binary_accuracy: 0.7219s - loss: 0.5201 - binary_accuracy: 0 - ETA: 8s - loss: 0.5211 - binary_a - ETA: 9s - loss: 0.5208 - binary_accuracy: 0.721 - ETA: 8s - loss: 0.5201 - binary_accuracy: 0.72 - ETA: 8s - loss: 0.5210 - - ETA: 7s - loss: 0.5180 - binary_ac - ETA: 7s - loss: 0.5187 - binar - ETA: 6s - loss: 0.5186 - binary_accura - ETA: 6s - loss: 0.5183 - binary_accuracy: 0.7 - ETA: 6s - loss: 0.5182 - binary_ac - ETA: 5s - loss: 0.5179 - binary_accuracy: 0.723 - ETA: 5s - loss: 0.5179 - binary_accurac - ETA: 4s - loss: 0.5187 - binary_accuracy: 0.722 - ETA: 4s - lo
Epoch 5/10
150000/150000 [==============================] - 13s 87us/sample - loss: 0.5132 - binary_accuracy: 0.72531s - loss: 0.5093 - binary_accuracy: 0.72 - ETA: 11s - ETA: 9s - loss: 0.5156 - binary_accuracy: 0.7 - ETA: 9s - loss: 0.5163 - binary_accuracy:  - ETA: 9s - loss: 0.5154 - binary_accurac - ETA: 8s - loss: 0.5164 - bin - ETA: 7s - loss: 0.5156 - binary - ETA: 6s - loss: 0.5168 - binary_accuracy: 0. - ETA: 6s - loss: 0.5163 - binary_accuracy: 0.72 - ETA: 6s - loss: 0.5160 - binary_accur - ETA: 5s - loss: 0.5154 - binary_accuracy: 0.723 - ETA: 5s - loss: 0.5153 - binary_a - ETA: 4s - los - ETA: 2s - loss: 0.5136 - binary_acc - ETA: 1s - loss: 0.5135 - binary_accuracy: 0.724 - ETA: 1s - loss: 0.5134 - binary_accuracy - ETA: 1s - loss: 0.5132 - binary_ac
Epoch 6/10
150000/150000 [==============================] - 13s 87us/sample - loss: 0.5094 - binary_accuracy: 0.72850s - loss: 0.4971 - binary_accuracy: 0.73 - ETA: 11s - loss: 0.5000 - binary_ - ETA: 9s - loss: 0.5043 - binary_accuracy: - ETA: 10s - loss: 0.5081 - binary_accuracy:  - ETA: 9s - loss: 0.5086 - binary_a - ETA: 9s - loss: 0.5106 - binary_accuracy: 0 - ETA: 9s - loss: 0.5111 - binary_accuracy: 0.72 - ETA: 9s - loss: 0.5112 - binary_accuracy:  - ETA: 9s - loss: 0.5122 - binary_accuracy: 0.724 - ETA: 9s - loss: 0.5122 - binary_accuracy:  - ETA: 8s - loss: 0.5114 - binary_accuracy: 0. - ETA: 8s - loss: 0.5115 - binary_accuracy: 0. - ETA: 8s - loss: 0.5118 - binary_accuracy: 0 - ETA: 8s - loss: 0.5111 - binary_accuracy: 0.725 - ETA: 8s - loss: 0.5114 - - ETA: 6s - loss: 0.5099 - binary_accuracy - ETA: 6s - loss: 0.5108 - binary_accuracy - ETA: 5s - loss: 0.5100 - binary_accuracy: 0.726 - ETA: 5s - loss: 0.5102 - binary_acc - ETA: 4s - loss: 0.5102 - binary_accurac - ETA: 4s - loss: 0.5102 - binary_accuracy: 0.7 - ETA: 4s - l - ETA: 1s - loss: 0.5102 - binary_accu - ETA: 1s - loss: 0.5098 - bi - ETA: 0s - loss: 0.5094 - binary_accuracy: 0.72
Epoch 7/10
150000/150000 [==============================] - 12s 79us/sample - loss: 0.5064 - binary_accuracy: 0.73061s - loss: 0.5050 - binary_accura - ETA: 9s - loss: 0.5096 - binary_accuracy: 0.72 - ETA: 9s - loss: 0.5090 - binary_accuracy - ETA: 8s - loss: 0.5083 - binary_accuracy: 0.7 - ETA: 8s - loss: 0.5082 - binary_accuracy: 0.7 - ETA: 8s - loss: 0.5083 - binary_accur - ETA: 7s - loss: 0.5085 - binary_accuracy: 0 - ETA: 6s - loss: 0.5079 - binary_a - ETA: 5s - loss: 0.5079 - binary_accuracy: 0. - ETA: 5s - loss: 0.5080 - binary_accuracy:  - ETA: 5s - loss: 0.5078 - binary_accuracy - ETA: 4s - loss: 0.5081 - binary_accuracy: 0 - ETA: 4s - loss: 0.5080 - binary_accuracy: 0.730 - ETA: 4s - loss: 0.5080 - binary_accuracy: 0.730 - ETA: 4s - loss: 0.5078 - binary_accuracy:  - ETA: 4s - loss: 0.5072 - binary_accuracy: 0 - ETA: 3s - loss: 0.5072 - binary_ - ETA: 2s - lo - ETA: 0s - loss: 0.5064 - binary_accuracy: 0.730
Epoch 8/10
150000/150000 [==============================] - 13s 85us/sample - loss: 0.5037 - binary_accuracy: 0.73210s - loss: 0.5053 - binary_acc - ETA: 9s - loss: 0.5045 - binary_accuracy:  - ETA: 9s - loss: 0.5024 - binary_accu - ETA: 8s - loss: 0.5013 - binary_accu - ETA: 8s - loss: 0.5014 - binary_accuracy: 0.73 - ETA: 8s - loss: 0.5007 - binar - ETA: 6s - loss: 0.5013 - binary_a - ETA: 6s - loss: 0.5016 - binary_accuracy:  - ETA: 6s - loss: 0.5019 - binary_accuracy: 0.73 - ETA: 5s - loss: 0.5019 - binary_accuracy: 0 - ETA: 5s - loss: 0.5023 - binary_accuracy: 0.73 - ETA: 5s - loss: 0.5021 - binary_accuracy: 0.73 - ETA: 5s - l - ETA: 3s - loss: 0.5027 - ETA: 2s - loss: 0.5036 - binary_accuracy:  - ETA: 2s - loss: 0.5033 - binary_accuracy: 0.732 - ETA: 2s - loss: 0.5033 - binary_accuracy: 0.73 - ETA: 2s - loss: 0.5034 - binary_accur - ETA: 1s - loss: 0.5035 - binary_accuracy:  - ETA: 0s - loss: 0.5033 - binary_accuracy:  - ETA: 0s - loss: 0.5035 - binary_acc
Epoch 9/10
150000/150000 [==============================] - 13s 85us/sample - loss: 0.5015 - binary_accuracy: 0.7337s - loss: 0.5023 - binary_accu - ETA: 13s - loss: 0.4956 - binary_accuracy: 0. - ETA: 14s - loss: 0.4949 - binary_accu - ETA: 15s - loss: 0.4938 - binary_accuracy: 0. - ETA: 15s - loss: 0.4977 - binary_accuracy: 0.73 - ETA: 15s - loss: 0.4981 - binary_accuracy: 0. - ETA: 14s - loss:  - ETA: 12s - loss: 0.4989 - binary_accuracy - ETA: 12s - loss: 0.4992 - binary_accuracy - ETA: 11s - loss: 0.4990 - binary_ - ETA: 10s - loss: 0.4982 - binary_accuracy: 0.73 - ETA: 10s - loss: 0.4982 - binar - ETA: 9s - loss: 0.4997 - binary_accuracy: 0. - ETA: 9s - loss: 0.4997 - binary_accuracy: 0 - ETA: 8s - loss: 0.4996 - binary_accuracy: 0.7 - ETA: 8s - loss: 0.4997 - binary - ETA: 7s - loss: 0.5011 - bina - ETA: 6s - loss: 0.5014 - binary_accuracy: 0.7 - ETA: 6s - loss: 0.5011 - binary_ - ETA: 4s - loss: 0.5016 - binary_accuracy: 0.733 - ETA: 4s - loss: 0.5017 - binary_accurac - ETA: 4s - loss: 0.5018 - binary_ac - ETA: 2s - loss: 0.5022 - binary_accuracy: 0.7 - ETA: 2s - loss: 0.5023 - binary_accuracy: 0.73 - ETA: 2s - loss: 0.5021 - binary_accuracy: - ETA: 1s - loss: 0.5018 - binary_accuracy: 0 - ETA: 1s - loss: 0.5019 - binary_accuracy - ETA: 1s - loss: 0.5018 - binary_accuracy: 0.73 - ETA: 1s - loss: 0.5018 - binary_accuracy: 0.7 - ETA: 1s - loss: 0.5017 - binary_a
Epoch 10/10
150000/150000 [==============================] - 14s 91us/sample - loss: 0.4995 - binary_accuracy: 0.73620s - loss: 0.4948 - binary_accuracy:  - ETA: 14s - loss: 0.4965 - binary_accura - ETA: 14s - loss: 0.4980 - binary_accu - ETA: 12s - loss: 0.4973 - binary_accuracy - ETA: 12s - loss: 0.5014 - binar - ETA: 12s - loss: 0.4992 - binary_accuracy - ETA: 12s - loss: 0.4996 - binary_accuracy:  - ETA: 12s - loss: 0.5001 - binary_accuracy: 0.73 - ETA: 12s - loss: 0.5003 - binary_accura - ETA: 11s - loss: 0.5011 - b - ETA: 9s - loss: 0.5027 - binary_accuracy:  - ETA: 9s - loss: 0.5026 - binary_accuracy - ETA: 8s - loss: 0.5013 - binary_accur - ETA: 8s - loss:  - ETA: 6s - loss: 0.5013 - binary_ - ETA: 5s - loss: 0.5006 - binary_ac - ETA: 4s - loss: 0.4997 - binary_accurac - ETA: 4s - loss: 0.5001 - binary_ac - ETA: 3s - loss: 0.5001 - b - ETA: 1s - loss: 0.4989 - binary_accuracy: 0.736 - ETA: 1s - loss: 0.4989 - binary_accuracy: 0.736 - ETA: 1s - loss: 0.4988 - binary_accuracy: 0.736 - ETA: 1s - loss: 0.4989 - binary_accurac - ETA: 1s - loss: 0.4991 - binary_accuracy: 0. - ETA: 0s - loss: 0.4993 - binary_accura - ETA: 0s - loss: 0.4995 - binary_accuracy: 0.73

<tensorflow.python.keras.callbacks.History at 0x1967af62ac8>

๋ชจ๋ธ ํ‰๊ฐ€

ํ•™์Šต๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ๋ชจ๋ธ ํ•™์Šต์ด ๋๋‚ฌ๋‹ค๋ฉด ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•ด๋ณด์ž.

results = model.evaluate(x_test, y_test)
50000/50000 [==============================] - 12s 234us/sample - loss: 0.5198 - binary_accuracy: 0.7184s - loss: 0.5197 - b - ETA: 11s - loss: 0.5199 - binary_accuracy: 0.  - ETA: 1s - loss: 0.5198 - b
#loss: 0.5, acc: 0.7
results
[0.5197769028568268, 0.71842]

์—ฌ๊ธฐ์„œ๋Š” 100๊ฑด์œผ๋กœ ํ–ˆ๊ธฐ๋•Œ๋ฌธ์— ์ข€ ๋‚ฎ์€ 71%์˜ ์ •ํ™•๋„๊ฐ€ ๋‚˜์™”๋‹ค. ์•„๋งˆ ์‚ฌ์šฉํ•œ ํ† ํฐ์ˆ˜๋ฅผ 100๊ฐœ๊ฐ€ ์•„๋‹Œ 10000๊ฐœ๋กœ ํ–ˆ๋‹ค๋ฉด 85%์ •๋„์˜ ์ •ํ™•๋„๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

ํŒ์œผ๋กœ ํž˜๋“ค๊ฒŒ ๋งŒ๋“  ๋ชจ๋ธ์„ ์•„๋ž˜์™€ ๊ฐ™์ด ์ €์žฅํ•ด๋‘๊ณ  ๋‚˜์ค‘์— ๋กœ๋“œํ•ด์„œ ์‚ฌ์šฉํ• ์ˆ˜ ์žˆ์œผ๋‹ˆ ๊ผญ ์•Œ์•„๋‘์ž.

#๋ชจ๋ธ์„ ์ €์žฅํ•ด๋‘˜์ˆ˜๋„ ์žˆ๋‹ค.
model.save('movie_review_model.h5')

# ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
#from keras.models import load_model
#model = load_model('movie_review_model.h5')

๊ฒฐ๊ณผ ์˜ˆ์ธกํ•˜๊ธฐ

์ด์ œ ๋ฆฌ๋ทฐ ๋ฌธ์ž์—ด์„ ๋ฐ›์•„ ๋ฐ”๋กœ ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ค์–ด ๋ณด์ž

๋ฐ์ดํ„ฐ ํ˜•ํƒœ๋ฅผ ๋งž์ถ”๊ธฐ ์œ„ํ•ด np.expand_dims ๋งค์„œ๋“œ๋ฅผ ์ด์šฉํ•ด array์˜ ์ถ•์„ ํ™•์žฅ ์‹œ์ผœ์ฃผ์–ด์•ผ ํ•œ๋‹ค.

์ตœ์ข… ํ™•๋ฅ ์ด 0.5 ์ด์ƒ์ด๋ฉด ๊ธ์ •, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ๋ถ€์ •์ด๋ผ๊ณ  ์˜ˆ์ธก์„ ํ•˜๊ฒ ๋‹ค.

๋Œ€๋žต ํ…Œ์ŠคํŠธ๋ฅผ ๋จผ์ € ํ•ด๋ณด๊ณ ...

review = "์•„์ฃผ ์žฌ๋ฏธ ์žˆ์–ด์š”"
token = tokenize(review)
token
['์•„์ฃผ/Noun', '์žฌ๋ฏธ/Noun', '์žˆ๋‹ค/Adjective']
tf = term_frequency(token)
data = np.expand_dims(np.asarray(tf).astype('float32'), axis=0)
float(model.predict(data))
0.9102853536605835

ํ…Œ์ŠคํŠธํ•œ ๋กœ์ง์„ ํ•จ์ˆ˜ํ™”ํ•ด์„œ ์‚ฌ์šฉํ•˜์ž.

def predict_review(review):
    token = tokenize(review)
    tfq = term_frequency(token)
    data = np.expand_dims(np.asarray(tfq).astype('float32'), axis=0)
    score = float(model.predict(data))
    if(score > 0.5):
        print(f"{review} ==> ๊ธ์ • ({round(score*100)}%)")
    else:
        print(f"{review} ==> ๋ถ€์ • ({round((1-score)*100)}%)")
predict_review("์žฌ๋ฏธ ์ •๋ง ์—†์–ด์š”")
์žฌ๋ฏธ ์ •๋ง ์—†์–ด์š” ==> ๋ถ€์ • (93%)

์ด์ œ ๋ฆฌ๋ทฐํ…์ŠคํŠธ ๋งŒ์œผ๋กœ ๊ธ์ •์ธ์ง€ ํ˜น์€ ๋ถ€์ •์ธ์ง€๋ฅผ ์–ด๋Š์ •๋„ ํŒ๋‹จ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค.

์ง€๊ธˆ๊นŒ์ง€ ์˜ํ™”๋ฆฌ๋ทฐ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด์„œ ๊ฐ์ •๋ถ„์„์„ ํ•ด๋ณด์•˜๋Š”๋ฐ ์ƒํ’ˆ, ๊ฒŒ์ž„, ์Œ์‹๋“ฑ์˜ ์‚ฌ์šฉ์ž ์˜๊ฒฌ์ด ๋‹ด๊ธด ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ๋ชจ์•„์„œ ํ™œ์šฉํ•œ๋‹ค๋ฉด ๋‹ค์–‘ํ•œ ๊ณณ์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

@winuss
Hello :) Developer notes!