Äú¿ÉÒÔ¾èÖú£¬Ö§³ÖÎÒÃǵĹ«ÒæÊÂÒµ¡£

1Ôª 10Ôª 50Ôª





ÈÏÖ¤Â룺  ÑéÖ¤Âë,¿´²»Çå³þ?Çëµã»÷Ë¢ÐÂÑéÖ¤Âë ±ØÌî



  ÇóÖª ÎÄÕ ÎÄ¿â Lib ÊÓÆµ iPerson ¿Î³Ì ÈÏÖ¤ ×Éѯ ¹¤¾ß ½²×ù Model Center   Code  
»áÔ±   
   
 
     
   
 ¶©ÔÄ
  ¾èÖú
Éî¶Èѧϰ¿ò¼ÜKeras½éÉܼ°ÊµÕ½
 
  1499  次浏览      28
 2019-7-9 
 
±à¼­ÍƼö:

±¾ÎÄÀ´×Ôcnblogs£¬Ö÷Òª½éÉÜÁËDense È«Á¬½Ó²ã£¬Ç¶Èë²ã Embedding£¬LSTM²ã£¬Êý¾ÝÔ¤´¦Àí£ºÎı¾Ô¤´¦Àí£¬ÐòÁÐÔ¤´¦Àí£¬Êý¾Ý¼ÓÔØ£¬Êý¾ÝÇåÏ´£¬Keras´î½¨ÍøÂçÏà¹ØÄÚÈÝ¡£

Keras ÊÇÒ»¸öÓà Python ±àдµÄ¸ß¼¶Éñ¾­ÍøÂç API£¬ËüÄܹ»ÒÔ TensorFlow, CNTK, »òÕß Theano ×÷Ϊºó¶ËÔËÐС£Keras µÄ¿ª·¢ÖصãÊÇÖ§³Ö¿ìËÙµÄʵÑé¡£Äܹ»ÒÔ×îСµÄʱÑÓ°ÑÄãµÄÏ뷨ת»»ÎªÊµÑé½á¹û£¬ÊÇ×öºÃÑо¿µÄ¹Ø¼ü¡£

±¾ÎÄÒÔKaggleÉϵÄÏîÄ¿:IMDBÓ°ÆÀÇé¸Ð·ÖÎöΪÀý,ѧϰÈçºÎÓÃKeras´î½¨Ò»¸öÉñ¾­ÍøÂç,´¦Àíʵ¼ÊÎÊÌâ.ÔĶÁ±¾ÎÄÐèÒª¶ÔÉñ¾­ÍøÂçÓлù´¡µÄÁ˽â.

ÎÄÕ·ÖΪÁ½¸ö²¿·Ö:

KerasÖеÄһЩ»ù±¾¸ÅÄî.ApiÓ÷¨.ÎÒ»á¸ø³öһЩ¼òµ¥µÄʹÓÃÑùÀý,»òÊǸø³öÏà¹ØÖªÊ¶Á´½Ó.

IMDBÓ°ÆÀÇé¸Ð·ÖÎöʵս.Óõ½µÄ¶¼ÊǵÚÒ»²¿·ÖÖн²µ½µÄ֪ʶµã.

Model

Dense È«Á¬½Ó²ã

keras.layers.core.Dense(units, activation=None, use_bias=True, k
ernel_initializer='glorot_uniform', bias_initializer='zeros', ke
rnel_regularizer=None, bias_regularizer=None, activity_regulariz
er=None, kernel_constraint=None, bias_constraint=None)

 

# as first layer in a sequential model:
# as first layer in a sequential model:
model = Sequential()
model.add(Dense(32, input_shape=(16,)))
# now the model will take as input arrays of shape (*, 16)
# and output arrays of shape (*, 32)
# after the first layer, you don't need to specify
# the size of the input anymore:
model.add(Dense(32))

ǶÈë²ã Embedding

keras.layers.embeddings.Embedding (input_dim, output_dim, embeddi
ngs_initializer='uniform', embeddings_regularizer=None, activity
_regularizer=None, embeddings_constraint=None, mask_zero=False,
input_length=None)

ÓÐÐËȤµÄ¿´Õâ¸öÁ´½Ó

Æäʵ¾ÍÊÇword to vector¡£ ÕâÒ»²ãµÄ×÷ÓþÍÊǵõ½ÓôÊÏòÁ¿±íʾµÄÎı¾.

input_dim: ´Ê±íµÄ´óС.¼´²»Í¬µÄ´ÊµÄ×ܸöÊý.

output_dim:ÏëÒª°Ñ´Êת»»³É¶àÉÙάµÄÏòÁ¿.

input_length: ÿһ¾äµÄ´ÊµÄ¸öÊý

±ÈÈçÈçÏ´ú±í:ÎÒÃÇÊäÈëÒ»¸öM*50µÄ¾ØÕó,Õâ¸ö¾ØÕóÖв»Í¬µÄ´ÊµÄ¸öÊýΪ200,ÎÒÃÇÏë°Ñÿ¸ö´Êת»»Îª32άÏòÁ¿. ·µ»ØµÄÊÇÒ»¸ö(M,50,32)µÄÕÅÁ¿.

Ò»¸ö¾ä×Ó50¸ö´Ê,ÿ¸ö´ÊÊÇ32άÏòÁ¿,¹²M¸ö¾ä×Ó. ËùÒÔÊÇe.shape=(M,50,32)

e = Embedding(200, 32, input_length=50)

LSTM²ã.

LSTMÊÇÑ­»·Éñ¾­ÍøÂçµÄÒ»ÖÖÌØÊâÇé¿ö .

¼òµ¥À´Ëµ,ÎÒÃÇ´Ëǰ˵¹ýµÄÉñ¾­ÍøÂç,°üÀ¨CNN,¶¼Êǵ¥ÏòµÄ,ûÓп¼ÂÇÐòÁйØÏµ,µ«ÊÇij¸ö´ÊµÄÒâÒåÓëÆäÉÏÏÂÎÄÊÇÓйصÄ,±ÈÈç"ÎÒÓÃ×ÅСÃ×ÊÖ»ú,³Ô×ÅСÃ×Öà",Á½¸öСÃ׿϶¨²»ÊÇÒ»¸öÒâ˼.ÔÚ×öÓïÒå·ÖÎöµÄʱºò,ÐèÒª¿¼ÂÇÉÏÏÂÎÄ. Ñ­»·Éñ¾­ÍøÂçRNN¾ÍÊǸÉÕâ¸öÊÂÇéµÄ.»òÕß˵"ÕⲿµçÓ°ÖÊÁ¿ºÜ¸ß,µ«ÊÇÎÒ²»Ï²»¶".Õâ¸ö¾ä×ÓÀï¼ÈÓÐÕýÃæÆÀ¼Û,ÓÖÓиºÃæÆÀ¼Û,²Î¿¼ÉÏÏÂÎĵÄLSTM»áʶ±ð³ö"µ«ÊÇ"ºóÃæµÄ²ÅÊÇÎÒÃÇÏëÒªÖØµã±í´ïµÄ.

keras.layers.recurrent.LSTM (units, activation='tanh', v recurrent_
activation='hard_sigmoid', use_bias=True, kernel_initializer='gl
orot_uniform', recurrent_initializer='orthogonal', bias_initiali
zer='zeros', unit_forget_bias=True, kernel_regularizer=None, rec
urrent_regularizer=None, bias_regularizer=None, activity_regular
izer=None, kernel_constraint=None, recurrent_constraint=None, bi
as_constraint=None, dropout=0.0, recurrent_dropout=0.0)

³Ø»¯²ã

keras.layers.pooling.GlobalMaxPooling1D() #¶Ôʱ¼äÐźŵÄÈ«¾Ö×î´ó³Ø»¯ https://stackoverflow.com/ questions/43728235/what-is -the-difference-between -keras-maxpooling1d-and- globalmaxpooling1d-functi>

input:ÐÎÈ磨 samples£¬ steps£¬ features£© µÄ3DÕÅÁ¿

output:ÐÎÈç(samples, features)µÄ2DÕÅÁ¿

keras.layers.pooling.MaxPooling1D (pool_size=2, strides=None, pad

ding='valid')

keras.layers.pooling.MaxPooling2D (pool_size=(2, 2), strides=None

, padding='valid', data_format=None)

keras.layers.pooling.MaxPooling3D (pool_size=(2, 2, 2), strides=N

one, padding='valid', data_format=None)

....

Êý¾ÝÔ¤´¦Àí

Îı¾Ô¤´¦Àí

keras.preprocessing.text. text_to_word_sequence(text,

filters=base_filter(), lower=True, split=" ")

keras.preprocessing.text.one_hot (text, n,

filters=base_filter(), lower= True, split=" ")

keras.preprocessing.text.Tokenizer (num_words=None, filters=base_

filter(),

lower=True, split=" ")

TokenizerÊÇÒ»¸öÓÃÓÚÏòÁ¿»¯Îı¾£¬ »ò½«Îı¾×ª»»ÎªÐòÁУ¨ ¼´µ¥´ÊÔÚ×ÖµäÖеÄϱ깹

³ÉµÄÁÐ±í£¬ ´Ó1ËãÆð£© µÄÀà¡£

num_words£º None»òÕûÊý£¬ ´¦ÀíµÄ×î´óµ¥´ÊÊýÁ¿¡£ Èô±»ÉèÖÃΪÕûÊý£¬ Ôò·Ö´ÊÆ÷

½«±»ÏÞÖÆÎª´¦ÀíÊý¾Ý¼¯ÖÐ×î³£¼ûµÄ num_words ¸öµ¥´Ê

²»¹Ünum_wordsÊǼ¸,fit_on_textsÒÔºó´Êµä¶¼ÊÇÒ»ÑùµÄ,È«²¿µÄ´Ê¶¼ÓжÔÓ¦µÄindex.Ö»ÊÇÔÚ×ötexts_to_sequencesʱËùµÃ½á¹û²»Í¬.

»áÈ¡×î³£³öÏÖµÄ(num_words - 1)¸ö´Ê¶ÔÓ¦µÄindexÀ´´ú±í¾ä×Ó.

×¢Òânum_words²»Í¬Ê±,×¼»»ºóX_tµÄ²»Í¬. ֻȡ´ÊµäÖгöÏÖ×î¶àµÄnum_words - 1´ú±í¾ä×Ó.Èç¹ûÒ»¸ö¾ä×ÓÖгöÏÖÌØ±ðÉúƧµÄ´Ê,¾Í»á±»¹ýÂ˵ô.±ÈÈçÒ»¸ö¾ä×Ó="x y z".y,z²»ÔڴʵäÖÐ×î³£³öÏÖµÄtop num_words-1µÄ»°,×îºóÕâ¸ö¾ä×ÓµÄÏòÁ¿ÐÎʽÔòΪ[x_index_in_dic]

t1="i love that girl"
t2='i hate u'
texts=[t1,t2]
tokenizer = Tokenizer(num_words=None)
tokenizer.fit_on_texts(texts) #µÃµ½´Êµä ÿ¸ö´Ê¶ÔÓ¦Ò»¸öindex.
print( tokenizer.word_counts) #OrderedDict([('i', 2), ('love', 1), ('that', 1), ('girl', 1), ('hate', 1), ('u', 1)])
print( tokenizer.word_index) #{'i': 1, 'love': 2, 'that': 3, 'girl': 4, 'hate': 5, 'u': 6}
print( tokenizer.word_docs) #{'i': 2, 'love': 1, 'that': 1, 'girl': 1, 'u': 1, 'hate': 1})
print( tokenizer.index_docs) #{1: 2, 2: 1, 3: 1, 4: 1, 6: 1, 5: 1}
tokennized_texts = tokenizer.texts_to_sequences(texts)
print(tokennized_texts) #[[1, 2, 3, 4], [1, 5, 6]] ÿ¸ö´ÊÓÉÆäindex±íʾ

X_t = pad_sequences(tokennized_texts, maxlen=None) #ת»»Îª2d array ¼´¾ØÕóÐÎʽ. ÿ¸öÎı¾µÄ´ÊµÄ¸öÊý¾ùΪmaxlen. ²»´æÔڵĴÊÓÃ0±íʾ.
print(X_t)#[[1 2 3 4][0 1 5 6]]

ÐòÁÐÔ¤´¦Àí

keras.preprocessing.sequence.pad_sequences (sequences, maxlen=None

, dtype='int32',

padding='pre', truncating='pre', value=0.)

·µ»ØÒ»¸ö2½×ÕÅÁ¿

keras.preprocessing.sequence.skipgrams (sequence, vocabulary_size

,

window_size=4, negative_samples=1., shuffle=True,

categorical=False, sampling_table=None)

keras.preprocessing.sequence.make _sampling_table (size, sampling_

factor=1e-5)

kerasʵս:IMDBÓ°ÆÀÇé¸Ð·ÖÎö

Êý¾Ý¼¯½éÉÜ

labeledTrainData.tsv/imdb_master.csv Ó°ÆÀÊý¾Ý¼¯ ÒѾ­±ê×¢¶ÔµçÓ°ÊÇÕýÃæ/¸ºÃæÆÀ¼Û

testData.tsv ²âÊÔ¼¯ ÐèÒªÔ¤²âÆÀÂÛÊÇÕýÃæ/¸ºÃæ

Ö÷Òª²½Öè

Êý¾Ý¶ÁÈ¡

Êý¾ÝÇåÏ´ Ö÷Òª°üÀ¨È¥³ýÍ£´Ê,È¥³ýhtml tag,È¥³ý±êµã·ûºÅ

Ä£Ð͹¹½¨

ǶÈë²ã:Íê³É´Êµ½ÏòÁ¿µÄת»»

LSTM

³Ø»¯²ã:Íê³ÉÖØÒªÌØÕ÷³éÈ¡

È«Á¬½Ó²ã£º·ÖÀà

Êý¾Ý¼ÓÔØ

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df_train = pd.read_csv("./dataset/word2vec-nlp-tutorial/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
df_train1=pd.read_csv("./dataset/imdb-review-dataset/imdb_master.csv",encoding="latin-1")
df_train1=df_train1.drop(["type",'file'],axis=1)
df_train1.rename(columns={'label':'sentiment',
'Unnamed: 0':'id',
'review':'review'},
inplace=True)
df_train1 = df_train1[df_train1.sentiment != 'unsup']
df_train1['sentiment'] = df_train1['sentiment'].map({'pos': 1, 'neg': 0})
new_train=pd.concat([df_train,df_train1])

Êý¾ÝÇåÏ´

ÓÃbs4´¦ÀíhtmlÊý¾Ý

¹ýÂ˳öµ¥´Ê

È¥³ýÍ£ÓôÊ

import re
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
def review_to_words( raw_review ):
review_text = BeautifulSoup (raw_review, 'lxml').get_text()
letters_only = re.sub("[^a-zA-Z]", " ", review_text)
words = letters_only.lower().split()
stops = set (stopwords.words("english"))
meaningful_words = [w for w in words if not w in stops]

return( " ".join ( meaningful_words ))

new_train['review']= new_train['review'].apply(review_to_words)
df_test["review"]= df_test["review"].apply(review_to_words)

Keras´î½¨ÍøÂç

Îı¾×ª»»Îª¾ØÕó

- Tokenizer×÷ÓÃÓÚlist(sentence)µÃµ½´Êµä.½«´ÊÓôÊÔڴʵäÖеÄIndex×öÌæ»»,µÃµ½Êý×Ö¾ØÕó

- pad_sequences×ö²¹0. ±£Ö¤¾ØÕóÿһÐÐÊýÄ¿ÏàµÈ. ¼´Ã¿¸ö¾ä×ÓÓÐÏàͬÊýÁ¿µÄ´Ê.

list_classes = ["sentiment"]
y = new_train[list_classes].values
print(y.shape)
list_sentences_train = new_train["review"]
list_sentences_test = df_test["review"]

max_features = 6000
tokenizer = Tokenizer (num_words=max_features)
tokenizer.fit_on_texts (list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences (list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences (list_sentences_test)

print (len(tokenizer.word_index))

totalNumWords = [len(one_comment) for one_comment in list_tokenized_train]
print(max(totalNumWords), sum(totalNumWords) / len(totalNumWords))

maxlen = 400
X_t = pad_sequences( list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences (list_tokenized_test, maxlen=maxlen)

 

 

Ä£Ð͹¹½¨

´ÊתÏòÁ¿

inp = Input(shape=(maxlen, ))
print(inp.shape) # (?, 400) #ÿ¸ö¾ä×Ó400¸ö´Ê
embed_size = 128 #ÿ¸ö´Êת»»³É128άµÄÏòÁ¿
x = Embedding(max_features, embed_size)(inp)
print(x.shape) #(?, 400, 128)

LSTM 60¸öÉñ¾­Ôª

GlobalMaxPool1D Ï൱ÓÚ³éÈ¡³ö×îÖØÒªµÄÉñ¾­ÔªÊä³ö

DropOut ¶ªÆú²¿·ÖÊä³ö ÒýÈëÕýÔò»¯,·ÀÖ¹¹ýÄâºÏ

Dense È«Á¬½Ó²ã

Ä£ÐͱàÒëʱָ¶¨Ëðʧº¯Êý,ÓÅ»¯Æ÷,Ä£ÐÍЧ¹ûÆÀ²â±ê×¼

x = LSTM(60, return_sequences=True,name='lstm_layer')(x)
print(x.shape)
x = GlobalMaxPool1D()(x)
print(x.shape)
x = Dropout(0.1)(x)
print(x.shape)
x = Dense(50, activation="relu")(x)
print(x.shape)
x = Dropout(0.1)(x)
print(x.shape)
x = Dense(1, activation="sigmoid")(x)
print(x.shape)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])

 

Ä£ÐÍѵÁ·

batch_size = 32
epochs = 2
print(X_t.shape,y.shape)
model.fit(X_t,y, batch_size=batch_size, epochs=epochs, validation_split=0.2)

ʹÓÃÄ£ÐÍÔ¤²â

prediction = model.predict(X_te)
y_pred = (prediction > 0.5)

 

 

   
1499 ´Îä¯ÀÀ       28
Ïà¹ØÎÄÕÂ

»ùÓÚͼ¾í»ýÍøÂçµÄͼÉî¶Èѧϰ
×Ô¶¯¼ÝÊ»ÖеÄ3DÄ¿±ê¼ì²â
¹¤Òµ»úÆ÷ÈË¿ØÖÆÏµÍ³¼Ü¹¹½éÉÜ
ÏîĿʵս£ºÈçºÎ¹¹½¨ÖªÊ¶Í¼Æ×
 
Ïà¹ØÎĵµ

5GÈ˹¤ÖÇÄÜÎïÁªÍøµÄµäÐÍÓ¦ÓÃ
Éî¶ÈѧϰÔÚ×Ô¶¯¼ÝÊ»ÖеÄÓ¦ÓÃ
ͼÉñ¾­ÍøÂçÔÚ½»²æÑ§¿ÆÁìÓòµÄÓ¦ÓÃÑо¿
ÎÞÈË»úϵͳԭÀí
Ïà¹Ø¿Î³Ì

È˹¤ÖÇÄÜ¡¢»úÆ÷ѧϰ&TensorFlow
»úÆ÷ÈËÈí¼þ¿ª·¢¼¼Êõ
È˹¤ÖÇÄÜ£¬»úÆ÷ѧϰºÍÉî¶Èѧϰ
ͼÏñ´¦ÀíËã·¨·½·¨Óëʵ¼ù