±à¼ÍƼö: |
±¾ÎÄÊ×ÏȽéÉÜÁË»úÆ÷ѧϰ¸ÅÊö¡¢È»ºó½éÉÜÁË-ÌØÕ÷¹¤³Ì£ºÌØÕ÷³éÈ¡¡¢ÌØÕ÷Ô¤´¦Àí¡¢ÌØÕ÷Ñ¡Ôñ¡¢ÌØÕ÷½µÎ¬¡£
±¾ÎÄÀ´×ÔÓÚcsdn£¬ÓÉ»ðÁú¹ûÈí¼þLinda±à¼¡¢ÍƼö¡£ |
|
1.»úÆ÷ѧϰ¸ÅÊö
1.1 ʲôÊÇ»úÆ÷ѧϰ
(1)±³¾°½éÉÜ

(2)¶¨Òå
»úÆ÷ѧϰÊÇ´ÓÊý¾ÝÖÐ×Ô¶¯·ÖÎö»ñµÃ¹æÂÉ£¨Ä£ÐÍ£©£¬²¢ÀûÓùæÂɶÔδ֪Êý¾Ý½øÐÐÔ¤²â.
(3)½âÊÍ


1.2 ΪʲôҪ»úÆ÷ѧϰ
(1)½â·ÅÉú²úÁ¦£º
ÖÇÄܿͷþ£º²»ÖªÆ£¾ë24СʱСʱ×÷Òµ
(2)Á¿»¯Í¶×Ê£º±ÜÃâ¸ü¶àµÄ½»Ò×ÈËÔ±
½â¾öרҵÎÊÌ⣺
(3)Ò½ÁÆ£º
°ïÖúÒ½Éú¸¨ÖúÒ½ÁÆ
(4)Ìṩ±ãÀû£º
°¢ÀïET³ÇÊдóÄÔ£ºÖÇÄܳÇÊÐ
1.3 »úÆ÷ѧϰӦÓó¡¾°
»úÆ÷ѧϰµÄÓ¦Óó¡¾°·Ç³£¶à£¬¿ÉÒÔËµÉøÍ¸µ½Á˸÷¸öÐÐÒµÁìÓòµ±ÖС£Ò½ÁÆ¡¢º½¿Õ¡¢½ÌÓý¡¢ÎïÁ÷¡¢µçÉ̵ȵÈÁìÓòµÄ¸÷ÖÖ³¡¾°¡£
(1)ÓÃÔÚÍÚ¾ò¡¢Ô¤²âÁìÓò£º
Ó¦Óó¡¾°£ºµêÆÌÏúÁ¿Ô¤²â¡¢Á¿»¯Í¶×Ê¡¢¹ã¸æÍƼö¡¢ÆóÒµ¿Í»§·ÖÀà¡¢SQLÓï¾ä°²È«¼ì²â·ÖÀà¡

(2)ÓÃÔÚͼÏñÁìÓò£º
Ó¦Óó¡¾°£º½ÖµÀ½»Í¨±êÖ¾¼ì²â¡¢Í¼Æ¬ÉÌÆ·Ê¶±ð¼ì²âµÈµÈ
(3)ÓÃÔÚ×ÔÈ»ÓïÑÔ´¦ÀíÁìÓò£º
Ó¦Óó¡¾°£ºÎı¾·ÖÀà¡¢Çé¸Ð·ÖÎö¡¢×Ô¶¯ÁÄÌì¡¢Îı¾¼ì²âµÈµÈ
µ±Ç°ÖØÒªµÄÊÇÕÆÎÕһЩ»úÆ÷ѧϰËã·¨µÈ¼¼ÇÉ£¬´Óij¸öÒµÎñÁìÓòÇÐÈë½â¾öÎÊÌâ¡£
1.4 ѧϰ¿ò¼ÜºÍ×ÊÁϵĽéÉÜ
(1)ѧϰ¿ò¼Ü

2.ÌØÕ÷¹¤³Ì
ѧϰĿ±ê:
Á˽âÌØÕ÷¹¤³ÌÔÚ»úÆ÷ѧϰµ±ÖеÄÖØÒªÐÔ
Ó¦ÓÃsklearnʵÏÖÌØÕ÷Ô¤´¦Àí
Ó¦ÓÃsklearnʵÏÖÌØÕ÷³éÈ¡
Ó¦ÓÃsklearnʵÏÖÌØÕ÷Ñ¡Ôñ
Ó¦ÓÃPCAʵÏÖÌØÕ÷µÄ½µÎ¬
˵Ã÷»úÆ÷ѧϰËã·¨¼à¶½Ñ§Ï°ÓëÎ޼ලѧϰµÄÇø±ð
˵Ã÷¼à¶½Ñ§Ï°ÖеķÖÀà¡¢»Ø¹éÌØµã
˵Ã÷»úÆ÷ѧϰË㷨Ŀ±êÖµµÄÁ½ÖÖÊý¾ÝÀàÐÍ
˵Ã÷»úÆ÷ѧϰ(Êý¾ÝÍÚ¾ò)µÄ¿ª·¢Á÷³Ì
2.1 ÌØÕ÷¹¤³Ì½éÉÜ
ѧϰСĿ±ê:
(1)˵³ö»úÆ÷ѧϰµÄѵÁ·Êý¾Ý¼¯½á¹¹×é³É
(2)Á˽âÌØÕ÷¹¤³ÌÔÚ»úÆ÷ѧϰµ±ÖеÄÖØÒªÐÔ
(3)ÖªµÀÌØÕ÷¹¤³ÌµÄ·ÖÀà
´ø×ÅÎÊÌâѧ£º´ÓÀúÊ·Êý¾Ýµ±ÖлñµÃ¹æÂÉ£¿ÕâЩÀúÊ·Êý¾ÝÊÇÔõôµÄ¸ñʽ£¿
ÌØÕ÷¹¤³Ì֪ʶͼÆ×:

2.1.1 Êý¾Ý¼¯µÄ¹¹³É
(1)¿ÉÓÃÊý¾Ý¼¯

(2)Êý¾Ý¼¯µÄ¹¹³É
½á¹¹£ºÌØÕ÷Öµ+Ä¿±êÖµ


2.1.2 ΪʲôÐèÒªÌØÕ÷¹¤³Ì(Feature Engineering)
»úÆ÷ѧϰÁìÓòµÄ´óÉñAndrew Ng(Îâ¶÷´ï)ÀÏʦ˵¡°Coming up with features
is difficult, time-consuming, requires expert knowledge.
¡°Applied machine learning¡± is basically feature engineering.
¡±
×¢£ºÒµ½ç¹ã·ºÁ÷´«£ºÊý¾ÝºÍÌØÕ÷¾ö¶¨ÁË»úÆ÷ѧϰµÄÉÏÏÞ£¬¶øÄ£ÐͺÍËã·¨Ö»ÊDZƽüÕâ¸öÉÏÏÞ¶øÒÑ¡£
(1) ÌØÕ÷¹¤³ÌµÄÒâÒåºÍ×÷ÓÃ
ÒâÒå: »áÖ±½ÓÓ°Ïì»úÆ÷ѧϰµÄЧ¹û
×÷ÓÃ: ɸѡ,´¦ÀíÑ¡ÔñһЩºÏÊʵÄÌØÕ÷
(2)¾Ù¸ö¼òµ¥µÄÀý×ÓÀí½âÒ»ÏÂ:
ºÃº¢×ÓÃÇÒª×ö·¹,ÒªÏë·¹ºÃ³Ô,²»½öÐèÒª¾«Õ¿µÄ³øÒÕ,»¹ÐèÒªÓÅÖʵÄʳ²Ä;
ÎÒÃǸù¾Ý´ó³øÑ¡ºÃµÄʳ²Ä,È¥²ËÊг¡Âò;
È»ºó¶Ôʳ²ÄÇåÏ´,È¥µôÔãÆÉ,ÇкÃ;
×îºó·ÖÀà×°µ½ÈÝÆ÷µÄ¹ý³Ì,¾ÍÏëµ±ÓÚ»úÆ÷ѧϰÖеÄÌØÕ÷¹¤³Ì.
ÈçºÎÈúÃʳ²ÄµÄ¼ÛֵȫÃÀʵÏÖ,¾Í¿´´ó³øÔõô×öÁË.×öµÄ¹ý³Ì¾ÍÊÇË㷨ѵÁ·µÄ¹ý³Ì.
´ó³ø¶à´Îʵ¼ù³¢ÊÔ,µ÷Õû×ö²Ë·½·¨,»ðºò,ÅäÁϵÈ,×îºó×ܽáµÃ³öÒ»¸ö²Ëµ¥.Õâ¸ö²Ëµ¥±ãÊÇÄ£ÐÍ.
ÓÐÁËÕâ¸öÄ£ÐÍ,¾Í¿ÉÒÔÈøü¶àµÄºÃº¢×Ó×ö³öºÃ·¹.
¶àºÃ!
(3)ÌØÕ÷¹¤³Ì°üº¬ÄÚÈÝ:
ÌØÕ÷³éÈ¡
ÌØÕ÷Ô¤´¦Àí
ÌØÕ÷Ñ¡Ôñ
ÌØÕ÷½µÎ¬
2.1.3 ÌØÕ÷¹¤³ÌËùÐ蹤¾ß
(1)Scikit-learn¹¤¾ß½éÉÜ

PythonÓïÑԵĻúÆ÷ѧϰ¹¤¾ß
Scikit-learn°üÀ¨Ðí¶àÖªÃûµÄ»úÆ÷ѧϰËã·¨µÄʵÏÖ
Scikit-learnÎĵµÍêÉÆ£¬ÈÝÒ×ÉÏÊÖ£¬·á¸»µÄAPI
(2)°²×°
conda install
Scikit-learn==0.20 »ò
pip3 install Scikit-learn==0.20 |
°²×°ºÃÖ®ºó¿ÉÒÔͨ¹ýÒÔÏÂÃüÁî²é¿´ÊÇ·ñ°²×°³É¹¦
×¢£º°²×°scikit-learnÐèÒªNumpy,PandasµÈ¿â
(3)Scikit-learn°üº¬µÄÄÚÈÝ
sklearn½Ó¿Ú:
·ÖÀà¡¢¾ÛÀà¡¢»Ø¹é
ÌØÕ÷¹¤³Ì
Ä£ÐÍÑ¡Ôñ¡¢µ÷ÓÅ
2.2 ÌØÕ÷³éÈ¡
ѧϰĿ±ê
Ó¦ÓÃDictVectorizerʵÏÖ¶ÔÀà±ðÌØÕ÷½øÐÐÊýÖµ»¯¡¢ÀëÉ¢»¯
Ó¦ÓÃCountVectorizerʵÏÖ¶ÔÎı¾ÌØÕ÷½øÐÐÊýÖµ»¯
Ó¦ÓÃTfidfVectorizerʵÏÖ¶ÔÎı¾ÌØÕ÷½øÐÐÊýÖµ»¯
˵³öÁ½ÖÖÎı¾ÌØÕ÷ÌáÈ¡µÄ·½Ê½Çø±ð
ʲôÊÇÌØÕ÷ÌáÈ¡ÄØ£¿


2.2.1 ÌØÕ÷ÌáÈ¡
(1)°üÀ¨½«ÈÎÒâÊý¾Ý£¨ÈçÎı¾»òͼÏñ£©×ª»»Îª¿ÉÓÃÓÚ»úÆ÷ѧϰµÄÊý×ÖÌØÕ÷.
×¢£ºÌØÕ÷Öµ»¯ÊÇΪÁ˼ÆËã»ú¸üºÃµÄÈ¥Àí½âÊý¾Ý
ÌØÕ÷ÌáÈ¡¾ÙÀý:
×ÖµäÌØÕ÷ÌáÈ¡(ÌØÕ÷ÀëÉ¢»¯)
Îı¾ÌØÕ÷ÌáÈ¡
ͼÏñÌØÕ÷ÌáÈ¡£¨Éî¶Èѧϰ½«½éÉÜ£©
(2)ÌØÕ÷ÌáÈ¡API
sklearn.feature_extraction |
2.2.2 ×ÖµäÌØÕ÷ÌáÈ¡
(1)×÷Ó㺶Ô×ÖµäÊý¾Ý½øÐÐÌØÕ÷Öµ»¯
sklearn.feature_extraction. DictVectorizer(sparse=True,¡)
DictVectorizer.fit_transform(X) X: ×Öµä»òÕß°üº¬×ÖµäµÄµü´úÆ÷·µ»ØÖµ£º·µ»Øsparse¾ØÕó
DictVectorizer.inverse_transform
(X) X:array Êý×é»òÕßsparse¾ØÕó ·µ»ØÖµ:ת»»Ö®Ç°Êý¾Ý¸ñʽ
DictVectorizer.get_feature_names() ·µ»ØÀà±ðÃû³Æ
(2)Ó¦ÓÃ:
ÎÒÃǶÔÒÔÏÂÊý¾Ý½øÐÐÌØÕ÷ÌáÈ¡
[{'city': '±±¾©','temperature':100}
{'city': 'ÉϺ£','temperature':60}
{'city': 'ÉîÛÚ','temperature':30}] |
(3)Á÷³Ì·ÖÎö:
ʵÀý»¯ÀàDictVectorizer
µ÷ÓÃfit_transform·½·¨ÊäÈëÊý¾Ý²¢×ª»»£¨×¢Òâ·µ»Ø¸ñʽ£©
from sklearn.feature_extraction
import DictVectorizer
def dict_vec(): # ʵÀý»¯dict
dict = DictVectorizer(sparse=False)
#dict_vecµ÷ÓÃfit_transform
#Èý¸öÑù±¾µÄÌØÕ÷ÖµÊý¾Ý(×ÖµäÐÎʽ)
data = dict.fit_transform ([{'city': '±±¾©','temperature':100},{'city':
'ÉϺ£','temperature':60}, {'city': 'ÉîÛÚ','temperature':30}]) # ´òÓ¡ÌØÕ÷³éȡ֮ºóµÄÌØÕ÷½á¹û
print(dict.get_feature_names())
print(data) return None dict_vec()
|
Êä³ö½á¹û:
/home/yuyang/anaconda3/envs/
tensor1-6/bin/python3.5 "/media/yuyang/Yinux/heima/
Machine learning/demo.py"
['city=ÉϺ£', 'city=±±¾©', 'city=ÉîÛÚ', 'temperature']
[[ 0. 1. 0. 100.]
[ 1. 0. 0. 60.]
[ 0. 0. 1. 30.]] Process finished with exit code 0
|
DictVectorizerĬÈÏ·µ»ØÏ¡Êèsparse¾ØÕó,Ϊ½âÔ¼ÄÚ´æ¿Õ¼ä
×¢Òâ¹Û²ìûÓмÓÉÏsparse=False²ÎÊýµÄ½á¹û
/home/yuyang/anaconda3/envs/
tensor1-6/bin/python3.5 "/media/yuyang/Yinux/heima/
Machine learning/demo.py"
['city=ÉϺ£', 'city=±±¾©', 'city=ÉîÛÚ', 'temperature']
(0, 1) 1.0
(0, 3) 100.0
(1, 0) 1.0
(1, 3) 60.0
(2, 2) 1.0
(2, 3) 30.0 Process finished with exit code 0
|
Õâ¸ö½á¹û²¢²»ÊÇÎÒÃÇÏëÒª¿´µ½µÄ£¬ËùÒÔ¼ÓÉϲÎÊý,µÃµ½ÏëÒªµÄ½á¹û
ÔÚ×öpandasµÄÀëÉ¢»¯Ê±ºò£¬Ò²ÊµÏÖÁËÀàËÆµÄЧ¹û£¬ÎÒÃǰÑÕâ¸ö´¦ÀíÊý¾ÝµÄ¼¼ÇÉÓÃרҵµÄ³Æºô"one-hot"±àÂ룬֮ǰÎÒÃÇÒ²½âÊ͹ýΪʲôÐèÒªÕâÖÖÀëÉ¢»¯µÄÁË£¬ÊÇ·ÖÎöÊý¾ÝµÄÒ»ÖÖÊֶΡ£±ÈÈçÏÂÃæÀý×Ó£º


(4)×ܽá:¶ÔÓÚÌØÕ÷µ±ÖдæÔÚÀà±ðÐÅÏ¢µÄÎÒÃǶ¼»á×öone-hot±àÂë´¦Àí
2.2.3Îı¾ÌØÕ÷ÌáÈ¡
(1)×÷Ó㺶ÔÎı¾Êý¾Ý½øÐÐÌØÕ÷Öµ»¯
sklearn.feature_extraction.text.
CountVectorizer(stop_words=[])·µ»Ø´ÊƵ¾ØÕó stop_words=[]
ÔÚÁбíÖÐÖ¸¶¨²»ÐèÒªµÄ´Ê½øÐйýÂË
CountVectorizer.fit_transform(X)
X:Îı¾»òÕß°üº¬Îı¾×Ö·û´®µÄ¿Éµü´ú¶ÔÏó ·µ»ØÖµ£º·µ»Øsparse¾ØÕó
CountVectorizer.inverse_transform(X)
X:arrayÊý×é»òÕßsparse¾ØÕó ·µ»ØÖµ:ת»»Ö®Ç°Êý¾Ý¸ñ
CountVectorizer.get_feature_names()
·µ»ØÖµ:µ¥´ÊÁбí
CountVectorizer:
µ¥´ÊÁбí:½«ËùÓÐÎÄÕµĵ¥´Êͳ¼Æµ½Ò»¸öÁÐ±íµ±ÖÐ(ÖØ¸´µÄ´ÎÖ»µ±×öÒ»´Î),ĬÈÏ»á¹ýÂ˵ôµ¥¸ö×Öĸ
¶ÔÓÚµ¥¸ö×Öĸ,¶ÔÎÄÕÂÖ÷ÌâûÓÐÓ°Ïì,µ¥´Ê¿ÉÒÔÓÐÓ°Ïì.
sklearn.feature_extraction.text.TfidfVectorizer
(2)Ó¦ÓÃ:
ÎÒÃǶÔÒÔÏÂÊý¾Ý½øÐÐÌØÕ÷ÌáÈ¡:
["life
is short,i like python",
"life is too long,i dislike python"]
|
(3)Á÷³Ì·ÖÎö:
ʵÀý»¯ÀàCountVectorizer
µ÷ÓÃfit_transform·½·¨ÊäÈëÊý¾Ý²¢×ª»» £¨×¢Òâ·µ»Ø¸ñʽ£¬ÀûÓÃtoarray()½øÐÐsparse¾ØÕóת»»arrayÊý×飩
from sklearn.feature_extraction.text
import CountVectorizer def countvec(): # ʵÀý»¯count
count = CountVectorizer()
data = count.fit_transform (["life is is
short,i like python", "life is too long,i
dislike python"])
# ÄÚÈÝ
print(count.get_feature_names())
print(data.toarray()) return None countvec()
|
Êä³ö½á¹û:
/home/yuyang/anaconda3/envs/
tensor1-6/bin/python3.5 "/media/yuyang/Yinux/heima/
Machine learning/demo.py"
['dislike', 'is', 'life', 'like', 'long', 'python',
'short', 'too']
[[0 2 1 1 0 1 1 0]
[1 1 1 0 1 1 0 1]] Process finished with exit code 0
|
ÎÊÌâ:Èç¹ûÎÒÃǽ«Êý¾ÝÌæ»»³ÉÖÐÎÄ£¿
"data
= count.fit_transform (["ÈËÉú ¿à¶Ì ÎÒ Ï²»¶ python",
"Éú»î Ì«³¤ÁË,ÎÒ²» ϲ»¶ python"])"
|
ÄÇô×îÖյõ½µÄ½á¹ûÊÇ:
/home/yuyang/anaconda3/envs/ tensor1-6/bin/python3.5
"/media/yuyang/Yinux/heima/Machine learning/demo.py"
['python', 'ÈËÉú', 'ϲ»¶', 'Ì«³¤ÁË', 'ÎÒ²»', 'Éú»î', '¿à¶Ì']
[[1 1 1 0 0 0 1]
[1 0 1 1 1 1 0]] Process finished with exit code 0
|
×¢:²»Ö§³Öµ¥¸öÖÐÎÄ·Ö´Ê
Ϊʲô»áµÃµ½ÕâÑùµÄ½á¹ûÄØ£¬×Ðϸ·ÖÎöÖ®ºó»á·¢ÏÖÓ¢ÎÄĬÈÏÊÇÒÔ¿Õ¸ñ·Ö¿ªµÄ¡£Æäʵ¾Í´ïµ½ÁËÒ»¸ö·Ö´ÊµÄЧ¹û£¬ËùÒÔÎÒÃÇÒª¶ÔÖÐÎĽøÐзִʴ¦Àí.
ÖÐÎķִʲÉÓÃjieba·Ö´Ê´¦Àí
(4)jieba·Ö´Ê´¦Àí
jieba.cut()·µ»Ø´ÊÓï×é³ÉµÄÉú³ÉÆ÷
ÐèÒª°²×°ÏÂjieba¿â
°¸Àý·ÖÎö:
from sklearn.feature_extraction.text
import CountVectorizer
import jieba
def cutword():
'''
½øÐзִʴ¦Àí
:return:
''' # ½«Èý¸ö¾ä×ÓÓÃjieba.cut´¦Àí
contetn1 = jieba.cut("½ñÌìºÜ²Ð¿á£¬Ã÷Ìì¸ü²Ð¿á£¬ºóÌìºÜÃÀºÃ£¬µ«¾ø¶Ô´ó²¿·ÖÊÇËÀÔÚÃ÷ÌìÍíÉÏ£¬ËùÒÔÿ¸öÈ˲»Òª·ÅÆú½ñÌì¡£")
contetn2 = jieba.cut("ÎÒÃÇ¿´µ½µÄ´ÓºÜÔ¶ÐÇϵÀ´µÄ¹âÊÇÔÚ¼¸°ÙÍòÄê֮ǰ·¢³öµÄ£¬ÕâÑùµ±ÎÒÃÇ¿´µ½ÓîÖæÊ±£¬ÎÒÃÇÊÇÔÚ¿´ËüµÄ¹ýÈ¥¡£")
contetn3 = jieba.cut("Èç¹ûÖ»ÓÃÒ»ÖÖ·½Ê½Á˽âijÑùÊÂÎÄã¾Í²»»áÕæÕýÁ˽âËü¡£Á˽âÊÂÎïÕæÕýº¬ÒåµÄÃØÃÜÈ¡¾öÓÚÈçºÎ½«ÆäÓëÎÒÃÇËùÁ˽âµÄÊÂÎïÏàÁªÏµ¡£") # ÏȽ«×ÅÈý¸öת»»³ÉÁбí
c1 = ' '.join(list(contetn1))
c2 = ' '.join(list(contetn2))
c3 = ' '.join(list(contetn3)) return c1, c2, c3
def chvec(): # ʵÀý»¯conunt
count = CountVectorizer (stop_words=['²»Òª', 'ÎÒÃÇ',
'ËùÒÔ']) #Ö¸¶¨²»ÐèÒªµÄ´Ê½øÐйýÂË # ¶¨ÒåÒ»¸ö·Ö´ÊµÄº¯Êý
c1, c2, c3 = cutword() data = count.fit_transform([c1, c2, c3]) # ÄÚÈÝ
print(count.get_feature_names())
print(data.toarray()) return None
|
Êä³ö½á¹û:
/home/yuyang/anaconda3/envs/
tensor1-6/bin/python3.5 "/media/yuyang/Yinux/heima/
Machine learning/demo.py"
Building prefix dict from the default dictionary
...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.604 seconds.
Prefix dict has been built succesfully.
['Ò»ÖÖ', '²»»á', '֮ǰ', 'Á˽â', 'ÊÂÎï', '½ñÌì', '¹âÊÇÔÚ',
'¼¸°ÙÍòÄê', '·¢³ö', 'È¡¾öÓÚ', 'Ö»ÓÃ', 'ºóÌì', 'º¬Òå', '´ó²¿·Ö',
'ÈçºÎ', 'Èç¹û', 'ÓîÖæ', '·ÅÆú', '·½Ê½', 'Ã÷Ìì', 'ÐÇϵ', 'ÍíÉÏ',
'ijÑù', '²Ð¿á', 'ÿ¸ö', '¿´µ½', 'ÕæÕý', 'ÃØÃÜ', '¾ø¶Ô', 'ÃÀºÃ',
'ÁªÏµ', '¹ýÈ¥', 'ÕâÑù']
[[0 0 0 0 0 2 0 0 0 0 0 1 0 1 0 0 0 1 0 2 0
1 0 2 1 0 0 0 1 1 0 0 0]
[0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0
0 0 0 2 0 0 0 0 0 1 1]
[1 1 0 4 3 0 0 0 0 1 1 0 1 0 1 1 0 0 1 0 0 0
1 0 0 0 2 1 0 0 1 0 0]] Process finished with exit code 0
|

¸ÃÈçºÎ´¦Àíij¸ö´Ê»ò¶ÌÓïÔÚ¶àÆªÎÄÕÂÖгöÏֵĴÎÊý¸ßÕâÖÖÇé¿ö
2.2.4 Tf-idfÎı¾ÌØÕ÷ÌáÈ¡
TF-IDFµÄÖ÷Ҫ˼Ï룺Èç¹ûij¸ö´Ê»ò¶ÌÓïÔÚһƪÎÄÕÂÖгöÏֵĸÅÂʸߣ¬²¢ÔÚÆäËûÎÄÕÂÖкÜÉÙ³öÏÖ£¬ÔòÈÏΪ´Ë´Ê»òÕß¶ÌÓï¾ßÓкܺõÄÀà±ðÇø·ÖÄÜÁ¦£¬ÊʺÏÓÃÀ´·ÖÀà¡£
TF-IDF×÷ÓãºÓÃÒÔÆÀ¹ÀÒ»×ִʶÔÓÚÒ»¸öÎļþ¼¯»òÒ»¸öÓïÁÏ¿âÖÐµÄÆäÖÐÒ»·ÝÎļþµÄÖØÒª³Ì¶È¡£
(1)¹«Ê½
´ÊƵ£¨term frequency£¬tf£©Ö¸µÄÊÇijһ¸ö¸ø¶¨µÄ´ÊÓïÔÚ¸ÃÎļþÖгöÏֵįµÂÊ
ÄæÏòÎĵµÆµÂÊ£¨inverse document frequency£¬idf)ÊÇÒ»¸ö´ÊÓïÆÕ±éÖØÒªÐԵĶÈÁ¿¡£Ä³Ò»Ìض¨´ÊÓïµÄidf£¬¿ÉÒÔÓÉ×ÜÎļþÊýÄ¿³ýÒÔ°üº¬¸Ã´ÊÓïÖ®ÎļþµÄÊýÄ¿£¬ÔÙ½«µÃµ½µÄÉÌÈ¡ÒÔ10Ϊµ×µÄ¶ÔÊýµÃµ½

×îÖյóö½á¹û¿ÉÒÔÀí½âÎªÖØÒª³Ì¶È¡£
×¢£º¼ÙÈçһƪÎļþµÄ×Ü´ÊÓïÊýÊÇ100¸ö£¬¶ø´ÊÓï"·Ç³£"³öÏÖÁË5´Î£¬ÄÇô"·Ç³£"Ò»´ÊÔÚ¸ÃÎļþÖÐµÄ´ÊÆµ¾ÍÊÇ5/100=0.05¡£¶ø¼ÆËãÎļþƵÂÊ£¨IDF£©µÄ·½·¨ÊÇÒÔÎļþ¼¯µÄÎļþ×ÜÊý£¬³ýÒÔ³öÏÖ"·Ç³£"Ò»´ÊµÄÎļþÊý¡£ËùÒÔ£¬Èç¹û"·Ç³£"Ò»´ÊÔÚ1,000·ÝÎļþ³öÏÖ¹ý£¬¶øÎļþ×ÜÊýÊÇ10,000,000·ÝµÄ»°£¬ÆäÄæÏòÎļþƵÂʾÍÊÇlg£¨10,000,000
/ 1,0000£©=3¡£×îºó"·Ç³£"¶ÔÓÚÕâÆªÎĵµµÄtf-idfµÄ·ÖÊýΪ0.05
* 3=0.15
(2)°¸Àý
rom sklearn.feature_extraction.text
import TfidfVectorizer
import jieba
def cutword():
'''
½øÐзִʴ¦Àí
:return:
'''
# ½«Èý¸ö¾ä×ÓÓÃjieba.cut´¦Àí
contetn1 = jieba.cut("½ñÌìºÜ²Ð¿á£¬Ã÷Ìì¸ü²Ð¿á£¬ºóÌìºÜÃÀºÃ£¬µ«¾ø¶Ô´ó²¿·ÖÊÇËÀÔÚÃ÷ÌìÍíÉÏ£¬ËùÒÔÿ¸öÈ˲»Òª·ÅÆú½ñÌì¡£")
contetn2 = jieba.cut("ÎÒÃÇ¿´µ½µÄ´ÓºÜÔ¶ÐÇϵÀ´µÄ¹âÊÇÔÚ¼¸°ÙÍòÄê֮ǰ·¢³öµÄ£¬ÕâÑùµ±ÎÒÃÇ¿´µ½ÓîÖæÊ±£¬ÎÒÃÇÊÇÔÚ¿´ËüµÄ¹ýÈ¥¡£")
contetn3 = jieba.cut("Èç¹ûÖ»ÓÃÒ»ÖÖ·½Ê½Á˽âijÑùÊÂÎÄã¾Í²»»áÕæÕýÁ˽âËü¡£Á˽âÊÂÎïÕæÕýº¬ÒåµÄÃØÃÜÈ¡¾öÓÚÈçºÎ½«ÆäÓëÎÒÃÇËùÁ˽âµÄÊÂÎïÏàÁªÏµ¡£")
# ÏȽ«×ÅÈý¸öת»»³ÉÁбí
c1 = ' '.join(list(contetn1))
c2 = ' '.join(list(contetn2))
c3 = ' '.join(list(contetn3)) return c1, c2, c3
def tfidfvec():
# ʵÀý»¯conunt
tfidf = TfidfVectorizer()
# ½øÐÐ3¾ä»°µÄ·Ö´Ê
c1, c2, c3 = cutword()
#¶ÔÁ½ÆªÎÄÕ½øÐÐÌØÕ÷ÌáÈ¡
data = tfidf.fit_transform([c1, c2, c3])
# ÄÚÈÝ
print(tfidf.get_feature_names())
print(data.toarray())
return None tfidfvec()
|
(3)Tf-idfµÄÖØÒªÐÔ
·ÖÀà»úÆ÷ѧϰËã·¨½øÐÐÎÄÕ·ÖÀàÖÐǰÆÚÊý¾Ý´¦Àí·½Ê½.
¸ù¾Ý²»Í¬´Ê»ãÔÚÎÄÕÂÖеÄÖØÒª³Ì¶È,ÅжÏÎÄÕÂÀà±ð.
2.3ÌØÕ÷Ô¤´¦Àí
ѧϰĿ±ê:
Á˽âÊýÖµÐÍÊý¾Ý¡¢Àà±ðÐÍÊý¾ÝÌØµã
Ó¦ÓÃMinMaxScalerʵÏÖ¶ÔÌØÕ÷Êý¾Ý½øÐйéÒ»»¯
Ó¦ÓÃStandardScalerʵÏÖ¶ÔÌØÕ÷Êý¾Ý½øÐбê×¼»¯
2.3.1ʲôÊÇÌØÕ÷Ô¤´¦Àí

#scikit-learnµÄ½âÊÍ
provides several common utility functions
and transformer classes to change raw feature vectors
into a representation that is more suitable for the
downstream estimators.
·Òë:ͨ¹ýһЩת»»º¯Êý½«ÌØÕ÷ת»»³É¸ü¼ÓÊʺÏË㷨ģÐ͵Ĺý³Ì
¿ÉÒÔͨ¹ýÉÏÃæÄÇÕÅͼÀ´Àí½â
(1)°üº¬ÄÚÈÝ
ÊýÖµÐÍÊý¾ÝµÄÎÞÁ¿¸Ù»¯£º
¹éÒ»»¯
±ê×¼»¯
(2)ÌØÕ÷Ô¤´¦ÀíAPI
ΪʲôÎÒÃÇÒª½øÐйéÒ»»¯/±ê×¼»¯£¿
ÌØÕ÷µÄµ¥Î»»òÕß´óСÏà²î½Ï´ó£¬»òÕßÄ³ÌØÕ÷µÄ·½²îÏà±ÈÆäËûµÄÌØÕ÷Òª´ó³ö¼¸¸öÊýÁ¿¼¶£¬ÈÝÒ×Ó°Ï죨֧Å䣩Ŀ±ê½á¹û£¬Ê¹µÃһЩËã·¨ÎÞ·¨Ñ§Ï°µ½ÆäËüµÄÌØÕ÷.
Ô¼»á¶ÔÏóÊý¾Ý

ÎÒÃÇÐèÒªÓõ½Ò»Ð©·½·¨½øÐÐÎÞÁ¿¸Ù»¯£¬Ê¹²»Í¬¹æ¸ñµÄÊý¾Ýת»»µ½Í¬Ò»¹æ¸ñ.
2.3.2 ¹éÒ»»¯
(1)¶¨Òå
ͨ¹ý¶ÔÔʼÊý¾Ý½øÐб任°ÑÊý¾ÝÓ³Éäµ½(ĬÈÏΪ[0,1])Ö®¼ä
(2)¹«Ê½

ÄÇôÔõôÀí½âÕâ¸ö¹ý³ÌÄØ£¿ÎÒÃÇͨ¹ýÒ»¸öÀý×Ó

(3)API
sklearn.preprocessing.MinMaxScaler (feature_range=(0,1)¡
)
MinMaxScalar.fit_transform(X)
X:numpy array¸ñʽµÄÊý¾Ý[n_samples,n_features]
·µ»ØÖµ£º×ª»»ºóµÄÐÎ×´ÏàͬµÄarray
(4)Êý¾Ý¼ÆËã
ÎÒÃǶÔÒÔÏÂÊý¾Ý½øÐÐÔËËã:
milage,Liters,Consumtime,target
40920,8.326976,0.953952,3
14488,7.153469,1.673904,2
26052,1.441871,0.805124,1
75136,13.147394,0.428964,1
|
·ÖÎö:
1¡¢ÊµÀý»¯MinMaxScalar
2¡¢Í¨¹ýfit_transformת»»
×¢:ֻȡǰÈýÁÐÊý¾Ý(ÌØÕ÷Öµ),×îºóÒ»ÁÐÊÇÄ¿±êÖµ.
from sklearn.preprocessing
import MinMaxScaler
import pandas as pd def minmaxscalar():
"""
¶ÔÔ¼»á¶ÔÏó½øÐйéÒ»»¯´¦Àí
:return:
"""
#¶ÁÈ¡Êý¾Ý,Ñ¡ÔñÒª´¦ÀíµÄÌØÕ÷
dating = pd.read_csv ("./data/dating.txt")
data = dating[['milage','Liters','Consumtime']]
#ʵÀý»¯minmaxscaler½øÐÐfit_transform
#mm = MinMaxScaler(feature_range=(2, 3))#½«Êý¾ÝÓ³Éäµ½2-3
mm = MinMaxScaler() #ĬÈÏfeature_range=(0, 1)
data = mm.fit_transform(data)
print(data)
print(data.shape) return None
minmaxscalar()
|
Êä³ö½á¹û:
[[0.44832535
0.39805139 0.56233353]
[0.15873259 0.34195467 0.98724416]
[0.28542943 0.06892523 0.47449629]
...
[0.29115949 0.50910294 0.51079493]
[0.52711097 0.43665451 0.4290048 ]
[0.47940793 0.3768091 0.78571804]]
(1000, 3) Process finished with exit code 0
|
ÎÊÌ⣺Èç¹ûÊý¾ÝÖÐÒì³£µã½Ï¶à£¬»áÓÐʲôӰÏ죿

(5)¹éÒ»»¯×ܽá
×¢Òâ×î´óÖµ×îСֵÊDZ仯µÄ£¬ÁíÍ⣬×î´óÖµÓë×îСֵ·Ç³£ÈÝÒ×ÊÜÒì³£µãÓ°Ï죬ËùÒÔ¹éÒ»»¯·½·¨Â³°ôÐԽϲֻÊʺϴ«Í³¾«È·Ð¡Êý¾Ý³¡¾°¡£
Ôõô°ì?
2.3.3 ±ê×¼»¯
(1)¶¨Òå
ͨ¹ý¶ÔÔʼÊý¾Ý½øÐб任°ÑÊý¾Ý±ä»»µ½¾ùֵΪ0,±ê×¼²îΪ1·¶Î§ÄÚ
(2)¹«Ê½

×÷ÓÃÓÚÿһÁУ¬meanΪƽ¾ùÖµ£¬¦ÒΪ±ê×¼²î
ËùÒԻص½¸Õ²ÅÒì³£µãµÄµØ·½£¬ÎÒÃÇÔÙÀ´¿´¿´±ê×¼»¯.

(1)¶ÔÓÚ¹éÒ»»¯À´Ëµ£ºÈç¹û³öÏÖÒì³£µã£¬Ó°ÏìÁË×î´óÖµºÍ×îСֵ£¬ÄÇô½á¹ûÏÔÈ»»á·¢Éú¸Ä±ä
(2)¶ÔÓÚ±ê×¼»¯À´Ëµ£ºÈç¹û³öÏÖÒì³£µã£¬ÓÉÓÚ¾ßÓÐÒ»¶¨Êý¾ÝÁ¿£¬ÉÙÁ¿µÄÒì³£µã¶ÔÓÚÆ½¾ùÖµµÄÓ°Ïì²¢²»´ó£¬´Ó¶ø·½²î¸Ä±ä½ÏС¡£
(3)API
sklearn.preprocessing.StandardScaler( )
´¦ÀíÖ®ºóÿÁÐÀ´ËµËùÓÐÊý¾Ý¶¼¾Û¼¯ÔÚ¾ùÖµ0¸½½ü;±ê×¼²î²îΪ1
StandardScaler.fit_transform(X)
X:numpy array¸ñʽµÄÊý¾Ý[n_samples,n_features]
·µ»ØÖµ£º×ª»»ºóµÄÐÎ×´ÏàͬµÄarray
(4)ÊýÖµ¼ÆËã
from sklearn.preprocessing
import StandardScaler
import pandas as pd def stdscalar():
"""
¶ÔÔ¼»á¶ÔÏó½øÐбê×¼»¯´¦Àí
:return:
"""
#¶ÁÈ¡Êý¾Ý,Ñ¡ÔñÒª´¦ÀíµÄÌØÕ÷
dating = pd.read_csv("./data/dating.txt")
data = dating [['milage','Liters','Consumtime']]
#ʵÀý»¯minmaxscaler½øÐÐfit_transform
#mm = MinMaxScaler(feature_range=(2, 3))#½«Êý¾ÝÓ³Éäµ½2-3
std = StandardScaler() #ĬÈÏfeature_range=(0,
1)
data = std.fit_transform(data)
print(data) return None stdscalar()
|
Êä³ö½á¹û:
[[ 0.33193158
0.41660188 0.24523407]
[-0.87247784 0.13992897 1.69385734]
[-0.34554872 -1.20667094 -0.05422437]
...
[-0.32171752 0.96431572 0.06952649]
[ 0.65959911 0.60699509 -0.20931587]
[ 0.46120328 0.31183342 1.00680598]] Process finished with exit code 0
|
(5)×ܽá
±ê×¼»¯·½·¨,ÔÚÒÑÓÐÑù±¾×ã¹»¶àµÄÇé¿öϱȽÏÎȶ¨£¬ÊʺÏÏÖ´úàÐÔÓ´óÊý¾Ý³¡¾°¡£
´¦Àíºó,ÿÁÐÊý¾Ý¾ùֵΪ0
2.4 ÌØÕ÷Ñ¡Ôñ
ѧϰĿ±ê:
ÖªµÀÌØÕ÷Ñ¡ÔñµÄǶÈëʽ¡¢¹ýÂËʽÒÔ¼°°ü¹üÊÏÈýÖÖ·½Ê½
Ó¦ÓÃVarianceThresholdʵÏÖɾ³ýµÍ·½²îÌØÕ÷
Á˽âÏà¹ØÏµÊýµÄÌØµãºÍ¼ÆËã
Ó¦ÓÃÏà¹ØÐÔϵÊýʵÏÖÌØÕ÷Ñ¡Ôñ
ÔÚÒýÈëÌØÕ÷Ñ¡Ôñ֮ǰ£¬ÏȽéÉÜÒ»¸ö½µÎ¬µÄ¸ÅÄî.
2.4.1½µÎ¬
½µÎ¬ÊÇÖ¸ÔÚijЩÏÞ¶¨Ìõ¼þÏ£¬½µµÍ**Ëæ»ú±äÁ¿(ÌØÕ÷)¸öÊý£¬µÃµ½Ò»×é¡°²»Ïà¹Ø¡±**Ö÷±äÁ¿µÄ¹ý³Ì.

2.4.2 ½µÎ¬µÄ2ÖÖ·½Ê½
(1)ÌØÕ÷Ñ¡Ôñ
(2)Ö÷³É·Ö·ÖÎö£¨¿ÉÒÔÀí½âÒ»ÖÖÌØÕ÷ÌáÈ¡µÄ·½Ê½£©
2.4.3ʲôÊÇÌØÕ÷Ñ¡Ôñ
(1)¶¨Òå
Êý¾ÝÖаüº¬ÈßÓà»òÎ޹رäÁ¿£¨»ò³ÆÌØÕ÷¡¢ÊôÐÔ¡¢Ö¸±êµÈ£©£¬Ö¼ÔÚ´ÓÔÓÐÌØÕ÷ÖÐÕÒ³öÖ÷ÒªÌØÕ÷¡£

(2)·½·¨

(3)Ä£¿é
sklearn.feature_selection
|
2.4.4 ¹ýÂËʽ
2.4.4.1 µÍ·½²îÌØÕ÷¹ýÂË
ɾ³ýµÍ·½²îµÄÒ»Ð©ÌØÕ÷¡£ÔÙ½áºÏ·½²îµÄ´óСÀ´¿¼ÂÇÕâ¸ö·½Ê½µÄ½Ç¶È¡£
(·½²î¿ÉÒÔ±íʾÊý¾ÝµÄÀëÉ¢³Ì¶È)
ÌØÕ÷·½²îС£ºÄ³¸öÌØÕ÷´ó¶àÑù±¾µÄÖµ±È½ÏÏà½ü
ÌØÕ÷·½²î´ó£ºÄ³¸öÌØÕ÷ºÜ¶àÑù±¾µÄÖµ¶¼Óвî±ð
(1)API
sklearn.feature_selection. VarianceThreshold(threshold
= 0.0)
ɾ³ýËùÓеͷ½²îÌØÕ÷
Variance.fit_transform(X)
X:numpy array¸ñʽµÄÊý¾Ý[n_samples,n_features]
·µ»ØÖµ£ºÑµÁ·¼¯²îÒìµÍÓÚthresholdµÄÌØÕ÷½«±»É¾³ý¡£Ä¬ÈÏÖµÊDZ£ÁôËùÓзÇÁã·½²îÌØÕ÷£¬¼´É¾³ýËùÓÐÑù±¾ÖоßÓÐÏàֵͬµÄÌØÕ÷¡£thresholdµÄÖµÔ½´ó,ɾ³ýµÄÌØÕ÷ÖµÔ½¶à.
(2)ÊýÖµ¼ÆËã
ÎÒÃǶÔÏÂÃæÊý¾Ý½øÐÐÔËËã
[[0, 2, 0,
3],
[0, 1, 4, 3],
[0, 1, 1, 3]] |
(3)ÊýÖµ·ÖÎö
1¡¢³õʼ»¯VarianceThreshold,Ö¸¶¨·§Öµ·½²î
2¡¢µ÷ÓÃfit_transform
from sklearn.feature_selection
import VarianceThreshold def varthreshold():
"""
ɾ³ýËùÓеͷ½²îÌØÕ÷
:return:
"""
var = VarianceThreshold (threshold=0.0) #ɾ³ý·½²îΪ0µÄÌØÕ÷
data = var.fit_transform([[0, 2, 0, 3],
[0, 1, 4, 3],
[0, 1, 1, 3]])
print(data)
return None
varthreshold()
|
/home/yuyang/anaconda3/envs/
tensor1-6/bin/python3.5 "/media/yuyang/Yinux/heima/
Machine learning/demo.py"
[[2 0]
[1 4]
[1 1]] Process finished with exit code 0
|
2.4.4.2 Ïà¹ØÏµÊý
Ƥ¶ûÑ·Ïà¹ØÏµÊý(Pearson Correlation Coefficient)
·´Ó³±äÁ¿Ö®¼äÏà¹Ø¹ØÏµÃÜÇг̶ȵÄͳ¼ÆÖ¸±ê
(1)Ƥ¶ûÑ·Ïà¹ØÏµÊý(ÎÞÐè¼ÇÒä)

±ÈÈç˵ÎÒÃǼÆËãÄê¹ã¸æ·ÑͶÈëÓëÔ¾ùÏúÊÛ¶î

ÄÇô֮¼äµÄÏà¹ØÏµÊýÔõô¼ÆËã

×îÖÕ¼ÆË㣺

ÉÏÃæÊÇз½²î£¬ÏÂÃæÊǸ÷×Եıê×¼²î
ËùÒÔÎÒÃÇ×îÖյóö½áÂÛÊÇ¹ã¸æÍ¶Èë·ÑÓëÔÂÆ½¾ùÏúÊÛ¶îÖ®¼äÓи߶ȵÄÕýÏà¹Ø¹ØÏµ¡£
(2)ÌØµã
(3)API
from scipy.stats
import pearsonr #x:(N,)array_like
#y(N,)arrat_like
Returns:(Pearson's correlation coefficient,
p-value) #·µ»ØÔª×éÖÐ,µÚÒ»¸öÊýΪƤ¶ûÑ·Ïà¹ØÏµÊý
|
(4)ÊýÖµ·ÖÎö
import pandas
as pd
from scipy.stats import pearsonr
data = pd.read_csv ('./data/factor_returns.csv')
factor = ['pe_ratio', 'pb_ratio', 'market_cap',
'return_on_asset_net_profit', 'du_return_on_equity',
'ev',
'earnings_per_share', 'revenue', 'total_expense'] datas = [(factor[i], factor[j + 1], pearsonr(data[factor[i]],
data[factor[j + 1]])[0])
for i in range(len(factor))
for j in range(i, len(factor) - 1)]
for data in datas:
print("Ö¸±ê {} ÓëÖ¸±ê {} Ö®¼äµÄÏà¹ØÐÔ´óСΪ {} ".format(*data))
|
2.5 ÌØÕ÷½µÎ¬
Ä¿±ê:
Ó¦ÓÃPCAʵÏÖÌØÕ÷µÄ½µÎ¬
Ó¦ÓÃ:
Óû§ÓëÎïÆ·Àà±ðÖ®¼äÖ÷³É·Ö·ÖÎö
2.5.1 ʲôÊÇÖ÷³É·Ö·ÖÎö(PCA)
¶¨Ò壺¸ßάÊý¾Ýת»¯ÎªµÍάÊý¾ÝµÄ¹ý³Ì£¬Ôڴ˹ý³ÌÖпÉÄÜ»áÉáÆúÔÓÐÊý¾Ý¡¢´´ÔìеıäÁ¿
×÷ÓãºÊÇÊý¾ÝάÊýѹËõ£¬¾¡¿ÉÄܽµµÍÔÊý¾ÝµÄάÊý£¨¸´ÔÓ¶È£©£¬ËðʧÉÙÁ¿ÐÅÏ¢¡£
Ó¦Ó㺻عé·ÖÎö»òÕß¾ÛÀà·ÖÎöµ±ÖÐ
ÄÇô¸üºÃµÄÀí½âÕâ¸ö¹ý³ÌÄØ£¿ÎÒÃÇÀ´¿´Ò»ÕÅͼ:

(1)¼ÆËã°¸ÀýÀí½â
¼ÙÉè¶ÔÓÚ¸ø¶¨5¸öµã£¬Êý¾ÝÈçÏÂ:
(-1,-2)
(-1, 0)
( 0, 0)
( 2, 1)
( 0, 1)
|

ÒªÇ󣺽«Õâ¸ö¶þάµÄÊý¾Ý¼ò»¯³Éһά£¿ ²¢ÇÒËðʧÉÙÁ¿µÄÐÅÏ¢

Õâ¸ö¹ý³ÌÈçºÎ¼ÆËãµÄÄØ£¿ÕÒµ½Ò»¸öºÏÊʵÄÖ±Ïߣ¬Í¨¹ýÒ»¸ö¾ØÕóÔËËãµÃ³öÖ÷³É·Ö·ÖÎöµÄ½á¹û:

(2)API
sklearn.decomposition.PCA (n_components=None)
½«Êý¾Ý·Ö½âΪ½ÏµÍάÊý¿Õ¼ä
n_components:
СÊý£º±íʾ±£Áô°Ù·ÖÖ®¶àÉÙµÄÐÅÏ¢
ÕûÊý£º¼õÉÙµ½¶àÉÙÌØÕ÷
PCA.fit_transform(X) X:numpy array¸ñʽµÄÊý¾Ý[n_samples,n_features]
·µ»ØÖµ£º×ª»»ºóÖ¸¶¨Î¬¶ÈµÄarray
(3)ÊýÖµ·ÖÎö
ÏȾÙÒ»¸ö¼òµ¥µÄÀý×Ó:
½«ÏÂÃæµÄ3¸öÑù±¾,4¸öÌØÕ÷¾ØÕó½µÎ¬:
[[2,8,4,5],
[6,3,0,8],
[5,4,9,1]]
|
from sklearn.decomposition
import PCA def pca():
"""
Ö÷³É·Ö·ÖÎö½øÐнµÎ¬
:return:
"""
pca = PCA(n_components=0.7) #±£Áô°Ù·ÖÖ®ÎåÊ®µÄÐÅÏ¢
data = pca.fit_transform ([[2,8,4,5],[6,3,0,8],[5,4,9,1]])
print(data) return None
pca()
|
Êä³ö:
[[ 1.22879107e-15
3.82970843e+00]
[ 5.74456265e+00 -1.91485422e+00]
[-5.74456265e+00 -1.91485422e+00]] Process finished with exit code 0
|
(4)Ó¦Óó¡¾°
ÌØÕ÷ÊýÁ¿·Ç³£´óµÄʱºò(ÉϰٸöÌØÕ÷):PCAȥѹËõÏà¹ØµÄÈßÓàÐÅÏ¢
´´ÔìеıäÁ¿(еÄÌØÕ÷):ÀýÈ罫¹ÉƱÊý¾ÝÖÐ,¸ß¶ÈÏà¹ØµÄÁ½¸öÖ¸±ê,revenueÓëÖ¸±êtotal_expensѹËõ³ÉÒ»¸öеÄÖ¸±ê(ÌØÕ÷)
2.5.2 °¸Àý:̽¾¿Óû§¶ÔÎïÆ·Àà±ðµÄϲºÃϸ·Ö½µÎ¬
KaggleÏîÄ¿:

(1)Êý¾ÝÈçÏ£º
products.csv£ºÉÌÆ·ÐÅÏ¢
×ֶΣºproduct_id, product_name, aisle_id,
department_id
order_products__prior.csv£º¶©µ¥ÓëÉÌÆ·ÐÅÏ¢
×ֶΣºorder_id, product_id, add_to_cart_order,
reordered
orders.csv£ºÓû§µÄ¶©µ¥ÐÅÏ¢
×ֶΣºorder_id,user_id,eval_set, order_number,¡.
aisles.csv£ºÉÌÆ·ËùÊô¾ßÌåÎïÆ·Àà±ð
×ֶΣº aisle_id, aisle
(2)ÐèÇó

(3)·ÖÎö
ºÏ²¢±í£¬Ê¹µÃuser_idÓëaisleÔÚÒ»ÕÅ±íµ±ÖÐ
½øÐн»²æ±í±ä»»
½øÐнµÎ¬
(4)ÍêÕû´úÂë
from sklearn.decomposition
import PCA
import pandas as pd
def pca():
"""
Ö÷³É·Ö·ÖÎö½øÐнµÎ¬
:return:
"""
# µ¼ÈëËÄÕűíµÄÊý¾Ý
prior = pd.read_csv ("./data/instacart/order_products__prior.csv")
products = pd.read_csv ("./data/instacart/products.csv")
orders = pd.read_csv ("./data/instacart/orders.csv")
aisles = pd.read_csv ("./data/instacart/aisles.csv")
# ºÏ²¢ËÄÕÅ±íµ½Ò»Õűí
#onÖ¸¶¨Á½ÕÅ±í¹²Í¬ÓµÓеļü ÄÚÁ¬½Ó
mt = pd.merge(prior, products, on= ['product_id',
'product_id'])
mt1 = pd.merge(mt, orders, on= ['order_id', 'order_id'])
mt2 = pd.merge(mt1, aisles, on= ['aisle_id', 'aisle_id'])
# ½øÐн»²æ±í±ä»»,pd.crosstab ͳ¼ÆÓû§ÓëÎïÆ·Ö®¼äµÄ´ÎÊý¹ØÏµ£¨Í³¼Æ´ÎÊý£©
user_sisle_cross = pd.crosstab (mt2['user_id'],
mt2['aisle'])
# PCA½øÐÐÖ÷³É·Ö·ÖÎö
pc = PCA(n_components=0.95)
data = pc.fit_transform (user_sisle_cross)
print(data)
pca()
|
mt2±íÊä³ö:

½»²æ±íÊä³ö:

½µÎ¬ºóÊä³ö:¡ª¡ª¡ª¡ª¡ª¡ª¡ª¡ª¡ª¡ª¡ª¡ª¡ª¡ª¡ª¡ª
|