±à¼ÍƼö: |
±¾ÎÄÀ´×ÔÓÚÔÆÉçÇø,±¾ÎÄÖ÷Òª½²½âÈçºÎÓÃPython½øÐÐÊý¾ÝÇåÏ´µÄ7Öз½·¨£¬Êý¾ÝÇåÏ´ÖбȽϳ£¼ûµÄÎÊÌ⣬ϣÍû¶ÔÄúµÄѧϰÓÐËù°ïÖú¡£ |
|
µ¼¶Á£ºÊý¾ÝÇåÏ´ÊÇÊý¾Ý·ÖÎöµÄ±Ø±¸»·½Ú£¬ÔÚ½øÐзÖÎö¹ý³ÌÖУ¬»áÓкܶ಻·ûºÏ·ÖÎöÒªÇóµÄÊý¾Ý£¬ÀýÈçÖØ¸´¡¢´íÎó¡¢È±Ê§¡¢Òì³£ÀàÊý¾Ý¡£

01 ÖØ¸´Öµ´¦Àí
Êý¾Ý¼Èë¹ý³Ì¡¢Êý¾ÝÕûºÏ¹ý³Ì¶¼¿ÉÄÜ»á²úÉúÖØ¸´Êý¾Ý£¬Ö±½Óɾ³ýÊÇÖØ¸´Êý¾Ý´¦ÀíµÄÖ÷Òª·½·¨¡£pandasÌṩ²é¿´¡¢´¦ÀíÖØ¸´Êý¾ÝµÄ·½·¨duplicatedºÍdrop_duplicates¡£ÒÔÈçÏÂÊý¾ÝΪÀý:
>sample =
pd.DataFrame({'id':[1,1,1,3,4,5],
'name':['Bob','Bob','Mark','Miki','Sully','Rose'],
'score':[99,99,87,77,77,np.nan],
'group':[1,1,1,2,1,2],})
>sample
group id name score
0 1 1 Bob 99.0
1 1 1 Bob 99.0
2 1 1 Mark 87.0
3 2 3 Miki 77.0
4 1 4 Sully 77.0
5 2 5 Rose NaN |
·¢ÏÖÖØ¸´Êý¾Ýͨ¹ýduplicated·½·¨Íê³É£¬ÈçÏÂËùʾ£¬¿ÉÒÔͨ¹ý¸Ã·½·¨²é¿´Öظ´µÄÊý¾Ý¡£
>sample[sample.duplicated()]
group id name score
1 1 1 Bob 99.0 |
ÐèÒªÈ¥ÖØÊ±£¬¿Édrop_duplicates·½·¨Íê³É£º
>sample.drop_duplicates()
group id name score
0 1 1 Bob 99.0
2 1 1 Mark 87.0
3 2 3 Miki 77.0
4 1 4 Sully 77.0
5 2 5 Rose NaN |
drop_duplicates·½·¨»¹¿ÉÒÔ°´ÕÕijÁÐÈ¥ÖØ£¬ÀýÈçÈ¥³ýidÁÐÖØ¸´µÄËùÓмǼ£º
>sample.drop_duplicates('id')
group id name score
0 1 1 Bob 99.0
3 2 3 Miki 77.0
4 1 4 Sully 77.0
5 2 5 Rose NaN |
02 ȱʧֵ´¦Àí
ȱʧֵÊÇÊý¾ÝÇåÏ´ÖбȽϳ£¼ûµÄÎÊÌ⣬ȱʧֵһ°ãÓÉNA±íʾ£¬ÔÚ´¦ÀíȱʧֵʱҪ×ñÑÒ»¶¨µÄÔÔò¡£
Ê×ÏÈ£¬ÐèÒª¸ù¾ÝÒµÎñÀí½â´¦Àíȱʧֵ£¬ÅªÇå³þȱʧֵ²úÉúµÄÔÒòÊǹÊÒâȱʧ»¹ÊÇËæ»úȱʧ£¬ÔÙͨ¹ýһЩҵÎñ¾Ñé½øÐÐÌî²¹¡£Ò»°ãÀ´Ëµµ±È±Ê§ÖµÉÙÓÚ20%ʱ£¬Á¬Ðø±äÁ¿¿ÉÒÔʹÓþùÖµ»òÖÐλÊýÌî²¹£»·ÖÀà±äÁ¿²»ÐèÒªÌî²¹£¬µ¥ËãÒ»À༴¿É£¬»òÕßÒ²¿ÉÒÔÓÃÖÚÊýÌî²¹·ÖÀà±äÁ¿¡£
µ±È±Ê§Öµ´¦ÓÚ20%-80%Ö®¼äʱ£¬Ìî²¹·½·¨Í¬ÉÏ¡£ÁíÍâÿ¸öÓÐȱʧֵµÄ±äÁ¿¿ÉÒÔÉú³ÉÒ»¸öÖ¸Ê¾ÑÆ±äÁ¿£¬²ÎÓëºóÐøµÄ½¨Ä£¡£µ±È±Ê§Öµ¶àÓÚ80%ʱ£¬Ã¿¸öÓÐȱʧֵµÄ±äÁ¿Éú³ÉÒ»¸öÖ¸Ê¾ÑÆ±äÁ¿£¬²ÎÓëºóÐøµÄ½¨Ä££¬²»Ê¹ÓÃÔʼ±äÁ¿¡£
ÔÚÏÂͼÖÐչʾÁËÖÐλÊýÌȱʧֵºÍȱʧֵָʾ±äÁ¿µÄÉú³É¹ý³Ì¡£

¡øÍ¼5-8£ºÈ±Ê§ÖµÌʾÀý
PandasÌṩÁËfillna·½·¨ÓÃÓÚÌæ»»È±Ê§ÖµÊý¾Ý£¬Æä¹¦ÄÜÀàËÆÓÚ֮ǰµÄreplace·½·¨£¬ÀýÈç¶ÔÓÚÈçÏÂÊý¾Ý£º
> sample
group id name score
0 1.0 1.0 Bob 99.0
1 1.0 1.0 Bob NaN
2 NaN 1.0 Mark 87.0
3 2.0 3.0 Miki 77.0
4 1.0 4.0 Sully 77.0
5 NaN NaN NaN NaN |
·Ö²½Öè½øÐÐȱʧֵµÄ²é¿´ºÍÌî²¹ÈçÏ£º
1. ²é¿´È±Ê§Çé¿ö
ÔÚ½øÐÐÊý¾Ý·ÖÎöǰ£¬Ò»°ãÐèÒªÁ˽âÊý¾ÝµÄȱʧÇé¿ö£¬ÔÚPythonÖпÉÒÔ¹¹ÔìÒ»¸ölambdaº¯ÊýÀ´²é¿´È±Ê§Öµ£¬¸Ãlambdaº¯ÊýÖУ¬sum(col.isnull())±íʾµ±Ç°ÁÐÓжàÉÙȱʧ£¬col.size±íʾµ±Ç°ÁÐ×ܹ²¶àÉÙÐÐÊý¾Ý£º
>sample.apply(lambda
col:sum(col.isnull())/col.size)
group 0.333333
id 0.166667
name 0.166667
score 0.333333
dtype: float64 |
2. ÒÔÖ¸¶¨ÖµÌî²¹
pandasÊý¾Ý¿òÌṩÁËfillna·½·¨Íê³É¶ÔȱʧֵµÄÌî²¹£¬ÀýÈç¶Ôsample±íµÄÁÐscoreÌȱʧֵ£¬Ìî²¹·½·¨Îª¾ùÖµ£º
>sample.score.fillna(sample.score.mean())
0 99.0
1 85.0
2 87.0
3 77.0
4 77.0
5 85.0
Name: score, dtype: float64 |
µ±È»»¹¿ÉÒÔÒÔ·ÖλÊýµÈ·½·¨½øÐÐÌî²¹£º
>sample.score.fillna(sample.score.median())
0 99.0
1 82.0
2 87.0
3 77.0
4 77.0
5 82.0
Name: score, dtype: float64 |
3. ȱʧֵָʾ±äÁ¿
pandasÊý¾Ý¿ò¶ÔÏó¿ÉÒÔÖ±½Óµ÷Ó÷½·¨isnull²úÉúȱʧֵָʾ±äÁ¿£¬ÀýÈç²úÉúscore±äÁ¿µÄȱʧֵָʾ±äÁ¿£º
>sample.score.isnull()
0 False
1 True
2 False
3 False
4 False
5 True
Name: score, dtype: bool |
ÈôÏëת»»ÎªÊýÖµ0£¬1ÐÍָʾ±äÁ¿£¬¿ÉÒÔʹÓÃapply·½·¨£¬int±íʾ½«¸ÃÁÐÌæ»»ÎªintÀàÐÍ¡£
>sample.score.isnull().apply(int)
0 0
1 1
2 0
3 0
4 0
5 1
Name: score, dtype: int64 |
03 ÔëÉùÖµ´¦Àí
ÔëÉùÖµÖ¸Êý¾ÝÖÐÓÐÒ»¸ö»ò¼¸¸öÊýÖµÓëÆäËûÊýÖµÏà±È²îÒì½Ï´ó£¬ÓÖ³ÆÎªÒì³£Öµ¡¢ÀëȺֵ(outlier)¡£
¶ÔÓڴ󲿷ֵÄÄ£ÐͶøÑÔ£¬ÔëÉùÖµ»áÑÏÖØ¸ÉÈÅÄ£Ð͵Ľá¹û£¬²¢ÇÒʹ½áÂÛ²»ÕæÊµ»òÆ«ÆÄ£¬Èçͼ5-9¡£ÐèÒªÔÚÊý¾ÝÔ¤´¦ÀíµÄʱºòÇå³ýËùÒÔÔëÉùÖµ¡£ÔëÉùÖµµÄ´¦Àí·½·¨ºÜ¶à£¬¶ÔÓÚµ¥±äÁ¿£¬³£¼ûµÄ·½·¨ÓиÇñ·¨¡¢·ÖÏä·¨£»¶à±äÁ¿µÄ´¦Àí·½·¨Îª¾ÛÀà·¨¡£ÏÂÃæ½øÐÐÏêϸ½éÉÜ£º

¡øÍ¼5-9£ºÔëÉùÖµ£¨Òì³£Öµ¡¢ÀëȺֵ£©Ê¾Àý£ºÄêÁäÊý¾Ý£¬Ô²È¦ÎªÔëÉùÖµ
1. ¸Çñ·¨
¸Çñ·¨½«Ä³Á¬Ðø±äÁ¿¾ùÖµÉÏÏÂÈý±¶±ê×¼²î·¶Î§ÍâµÄ¼ÇÂ¼Ìæ»»Îª¾ùÖµÉÏÏÂÈý±¶±ê×¼²îÖµ£¬¼´¸Çñ´¦Àí£¨Í¼5-10£©¡£

¡øÍ¼5-10£º¸Çñ·¨´¦ÀíÔëÉùֵʾÀý
PythonÖпÉ×Ô¶¨Ò庯ÊýÍê³É¸Çñ·¨¡£ÈçÏÂËùʾ£¬²ÎÊýx±íʾһ¸öpd.SeriesÁУ¬quantileÖ¸¸ÇñµÄ·¶Î§Çø¼ä£¬Ä¬ÈÏ·²Ð¡ÓÚ°Ù·ÖÖ®1·ÖλÊýºÍ´óÓÚ°Ù·ÖÖ®99·ÖλÊýµÄÖµ½«»á±»°Ù·ÖÖ®1·ÖλÊýºÍ°Ù·ÖÖ®99·ÖλÊýÌæ´ú£º
>def cap(x,quantile=[0.01,0.99]):
"""¸Çñ·¨´¦ÀíÒì³£Öµ
Args£º
x£ºpd.SeriesÁУ¬Á¬Ðø±äÁ¿
quantile£ºÖ¸¶¨¸Çñ·¨µÄÉÏÏ·ÖλÊý·¶Î§ """
# Éú³É·ÖλÊý
Q01,Q99=x.quantile(quantile).values.tolist()
# Ìæ»»Ò쳣ֵΪָ¶¨µÄ·ÖλÊý
if Q01 > x.min():
x = x.copy()
x.loc[x<Q01] = Q01
if Q99 < x.max():
x = x.copy()
x.loc[x>Q99] = Q99
return(x) |
ÏÖÉú³ÉÒ»×é·þ´ÓÕý̬·Ö²¼µÄËæ»úÊý£¬sample.hist±íʾ²úÉúÖ±·½Í¼£¬¸ü¶à»æÍ¼·½·¨»áÔÚÏÂÒ»Õ½ڽøÐн²½â£º
>sample =
pd.DataFrame({'normal':np.random.randn(1000)})
>sample.hist(bins=50) |

¡øÍ¼5-11£ºÎ´´¦ÀíÔëÉùʱµÄ±äÁ¿Ö±·½Í¼
¶ÔpandasÊý¾Ý¿òËùÓÐÁнøÐиÇñ·¨×ª»»£¬¿ÉÒÔÒÔÈçÏÂд·¨£¬´ÓÖ±·½Í¼¶Ô±È¿ÉÒÔ¿´³ö¸Çñºó¼«¶ËֵƵÊýµÄ±ä»¯¡£
>new = sample.apply(cap,quantile=[0.01,0.99])
>new.hist(bins=50) |

¡øÍ¼5-12£º´¦ÀíÍêÔëÉùºóµÄ±äÁ¿Ö±·½Í¼
2. ·ÖÏä·¨
·ÖÏ䷨ͨ¹ý¿¼²ìÊý¾ÝµÄ¡°½üÁÚ¡±À´¹â»¬ÓÐÐòÊý¾ÝµÄÖµ¡£ÓÐÐòÖµ·Ö²¼µ½Ò»Ð©Í°»òÏäÖС£
·ÖÏä·¨°üÀ¨µÈÉî·ÖÏ䣺ÿ¸ö·ÖÏäÖеÄÑù±¾Á¿Ò»Ö£»µÈ¿í·ÖÏ䣺ÿ¸ö·ÖÏäÖеÄȡֵ·¶Î§Ò»Ö¡£Ö±·½Í¼ÆäʵÊ×ÏȶÔÊý¾Ý½øÐÐÁ˵ȿí·ÖÏ䣬ÔÙ¼ÆËãÆµÊý»Í¼¡£
±ÈÈç¼Û¸ñÅÅÐòºóÊý¾ÝΪ£º4¡¢8¡¢15¡¢21¡¢21¡¢24¡¢25¡¢28¡¢34
½«Æä»®·ÖΪ£¨µÈÉÏ䣺
Ïä1£º4¡¢8¡¢15
Ïä2£º21¡¢21¡¢24
Ïä3£º25¡¢28¡¢34
½«Æä»®·ÖΪ£¨µÈ¿í£©Ï䣺
Ïä1£º4¡¢8
Ïä2£º15¡¢21¡¢21¡¢24
Ïä3£º25¡¢28¡¢34
·ÖÏä·¨½«Òì³£Êý¾Ý°üº¬ÔÚÁËÏä×ÓÖУ¬ÔÚ½øÐн¨Ä£µÄʱºò£¬²»Ö±½Ó½øÐе½Ä£ÐÍÖУ¬Òò¶ø¿ÉÒÔ´ïµ½´¦ÀíÒì³£ÖµµÄÄ¿µÄ¡£
pandasµÄqcutº¯ÊýÌṩÁË·ÖÏäµÄʵÏÖ·½·¨£¬ÏÂÃæ½éÉÜÈçºÎ¾ßÌåʵÏÖ¡£
µÈ¿í·ÖÏ䣺qcutº¯Êý¿ÉÒÔÖ±½Ó½øÐеȿí·ÖÏ䣬´ËʱÐèÒªµÄ´ý·ÖÏäµÄÁкͷÖÏä¸öÊýÁ½¸ö²ÎÊý£¬ÈçÏÂËùʾ£¬sampleÊý¾ÝµÄintÁÐΪ´Ó10¸ö·þ´Ó±ê×¼Õý̬·Ö²¼µÄËæ»úÊý£º
>sample =pd.DataFrame({'normal':np.random.randn(10)})
>sample
normal
0 0.065108
1 -0.597031
2 0.635432
3 -0.491930
4 -1.894007
5 1.623684
6 1.723711
7 -0.225949
8 -0.213685
9 -0.309789 |
ÏÖ·ÖΪ5Ï䣬¿ÉÒÔ¿´µ½£¬½á¹ûÊǰ´ÕÕ¿í¶È·ÖΪ5·Ý£¬ÏÂÏÞÖУ¬cutº¯Êý×Ô¶¯Ñ¡ÔñСÓÚÁÐ×îСֵһ¸öÊýÖµ×÷ΪÏÂÏÞ£¬×î´óֵΪÉÏÏÞ£¬µÈ·ÖΪÎå·Ö¡£½á¹û²úÉúÒ»¸öCategoriesÀàµÄÁУ¬ÀàËÆÓÚRÖеÄfactor£¬±íʾ·ÖÀà±äÁ¿ÁС£
´ËÍâÈõÊý¾Ý´æÔÚȱʧ£¬È±Ê§Öµ½«ÔÚ·ÖÏäºó½«¼ÌÐø±£³Öȱʧ£¬ÈçÏÂËùʾ£º
>pd.cut(sample.normal,5)
0 (-0.447, 0.277]
1 (-1.17, -0.447]
2 (0.277, 1.0]
3 (-1.17, -0.447]
4 (-1.898, -1.17]
5 (1.0, 1.724]
6 (1.0, 1.724]
7 (-0.447, 0.277]
8 (-0.447, 0.277]
9 (-0.447, 0.277]
Name: normal, dtype: category
Categories (5, interval[float64]): [(-1.898, -1.17]
< (-1.17, -0.447] < (-0.447, 0.277] <
(0.277, 1.0] < (1.0, 1.724]] |
ÕâÀïÒ²¿ÉÒÔʹÓÃlabels²ÎÊýÖ¸¶¨·ÖÏäºó¸÷¸öˮƽµÄ±êÇ©£¬ÈçÏÂËùʾ£¬´ËʱÏàÓ¦Çø¼äÖµ±»±êÇ©ÖµÌæ´ú£º
> pd.cut(sample.normal,bins=5,labels=[1,2,3,4,5])
0 1
1 1
2 2
3 2
4 3
5 3
6 4
7 4
8 5
9 5
Name: normal, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4
< 5] |
±êÇ©³ýÁË¿ÉÒÔÉ趨ΪÊýÖµ,Ò²¿ÉÒÔÉ趨Ϊ×Ö·û£¬ÈçÏÂËùʾ£¬½«Êý¾ÝµÈ¿í·ÖΪÁ½Ï䣬±êǩΪ¡®bad¡¯£¬¡®good¡¯£º
>pd.cut(sample.normal,bins=2,labels=['bad','good'])
0 bad
1 bad
2 bad
3 bad
4 bad
5 good
6 good
7 good
8 good
9 good
Name: normal, dtype: category
Categories (2, object): [bad < good] |
µÈÉî·ÖÏ䣺µÈÉî·ÖÏäÖУ¬¸÷¸öÏäµÄ¿í¶È¿ÉÄܲ»Ò»£¬µ«ÆµÊýÊǼ¸ºõÏàµÈµÄ£¬ËùÒÔ¿ÉÒÔ²ÉÓÃÊý¾ÝµÄ·ÖλÊýÀ´½øÐзÖÏä¡£ÒÀ¾ÉÒÔ֮ǰµÄsampleÊý¾ÝΪÀý£¬ÏÖ½øÐеÈÉî¶È·Ö2Ï䣬Ê×ÏÈÕÒµ½2ÏäµÄ·ÖλÊý£º
>sample.normal.quantile([0,0.5,1])
0.0 0.0
0.5 4.5
1.0 9.0
Name: normal, dtype: float64 |
ÔÚbins²ÎÊýÖÐÉ趨·ÖλÊýÇø¼ä£¬ÈçÏÂËùʾÍê³É·ÖÏ䣬include_lowest=True²ÎÊý±íʾ°üº¬±ß½ç×îСֵ°üº¬Êý¾ÝµÄ×îСֵ£º
>pd.cut(sample.normal,bins=sample.normal.
quantile([0,0.5,1]),
include_lowest=True)
0 [0, 4.5]
1 [0, 4.5]
2 [0, 4.5]
3 [0, 4.5]
4 [0, 4.5]
5 (4.5, 9]
6 (4.5, 9]
7 (4.5, 9]
8 (4.5, 9]
9 (4.5, 9]
Name: normal, dtype: category
Categories (2, object): [[0, 4.5] < (4.5,
9)] |
´ËÍâÒ²¿ÉÒÔ¼ÓÈëlabel²ÎÊýÖ¸¶¨±êÇ©£¬ÈçÏÂËùʾ£º
>pd.cut(sample.normal,bins=sample.normal.
quantile([0,0.5,1]),
include_lowest=True)
0 bad
1 bad
2 bad
3 bad
4 bad
5 good
6 good
7 good
8 good
9 good
Name: normal, dtype: category
Categories (2, object): [bad < good] |
3. ¶à±äÁ¿Òì³£Öµ´¦Àí-¾ÛÀà·¨
ͨ¹ý¿ìËÙ¾ÛÀà·¨½«Êý¾Ý¶ÔÏó·Ö×é³ÉΪ¶à¸ö´Ø£¬ÔÚͬһ¸ö´ØÖеĶÔÏó¾ßÓнϸߵÄÏàËÆ¶È£¬¶ø²»Í¬µÄ´ØÖ®¼äµÄ¶ÔÏó²î±ð½Ï´ó¡£¾ÛÀà·ÖÎö¿ÉÒÔÍÚ¾ò¹ÂÁ¢µãÒÔ·¢ÏÖÔëÉùÊý¾Ý£¬ÒòΪÔëÉù±¾Éí¾ÍÊǹÂÁ¢µã¡£
±¾°¸Àý¿¼ÂÇÁ½¸ö±äÁ¿incomeºÍage£¬É¢µãͼÈçͼ5-13Ëùʾ£¬ÆäÖÐA¡¢B±íʾÒì³£Öµ£º

¡øÍ¼5-13£º¶à±äÁ¿Ò쳣ֵʾÀý
¶ÔÓÚ¾ÛÀà·½·¨´¦ÀíÒì³£Öµ£¬Æä²½ÖèÈçÏÂËùʾ£º
ÊäÈ룺Êý¾Ý¼¯S£¨°üÀ¨NÌõ¼Ç¼£¬ÊôÐÔ¼¯D£º{ÄêÁä¡¢ÊÕÈë}£©£¬Ò»Ìõ¼Ç¼Ϊһ¸öÊý¾Ýµã£¬Ò»Ìõ¼Ç¼ÉϵÄÿ¸öÊôÐÔÉϵÄֵΪһ¸öÊý¾Ýµ¥Ôª¸ñ¡£Êý¾Ý¼¯SÓÐN¡ÁD¸öÊý¾Ýµ¥Ôª¸ñ£¬ÆäÖÐijЩÊý¾Ýµ¥Ôª¸ñÊÇÔëÉùÊý¾Ý¡£
Êä³ö£º¹ÂÁ¢Êý¾ÝµãÈçͼËùʾ¡£¹ÂÁ¢µãAÊÇÎÒÃÇÈÏΪËüÊÇÔëÉùÊý¾Ý£¬ºÜÃ÷ÏÔËüµÄÔëÉùÊôÐÔÊÇÊÕÈ룬ͨ¹ý¶ÔÊÕÈë±äÁ¿Ê¹ÓøÇñ·¨¿ÉÒÔÌÞ³ýA¡£
ÁíÍ⣬Êý¾ÝµãBÒ²ÊÇÒ»¸öÔëÉùÊý¾Ý£¬µ«ÊǺÜÄÑÅж¨ËüÔÚÄĸöÊôÐÔÉϵÄÊý¾Ý³öÏÖ´íÎó¡£ÕâÖÖÇé¿öÏÂÖ»¿ÉÒÔʹÓöà±äÁ¿·½·¨½øÐд¦Àí¡£ |