±à¼ÍƼö: |
±¾ÎÄÀ´×ÔÓÚcsdn,±¾ÎÄÖ÷Òª½²½âÈçºÎÀûÓÃÊý¾Ýȱʧ¡¢¼ì²âºÍ¹ýÂËÒì³£Öµ¡¢ÒƳýÖØ¸´Êý¾Ý¶ÔPython
Pandas½øÐÐÊý¾ÝÔ¤´¦Àí,Ï£Íû¶ÔÄúµÄѧϰÓÐËù°ïÖú¡£ |
|
Êý¾Ýȱʧ
Êý¾ÝȱʧÔڴ󲿷ÖÊý¾Ý·ÖÎöÓ¦ÓÃÖж¼ºÜ³£¼û£¬PandasʹÓø¡µãÖµNaN±íʾ¸¡µãºÍ·Ç¸¡µãÊý×éÖеÄȱʧÊý¾Ý£¬ËûÖ»ÊÇÒ»¸ö±ãÓÚ±»¼ì²â³öÀ´µÄÊý¾Ý¶øÒÑ¡£
from pandas import
Series,DataFrame
string_data=Series(['abcd','efgh','ijkl','mnop'])
print(string_data)
print("...........\n")
print(string_data.isnull())
|

PythonÄÚÖõÄNoneÖµÒ²»á±»µ±×÷NA´¦Àí
from pandas import
Series,DataFrame
string_data=Series(['abcd','efgh','ijkl','mnop'])
print(string_data)
print("...........\n")
string_data[0]=None
print(string_data.isnull())
|

´¦ÀíNAµÄ·½·¨ÓÐËÄÖÖ£ºdropna,fillna,isnull,notnull
is(not)null£¬ÕâÒ»¶Ô·½·¨¶Ô¶ÔÏó×ö³öÔªËØ¼¶µÄÓ¦Óã¬È»ºó·µ»ØÒ»¸ö²¼¶ûÐÍÊý×飬һ°ã¿ÉÓÃÓÚ²¼¶ûÐÍË÷Òý¡£
dropna£¬¶ÔÓÚÒ»¸öSeries£¬dropna·µ»ØÒ»¸ö½öº¬·Ç¿ÕÊý¾ÝºÍË÷ÒýÖµµÄSeries¡£
ÎÊÌâÔÚÓÚDataFrameµÄ´¦Àí·½Ê½£¬ÒòΪһµ©dropµÄ»°£¬ÖÁÉÙÒª¶ªµôÒ»ÐУ¨ÁУ©¡£ÕâÀï½â¾ö·½·¨ÓëÇ°ÃæÀàËÆ£¬»¹ÊÇͨ¹ýÒ»¸ö¶îÍâµÄ²ÎÊý£ºdropna(axis=0,how=¡¯any¡¯,thresh=None)£¬how²ÎÊý¿ÉÑ¡µÄֵΪany»òÕßall.all½öÔÚÇÐÆ¬ÔªËØÈ«ÎªNAʱ²ÅÅׯú¸ÃÐÐ(ÁÐ)¡£threshΪÕûÊýÀàÐÍ£¬eg:thresh=3,ÄÇôһÐе±ÖÐÖÁÉÙÓÐÈý¸öNAֵʱ²Å½«Æä±£Áô¡£
fillna,fillna(value=None,method=None,axis=0)ÖеÄvalue³ýÁË»ù±¾ÀàÐÍÍ⣬»¹¿ÉÒÔʹÓÃ×ֵ䣬ÕâÑù¿ÉÒÔʵÏÖ¶Ô²»Í¬ÁÐÌî³ä²»Í¬µÄÖµ¡£
¹ýÂËÊý¾Ý£º
¶ÔÓÚÒ»¸öSeries£¬dropna·µ»ØÒ»¸ö½öº¬·Ç¿ÕÊý¾ÝºÍË÷ÒýÖµµÄSeries£º
from pandas import
Series,DataFrame
from numpy import nan as NA
data=Series([1,NA,3.5,NA,7])
print(data.dropna()) |

ÁíÒ»¸ö¹ýÂËDataFrameÐеÄÎÊÌâÉæ¼°ÎÊÌâÐòÁÐÊý¾Ý¡£¼ÙÉèÖ»ÏëÁôÒ»²¿·Ö¹Û²ìÊý¾Ý£¬¿ÉÒÔÓÃthresh²ÎÊýʵÏÖ´ËÄ¿µÄ£º
from pandas import
Series,DataFrame, np
from numpy import nan as NA
data=DataFrame(np.random.randn(7,3))
data.ix[:4,1]=NA
data.ix[:2,2]=NA
print(data)
print("...........")
print(data.dropna(thresh=2)) |

²»ÏëÂ˳ýȱʧµÄÊý¾Ý£¬¶øÊÇͨ¹ýÆäËû·½Ê½Ìî²¹¡°¿Õ¶´¡±£¬fillnaÊÇ×îÖ÷ÒªµÄº¯Êý¡£
ͨ¹ýÒ»¸ö³£Êýµ÷ÓÃfillna¾Í»á½«È±Ê§ÖµÌ滻ΪÄǸö³£ÊýÖµ£º
from pandas import
Series,DataFrame, np
from numpy import nan as NA
data=DataFrame(np.random.randn(7,3))
data.ix[:4,1]=NA
data.ix[:2,2]=NA
print(data)
print("...........")
print(data.fillna(0))
|

ÈôÊÇͨ¹ýÒ»¸ö×Öµäµ÷ÓÃfillna£¬¾Í¿ÉÒÔʵÏÖ¶Ô²»Í¬ÁÐÌî³ä²»Í¬µÄÖµ¡£
from pandas import
Series,DataFrame, np
from numpy import nan as NA
data=DataFrame(np.random.randn(7,3))
data.ix[:4,1]=NA
data.ix[:2,2]=NA
print(data)
print("...........")
print(data.fillna({1:111,2:222}))
|

¿ÉÒÔÀûÓÃfillnaʵÏÖÐí¶à±ðµÄ¹¦ÄÜ£¬±ÈÈç¿ÉÒÔ´«ÈëSeriesµÄƽ¾ùÖµ»òÖÐλÊý£º
from pandas import
Series,DataFrame, np
from numpy import nan as NA
data=Series([1.0,NA,3.5,NA,7])
print(data)
print("...........\n")
print(data.fillna(data.mean()))
|

¼ì²âºÍ¹ýÂËÒì³£Öµ
Òì³£Öµ(outlier)µÄ¹ýÂË»ò±ä»»ÔËËãÔںܴó³Ì¶ÈÉϾÍÊÇÊý×éÔËËã¡£ÈçÏÂÒ»¸ö(1000,4)µÄ±ê×¼Õý̬·Ö²¼Êý×飺
from pandas import
Series,DataFrame, np
from numpy import nan as NA
data=DataFrame(np.random.randn(1000,4))
print(data.describe())
print("\n....ÕÒ³öijһÁÐÖоø¶ÔÖµ´óС³¬¹ý3µÄÏî...\n")
col=data[3]
print(col[np.abs(col) > 3] )
print("\n....ÕÒ³öÈ«²¿¾ø¶ÔÖµ³¬¹ý3µÄÖµµÄÐÐ...\n")
print(col[(np.abs(data) > 3).any(1)] )
|

ÒÆ³ýÖØ¸´Êý¾Ý
DataFrameµÄduplicated·½·¨·µ»ØÒ»¸ö²¼¶ûÐÍSeries£¬±íʾ¸÷ÐÐÊÇ·ñÊÇÖØ¸´ÐС£
from pandas import
Series,DataFrame, np
from numpy import nan as NA
import pandas as pd
import numpy as np
data=pd.DataFrame({'k1':['one']*3+['two']*4, 'k2':[1,1,2,2,3,3,4]})
print(data)
print("........\n")
print(data.duplicated())
|

Óë´ËÏà¹ØµÄ»¹ÓÐÒ»¸ödrop_duplicated·½·¨£¬ËüÓÃÓÚ·µ»ØÒ»¸öÒÆ³ýÁËÖØ¸´ÐеÄDataFrame£º
from pandas import
Series,DataFrame, np
from numpy import nan as NA
import pandas as pd
import numpy as np
data=pd.DataFrame({'k1':['one']*3+['two']*4, 'k2':[1,1,2,2,3,3,4]})
print(data)
print("........\n")
print(data.drop_duplicates())
|

ÉÏÃæµÄÁ½¸ö·½·¨»áĬÈÏÅжÏÈ«²¿ÁУ¬Ò²¿ÉÒÔÖ¸¶¨²¿·ÖÁнøÐÐÖØ¸´ÏîÅжϣ¬¼ÙÉ軹ÓÐÒ»ÁÐÖµ£¬¶øÖ»Ï£Íû¸ù¾Ýk1ÁйýÂËÖØ¸´Ïî¡£
from pandas import
Series,DataFrame, np
from numpy import nan as NA
import pandas as pd
import numpy as np
data=pd.DataFrame({'k1':['one']*3+['two']*4, 'k2':[1,1,2,2,3,3,4]})
data['v1']=range(7)
print(data)
print("........\n")
print(data.drop_duplicates(['k1']))
|

10duplicatesºÍdrop_duplicatesĬÈϱ£ÁôµÚÒ»¸ö³öÏÖµÄÖµ×éºÏ¡£´«Èëtake_last=TrueÔò±£Áô×îºóÒ»¸ö£º
from pandas import
Series,DataFrame, np
from numpy import nan as NA
import pandas as pd
import numpy as np
data=pd.DataFrame({'k1':['one']*3+['two']*4, 'k2':[1,1,2,2,3,3,4]})
data['v1']=range(7)
print(data)
print("........\n")
print(data.drop_duplicates(['k1','k2'],take_last=True)) |

|