±à¼ÍƼö: |
±¾ÎÄÀ´×ÔÓÚ΢ÐŹ«ÖںţºPythonÊý¾Ý¿ÆÑ§£¬ÎÄÕÂÖ÷Òª½²½âÁËÔõôɾ³ýDataFrameµÄÁУ¬¸Ä±äDataFrameµÄË÷Òý£¬Êý¾ÝÇåÏ´µÃ´Ó¼òµ¥µÃ×ֶε½ÇåÏ´Õû¸öÊý¾Ý¼¯µÈµÈ¡£ |
|
Êý¾Ý¿ÆÑ§¼Ò»¨ÁË´óÁ¿µÄʱ¼äÇåÏ´Êý¾Ý¼¯£¬²¢½«ÕâЩÊý¾Ýת»»ÎªËûÃÇ¿ÉÒÔ´¦ÀíµÄ¸ñʽ¡£ÊÂʵÉÏ£¬ºÜ¶àÊý¾Ý¿ÆÑ§¼ÒÉù³Æ¿ªÊ¼»ñÈ¡ºÍÇåÏ´Êý¾ÝµÄ¹¤×÷Á¿ÒªÕ¼Õû¸ö¹¤×÷µÄ80%¡£
Òò´Ë£¬Èç¹ûÄãÕýÇÉÒ²ÔÚÕâ¸öÁìÓòÖУ¬»òÕ߼ƻ®½øÈëÕâ¸öÁìÓò£¬ÄÇô´¦ÀíÕâЩÔÓÂÒ²»¹æÔòÊý¾ÝÊǷdz£ÖØÒªµÄ£¬ÕâЩÔÓÂÒÊý¾Ý°üÀ¨Ò»Ð©È±Ê§Öµ£¬²»Á¬Ðø¸ñʽ£¬´íÎó¼Ç¼£¬»òÕßÊÇûÓÐÒâÒåµÄÒì³£Öµ¡£
ÔÚÕâ¸ö½Ì³ÌÖУ¬ÎÒÃǽ«ÀûÓÃPythonµÄPandasºÍNumpy°üÀ´½øÐÐÊý¾ÝÇåÏ´¡£
Ö÷ÒªÄÚÈÝÈçÏ£º
ɾ³ý DataFrame ÖеIJ»±ØÒª columns
¸Ä±ä DataFrame µÄ index
ʹÓà .str() ·½·¨À´ÇåÏ´ columns
ʹÓà DataFrame.applymap() º¯Êý°´ÔªËصÄÇåÏ´Õû¸öÊý¾Ý¼¯
ÖØÃüÃû columns Ϊһ×é¸üÒ×ʶ±ðµÄ±êÇ©
Â˳ý CSVÎļþÖв»±ØÒªµÄ rows
ÏÂÃæÊÇÒªÓõ½µÄÊý¾Ý¼¯£º
BL-Flickr-Images-Book.csv - Ò»·ÝÀ´×ÔÓ¢¹úͼÊé¹Ý°üº¬¹ØÓÚÊé¼®ÐÅÏ¢µÄCSVÎĵµ
university_towns.txt - Ò»·Ý°üº¬ÃÀ¹ú¸÷´óÖÞ´óѧ³ÇÃû³ÆµÄtextÎĵµ
olympics.csv - Ò»·Ý×ܽáÁ˸÷¹ú¼Ò²Î¼ÓÏļ¾Ó붬¼¾°ÂÁÖÆ¥¿ËÔ˶¯»áÇé¿öµÄCSVÎĵµ
Äã¿ÉÒÔ´ÓReal Python µÄ GitHub repository ÏÂÔØÊý¾Ý¼¯À´½øÐÐÏÂÃæµÄÀý×Ó¡£
×¢Ò⣺½¨ÒéʹÓÃJupter NotebooksÀ´Ñ§Ï°ÏÂÃæµÄ֪ʶ¡£
ѧϰ֮ǰ¼ÙÉèÄãÒѾÓÐÁ˶ÔPandasºÍNumpy¿âµÄ»ù±¾ÈÏʶ£¬°üÀ¨PandasµÄ¹¤×÷»ù´¡SeriesºÍDataFrame¶ÔÏó£¬Ó¦Óõ½ÕâЩ¶ÔÏóÉϵij£Ó÷½·¨£¬ÒÔ¼°ÊìϤÁËNumPyµÄNaNÖµ¡£
ÈÃÎÒÃǵ¼ÈëÕâЩģ¿é¿ªÊ¼ÎÒÃǵÄѧϰ¡£
>>> import
pandas as pd
>>> import numpy as np |
ɾ³ýDataFrameµÄÁÐ
¾³£µÄ£¬Äã»á·¢ÏÖÊý¾Ý¼¯Öв»ÊÇËùÓеÄ×Ö¶ÎÀàÐͶ¼ÊÇÓÐÓõġ£ÀýÈ磬Äã¿ÉÄÜÓÐÒ»¸ö¹ØÓÚѧÉúÐÅÏ¢µÄÊý¾Ý¼¯£¬°üº¬ÐÕÃû£¬·ÖÊý£¬±ê×¼£¬¸¸Ä¸ÐÕÃû£¬×¡Ö·µÈ¾ßÌåÐÅÏ¢£¬µ«ÊÇÄãÖ»Ïë·ÖÎöѧÉúµÄ·ÖÊý¡£
Õâ¸öÇé¿öÏ£¬×¡Ö·»òÕ߸¸Ä¸ÐÕÃûÐÅÏ¢¶ÔÄãÀ´Ëµ¾Í²»ÊǺÜÖØÒª¡£ÕâЩûÓÐÓõÄÐÅÏ¢»áÕ¼Óò»±ØÒªµÄ¿Õ¼ä£¬²¢»áʹÔËÐÐʱ¼ä¼õÂý¡£
PandasÌṩÁËÒ»¸ö·Ç³£±ã½ÝµÄ·½·¨drop()º¯ÊýÀ´ÒƳýÒ»¸öDataFrameÖв»ÏëÒªµÄÐлòÁС£ÈÃÎÒÃÇ¿´Ò»¸ö¼òµ¥µÄÀý×ÓÈçºÎ´ÓDataFrameÖÐÒÆ³ýÁС£
Ê×ÏÈ£¬ÎÒÃÇÒýÈëBL-Flickr-Images-Book.csvÎļþ£¬²¢´´½¨Ò»¸ö´ËÎļþµÄDataFrame¡£ÔÚÏÂÃæÕâ¸öÀý×ÓÖУ¬ÎÒÃÇÉèÖÃÁËÒ»¸öpd.read_csvµÄÏà¶Ô·¾¶£¬Òâζ×ÅËùÓеÄÊý¾Ý¼¯¶¼ÔÚDatasetsÎļþ¼Ðϵĵ±Ç°¹¤×÷Ŀ¼ÖУº

ÎÒÃÇʹÓÃÁËhead()·½·¨µÃµ½ÁËǰÎå¸öÐÐÐÅÏ¢£¬ÕâЩÁÐÌṩÁ˶ÔͼÊé¹ÝÓаïÖúµÄ¸¨ÖúÐÅÏ¢£¬µ«ÊDz¢²»ÄܺܺõÄÃèÊöÕâЩÊé¼®£ºEdition
Statement, Corporate Author, Corporate Contributors,
Former owner, Engraver, Issuance type and Shelfmarks¡£
Òò´Ë£¬ÎÒÃÇ¿ÉÒÔÓÃÏÂÃæµÄ·½·¨ÒƳýÕâЩÁУº
>>> to_drop
= ['Edition Statement',
... 'Corporate Author',
... 'Corporate Contributors',
... 'Former owner',
... 'Engraver',
... 'Contributors',
... 'Issuance type',
... 'Shelfmarks']
>>> df.drop(to_drop, inplace=True,
axis=1) |
ÔÚÉÏÃæ£¬ÎÒÃǶ¨ÒåÁËÒ»¸ö°üº¬ÎÒÃDz»ÒªµÄÁеÄÃû³ÆÁÐ±í¡£½Ó×Å£¬ÎÒÃÇÔÚ¶ÔÏóÉϵ÷ÓÃdrop()º¯Êý£¬ÆäÖÐinplace²ÎÊýÊÇTrue£¬axis²ÎÊýÊÇ1¡£Õâ¸æËßÁËPandasÎÒÃÇÏëÒªÖ±½ÓÔÚÎÒÃǵĶÔÏóÉÏ·¢Éú¸Ä±ä£¬²¢ÇÒËüÓ¦¸Ã¿ÉÒÔѰÕÒ¶ÔÏóÖб»ÒƳýÁеÄÐÅÏ¢¡£
ÎÒÃÇÔٴο´Ò»ÏÂDataFrame£¬ÎÒÃǻῴµ½²»ÒªÏëµÄÐÅÏ¢ÒѾ±»ÒƳýÁË¡£

ͬÑùµÄ£¬ÎÒÃÇÒ²¿ÉÒÔͨ¹ý¸øcolumns²ÎÊý¸³ÖµÖ±½ÓÒÆ³ýÁУ¬¶ø¾Í²»Ó÷ֱð¶¨Òåto_dropÁбíºÍaxisÁË¡£
>>> df.drop(columns=to_drop,
inplace=True) |
ÕâÖÖÓï·¨¸üÖ±¹Û¸ü¿É¶Á¡£ÎÒÃÇÕâÀォҪ×öʲô¾ÍºÜÃ÷ÏÔÁË¡£
¸Ä±äDataFrameµÄË÷Òý
PandasË÷ÒýindexÀ©Õ¹ÁËNumpyÊý×éµÄ¹¦ÄÜ£¬ÒÔÔÊÐí¸ü¶à¶àÑù»¯µÄÇзֺͱê¼Ç¡£ÔںܶàÇé¿öÏ£¬Ê¹ÓÃΨһµÄÖµ×÷ΪË÷Òýֵʶ±ðÊý¾Ý×Ö¶ÎÊǷdz£ÓаïÖúµÄ¡£
ÀýÈ磬ÈÔȻʹÓÃÉÏÒ»½ÚµÄÊý¾Ý¼¯£¬¿ÉÒÔÏëÏóµ±Ò»¸öͼÊé¹ÜÀíԱѰÕÒÒ»¸ö¼Ç¼£¬ËûÃÇÒ²Ðí»áÊäÈëÒ»¸öΨһ±êʶÀ´¶¨Î»Ò»±¾Êé¡£
>>>
df['Identifier'].is_unique
True |
ÈÃÎÒÃÇÓÃset_index°ÑÒѾ´æÔÚµÄË÷Òý¸ÄΪÕâ¸öÁС£

¼¼Êõϸ½Ú£º²»ÏñÔÚSQLÖеÄÖ÷¼üÒ»Ñù£¬pandasµÄË÷Òý²»±£Ö¤Î¨Ò»ÐÔ£¬¾¡¹ÜÐí¶àË÷ÒýºÍºÏ²¢²Ù×÷½«»áʹÔËÐÐʱ¼ä±ä³¤Èç¹ûÊÇÕâÑù¡£
ÎÒÃÇ¿ÉÒÔÓÃÒ»¸öÖ±½ÓµÄ·½·¨loc[]À´»ñȡÿһÌõ¼Ç¼¡£¾¡¹Üloc[]Õâ¸ö´Ê¿ÉÄÜ¿´ÉÏȥûÓÐÄÇôֱ¹Û£¬µ«ËüÔÊÐíÎÒÃÇʹÓûùÓÚ±êÇ©µÄË÷Òý£¬Õâ¸öË÷ÒýÊÇÐеıêÇ©»òÕß²»¿¼ÂÇλÖõļǼ¡£

»»¾ä»°Ëµ£¬206ÊÇË÷ÒýµÄµÚÒ»¸ö±êÇ©¡£Èç¹ûÏëͨ¹ýλÖûñÈ¡Ëü£¬ÎÒÃÇ¿ÉÒÔʹÓÃdf.iloc[0]£¬ÊÇÒ»¸ö»ùÓÚλÖõÄË÷Òý¡£
֮ǰ£¬ÎÒÃǵÄË÷ÒýÊÇÒ»¸ö·¶Î§Ë÷Òý£º´Ó0¿ªÊ¼µÄÕûÊý£¬ÀàËÆPythonµÄÄÚ½¨range¡£Í¨¹ý¸øset_indexÒ»¸öÁÐÃû£¬ÎÒÃǾͰÑË÷Òý±ä³ÉÁËIdentifierÖеÄÖµ¡£
ÄãÒ²Ðí×¢Òâµ½ÁËÎÒÃÇͨ¹ýdf = df.set_index(...)µÄ·µ»Ø±äÁ¿ÖØÐ¸ø¶ÔÏó¸³ÁËÖµ¡£ÕâÊÇÒòΪ£¬Ä¬ÈϵÄÇé¿öÏ£¬Õâ¸ö·½·¨·µ»ØÒ»¸ö±»¸Ä±ä¶ÔÏóµÄ¿½±´£¬²¢ÇÒËü²»»áÖ±½Ó¶ÔÔ¶ÔÏó×öÈκθı䡣ÎÒÃÇ¿ÉÒÔͨ¹ýÉèÖòÎÊýinplaceÀ´±ÜÃâÕâ¸öÎÊÌâ¡£
df.set_index('Identifier',
inplace=True) |
ÇåÏ´Êý¾Ý×Ö¶Î
µ½ÏÖÔÚΪֹ£¬ÎÒÃÇÒÆ³ýÁ˲»±ØÒªµÄÁв¢¸Ä±äÁËÎÒÃǵÄË÷Òý±äµÃ¸üÓÐÒâÒå¡£Õâ¸ö²¿·Ö£¬ÎÒÃǽ«ÇåÏ´ÌØÊâµÄÁУ¬²¢Ê¹ËüÃDZä³ÉͳһµÄ¸ñʽ£¬ÕâÑù¿ÉÒÔ¸üºÃµÄÀí½âÊý¾Ý¼¯ºÍ¼ÓÇ¿Á¬ÐøÐÔ¡£ÌرðµÄ£¬ÎÒÃǽ«ÇåÏ´Date
of PublicationºÍPlace of Publication¡£
¸ù¾ÝÉÏÃæ¹Û²ì£¬ËùÓеÄÊý¾ÝÀàÐͶ¼ÊÇÏÖÔÚµÄobjectdtypeÀàÐÍ£¬²î²»¶àÀàËÆÓÚPythonÖеÄstr¡£
Ëü°üº¬ÁËһЩ²»Äܱ»ÊÊÓÃÓÚÊýÖµ»òÊÇ·ÖÀàµÄÊý¾Ý¡£ÕâÒ²Õý³££¬ÒòΪÎÒÃÇÕýÔÚ´¦ÀíÕâЩ³õʼֵ¾ÍÊÇÔÓÂÒÎÞÕÂ×Ö·û´®µÄÊý¾Ý¡£
>>>
df.get_dtype_counts()
object 6 |
Ò»¸öÐèÒª±»¸Ä±äΪÊýÖµµÄµÄ×Ö¶ÎÊÇthe date of publicationËùÒÔÎÒÃÇ×öÈçϲÙ×÷£º
>>>
df.loc[1905:, 'Date of Publication'].head(10)
Identifier
1905 1888
1929 1839, 38-54
2836 [1897?]
2854 1865
2956 1860-63
2957 1873
3017 1866
3131 1899
4598 1814
4884 1820
Name: Date of Publication, dtype: object |
Ò»±¾ÊéÖ»ÄÜÓÐÒ»¸ö³ö°æÈÕÆÚdata of publication¡£Òò´Ë£¬ÎÒÃÇÐèÒª×öÒÔϵÄһЩÊÂÇ飺
ÒÆ³ýÔÚ·½À¨ºÅÄڵĶîÍâÈÕÆÚ£¬ÈκδæÔڵģº1879[1878]¡£
½«ÈÕÆÚ·¶Î§×ª»¯ÎªËüÃÇµÄÆðʼÈÕÆÚ£¬ÈκδæÔڵģº1860-63;1839,38-54¡£
ÍêÈ«ÒÆ³ýÎÒÃDz»¹ØÐĵÄÈÕÆÚ£¬²¢ÓÃNumpyµÄNaNÌæ»»£º[1879?]¡£
½«×Ö·û´®nanת»¯ÎªNumpyµÄNaNÖµ¡£
¿¼ÂÇÕâЩģʽ£¬ÎÒÃÇ¿ÉÒÔÓÃÒ»¸ö¼òµ¥µÄÕýÔò±í´ïʽÀ´ÌáÈ¡³ö°æÈÕÆÚ£º
ÉÏÃæÕýÔò±í´ïʽµÄÒâ˼ÔÚ×Ö·û´®¿ªÍ·Ñ°ÕÒÈκÎËÄλÊý×Ö£¬·ûºÏÎÒÃǵÄÇé¿ö¡£
\d´ú±íÈκÎÊý×Ö£¬{4}ÖØ¸´Õâ¸ö¹æÔòËĴΡ£^·ûºÅÆ¥ÅäÒ»¸ö×Ö·û´®×ʼµÄ²¿·Ö£¬Ô²À¨ºÅ±íʾһ¸ö·Ö×飬ÌáʾpandasÎÒÃÇÏëÒªÌáÈ¡ÕýÔò±í´ïʽµÄ²¿·Ö¡£
ÈÃÎÒÃÇ¿´¿´ÔËÐÐÕâ¸öÕýÔòÔÚÊý¾Ý¼¯ÉÏÖ®ºó»á·¢Éúʲô¡£
>>>
extr = df['Date of Publication'].str.extract(r'^(\d{4})',
expand=False)
>>> extr.head()
Identifier
206 1879
216 1868
218 1869
472 1851
480 1857
Name: Date of Publication, dtype: object |
ÆäʵÕâ¸öÁÐÈÔÈ»ÊÇÒ»¸öobjectÀàÐÍ£¬µ«ÊÇÎÒÃÇ¿ÉÒÔʹÓÃpd.to_numericÇáËɵĵõ½Êý×ֵİ汾£º
>>>
df['Date of Publication'] = pd.to_numeric(extr)
>>> df['Date of Publication'].dtype
dtype('float64') |
Õâ¸ö½á¹ûÖУ¬10¸öÖµÀï´óÔ¼ÓÐ1¸öֵȱʧ£¬ÕâÈÃÎÒÃǸ¶³öÁ˺ÜСµÄ´ú¼ÛÀ´¶ÔÊ£ÓàÓÐЧµÄÖµ×ö¼ÆËã¡£
>>>
df['Date of Publication'].isnull().sum() / len(df)
0.11717147339205986 |
½áºÏstr·½·¨ÓëNumpyÇåÏ´ÁÐ
ÉÏÃæ£¬Äã¿ÉÒԹ۲쵽df['Date of Publication'].str. µÄʹÓá£Õâ¸öÊôÐÔÊÇpandasÀïµÄÒ»ÖÖÌáÉý×Ö·û´®²Ù×÷Ëٶȵķ½·¨£¬²¢ÓдóÁ¿µÄPython×Ö·û´®»ò±àÒëµÄÕýÔò±í´ïʽÉϵÄС²Ù×÷£¬ÀýÈç.split(),.replace(),ºÍ.capitalize()¡£
ΪÁËÇåÏ´Place of Publication×ֶΣ¬ÎÒÃÇ¿ÉÒÔ½áºÏpandasµÄstr·½·¨ºÍnumpyµÄnp.whereº¯ÊýÅäºÏÍê³É¡£
ËüµÄÓï·¨ÈçÏ£º
>>> np.where(condition,
then, else) |
ÕâÀcondition¿ÉÒÔʹһ¸öÀàÊý×éµÄ¶ÔÏó£¬Ò²¿ÉÒÔÊÇÒ»¸ö²¼¶û±í´ï¡£Èç¹ûconditionÖµÎªÕæ£¬ÄÇôthen½«±»Ê¹Ó㬷ñÔòʹÓÃelse¡£
ËüÒ²¿ÉÒÔ×éÍøÊ¹Óã¬ÔÊÐíÎÒÃÇ»ùÓÚ¶à¸öÌõ¼þ½øÐмÆËã¡£
>>>
np.where(condition1, x1,
np.where(condition2, x2,
np.where(condition3, x3, ...))) |
ÎÒÃǽ«Ê¹ÓÃÕâÁ½¸ö·½³ÌÀ´ÇåÏ´Place of PublicationÓÉÓÚÕâÁÐÓÐ×Ö·û´®¶ÔÏó¡£ÒÔÏÂÊÇÕâ¸öÁеÄÄÚÈÝ£º

ÎÒÃÇ¿´µ½£¬¶ÔÓÚһЩÐУ¬place of publication»¹±»Ò»Ð©ÆäËüûÓÐÓõÄÐÅÏ¢Î§ÈÆ×Å¡£Èç¹ûÎÒÃÇ¿´¸ü¶àµÄÖµ£¬ÎÒÃÇ·¢ÏÖÕâÖÖÇé¿öÖÐÓÐЩÐÐ
ÈÃÎÒÃÇ¿´¿´Á½¸öÌØÊâµÄ£º

ÕâÁ½±¾ÊéÔÚͬһ¸öµØ·½³ö°æ£¬µ«ÊÇÒ»¸öÓÐÁ¬×Ö·û£¬ÁíÒ»¸öûÓС£
ΪÁËÒ»´ÎÐÔÇåÏ´Õâ¸öÁУ¬ÎÒÃÇʹÓÃstr.contains()À´»ñȡһ¸ö²¼¶ûÖµ¡£
ÎÒÃÇÇåÏ´µÄÁÐÈçÏ£º
>>>
pub = df['Place of Publication']
>>> london = pub.str.contains('London')
>>> london[:5]
Identifier
206 True
216 True
218 True
472 True
480 True
Name: Place of Publication, dtype: bool
>>> oxford = pub.str.contains('Oxford')
|
ÎÒÃǽ«ËüÓënp.where½áºÏ¡£
df['Place of
Publication'] = np.where(london, 'London',
np.where(oxford, 'Oxford',
pub.str.replace('-', ' ')))
>>> df['Place of Publication'].head()
Identifier
206 London
216 London
218 London
472 London
480 London
Name: Place of Publication, dtype: object |
ÕâÀnp.where·½³ÌÔÚÒ»¸öǶÌ׵ĽṹÖб»µ÷Óã¬conditionÊÇÒ»¸öͨ¹ýst.contains()µÃµ½µÄ²¼¶ûµÄSeries¡£contains()·½·¨ÓëPythonÄÚ½¨µÄin¹Ø¼ü×ÖÒ»Ñù£¬ÓÃÓÚ·¢ÏÖÒ»¸ö¸öÌåÊÇ·ñ·¢ÉúÔÚÒ»¸öµü´úÆ÷ÖС£
ʹÓõÄÌæ´úÎïÊÇÒ»¸ö´ú±íÎÒÃÇÆÚÍûµÄ³ö°æÉçµØÖ·×Ö·û´®¡£ÎÒÃÇҲʹÓÃstr.replace()½«Á¬×Ö·ûÌæ»»Îª¿Õ¸ñ£¬È»ºó¸øDataFrameÖеÄÁÐÖØÐ¸³Öµ¡£
¾¡¹ÜÊý¾Ý¼¯Öл¹Óиü¶àµÄ²»¸É¾»Êý¾Ý£¬µ«ÊÇÎÒÃÇÏÖÔÚ½öÌÖÂÛÕâÁ½ÁС£
ÈÃÎÒÃÇ¿´¿´Ç°ÎåÐУ¬ÏÖÔÚ¿´ÆðÀ´±ÈÎÒÃǸտªÊ¼µÄʱºòºÃµãÁË¡£

ÔÚÕâÒ»µãÉÏ£¬Place of Publication¾ÍÊÇÒ»¸öºÜºÃµÄÐèÒª±»×ª»»³É·ÖÀàÊý¾ÝµÄÀàÐÍ£¬ÒòΪÎÒÃÇ¿ÉÒÔÓÃÕûÊý½«ÕâÏ൱СµÄΨһ³ÇÊм¯±àÂë¡££¨·ÖÀàÊý¾ÝµÄʹÓÃÄÚ´æÓë·ÖÀàµÄÊýÁ¿ÒÔ¼°Êý¾ÝµÄ³¤¶È³ÉÕý±È£©
ʹÓÃapplymap·½·¨ÇåÏ´Õû¸öÊý¾Ý¼¯
ÔÚÒ»¶¨µÄÇé¿öÏ£¬Ä㽫¿´µ½²¢²»Êǽö½öÓÐÒ»ÌõÁв»¸É¾»£¬¶øÊǸü¶àµÄ¡£
ÔÚһЩʵÀýÖУ¬Ê¹ÓÃÒ»¸ö¶¨ÖƵĺ¯Êýµ½DataFrameµÄÿһ¸öÔªËØ½«»áÊǺÜÓаïÖúµÄ¡£pandasµÄapplyma()·½·¨ÓëÄÚ½¨µÄmap()º¯ÊýÏàËÆ£¬²¢ÇÒ¼òµ¥µÄÓ¦Óõ½Ò»¸öDataFrameÖеÄËùÓÐÔªËØÉÏ¡£
ÈÃÎÒÃÇ¿´Ò»¸öÀý×Ó¡£ÎÒÃǽ«»ùÓÚ"university_towns.txt"Îļþ´´½¨Ò»¸öDataFrame¡£
$ head Datasets/univerisity_towns.txt
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College,
Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit] |
ÎÒÃÇ¿ÉÒÔ¿´µ½Ã¿¸östateºó±ß¶¼ÓÐһЩÔÚÄǸöstateµÄ´óѧ³Ç£ºStateA TownA1 TownA2
StateB TownB1 TownB2...¡£Èç¹ûÎÒÃÇ×Ðϸ¹Û²ìstateÃû×ÖµÄд·¨£¬ÎÒÃǻᷢÏÖËüÃǶ¼ÓÐ"[edit]"µÄ×Ô×Ö·û´®¡£
ÎÒÃÇ¿ÉÒÔÀûÓÃÕâ¸öÌØÕ÷´´½¨Ò»¸öº¬ÓÐ(state,city)Ôª×éµÄÁÐ±í£¬²¢½«Õâ¸öÁбíǶÈëµ½DdataFrameÖУ¬
>>>
university_towns = []
>>> with open('Datasets/university_towns.txt')
as file:
... for line in file:
... if '[edit]' in line:
... # Remember this `state` until the next is
found
... state = line
... else:
... # Otherwise, we have a city; keep `state`
as last-seen
... university_towns.append((state, line))
>>> university_towns[:5]
[('Alabama[edit]\n', 'Auburn (Auburn University)[1]\n'),
('Alabama[edit]\n', 'Florence (University of North
Alabama)\n'),
('Alabama[edit]\n', 'Jacksonville (Jacksonville
State University)[2]\n'),
('Alabama[edit]\n', 'Livingston (University of
West Alabama)[2]\n'),
('Alabama[edit]\n', 'Montevallo (University of
Montevallo)[2]\n')] |
ÎÒÃÇ¿ÉÒÔÔÚDataFrameÖаü×°Õâ¸öÁÐ±í£¬²¢ÉèÁÐÃûΪ"State"ºÍ"RegionName"¡£pandas½«»áʹÓÃÁбíÖеÄÿ¸öÔªËØ£¬È»ºóÉèÖÃStateµ½×ó±ßµÄÁУ¬RegionNameµ½ÓұߵÄÁС£
×îÖÕµÄDataFrameÊÇÕâÑùµÄ£º

ÎÒÃÇ¿ÉÒÔÏñÉÏÃæÊ¹ÓÃfor loopÀ´½øÐÐÇåÏ´£¬µ«ÊÇpandasÌṩÁ˸ü¼òµ¥µÄ°ì·¨¡£ÎÒÃÇÖ»ÐèÒªstate
nameºÍtown name£¬È»ºó¾Í¿ÉÒÔÒÆ³ýËùÒÔÆäËûµÄÁË¡£ÕâÀïÎÒÃÇ¿ÉÒÔÔÙ´ÎʹÓÃpandasµÄ.str()·½·¨£¬Í¬Ê±ÎÒÃÇÒ²¿ÉÒÔʹÓÃapplymap()½«Ò»¸öpython
callableÓ³Éäµ½DataFrameÖеÄÿ¸öÔªËØÉÏ¡£
ÎÒÃÇÒ»Ö±ÔÚʹÓÃ"ÔªËØ"Õâ¸öÉãÓÚ£¬µ«ÊÇÎÒÃǵ½µ×ÊÇʲôÒâË¼ÄØ£¿¿´¿´ÏÂÃæÕâ¸ö"toy"µÄDataFrame£º

ÔÚÕâ¸öÀý×ÓÖУ¬Ã¿¸öµ¥Ôª (¡®Mock¡¯, ¡®Dataset¡¯, ¡®Python¡¯,
¡®Pandas¡¯, etc.) ¶¼ÊÇÒ»¸öÔªËØ¡£Òò´Ë£¬applymap()½«·Ö±ðÓ¦ÓÃÒ»¸öº¯Êýµ½ÕâÐ©ÔªËØÉÏ¡£ÈÃÎÒÃǶ¨ÒåÕâ¸öº¯Êý¡£

pandasµÄapplymap()Ö»ÓÃÒ»¸ö²ÎÊý£¬¾ÍÊÇÒªÓ¦Óõ½Ã¿¸öÔªËØÉϵĺ¯Êý£¨callable£©¡£
>>>
towns_df = towns_df.applymap(get_citystate) |
Ê×ÏÈ£¬ÎÒÃǶ¨ÒåÒ»¸öº¯Êý£¬Ëü½«´ÓDataFrameÖлñȡÿһ¸öÔªËØ×÷Ϊ×Ô¼ºµÄ²ÎÊý¡£ÔÚÕâ¸öº¯ÊýÖУ¬¼ìÑéÔªËØÖÐÊÇ·ñÓÐÒ»¸ö(»òÕß[¡£
»ùÓÚÉÏÃæµÄ¼ì²é£¬º¯Êý·µ»ØÏàÓ¦µÄÖµ¡£×îºó£¬applymap()º¯Êý±»ÓÃÔÚÎÒÃǵĶÔÏóÉÏ¡£ÏÖÔÚDataFrame¾Í¿´ÆðÀ´¸ü¸É¾²ÁË¡£

applymap()·½·¨´ÓDataFrameÖÐÌáȡÿ¸öÔªËØ£¬´«µÝµ½º¯ÊýÖУ¬È»ºó¸²¸ÇÔÀ´µÄÖµ¡£¾ÍÊÇÕâô¼òµ¥£¡
¼¼Êõϸ½Ú£ºËäÈ».applymapÊÇÒ»¸ö·½±ãºÍÁé»îµÄ·½·¨£¬µ«ÊǶÔÓÚ´óµÄÊý¾Ý¼¯Ëü½«»á»¨·ÑºÜ³¤Ê±¼äÔËÐУ¬ÒòΪËüÐèÒª½«python
callableÓ¦Óõ½Ã¿¸öÔªËØÉÏ¡£Ò»Ð©Çé¿öÖУ¬Ê¹ÓÃCython»òÕßNumPYµÄÏòÁ¿»¯µÄ²Ù×÷»á¸ü¸ßЧ¡£
ÖØÃüÃûÁкÍÒÆ³ýÐÐ
¾³£µÄ£¬Äã´¦ÀíµÄÊý¾Ý¼¯»áÓÐÈÃÄ㲻̫ÈÝÒ×Àí½âµÄÁÐÃû£¬»òÕßÔÚÍ·¼¸Ðлò×îºó¼¸ÐÐÓÐһЩ²»ÖØÒªµÄÐÅÏ¢£¬ÀýÈçÊõÓﶨÒ壬»òÊǸ½×¢¡£
ÕâÖÖÇé¿öÏ£¬ÎÒÃÇÏëÖØÐÂÃüÃûÁкÍÒÆ³ýÒ»¶¨µÄÐÐÒÔÈÃÎÒÃÇÖ»ÁôÏÂÕýÈ·ºÍÓÐÒâÒåµÄÐÅÏ¢¡£
ΪÁËÖ¤Ã÷ÎÒÃÇÈçºÎ´¦ÀíËü£¬ÎÒÃÇÏÈ¿´Ò»ÏÂ"olympics.csv"Êý¾Ý¼¯µÄÍ·5ÐУº
$ head -n 5
Datasets/olympics.csv
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,? Summer,01 !,02 !,03 !,Total,? Winter,01 !,02
!,03 !,Total,? Games,01 !,02 !,03 !,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
|
ÏÖÔÚÎÒÃǽ«Ëü¶ÁÈëpandasµÄDataFrame¡£

ÕâµÄÈ·ÓеãÂÒ£¡ÁÐÃûÊÇÒÔÕûÊýµÄ×Ö·û´®ÐÎʽË÷ÒýµÄ£¬ÒÔ0¿ªÊ¼¡£±¾Ó¦¸ÃÊÇÁÐÃûµÄÐÐÈ´´¦ÔÚolympics_df.iloc[0]¡£·¢ÉúÕâ¸öÊÇÒòΪCSVÎļþÒÔ0,
1, 2, ¡, 15ÆðʼµÄ¡£
ͬÑù£¬Èç¹ûÎÒÃÇÈ¥Êý¾Ý¼¯µÄÔ´Îļþ¹Û²ì£¬ÉÏÃæµÄNaNÕæµÄÓ¦¸ÃÊÇÏñ"Country"ÕâÑùµÄ£¬?
SummerÓ¦¸Ã´ú±í"Summer Games", ¶ø01 !Ó¦¸ÃÊÇ"Gold"Ö®ÀàµÄ¡£
Òò´Ë£¬ÎÒÃÇÐèÒª×öÁ½¼þÊ£º
ÒÆ³ýµÚÒ»Ðв¢ÉèÖÃheaderΪµÚÒ»ÐÐ
ÖØÐÂÃüÃûÁÐ
µ±ÎÒÃǶÁCSVÎļþµÄʱºò£¬¿ÉÒÔͨ¹ý´«µÝһЩ²ÎÊýµ½read_csvº¯ÊýÀ´ÒƳýÐкÍÉèÖÃÁÐÃû³Æ¡£
Õâ¸öº¯ÊýÓкܶà¿ÉÑ¡èñÊ÷£¬µ«ÊÇÕâÀïÎÒÃÇÖ»ÐèÒªheader
À´ÒƳýµÚ0ÐУº

ÎÒÃÇÏÖÔÚÓÐÁËÉèÖÃΪheaderµÄÕýÈ·ÐУ¬²¢ÇÒËùÓÐûÓõÄÐж¼±»ÒƳýÁË¡£¼Ç¼һÏÂpandasÊÇÈçºÎ½«°üº¬¹ú¼ÒµÄÁÐÃûNaN¸Ä±äΪUnnamed:0µÄ¡£
ΪÁËÖØÃüÃûÁУ¬ÎÒÃǽ«Ê¹ÓÃDataFrameµÄrename()·½·¨£¬ÔÊÐíÄãÒÔÒ»¸öÓ³É䣨ÕâÀïÊÇÒ»¸ö×ֵ䣩֨бê¼ÇÒ»¸öÖá¡£
ÈÃÎÒÃÇ¿ªÊ¼¶¨ÒåÒ»¸ö×ÖµäÀ´½«ÏÖÔÚµÄÁÐÃû³Æ£¨¼ü£©Ó³Éäµ½¸ü¶àµÄ¿ÉÓÃÁÐÃû³Æ£¨×ÖµäµÄÖµ£©¡£

ÎÒÃÇÔÚ¶ÔÏóÉϵ÷ÓÃrename()º¯Êý£º
>>>
olympics_df.rename (columns=new_names, inplace=True)
|
ÉèÖÃinplaceΪTrue¿ÉÒÔÈÃÎÒÃǵĸıäÖ±½Ó·´Ó³ÔÚ¶ÔÏóÉÏ¡£ÈÃÎÒÃÇ¿´¿´ÊÇ·ñÕýÈ·£º

PythonÊý¾ÝÇåÏ´£º»Ø¹Ë
Õâ¸ö½Ì³ÌÖУ¬Äãѧ»áÁË´ÓÊý¾Ý¼¯ÖÐÈçºÎʹÓÃdrop()º¯ÊýÈ¥³ý²»±ØÒªµÄÐÅÏ¢£¬Ò²Ñ§»áÁËÈçºÎΪÊý¾Ý¼¯ÉèÖÃË÷Òý£¬ÒÔÈÃitems¿ÉÒÔ±»ÈÝÒ×µÄÕÒµ½¡£
¸ü¶àµÄ£¬Äãѧ»áÁËÈçºÎʹÓÃ.str()ÇåÏ´¶ÔÏó×ֶΣ¬ÒÔ¼°ÈçºÎʹÓÃapplymap¶ÔÕû¸öÊý¾Ý¼¯ÇåÏ´¡£×îºó£¬ÎÒÃÇ̽Ë÷ÁËÈçºÎÒÆ³ýCSVÎļþµÄÐУ¬²¢ÇÒʹÓÃrename()·½·¨ÖØÃüÃûÁС£
ÕÆÎÕÊý¾ÝÇåÏ´·Ç³£ÖØÒª£¬ÒòΪËüÊÇÊý¾Ý¿ÆÑ§µÄÒ»¸ö´óµÄ²¿·Ö¡£ÄãÏÖÔÚÓ¦¸ÃÓÐÁËÒ»¸öÈçºÎʹÓÃpandasºÍnumpy½øÐÐÊý¾ÝÇåÏ´µÄ»ù±¾Àí½âÁË¡£
|