±à¼ÍƼö: |
À´Ô´ÓÚcnblogs£¬½éÉÜÁËÊý¾Ýµ¼ÈëºÍµ¼³ö£¬ÌáÈ¡ºÍɸѡÐèÒªµÄÊý¾Ý£¬Í³¼ÆÃèÊö£¬Êý¾Ý´¦ÀíµÈ¡£ |
|

ǰÑÔ£º¸÷ÖÖºÍÊý¾Ý·ÖÎöÏà¹Øpython¿âµÄ½éÉÜ
1.Numpy£º
NumpyÊÇpython¿ÆÑ§¼ÆËãµÄ»ù´¡°ü£¬ËüÌṩÒÔϹ¦ÄÜ£¨²»ÏÞÓÚ´Ë£©£º
(1)¿ìËÙ¸ßЧµÄ¶àάÊý×é¶ÔÏónaarray
(2)ÓÃÓÚ¶ÔÊý×éÖ´ÐÐÔªËØ¼¶¼ÆËãÒÔ¼°Ö±½Ó¶ÔÊý×éÖ´ÐÐÊýѧÔËËãµÄº¯Êý
(3)ÓÃÓÚ¶ÁдӲÅÌÉÏ»ùÓÚÊý×éµÄÊý¾Ý¼¯µÄ¹¤¾ß
(4)ÏßÐÔ´úÊýÔËËã¡¢¸µÀïÒ¶±ä»»£¬ÒÔ¼°Ëæ»úÊýÉú³É
(5)ÓÃÓÚ½«C¡¢C++¡¢Fortran´úÂ뼯³Éµ½pythonµÄ¹¤¾ß
2.pandas
pandasÌṩÁËʹÎÒÃÇÄܹ»¿ìËÙ±ã½ÝµØ´¦Àí½á¹¹»¯Êý¾ÝµÄ´óÁ¿Êý¾Ý½á¹¹ºÍº¯Êý¡£pandas¼æ¾ßNumpy¸ßÐÔÄܵÄÊý×鼯Ë㹦ÄÜÒÔ¼°µç×Ó±í¸ñºÍ¹ØÏµÐÍÊý¾Ý£¨ÈçSQL£©Áé»îµÄÊý¾Ý´¦ÀíÄÜÁ¦¡£ËüÌṩÁ˸´ÔÓ¾«Ï¸µÄË÷Òý¹¦ÄÜ£¬ÒÔ±ã¸üΪ±ã½ÝµØÍê³ÉÖØËÜ¡¢ÇÐÆ¬ºÍÇп顢¾ÛºÏÒÔ¼°Ñ¡È¡Êý¾Ý×Ó¼¯µÈ²Ù×÷¡£
¶ÔÓÚ½ðÈÚÐÐÒµµÄÓû§£¬pandasÌṩÁË´óÁ¿ÊÊÓÃÓÚ½ðÈÚÊý¾ÝµÄ¸ßÐÔÄÜʱ¼äÐòÁй¦Äܺ͹¤¾ß¡£
DataFrameÊÇpandasµÄÒ»¸ö¶ÔÏó£¬ËüÊÇÒ»¸öÃæÏòÁеĶþά±í½á¹¹£¬ÇÒº¬ÓÐÐбêºÍÁбꡣ
ps.ÒýÓÃÒ»¶ÎÍøÉϵĻ°ËµÃ÷DataFrameµÄÇ¿´óÖ®´¦£º
Excel 2007¼°ÆäÒÔºóµÄ°æ±¾µÄ×î´óÐÐÊýÊÇ1048576£¬×î´óÁÐÊýÊÇ16384£¬³¬¹ýÕâ¸ö¹æÄ£µÄÊý¾ÝExcel¾Í»áµ¯³ö¸ö¿ò¿ò¡°´ËÎı¾°üº¬¶àÐÐÎı¾£¬ÎÞ·¨·ÅÖÃÔÚÒ»¸ö¹¤×÷±íÖС±¡£Pandas´¦ÀíÉÏǧÍòµÄÊý¾ÝÊÇÒ×Èç·´ÕÆµÄÊÂÇé£¬Í¬Ê±ËæºóÎÒÃÇÒ²½«¿´µ½Ëü±ÈSQLÓиüÇ¿µÄ±í´ïÄÜÁ¦£¬¿ÉÒÔ×öºÜ¶à¸´ÔӵIJÙ×÷£¬ÒªÐ´µÄcodeÒ²¸üÉÙ¡£
˵ÁËÒ»´ó¶ÑËüµÄºÃ´¦£¬ÒªÊµ¼Ê¸Ð´¥»¹µÃ¶¯ÊÖÂë´úÂë¡£
3.matplotlib
matplotlibÊÇ×îÁ÷ÐеÄÓÃÓÚ»æÖÆÊý¾Ýͼ±íµÄpython¿â¡£
4.Scipy
ScipyÊÇÒ»×éרÃŽâ¾ö¿ÆÑ§¼ÆËãÖи÷ÖÖ±ê×¼ÎÊÌâÓòµÄ°üµÄ¼¯ºÏ¡£
5.statsmodels£º https://github.com/statsmodels/statsmodels
6.scikit-learn£º http://scikit-learn.org/stable/
Ò».Êý¾Ýµ¼ÈëºÍµ¼³ö
£¨Ò»£©¶ÁÈ¡csvÎļþ
1.±¾µØ¶ÁÈ¡
import pandas
as pd
df = pd.read_csv('E:\\tips.csv') #¸ù¾Ý×Ô¼ºÊý¾ÝÎļþ±£´æµÄ·¾¶Ìîд(p.s.
pythonÌîд·¾¶Ê±£¬ÒªÃ´Ê¹ÓÃ/£¬ÒªÃ´Ê¹ÓÃ\\)
#Êä³ö£º
total_bill tip sex smoker day time size
16.99 1.01 Female No Sun Dinner 2
10.34 1.66 Male No Sun Dinner 3
21.01 3.50 Male No Sun Dinner 3
23.68 3.31 Male No Sun Dinner 2
24.59 3.61 Female No Sun Dinner 4
25.29 4.71 Male No Sun Dinner 4
.. ... ... ... ... ... ... ...
27.18 2.00 Female Yes Sat Dinner 2
22.67 2.00 Male Yes Sat Dinner 2
17.82 1.75 Male No Sat Dinner 2
18.78 3.00 Female No Thur Dinner 2
[244 rows x 7 columns] |
2.ÍøÂç¶ÁÈ¡
import pandas
as pd
data_url = "https: //raw. githubusercontent
.com / mwaskom /seaborn- data/master /tips.csv"
#Ìîдurl¶ÁÈ¡
df = pd.read_csv(data_url)
#Êä³öͬÉÏ |
3.read_csvÏê½â
¹¦ÄÜ£º Read CSV (comma-separated) file into DataFrame
read_ csv(filepath_
or_buffer, sep =',', dialect =None , compression=
'infer', doublequote= True, escapechar= None,
quotechar ='"', quoting= 0, skipinitialspace=
False, lineterminator= None, header= 'infer',
index_col= None, names= None, prefix= None, skiprows=
None, skipfooter =None, skip_ footer= 0, na_values=
None, true_values= None , false_ values= None,
delimiter= None, converters =None, dtype= None,
usecols None, engine =None, delim _whitespace
=False, as_ recarray =False, na_ filter= True,
compact_ ints= False, use_ unsigned =False, low
_memory= True, buffer _lines= None, warn _bad_lines
=True, error_ bad_lines =True, keep_ default _na=
True, thousands = None, comment = None, decimal
='.', parse_ dates= False, keep _date_col =False,
dayfirst = False, date_parser= None, memory _map=
False, float _precision =None, nrows =None, iterator
=False , chunksize= None, verbose= False, encoding=
None, squeeze= False, mangle_dupe_cols = True,
tupleize_ cols= False, infer_ datetime _ format
= False, skip _blank_ lines= True) |
²ÎÊýÏê½â£º
http: //pandas.pydata.org /pandas-docs
/stable/generated /pandas.read_csv.html
(¶þ)¶ÁÈ¡MysqlÊý¾Ý
¼ÙÉèÊý¾Ý¿â°²×°ÔÚ±¾µØ£¬Óû§ÃûΪmyusername,ÃÜÂëΪmypassword,Òª¶ÁÈ¡mydbÊý¾Ý¿âÖеÄÊý¾Ý
import pandas
as pd
import MySQLdb
mysql_cn= MySQLdb.connect (host='localhost', port=
3306,user ='myusername', passwd= 'mypassword',
db= 'mydb ')
df = pd.read_sql('select * from test;', con= mysql_
cn)
mysql_ cn.close() |
ÉÏÃæµÄ´úÂë¶ÁÈ¡ÁËtest±íÖÐËùÓеÄÊý¾Ýµ½dfÖУ¬¶ødfµÄÊý¾Ý½á¹¹ÎªDataframe¡£
ps.MySQL½Ì³Ì:http://www.runoob.com/mysql/mysql-tutorial.html
(Èý)¶ÁÈ¡excelÎļþ
Òª¶ÁÈ¡excelÎļþ»¹ÐèÒª°²×°xlrdÄ£¿é£¬pip install xlrd¼´¿É¡£
df = pd.read_excel('E:\\tips.xls') |
(ËÄ)Êý¾Ýµ¼³öµ½csvÎļþ
df.to_csv('E:\\
demo.csv', encoding= 'utf-8', index = False)
#index=False ±íʾµ¼³öʱȥµôÐÐÃû³Æ£¬Èç¹ûÊý¾ÝÖк¬ÓÐÖÐÎÄ£¬Ò»°ãencoding Ö¸¶¨Îª¡®utf-8¡¯ |
(Îå)¶ÁдSQLÊý¾Ý¿â
import pandas
as pd
import sqlite3
con = sqlite3.connect('...')
sql = '...'
df = pd.read_sql(sql,con)
#helpÎļþ
help (sqlite3.connect)
#Êä³ö
Help on built- in function connect in module _
sqlite3 :
connect(...)
connect(database[, timeout, isolation_level, detect
_types, factory])
Opens a connection to the SQLite database file
*database *. You can use
":memory :" to open a database connection
to a database that resides in
RAM instead of on disk.
#############
help(pd.read_sql)
#Êä³ö
Help on function read_ sql in module pandas.io.
sql :
read_sq l(sql, con, index_col= None, coerce_float=
True, params= None, parse_ dates= None, columns=
None, chunksize= None)
Read SQL query or database table into a DataFrame. |
ps.Êý¾Ý¿âµÄ´úÂëÊÇÎÒÖ±½Ó´ÓÍøÂçÉÏÕ³Ìù¹ýÀ´µÄ£¬Ã»ÓвâÊÔ¹ýÊDz»ÊÇ¿ÉÐУ¬ÏÈÌùÉÏÀ´¡£
Êý¾Ý¿âÎÒ»¹ÔÚÃþË÷ÖУ¬Ñ§Ï°ÐĵÃѧϰ±Ê¼ÇÖ®ÀàµÄ´ó¼Ò¿ÉÒÔÒ»Æð·ÖÏí23333~
¶þ.ÌáÈ¡ºÍɸѡÐèÒªµÄÊý¾Ý
£¨Ò»£©ÌáÈ¡ºÍ²é¿´ÏàÓ¦Êý¾Ý £¨ÓõÄÊÇtips.csvµÄÊý¾Ý£¬Êý¾ÝÀ´Ô´£ºhttps:
//github .com/mwaskom /seaborn- data£©
print df.head()
#´òÓ¡Êý¾ÝǰÎåÐÐ
#Êä³ö
total_ bill tip sex smoker day time size
16.99 1.01 Female No Sun Dinner 2
10.34 1.66 Male No Sun Dinner 3
21.01 3.50 Male No Sun Dinner 3
23.68 3.31 Male No Sun Dinner 2
24.59 3.61 Female No Sun Dinner 4 |
print df.tail()
#´òÓ¡Êý¾Ýºó5ÐÐ
#Êä³ö
total_bill tip sex smoker day time size
29.03 5.92 Male No Sat Dinner 3
27.18 2.00 Female Yes Sat Dinner 2
22.67 2.00 Male Yes Sat Dinner 2
17.82 1.75 Male No Sat Dinner 2
18.78 3.00 Female No Thur Dinner 2! |
print df.columns
#´òÓ¡ÁÐÃû
#Êä³ö
Index ([u'total_bill', u'tip', u'sex', u'smoker',
u'day', u'time', u'size'], dtype ='object') |
print df.index
#´òÓ¡ÐÐÃû
#Êä³ö
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
234, 235, 236, 237, 238, 239, 240, 241, 242, 243],
dtype='int64', length= 244) |
print df.ix[10:20,
0:3] #´òÓ¡10~20ÐÐǰÈýÁÐÊý¾Ý
#Êä³ö
total_bill tip sex
10.27 1.71 Male
35.26 5.00 Female
15.42 1.57 Male
18.43 3.00 Male
14.83 3.02 Female
21.58 3.92 Male
10.33 1.67 Female
16.29 3.71 Male
16.97 3.50 Female
20.65 3.35 Male
17.92 4.08 Male |
#ÌáÈ¡²»Á¬ÐøÐкÍÁеÄÊý¾Ý£¬Õâ¸öÀý×ÓÌáÈ¡µÄÊǵÚ1,3,5ÐУ¬µÚ2,4ÁеÄÊý¾Ý
df.iloc[[1,3,5],[2,4]]
#Êä³ö
sex day
Male Sun
Male Sun
Male Sun |
#רÃÅÌáȡijһ¸öÊý¾Ý£¬Õâ¸öÀý×ÓÌáÈ¡µÄÊǵÚÈýÐУ¬µÚ¶þÁÐÊý¾Ý£¨Ä¬ÈÏ´Ó0¿ªÊ¼Ëã¹þ£©
df.iat[3,2]
#Êä³ö
'Male'
|
print df.drop(df.columns[1,
2], axis = 1) #ÉáÆúÊý¾ÝǰÁ½ÁÐ
print df.drop(df.columns[[1, 2]], axis = 0)
#ÉáÆúÊý¾ÝǰÁ½ÐÐ
#ΪÁ˽Úʡƪ·ù½á¹û¾Í²»Ìù³öÀ´Á˹þ~
|
print df.shape
#´òӡά¶È
#Êä³ö
(244, 7) |
df.iloc[3]
#ѡȡµÚ3ÐÐ
#Êä³ö1
total_bill 23.68
tip 3.31
sex Male
smoker No
day Sun
time Dinner
size 2
Name: 3, dtype: object
df.iloc[2:4] #ѡȡµÚ2µ½µÚ3ÐÐ
#Êä³ö2
total_bill tip sex smoker day time size
21.01 3.50 Male No Sun Dinner 3
23.68 3.31 Male No Sun Dinner 2
df.iloc[0,1] #ѡȡµÚ0ÐÐ1ÁеÄÔªËØ
#Êä³ö3
1.01 |
(¶þ)ɸѡ³öÐèÒªµÄÊý¾Ý£¨ÓõÄÊÇtips.csvµÄÊý¾Ý£¬Êý¾ÝÀ´Ô´£ºhttps:
//github.com /mwaskom /seaborn- data£©
#example:¼ÙÉèÎÒÃÇҪɸѡ³öС·Ñ´óÓÚ$8µÄÊý¾Ý
df[df.tip>8]
#Êä³ö
total_bill tip sex smoker day time size
50.81 10 Male Yes Sat Dinner 3
48.33 9 Male No Sat Dinner 4 |
#Êý¾ÝɸѡͬÑù¿ÉÒÔÓá±»ò¡°ºÍ¡±ÇÒ¡°×÷ΪɸѡÌõ¼þ£¬±ÈÈç
#1
df[(df.tip>7)|(df.total_bill>50)] #ɸѡ³öС·Ñ´óÓÚ$7
»ò×ÜÕ˵¥´óÓÚ$50µÄÊý¾Ý
#Êä³ö
total_bill tip sex smoker day time size
39.42 7.58 Male No Sat Dinner 4
50.81 10.00 Male Yes Sat Dinner 3
48.33 9.00 Male No Sat Dinner 4
#2
df [(df.tip>7)&(df.total_ bill>50)]#ɸѡ³öС·Ñ´óÓÚ$7ÇÒ×ÜÕ˵¥´óÓÚ$50µÄÊý¾Ý
#Êä³ö
total_bill tip sex smoker day time size
50.81 10 Male Yes Sat Dinner 3 |
#½ÓÉÏ
#¼ÙÈç¼ÓÈëÁËɸѡÌõ¼þºó£¬ÎÒÃÇÖ»¹ØÐÄdayºÍtime
df[['day','time']][(df.tip>7)|(df.total_bill>50)]
#Êä³ö
day time
Sat Dinner
Sat Dinner
Sat Dinner |
Èý.ͳ¼ÆÃèÊö£¨ÓõÄÊÇtips.csvµÄÊý¾Ý£¬Êý¾ÝÀ´Ô´£ºhttps://github.com/mwaskom/seaborn-data£©
print df.describe() #ÃèÊöÐÔͳ¼Æ
#Êä³ö ¸÷Ö¸±ê¶¼±È½Ï¼òµ¥¾Í²»½âÊÍÁ˹þ
total_bill tip size
count 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672
std 8.902412 1.383638 0.951100
min 3.070000 1.000000 1.000000
25% 13.347500 2.000000 2.000000
50% 17.795000 2.900000 2.000000
75% 24.127500 3.562500 3.000000
max 50.810000 10.000000 6.000000 |
ËÄ.Êý¾Ý´¦Àí(Ò»)Êý¾ÝתÖã¨ÓõÄÊÇtips.csvµÄÊý¾Ý£¬Êý¾ÝÀ´Ô´£ºhttps:
//github.com /mwaskom /seaborn- data£©
print df.T
#output
1 2 3 4 5 6 7 \
total_bill 16.99 10.34 21.01 23.68 24.59 25.29
8.77 26.88
tip 1.01 1.66 3.5 3.31 3.61 4.71 2 3.12
sex Female Male Male Male Female Male Male Male
smoker No No No No No No No No
day Sun Sun Sun Sun Sun Sun Sun Sun
time Dinner Dinner Dinner Dinner Dinner Dinner
Dinner Dinner
size 2 3 3 2 4 4 2 4
9 ... 234 235 236 237 238 \
total_bill 15.04 14.78 ... 15.53 10.07 12.6 32.83
35.83
tip 1.96 3.23 ... 3 1.25 1 1.17 4.67
sex Male Male ... Male Male Male Male Female
smoker No No ... Yes No Yes Yes No
day Sun Sun ... Sat Sat Sat Sat Sat
time Dinner Dinner ... Dinner Dinner Dinner Dinner
Dinner
size 2 2 ... 2 2 2 2 3
240 241 242 243
total_bill 29.03 27.18 22.67 17.82 18.78
tip 5.92 2 2 1.75 3
sex Male Female Male Male Female
smoker No Yes Yes No No
day Sat Sat Sat Sat Thur
time Dinner Dinner Dinner Dinner Dinner
size 3 2 2 2 2
[7 rows x 244 columns] |
(¶þ)Êý¾ÝÅÅÐò£¨ÓõÄÊÇtips.csvµÄÊý¾Ý£¬Êý¾ÝÀ´Ô´£ºhttps:
//github.com/mwaskom /seaborn-data £©
df.sort_values(by='tip')
#°´tipÁÐÉýÐòÅÅÐò
#Êä³ö£¨ÎªÁ˲»Õ¼Æª·ùÎÒ¼ò»¯ÁËÒ»²¿·Ö£©
total_bill tip sex smoker day time size
3.07 1.00 Female Yes Sat Dinner 1
12.60 1.00 Male Yes Sat Dinner 2
5.75 1.00 Female Yes Fri Dinner 2
7.25 1.00 Female No Sat Dinner 1
16.99 1.01 Female No Sun Dinner 2
.. ... ... ... ... ... ... ...
28.17 6.50 Female Yes Sat Dinner 3
34.30 6.70 Male No Thur Lunch 6
48.27 6.73 Male No Sat Dinner 4
39.42 7.58 Male No Sat Dinner 4
48.33 9.00 Male No Sat Dinner 4
50.81 10.00 Male Yes Sat Dinner 3
[244 rows x 7 columns] |
(Èý)ȱʧֵ´¦Àí1.Ìî³äȱʧֵ(Êý¾ÝÀ´×Ô¡¶ÀûÓÃpython½øÐÐÊý¾Ý·ÖÎö¡·µÚ¶þÕÂ
usagov_ bitly_ data 2012-03-16- 1331923249.txt£¬ÐèÒªµÄͬѧ¿ÉÒÔÕÒÎÒÒª)
import json
#pythonÓÐÐí¶àÄÚÖûòµÚÈý·½Ä£¿é¿ÉÒÔ½«JSON×Ö·û´®×ª»»³Épython×Öµä¶ÔÏó
import pandas as pd
import numpy as np
from pandas import DataFrame
path = 'F: \PycharmProjects\pydata-book-master\
ch02\ usagov_bitly_ data2012-03-16-1331923249.txt'
#¸ù¾Ý×Ô¼ºµÄ·¾¶Ìîд
records = [json.loads(line) for line in open (path)]
frame = DataFrame(records)
frame ['tz']
#Êä³ö£¨ÎªÁ˽Úʡƪ·ùÎÒɾ³ýÁ˲¿·ÖÊä³ö½á¹û£©
America/New_York
America/Denver
America/New_York
America/Sao_Paulo
America/New_York
America/New_York
Europe/Warsaw
America/Los_Angeles
America/New_York
America/New_York
NaN
...
Name: tz, dtype: object |
´ÓÒÔÉÏÊä³öÖµ¿ÉÒÔ¿´³öÊý¾Ý´æÔÚδ֪»òȱʧֵ£¬½Ó×ÅÔÛÃÇÀ´´¦Àíȱʧֵ¡£
print frame['tz'].fillna(1111111111111)
#ÒÔÊý×Ö´úÌæÈ±Ê§Öµ
#Êä³ö½á¹û£¨ÎªÁ˽Úʡƪ·ùÎÒɾ³ýÁ˲¿·ÖÊä³ö½á¹û£©
America/New_York
America/Denver
America/New_York
America/Sao_Paulo
America/New_York
America/New_York
Europe/Warsaw
America/Los_Angeles
America/New_York
America/New_York
1111111111111
Name: tz, dtype: object |
print frame
['tz'].fillna ('YuJie2333333333333') #ÓÃ×Ö·û´®´úÌæÈ±Ê§Öµ
#Êä³ö£¨ÎªÁ˽Úʡƪ·ùÎÒɾ³ýÁ˲¿·ÖÊä³ö½á¹û£©
America/New_York
America/Denver
America/New_York
America/Sao_Paulo
America/New_York
America/New_York
Europe/Warsaw
America/Los_Angeles
America/New_York
America/New_York
YuJie2333333333333
Name: tz, dtype: object |
»¹ÓУº
print frame['tz'].fillna(method='pad')
#ÓÃǰһ¸öÊý¾Ý´úÌæÈ±Ê§Öµ
print frame['tz'].fillna(method='bfill') #ÓúóÒ»¸öÊý¾Ý´úÌæÈ±Ê§Öµ
|
2.ɾ³ýȱʧֵ £¨Êý¾ÝͬÉÏ£©
print frame['tz'].dropna(axis=0)
#ɾ³ýȱʧÐÐ
print frame['tz'].dropna(axis=1) #ɾ³ýȱʧÁÐ
|
3.²åÖµ·¨Ìȱʧֵ
ÓÉÓÚûÓÐÊý¾Ý£¬Õâ¶ù²å²¥Ò»¸öС֪ʶµã£º´´½¨Ò»¸öËæ»úµÄÊý¾Ý¿ò
import pandas
as pd
import numpy as np
#´´½¨Ò»¸ö6*4µÄÊý¾Ý¿ò£¬randnº¯ÊýÓÃÓÚ´´½¨Ëæ»úÊý
czf_data = pd.DataFrame (np.random .randn (6,4),columns=
list('ABCD'))
czf_ data
#Êä³ö
A B C D
0.355690 1.165004 0.810392 -0.818982
0.496757 -0.490954 -0.407960 -0.493502
-0.202123 -0.842278 -0.948464 0.223771
0.969445 1.357910 -0.479598 -1.199428
0.125290 0.943056 -0.082404 -0.363640
-1.762905 -1.471447 0.351570 -1.546152 |
ºÃÀ²£¬Êý¾Ý¾Í³öÀ´ÁË¡£½Ó×ÅÎÒÃÇÓÿÕÖµÌæ»»ÊýÖµ£¬´´Ôì³öÒ»¸öº¬ÓпÕÖµµÄDataFrame¡£
#°ÑµÚ¶þÁÐÊý¾ÝÉèÖÃΪȱʧֵ
czf_data.ix [2,:]=np.nan
czf_data
#Êä³ö
A B C D
0.355690 1.165004 0.810392 -0.818982
0.496757 -0.490954 -0.407960 -0.493502
NaN NaN NaN NaN
0.969445 1.357910 -0.479598 -1.199428
0.125290 0.943056 -0.082404 -0.363640
-1.762905 -1.471447 0.351570 -1.546152 |
#½ÓמͿÉÒÔÀûÓòåÖµ·¨Ìî²¹¿ÕȱֵÁË~
print czf_ data.interpolate()
#Êä³ö
A B C D
0.355690 1.165004 0.810392 -0.818982
0.496757 -0.490954 -0.407960 -0.493502
0.733101 0.433478 -0.443779 -0.846465
0.969445 1.357910 -0.479598 -1.199428
0.125290 0.943056 -0.082404 -0.363640
-1.762905 -1.471447 0.351570 -1.546152 |
(ËÄ)Êý¾Ý·Ö×飨ÓõÄÊÇtips.csvµÄÊý¾Ý£¬Êý¾ÝÀ´Ô´£ºhttps:
//github.com/mwaskom /seaborn-data £©
group = df.groupby('day') #°´dayÕâÒ»ÁнøÐзÖ×é
#1
print group.first ()#´òӡÿһ×éµÄµÚÒ»ÐÐÊý¾Ý
#Êä³ö
total_bill tip sex smoker time size
day
Fri 28.97 3.00 Male Yes Dinner 2
Sat 20.65 3.35 Male No Dinner 3
Sun 16.99 1.01 Female No Dinner 2
Thur 27.20 4.00 Male No Lunch 4
#2
print group.last()#´òӡÿһ×éµÄ×îºóÒ»ÐÐÊý¾Ý
#Êä³ö
total_bill tip sex smoker time size
day
Fri 10.09 2.00 Female Yes Lunch 2
Sat 17.82 1.75 Male No Dinner 2
Sun 15.69 1.50 Male Yes Dinner 2
Thur 18.78 3.00 Female No Dinner 2 |
(Îå)ÖµÌæ»»
import pandas as pd
import numpy as np
#Ê×ÏÈ´´ÔìÒ»¸öSeries£¨Ã»ÓÐÊý¾ÝÇé¿öϵĸ£Òô233£©
Series = pd.Series([0,1,2,3,4,5])
#Êä³ö
Series
0
1
2
3
4
5
dtype: int64 |
#ÊýÖµÌæ»»£¬ÀýÈ罫0»»³É10000000000000
print Series.replace(0,10000000000000)
#Êä³ö
10000000000000
1
2
3
4
5
dtype: int64 |
#ÁкÍÁеÄÌæ»»Í¬Àí
print Series.replace([0,1,2,3,4,5]£¬[11111,222222,3333333,44444,55555,666666])
#Êä³ö
11111
222222
3333333
44444
55555
666666
dtype: int64 |
Îå.ͳ¼Æ·ÖÎö
(Ò»)t¼ìÑé
1.¶ÀÁ¢Ñù±¾t¼ìÑé
Á½¶ÀÁ¢Ñù±¾t¼ìÑé¾ÍÊǸù¾ÝÑù±¾Êý¾Ý¶ÔÁ½¸öÑù±¾À´×ÔµÄÁ½¶ÀÁ¢×ÜÌåµÄ¾ùÖµÊÇ·ñÓÐÏÔÖø²îÒì½øÐÐÍÆ¶Ï£»½øÐÐÁ½¶ÀÁ¢Ñù±¾t¼ìÑéµÄÌõ¼þÊÇ£¬Á½Ñù±¾µÄ×ÜÌåÏ໥¶ÀÁ¢ÇÒ·ûºÏÕý̬·Ö²¼¡£
¿ªÊ¼ÕÒ²»µ½ºÏÊʵÄÊý¾Ý£¬ÎÒ¾ÍÔÚÍøÉÏËæ±ãÕª³Á˸öspss×ö¶ÀÁ¢Ñù±¾t¼ìÑéµÄʵÀýÊý¾Ý×÷ΪÀý×Ó´ó¼ÒÔÝʱ¿´×ŰÉÕÒµ½ºÏÊʵÄÀý×ÓÔÙ¸ø´ó¼Ò¾Ù~
Êý¾ÝÈçÏ£¬ÎÒ½«Êý¾Ý±£´æÎª±¾µØxlsx¸ñʽ£º
group data
1 34
1 37
1 28
1 36
1 30
2 43
2 45
2 47
2 49
2 39 |
import pandas
as pd
from scipy.stats import ttest_ind
IS_t_test = pd.read_excel('E:\\IS_t_test.xlsx')
Group1 = IS_t_test[IS_t_test['group']==1]['data']
Group2 = IS_t_test[IS_t_test['group']==2]['data']
print ttest_ind(Group1,Group2)
#Êä³ö
(-4.7515451390104353, 0.0014423819408438474) |
Êä³ö½á¹ûµÄµÚÒ»¸öÔªËØÎªtÖµ£¬µÚ¶þ¸öÔªËØÎªp-value
ttest_indĬÈÏÁ½×éÊý¾Ý·½²îÆëÐԵģ¬Èç¹ûÏëÒªÉèÖÃĬÈÏ·½²î²»Æë£¬¿ÉÒÔÉèÖÃequal_var=False
print ttest_ind(Group1,Group2,equal_var=True)
print ttest_ind(Group1,Group2,equal_var=False)
#Êä³ö
(-4.7515451390104353, 0.0014423819408438474)
(-4.7515451390104353, 0.0014425608643614844) |
2.Åä¶ÔÑù±¾t¼ìÑé
ͬÑùÕÒ²»µ½Êý¾Ý£¬ÈÃÎÒÃÇÔÝÇÒ¼ÙÉèÉϱ߶ÀÁ¢Ñù±¾ÊÇÅä¶ÔÑù±¾°É£¬Ê¹ÓÃͬÑùµÄÊý¾Ý¡£
import pandas
as pd
from scipy.stats import ttest_rel
IS_t_test = pd.read_excel('E:\\IS_t_test.xlsx')
Group1 = IS_t_test[IS_t_test['group']==1]['data']
Group2 = IS_t_test[IS_t_test['group']==2]['data']
print ttest_rel(Group1,Group2)
#Êä³ö
(-5.6873679190073361, 0.00471961872448184) |
ͬÑùµÄ£¬Êä³ö½á¹ûµÄµÚÒ»¸öÔªËØÎªtÖµ£¬µÚ¶þ¸öÔªËØÎªp-value¡£
(¶þ)·½²î·ÖÎö
1.µ¥ÒòËØ·½²î·ÖÎö
ÕâÀïÒÀÈ»ÑØÓÃt¼ìÑéµÄÊý¾Ý
import pandas
as pd
from scipy import stats
IS_t_test = pd.read_excel('E:\\IS_t_test.xlsx')
Group1 = IS_t_test[IS_t_test['group']==1]['data']
Group2 = IS_t_test[IS_t_test['group']==2]['data']
w,p = stats.levene(*args)
#levene·½²îÆëÐÔ¼ìÑé¡£levene(*args, **kwds) Perform Levene
test for equal variances.Èç¹ûp<0.05£¬Ôò·½²î²»Æë
print w,p
#½øÐз½²î·ÖÎö
f,p = stats.f_oneway(*args)
print f,p
#Êä³ö
(0.019607843137254936, 0.89209916055865535)
22.5771812081 0.00144238194084 |
2.¶àÒòËØ·½²î·ÖÎö
Êý¾ÝÊÇÎÒ´ÓÍøÉÏÕҵĶàÒòËØ·½²î·ÖÎöµÄÒ»¸öÀý×Ó£¬Ñо¿Çø×éºÍÓªÑøËØ¶ÔÌåÖØµÄÓ°Ïì¡£ÎÒ×ö³ÉÁËexcelÎļþ£¬ÐèÒªµÄͬѧ¿ÉÒÔÎÊÎÒÒª¹þ~×ö¶àÒòËØ·½²î·ÖÎöÐèÒª¼ÓÔØstatsmodelsÄ£¿é£¬Èç¹ûµçÄÔûÓа²×°¿ÉÒÔpip
installһϡ£
#Êý¾Ýµ¼Èë
import pandas as pd
MANOVA=pd.read_excel('E:\\MANOVA.xlsx')
MANOVA
#Êä³ö£¨ÎªÁ˽Úʡƪ·ùɾµôÁËÖм䲿·ÖµÄÊä³ö½á¹û£©
id nutrient weight
1 1 50.1
2 1 47.8
3 1 53.1
4 1 63.5
5 1 71.2
6 1 41.4
.......................
6 3 38.5
7 3 51.2
8 3 46.2 |
#¶àÒòËØ·½²î·ÖÎö
from statsmodels.formula.api import ols
from statsmodels. stats.anova import anova_lm
formula = 'weight~C (id)+ C(nutrient) +C(id):
C (nutrient) '
anova_results = anova_lm (ols (formula ,MANOVA)
.fit ())
print anova_results
#output
df sum_sq mean_sq F PR (>F)
C(id) 7 2.373613e +03 339.087619 0 NaN
C(nutrient) 2 1.456133e+02 72.806667 0 NaN
C(id):C(nutrient) 14 3.391667e +02 24.226190 0
NaN
Residual 0 8.077936e-27 inf NaN NaN |
Ò²ÐíÊý¾ÝÑ¡µÃ²»¶Ô£¬p-valueÈ«ÊÇ¿ÕÖµ23333£¬´ýÎÒÕÒ¸öºÃµã¶ùµÄÊý¾ÝÔÙ×öÒ»´Î¶àÒòËØ·½²î·ÖÎö¡£
3.ÖØ¸´²âÁ¿Éè¼ÆµÄ·½²î·ÖÎö£¨µ¥ÒòËØ£© ********´ýÍêÉÆ
ÖØ¸´²âÁ¿Éè¼ÆÊǶÔͬһÒò±äÁ¿½øÐÐÖØ¸´²â¶È£¬Öظ´²âÁ¿Éè¼ÆµÄ·½²î·ÖÎö¿ÉÒÔÊÇͬһÌõ¼þϽøÐеÄÖØ¸´²â¶È£¬Ò²¿ÉÒÔÊDz»Í¬Ìõ¼þϵÄÖØ¸´²âÁ¿¡£
´úÂëºÍ¶àÒòËØ·½²î·ÖÎöÒ»Ñù£¬Ë¼Â·²»Ò»Ñù¶øÒÑ~µ«ÎÒ»¹ÕÒ²»µ½¶àÒòËØ·½²î·ÖÎöºÏÊʵÄÊý¾ÝËùÒÔÕâ¶ù¾ÍÏȲ»Ð´ÁË2333
4.»ìºÏÉè¼ÆµÄ·½²î·ÖÎö ********´ýÍêÉÆ
#########ͳ¼ÆÑ§Ñ§µÃºÃµÄͬѧÃÇ£¬½Ì½ÌÎÒ°É¡£¡£
(Èý)¿¨·½¼ìÑé
¿¨·½¼ìÑé¾ÍÊÇͳ¼ÆÑù±¾µÄʵ¼Ê¹Û²âÖµÓëÀíÂÛÍÆ¶ÏÖµÖ®¼äµÄÆ«Àë³Ì¶È£¬Êµ¼Ê¹Û²âÖµÓëÀíÂÛÍÆ¶ÏÖµÖ®¼äµÄÆ«Àë³Ì¶È¾Í¾ö¶¨¿¨·½ÖµµÄ´óС£¬¿¨·½ÖµÔ½´ó£¬Ô½²»·ûºÏ£»¿¨·½ÖµÔ½Ð¡£¬Æ«²îԽС£¬Ô½Ç÷ÓÚ·ûºÏ£¬ÈôÁ½¸öÖµÍêÈ«ÏàµÈʱ£¬¿¨·½Öµ¾ÍΪ0£¬±íÃ÷ÀíÂÛÖµÍêÈ«·ûºÏ¡££¨from
°Ù¶È°Ù¿Æ2333£©
1.µ¥ÒòËØ¿¨·½¼ìÑé
Êý¾ÝÔ´ÓÚÍøÂ磬ÄÐÅ®»¯×±Óë²»»¯×±ÈËÊýµÄÀíÂÛÖµÓëʵ¼ÊÖµ¡£
import numpy as np
from scipy import stats
from scipy.stats import chisquare
observed = np.array([15,95])
#¹Û²âÖµ£º110ѧÉúÖл¯×±µÄÅ®Éú95ÈË£¬»¯×±µÄÄÐÉú15ÈË
expected = np.array([55,55])
#ÀíÂÛÖµ£º110ѧÉúÖл¯×±µÄÅ®Éú55ÈË£¬»¯×±µÄÄÐÉú55ÈË
chisquare(observed,expected)
#output
(58.18181818181818, 2.389775628860044e-14) |
2.¶àÒòËØ¿¨·½¼ìÑé*****ÕýÔÚÑо¿ÖУ¬Ñ§»áÁËÍêÉÆÕâÒ»¿é~
(ËÄ)¼ÆÊýͳ¼Æ£¨ÓõÄÊý¾ÝΪtips.csv£©
#example£ºÍ³¼ÆÐÔ±ð
count = df['sex'].value_counts()
#Êä³ö
print count
Male 157
Female 87
Name: sex, dtype: int64 |
(Îå)»Ø¹é·ÖÎö *****´ýѧϰ£º Êý¾ÝÄâºÏ£¬¹ãÒåÏßÐԻع顣¡£¡£¡£µÈµÈ
Áù.¿ÉÊÓ»¯
ÎÒ¾õµÃ°É£¬Æäʵ¿´×Åexcel¾Í¿ÉÒÔʵÏֵŦÄÜΪºÎÄÇô¸´ÔÓ£¬excelȷʵ¹»Í¨Óù»±ã½Ý£¬µ«ÊÇ´¦ÀíºÜ´óÊý¾ÝÁ¿µÄ»°Ò²Ðí³Ô²»Ïû°É¡£Ñ§Ñ§python»æÍ¼Ò²²»Àµ£¬¶øÇÒ½²Õ棬ÓеijÉÐ§ÕæµÄͦºÃ¿´µÄ¡£
(Ò»)Seaborn
ÎÒѧÊý¾Ý·ÖÎö¿ÉÊÓ»¯ÊÇ´ÓѧϰSeabornÈëÃŵģ¬SeabornÊÇ»ùÓÚmatplotlibµÄPython¿ÉÊÓ»¯¿â£¬¸Õ¿ªÊ¼±ã½Ó´¥matplotlibÄÑÃâÓÐЩ³ÔÁ¦£¬²ÎÊý¶àÇÒÄÑÀí½â£¬µ«ÊÇÂýÂýÀ´×Ü»áѧ»áµÄ¡£»¹ÓйؼüµÄÒ»µãÊÇ£¬seaborn»³öÀ´µÄͼºÃºÃ¿´¡£¡£
#»ù´¡µ¼Èë
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt |
#С·ÑÊý¾ÝÕæµÄͦºÃµÄ£¬Õâ¶ùÓÃtips×÷Ϊexample
tips = sns.load_dataset('tips') #´ÓÍøÂç»·¾³µ¼ÈëÊý¾Ýtips |
1.lmplotº¯Êý
lmplot(x, y, data, hue =None, col=None,
row=None, palette =None, col_wrap=None, size=5, aspect=1,
markers ='o', sharex=True, sharey=True, hue _order=None,
col_order=None, row_ order =None, legend=True, legend_out=True,
x_ estimator= None, x_bins=None, x_ci='ci', scatter
= True, fit_reg=True, ci=95, n_boot= 1000, units=
None, order=1, logistic= False , lowess =False, robust
=False, logx= False, x_partial=None, y_partial=None,
truncate = False , x_ jitter=None, y_jitter=None,
scatter_kws=None, line_kws= None)
¹¦ÄÜ£ºPlot data and regression model fits across a FacetGrid.
ÏÂÃæ¾Í²»Í¬µÄÀý×Ó£¬¶ÔlmplotµÄ²ÎÊý½øÐнâÊÍ
Àý×Ó1. »³ö×ÜÕ˵¥ºÍС·Ñ»Ø¹é¹ØÏµÍ¼
Óõ½ÁËlmplot(x, y, data,scatter_kws£©
x,y,dataһĿÁËÈ»Õâ¶ù¾Í²»¶à½âÊÍÁË£¬scatter_kwsºÍline_kwsµÄ¹Ù·½½âÊÍÈçÏ£º
{scatter,line}_kws : dictionarie
Additional keyword arguments to pass to plt.scatter
and plt.plot.
scatterΪµã£¬lineΪÏß¡£Æäʵ¾ÍÊÇÓÃ×ÖµäÈ¥ÏÞ¶¨µãºÍÏߵĸ÷ÖÖÊôÐÔ£¬ÈçÀý×ÓËùʾ£¬É¢µãµÄÑÕɫΪ»Òʯɫ£¬ÏßÌõµÄÑÕɫΪӡ¶Èºì£¬³ÉÏñЧ¹û¾ÍÊÇÕâÑùµãÏßÑÕÉ«·ÖÀ룬չÏÖЧ¹ûºÜºÃ¡£´ó¼ÒÒ²¿ÉÒÔ»»ÉÏ×Ô¼ºÏëÒªµÄͼƬÊôÐÔ¡£
sns.lmplot("total_bill",
"tip", tips,
scatter_kws= {"marker": ".",
"color": "slategray"},
line_ kws= {"linewidth": 1, "color":
"indianred" }).savefig ('picture2')
|

ÁíÍ⣺ÑÕÉ«»¹¿ÉÒÔʹÓÃRGB´úÂ룬¾ßÌå¶ÔÕÕ±í¿ÉÒԲο¼Õâ¸öÍøÕ¾£¬¿ÉÒÔ×Ô¼º´îÅäÑÕÉ«£º
http: //www.114la.com /other/rgb.htm
markerÒ²¿ÉÒÔÓжàÖÖÑùʽ£¬¾ßÌåÈçÏ£º
. Point marker
, Pixel marker
o Circle marker
v Triangle down marker
^ Triangle up marker
< Triangle left marker
> Triangle right marker
1 Tripod down marker
2 Tripod up marker
3 Tripod left marker
4 Tripod right marker
s Square marker
p Pentagon marker
* Star marker
h Hexagon marker
H Rotated hexagon D Diamond marker
d Thin diamond marker
| Vertical line (vlinesymbol) marker
_ Horizontal line (hline symbol) marker
+ Plus marker
x Cross (x) marker
sns.lmplot("total_bill",
"tip", tips,
scatter_ kws= {"marker": ".","color":"#FF7F00"},
line _ kws= {"linewidth": 1, "color":
"#BF3EFF" }). savefig ('s1')
ps.ÎÒÐÞ¸ÄmakerÊôÐÔ²»³É¹¦²»ÖªÎªºÎ£¬Çó½â´ð |
 Àý×Ó2.ÓòÍÈËÊý(size)ºÍС·Ñ(tip)µÄ¹ØÏµÍ¼
¹Ù·½½âÊÍ£º
x_estimator : callable that maps vector -> scalar,
optional
Apply this function to each unique value of x and
plot the resulting estimate. This is useful when x
is a discrete variable. If x_ci is not None, this
estimate will be bootstrapped and a confidence interval
will be drawn.
´ó¸Å½âÊ;ÍÊÇ£º¶ÔÓµÓÐÏàͬxˮƽµÄyÖµ½øÐÐÓ³Éä
plt.figure()
sns.lmplot ('size', 'tip', tips, x_estimator =
np .mean ). savefig('picture3') |

{x,y}_jitter : floats, optional
Add uniform random noise of this size to either
the x or y variables. The noise is added to a copy
of the data after fitting the regression, and only
influences the look of the scatterplot. This can be
helpful when plotting variables that take discrete
values.
jitterÊǸöºÜÓÐÒâ˼µÄ²ÎÊý, ÌØ±ðÊÇ´¦Àí°ÐÊý¾ÝµÄoverlapping¹ýÓÚÑÏÖØµÄÇé¿öʱ,
ͨ¹ýÔö¼ÓÒ»¶¨³Ì¶ÈµÄÔëÉù(noise)ʵÏÖÊý¾ÝµÄÇø¸ô»¯, ÕâÑùÔʼÊý¾ÝÊÇÈô¸É µã´Ø ±ä³ÉһϵÁÐÃܼ¯ÁÚ½üµÄµãȺ.
ÁíÍâ, ÓеÄÈ˻ᾳ£½« rug Óë jitter ½áºÏʹÓÃ. ÕâÒÀÈ˰É.¶ÔÓÚºáÖáÈ¡ÀëɢˮƽµÄʱºò,
ÓÃx_jitter ¿ÉÒÔÈÃÊý¾Ýµã·¢ÉúˮƽµÄÈŶ¯.µ«ÈŶ¯µÄ·ù¶È²»Ò˹ý´ó¡£
sns.lmplot('size',
'tip', tips, x_jitter= .15). savefig ('picture4') |

seaborn»¹¿ÉÒÔ×ö³öxkcd·ç¸ñµÄͼƬ£¬»¹Í¦ÓÐÒâ˼µÄ
with plt.xkcd():
sns.color_ palette('husl', 8)
sns.set_ context('paper')
sns.lmplot (x='total_bill', y='tip', data= tips,
ci= 65).savefig ('picture1') |

with plt.xkcd():
sns.lmplot('total_ bill', 'tip', data =tips, hue=
'day ')
plt.xlabel('hue = day')
plt.savefig('picture5') |

with plt.xkcd():
sns.lmplot ('total_bill', 'tip', data=tips, hue=
'smoker')
plt.xlabel('hue = smoker')
plt.savefig('picture6') |

sns.set_style('dark')
sns.set_context('talk')
sns.lmplot('size', 'total_ bill', tips, order=2)
plt.title('# poly order = 2')
plt.savefig ('picture7')
plt.figure()
sns.lmplot('size', 'total_bill', tips, order=3)
plt.title ('# poly order = 3')
plt.savefig('picture8') |
sns.jointplot("total_bill",
"tip", tips). savefig( 'picture9 ') |

(¶þ)matplotlib ********´ýÍêÉÆ
Æß.ÆäËü~
(Ò»)µ÷ÓÃR
ÈÃPythonÖ±½Óµ÷ÓÃRµÄº¯Êý£¬ÏÂÔØ°²×°rpy2Ä£¿é¼´¿É~
¾ßÌå²½Ö裺http://www.geome.cn/posts/python-%E9%80%9A%E8%BF%87rpy2%E8%B0%83%E7%94%A8-r%E8%AF%AD%E8%A8%80/ |