±à¼ÍƼö: |
±¾ÎÄÀ´×ÔÓÚe-learn,ÎÄÕ½éÉÜÁËÌí¼ÓÕýÔò»¯Ïî,ÒÔ¼°ÕýÔò»¯Ïî¿É·ÖΪµÄÁ½ÖÖ:
Ò»ÖÖÊÇÁë»Ø¹é, ÁíÒ»ÖÖÊÇLasso»Ø¹é£¬ÒÔ¼°Á½ÖֻعéµÄ¶Ô±ÈµÈ¡£ |
|
ÕýÎÄ
Ìí¼ÓÕýÔò»¯Ïî, ÊÇÖ¸ÔÚËðʧº¯ÊýÉÏÌí¼ÓÕýÔò»¯Ïî, ¶øÕýÔò»¯Ïî¿É·ÖΪÁ½ÖÖ:
Ò»ÖÖÊÇL1ÕýÔò»¯Ïî, ÁíÒ»ÖÖÊÇL2ÕýÔò»¯. ÎÒÃǰѴøÓÐL2ÕýÔò»¯ÏîµÄ»Ø¹éÄ£ÐͳÆÎªÁë»Ø¹é, ´øÓÐL1ÕýÔò»¯ÏîµÄ»Ø¹é³ÆÎªLasso»Ø¹é.
1. Áë»Ø¹é
ÒýÓðٶȰٿƶ¨Òå.
Áë»Ø¹é(Ó¢ÎÄÃû£ºridge regression, Tikhonov
regularization)ÊÇÒ»ÖÖרÓÃÓÚ¹²ÏßÐÔÊý¾Ý·ÖÎöµÄÓÐÆ«¹À¼Æ»Ø¹é·½·¨£¬ÊµÖÊÉÏÊÇÒ»ÖÖ¸ÄÁ¼µÄ×îС¶þ³Ë¹À¼Æ·¨£¬Í¨¹ý·ÅÆú×îС¶þ³Ë·¨µÄÎÞÆ«ÐÔ£¬ÒÔËðʧ²¿·ÖÐÅÏ¢¡¢½µµÍ¾«¶ÈΪ´ú¼Û»ñµÃ»Ø¹éϵÊý¸üΪ·ûºÏʵ¼Ê¡¢¸ü¿É¿¿µÄ»Ø¹é·½·¨£¬¶Ô²¡Ì¬Êý¾ÝµÄÄâºÏҪǿÓÚ×îС¶þ³Ë·¨¡£
ͨ¹ý¶¨Òå¿ÉÒÔ¿´³ö, Áë»Ø¹éÊǸÄÁ¼ºóµÄ×îС¶þ³Ë·¨, ÊÇÓÐÆ«¹À¼ÆµÄ»Ø¹é·½·¨,
¼´¸øËðʧº¯Êý¼ÓÉÏÒ»¸öÕýÔò»¯Ïî, Ò²½Ð³Í·£Ïî(L2·¶Êý), ÄÇôÁë»Ø¹éµÄËðʧº¯Êý±íʾΪ

ÆäÖÐ, mÊÇÑù±¾Á¿, nÊÇÌØÕ÷Êý, Êdzͷ£Ïî²ÎÊý(Æäȡֵ´óÓÚ0),
¼Ó³Í·£ÏîÖ÷ҪΪÁËÈÃÄ£ÐͲÎÊýµÄȡֵ²»Äܹý´ó. µ±Ç÷ÓÚÎÞÇî´óʱ, ¶ÔÓ¦Ç÷ÏòÓÚ0, ¶ø±íʾµÄÊÇÒò±äÁ¿Ëæ×Åijһ×Ô±äÁ¿¸Ä±äÒ»¸öµ¥Î»¶ø±ä»¯µÄÊýÖµ(¼ÙÉèÆäËû×Ô±äÁ¿¾ù±£³Ö²»±ä),
Õâʱ, ×Ô±äÁ¿Ö®¼äµÄ¹²ÏßÐÔ¶ÔÒò±äÁ¿µÄÓ°Ï켸ºõ²»´æÔÚ, ¹ÊÆäÄÜÓÐЧ½â¾ö×Ô±äÁ¿Ö®¼äµÄ¶àÖØ¹²ÏßÐÔÎÊÌâ, ͬʱҲÄÜ·ÀÖ¹¹ýÄâºÏ.
2. Lasso»Ø¹é
Áë»Ø¹éµÄÕýÔò»¯ÏîÊǶÔÇ󯽷½ºÍ, ¼ÈÈ»ÄÜÇ󯽷½Ò²¾ÍÄÜÈ¡¾ø¶ÔÖµ, ¶øLasso»Ø¹éµÄL1·¶ÊýÕýÊǶÔÈ¡¾ø¶ÔÖµ,
¹ÊÆäËðʧº¯Êý¿ÉÒÔ±íʾΪ

µ±Ö»ÓÐÁ½¸ö×Ô±äÁ¿Ê±, L1·¶ÊýÔÚ¶þάÉ϶ÔÓ¦µÄͼÐÎÊǾØÐÎ(¶¥µã¾ùÔÚ×ø±êÖáÉÏ,
¼´ÆäÖÐÒ»¸ö»Ø¹éϵÊýΪ0), ¶ÔÓÚÕâÑùµÄ¾ØÐÎÀ´ËµÆä¶¥µã¸üÈÝÒ×ÓëͬÐÄÍÖÔ²(µÈÖµÏß)Ïཻ, ¶øÏཻµÄµãÔòΪ×îСËðʧº¯ÊýµÄ×îÓŽâ.
Ò²¾ÍÊÇ˵Lasso»á³öÏֻعéϵÊýΪ0µÄÇé¿ö. ¶ÔÓÚL2·¶ÊýÀ´ËµÔòÊÇÔ²ÐÎ,Æä²»»áÏཻÓÚ×ø±êÖáÉϵĵã,
×ÔȻҲ¾Í²»»á³öÏֻعéϵÊýΪ0µÄÇé¿ö. µ±È»¶à¸ö×Ô±äÁ¿Ò²ÊÇͬÑùµÄµÀÀí
3. Áë»Ø¹éºÍLasso»Ø¹é¶Ô±È
Ïàͬµã:
1. Áë»Ø¹éºÍLasso»Ø¹é¾ùÊǼÓÁËÕýÔò»¯ÏîµÄÏßÐԻعéÄ£ÐÍ, ±¾ÖÊÉÏËüÃǶ¼ÊÇÏßÐԻعéÄ£ÐÍ.
2. Á½Õß¾ùÄÜÔÚÒ»¶¨³Ì¶ÈÉϽâ¾ö¶àÖØ¹²ÏßÐÔÎÊÌâ, ²¢ÇÒ¿ÉÒÔÓÐЧ±ÜÃâ¹ýÄâºÏ.
3. »Ø¹éϵÊý¾ùÊÜÕýÔò»¯²ÎÊýµÄÓ°Ïì, ¾ù¿ÉÒÔÓÃͼÐαíʾ»Ø¹éϵÊýºÍÕýÔò»¯²ÎÊýµÄ¹ØÏµ,
²¢¿ÉÒÔͨ¹ý¸ÃͼÐνøÐбäÁ¿ÒÔ¼°ÕýÔò»¯²ÎÊýµÄɸѡ.
²»Í¬µã:
1. Áë»Ø¹éµÄ»Ø¹éϵÊý¾ù²»Îª0, Lasso»Ø¹é²¿·Ö»Ø¹éϵÊýΪ0.
4. ʵ¼Ê°¸ÀýÓ¦ÓÃ
1. Êý¾ÝÀ´Ô´¼°Êý¾Ý±³¾°
Êý¾ÝÀ´Ô´: https://www.kaggle.com/c/bike-sharing-demand/data,
Êý¾ÝÓÐѵÁ·¼¯ºÍ²âÊÔ¼¯, ÔÚѵÁ·¼¯Öаüº¬10886¸öÑù±¾ÒÔ¼°12¸ö×Ö¶Î, ͨ¹ýѵÁ·¼¯ÉÏ×ÔÐгµ×âÁÞÊý¾Ý¶ÔÃÀ¹ú»ªÊ¢¶Ù×ÔÐгµ×âÁÞÐèÇó½øÐÐÔ¤²â.
2. Êý¾Ý¸ÅÀÀ
1. ¶ÁÈ¡Êý¾Ý
import pandas
as pd
df = pd.read_csv(r'D:\Data\bike.csv')
pd.set_option('display.max_rows',4 )
df |

ͨ¹ýÒÔÉÏ¿ÉÒÔµÃÖªÊý¾Ýά¶È10886ÐÐX12ÁÐ, ³ýÁ˵ÚÒ»ÁÐÆäËü¾ùÏÔʾΪÊýÖµ,
¾ßÌåµÄ¸ñʽ»¹Òª½øÒ»²½²é¿´, ¶ÔÓÚ¸÷ÁеĽâÊÍÒ²·ÅÈëÏÂÒ»»·½Ú.
2. ²é¿´Êý¾ÝÕûÌåÐÅÏ¢
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime 10886 non-null object #ʱ¼äºÍÈÕÆÚ
season 10886 non-null int64 #¼¾½Ú, 1 =´º¼¾£¬2 =Ïļ¾£¬3
=Çï¼¾£¬4 =¶¬¼¾
holiday 10886 non-null int64 #ÊÇ·ñÊÇ¼ÙÆÚ, 1=ÊÇ, 0=·ñ
workingday 10886 non-null int64 #ÊÇ·ñÊǹ¤×÷ÈÕ, 1=ÊÇ,
0=·ñ
weather 10886 non-null int64 #ÌìÆø,1:ÇçÀÊ£¬ºÜÉÙÓÐÔÆ£¬²¿·Ö¶àÔÆ£¬²¿·Ö¶àÔÆ;
2:Îí+¶àÔÆ£¬Îí+ËéÔÆ£¬Îí+ÉÙÔÆ£¬Îí; 3:Сѩ£¬Ð¡Óê+À×Óê+É¢ÔÆ£¬Ð¡Óê+É¢ÔÆ; 4:´óÓê+±ù¿é+À×±©+Îí£¬Ñ©+Îítemp
10886 non-null float64 #ζÈatemp 10886 non-null
float64 #Ìå¸ÐζÈhumidity 10886 non-null int64 #Ïà¶Ôʪ¶Èwindspeed
10886 non-null float64 #·çËÙcasual 10886 non-null
int64 #δע²áÓû§×âÁÞÊýÁ¿registered 10886 non-null int64
#×¢²áÓû§×âÁÞÊýÁ¿count 10886 non-null int64 #ËùÓÐÓû§×âÁÞ×ÜÊýdtypes:
float64(3), int64(8), object(1) memory usage:
1020.6+ KB |
³ýÁËdatetimeΪ×Ö·û´®ÐÍ, ÆäËû¾ùΪÊýÖµÐÍ, ÇÒÎÞȱʧֵ.
3. ÃèÊöÐÔͳ¼Æ
ζÈ, Ìå±íζÈ, Ïà¶Ôʪ¶È, ·çËÙ¾ù½üËÆ¶Ô³Æ·Ö²¼, ¶ø·Ç×¢²áÓû§,
×¢²áÓû§,ÒÔ¼°×ÜÊý¾ùÓұ߷ֲ¼.

4. ƫ̬, ·å̬
for i in range(5,
12):
name = df.columns[i]
print('{0}ƫ̬ϵÊýΪ {1}, ·å̬ϵÊýΪ {2}'.format(name, df[name].skew(),
df[name].kurt())) |
tempƫ̬ϵÊýΪ 0.003690844422472008,
·å̬ϵÊýΪ -0.9145302637630794
atempƫ̬ϵÊýΪ -0.10255951346908665, ·å̬ϵÊýΪ -0.8500756471754651
humidityƫ̬ϵÊýΪ -0.08633518364548581, ·å̬ϵÊýΪ -0.7598175375208864
windspeedƫ̬ϵÊýΪ 0.5887665265853944, ·å̬ϵÊýΪ 0.6301328693364932
casualƫ̬ϵÊýΪ 2.4957483979812567, ·å̬ϵÊýΪ 7.551629305632764
registeredƫ̬ϵÊýΪ 1.5248045868182296, ·å̬ϵÊýΪ 2.6260809999210672
countƫ̬ϵÊýΪ 1.2420662117180776, ·å̬ϵÊýΪ 1.3000929518398334 |
temp, atemp, humidityµÍ¶Èƫ̬, windspeedÖÐ¶ÈÆ«Ì¬,
casual, registered, count¸ß¶Èƫ̬
temp, atemp, humidityΪƽ·å·Ö²¼, windspeed,casual,
registered, countΪ¼â·å·Ö²¼.
3. Êý¾ÝÔ¤´¦Àí
ÓÉÓÚûÓÐȱʧֵ, ²»Óô¦Àíȱʧֵ, ¿´¿´ÓÐûÓÐÖØ¸´Öµ.
1. ¼ì²éÖØ¸´Öµ
print('Î´È¥ÖØ:
', df.shape)
print('È¥ÖØ: ', df.drop_duplicates().shape) |
Î´È¥ÖØ: (10886,
12)
È¥ÖØ: (10886, 12) |
ûÓÐÖØ¸´Ïî, ¿´¿´Òì³£Öµ.
2. Òì³£Öµ
ͨ¹ýÏäÏßͼ²é¿´Òì³£Öµ
import seaborn
as sns
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,
6))
#»æÖÆÏäÏßͼ
sns.boxplot(x="windspeed", data=df,ax=axes[0][0])
sns.boxplot(x='casual', data=df, ax=axes[0][1])
sns.boxplot(x='registered', data=df, ax=axes[1][0])
sns.boxplot(x='count', data=df, ax=axes[1][1])
plt.show() |
×âÁÞÊýÁ¿»áÊÜСʱµÄÓ°Ïì, ±ÈÈç˵Éϰà¸ß·åÆÚµÈ, ¹ÊÔÚÕâÀïÏȲ»´¦ÀíÒì³£Öµ.

3. Êý¾Ý¼Ó¹¤
#ת»»¸ñʽ, ²¢ÌáÈ¡³öСʱ,
ÐÇÆÚ¼¸, Ô·Ý
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df.datetime.dt.hour
df['week'] = df.datetime.dt.dayofweek
df['month'] = df.datetime.dt.month
df['year_month'] = df.datetime.dt.strftime('%Y-%m')
df['date'] = df.datetime.dt.date
#ɾ³ýdatetime
df.drop('datetime', axis = 1, inplace = True)
df |

4. ÌØÕ÷·ÖÎö
1) ÈÕÆÚºÍ×Ü×âÁÞÊýÁ¿
import matplotlib
#ÉèÖÃÖÐÎÄ×ÖÌå
font = {'family': 'SimHei'}
matplotlib.rc('font', **font)
#·Ö±ð¼ÆËãÈÕÆÚºÍÔ·ÝÖÐλÊý
group_date = df.groupby('date')['count'].median()
group_month = df.groupby('year_month')['count'].median()
group_month.index = pd.to_datetime(group_month.index)
plt.figure(figsize=(16,5))
plt.plot(group_date.index, group_date.values,
'-', color = 'b', label = 'ÿÌì×âÁÞÊýÁ¿ÖÐλÊý', alpha=0.8)
plt.plot(group_month.index, group_month.values,
'-o', color='orange', label = 'ÿÔÂ×âÁÞÊýÁ¿ÖÐλÊý')
plt.legend()
plt.show() |
2012ÄêÏà±È2011Äê×âÁÞÊýÁ¿ÓÐËùÔö³¤, ÇÒ²¨¶¯·ù¶ÈÏàÀàËÆ.

2) Ô·ݺÍ×Ü×âÁÞÊýÁ¿
import seaborn
as sns
plt.figure(figsize=(10, 4))
sns.boxplot(x='month', y='count', data=df)
plt.show() |
ÓëÉÏͼµÄ²¨¶¯·ù¶È»ù±¾Ò»ÖÂ, ÁíÍâÿ¸öÔ¾ùÓв»Í¬³Ì¶ÈµÄÀëȺֵ.

3) ¼¾½ÚºÍ×Ü×âÁÞÊýÁ¿
plt.figure(figsize=(8,
4))
sns.boxplot(x='season', y='count', data=df)
plt.show() |
¾ÍÖÐλÊýÀ´Ëµ, Çï¼¾ÊÇ×î¶àµÄ, ´º¼¾×îÉÙÇÒÀëȺֵ½Ï¶à.

4) ÐÇÆÚ¼¸ºÍ×âÁÞÊýÁ¿
fig, axes =
plt.subplots(nrows=3, ncols=1, figsize=(12, 8))
sns.boxplot(x="week",y='casual' ,data=df,ax=axes[0])
sns.boxplot(x='week',y='registered', data=df,
ax=axes[1])
sns.boxplot(x='week',y='count', data=df, ax=axes[2])
plt.show() |
¾ÍÖÐλÊýÀ´Ëµ, δע²áÓû§ÖÜÁùºÍÖÜÈս϶à, ¶ø×¢²áÓû§ÔòÖÜÄڽ϶à,
¶ÔÓ¦µÄ×ÜÊýÒ²ÊÇÖÜÄڽ϶à, ÇÒÖÜÄÚÔÚ×ÜÊýµÄÀëȺֵ½Ï¶à(0´ú±íÖÜÒ», 6´ú±íÖÜÈÕ)

5) ½Ú¼ÙÈÕ, ¹¤×÷ÈÕºÍ×Ü×âÁÞÊýÁ¿
fig, axes =
plt.subplots(nrows=3, ncols=2, figsize=(9, 7))
sns.boxplot(x='holiday', y='casual', data=df,
ax=axes[0][0])
sns.boxplot(x='holiday', y='registered', data=df,
ax=axes[1][0])
sns.boxplot(x='holiday', y='count', data=df, ax=axes[2][0])
sns.boxplot(x='workingday', y='casual', data=df,
ax=axes[0][1])
sns.boxplot(x='workingday', y='registered', data=df,
ax=axes[1][1])
sns.boxplot(x='workingday', y='count', data=df,
ax=axes[2][1])
plt.show() |
δע²áÓû§: ÔÚ½Ú¼ÙÈս϶à, ÔÚ¹¤×÷ÈÕ½ÏÉÙ
×¢²áÓû§: ÔÚ½Ú¼ÙÈÕ½ÏÉÙ, ÔÚ¹¤×÷Èս϶à
×ܵÄÀ´Ëµ, ½Ú¼ÙÈÕ×âÁÞ½ÏÉÙ, ¹¤×÷ÈÕ×âÁ޽϶à, ³õ²½²Â²â¶àÊýδע²áÓû§×âÁÞ×ÔÐгµÊÇÓÃÀ´·Ç¹¤×÷ÈÕ³öÓÎ,
¶ø¶àÊý×¢²áÓû§ÔòÊǹ¤×÷ÈÕÓÃÀ´Éϰà»òÕßÉÏѧ.

6) СʱºÍ×Ü×âÁÞÊýÁ¿µÄ¹ØÏµ
#»æÖƵÚÒ»¸ö×Óͼ
plt.figure(1, figsize=(14, 8))
plt.subplot(221)
hour_casual = df[df.holiday==1].groupby('hour')['casual'].median()
hour_registered = df[df.holiday==1].groupby('hour')['registered'].median()
hour_count = df[df.holiday==1].groupby('hour')['count'].median()
plt.plot(hour_casual.index, hour_casual.values,
'-', color='r', label='δע²áÓû§')
plt.plot(hour_registered.index, hour_registered.values,
'-', color='g', label='×¢²áÓû§')
plt.plot(hour_count.index, hour_count.values,
'-o', color='c', label='ËùÓÐÓû§')
plt.legend()
plt.xticks(hour_casual.index)
plt.title('δע²áÓû§ºÍ×¢²áÓû§ÔÚ½Ú¼ÙÈÕ×ÔÐгµ×âÁÞÇé¿ö')
#»æÖƵڶþ¸ö×Óͼ
plt.subplot(222)
hour_casual = df[df.workingday==1].groupby('hour')['casual'].median()
hour_registered = df[df.workingday==1].groupby('hour')['registered'].median()
hour_count = df[df.workingday==1].groupby('hour')['count'].median()
plt.plot(hour_casual.index, hour_casual.values,
'-', color='r', label='δע²áÓû§')
plt.plot(hour_registered.index, hour_registered.values,
'-', color='g', label='×¢²áÓû§')
plt.plot(hour_count.index, hour_count.values,
'-o', color='c', label='ËùÓÐÓû§')
plt.legend()
plt.title('δע²áÓû§ºÍ×¢²áÓû§ÔÚ¹¤×÷ÈÕ×ÔÐгµ×âÁÞÇé¿ö')
plt.xticks(hour_casual.index)
#»æÖƵÚÈý¸ö×Óͼ
plt.subplot(212)
hour_casual = df.groupby('hour')['casual'].median()
hour_registered = df.groupby('hour')['registered'].median()
hour_count = df.groupby('hour')['count'].median()
plt.plot(hour_casual.index, hour_casual.values,
'-', color='r', label='δע²áÓû§')
plt.plot(hour_registered.index, hour_registered.values,
'-', color='g', label='×¢²áÓû§')
plt.plot(hour_count.index, hour_count.values,
'-o', color='c', label='ËùÓÐÓû§')
plt.legend()
plt.title('δע²áÓû§ºÍ×¢²áÓû§×ÔÐгµ×âÁÞÇé¿ö')
plt.xticks(hour_casual.index)
plt.show() |
ÔÚ½Ú¼ÙÈÕ, δע²áÓû§ºÍ×¢²áÓû§×ßÊÆÏà½Ó½ü, ²»¹ýδע²áÓû§×î¸ß·åÔÚ14µã,
¶ø×¢²áÓû§ÔòÊÇ17µã
ÔÚ¹¤×÷ÈÕ, ×¢²áÓû§³ÊÏÖ³öË«·å×ßÊÆ, ÔÚ8µãºÍ17µã¾ùΪÓóµ¸ß·åÆÚ,
¶øÕâÕýÊÇÉÏϰà»òÕßÉÏÏÂѧ¸ß·åÆÚ.
¶ÔÓÚ×¢²áÓû§À´Ëµ, 17µãÔÚ½Ú¼ÙÈպ͹¤×÷ÈÕ¾ùΪ¸ß·åÆÚ, ˵Ã÷²¿·ÖÓû§ÔÚ½Ú¼ÙÈÕ¿ÉÄÜδ±ØÐݼÙ.

7) ÌìÆøºÍ×Ü×âÁÞÊýÁ¿
fig, ax = plt.subplots(3,
1, figsize=(12, 6))
sns.boxplot(x='weather', y='casual', hue='workingday',data=df,
ax=ax[0])
sns.boxplot(x='weather', y='registered',hue='workingday',
data=df, ax=ax[1])
sns.boxplot(x='weather', y='count',hue='workingday',
data=df, ax=ax[2]) |
¾ÍÖÐλÊý¶øÑÔδע²áÓû§ºÍ×¢²áÓû§¾ù±íÏÖΪ: ÔÚ¹¤×÷Èպͷǹ¤×÷ÈÕ×âÁÞÊýÁ¿¾ùËæ×ÅÌìÆøµÄ¶ñÁÓ¶ø¼õÉÙ,
ÌØ±ðµØ, µ±ÌìÆøÎª´óÓê´óÑ©Ìì(4)Çҷǹ¤×÷ÈÕ¾ùûÓÐ×ÔÐгµ×âÁÞ.

´ÓͼÉÏ¿ÉÒÔ¿´³ö, ´óÓê´óÑ©ÌìÖ»ÓÐÒ»¸öÊý¾Ý, ÎÒÃÇ¿´¿´ÔÊý¾Ý.
Ö»ÓÐÔÚ2012Äê1ÔÂ9ÈÕ18ʱΪ´óÓê´óÑ©Ìì, ˵Ã÷ÌìÆøÊÇͻȻ±ä»¯µÄ,
²¿·ÖÓû§¿ÉÄÜÒòΪûÓп´ÌìÆøÔ¤±¨¶ø×âÁÞ×ÔÐгµ, µ±È»Ò²ÓÐÆäËûÔÒò.

ÁíÍâ, ·¢ÏÖ1Ô·ÝÊÇ´º¼¾, ¿´¿´ËüµÄ¼¾½Ú»®·Ö¹æÔò.
sns.boxplot(x='season',
y='month',data=df) |
123Ϊ´º¼¾, 456ΪÏļ¾, 789ΪÇï¼¾...

¼¾½ÚµÄ»®·Öͨ³£ºÍγ¶ÈÏà¹Ø, ¶øÕâ·ÝÊý¾ÝÊÇÓÃÀ´Ô¤²âÃÀ¹ú»ªÊ¢¶ÙµÄ×âÁÞÊýÁ¿,
ÇÒÃÀ¹úºÍÎÒ¹úµÄγ¶È»ù±¾Ò»Ñù, ¹Ê°´ÕÕ345´º½Ú, 678Ïļ¾..Õâ¸ö¹æÔòÀ´ÖØÐ»®·Ö.
import numpy
as np
df['group_season'] = np.where((df.month <=5)
& (df.month >=3), 1,
np.where((df.month <=8) & (df.month >=6),
2,
np.where((df.month <=11) & (df.month >=9),
3, 4)))
fig, ax = plt.subplots(2, 1, figsize=(12, 6))
#»æÖÆÆøÎºͼ¾½ÚÏäÏßͼ
sns.boxplot(x='season', y='temp',data=df, ax=ax[0])
sns.boxplot(x='group_season', y='temp',data=df,
ax=ax[1]) |
µÚÒ»¸öͼÊǵ÷Õû֮ǰµÄ, ¾ÍÖÐλÊýÀ´Ëµ, ´º¼¾ÆøÎÂ×îµÍ, Çï¼¾ÆøÎÂ×î¸ß
µÚ¶þ¸öͼÊǵ÷ÕûÖ®ºóµÄ, ¾ÍÖÐλÊýÀ´Ëµ, ¶¬¼¾ÆøÎÂ×îµÍ, ÏᆵøÎÂ×î¸ß

ÏÔÈ»µÚ¶þÕŵÄͼµÄ½á¹û½Ï·ûºÏ³£Àí, ¹Êɾ³ýÁíÍâÄÇÒ»ÁÐ.
df.drop('season',
axis=1, inplace=True)
df.shape |
8) ÆäËû±äÁ¿ºÍ×Ü×âÁÞÊýÁ¿µÄ¹ØÏµ
ÕâÀïÎÒÖ±½ÓʹÓÃÀûÓÃseabornµÄpairplot»æÖÆÊ£ÓàµÄζÈ,
Ìå¸ÐζÈ, Ïà¶Ôʪ¶È, ·çËÙÕâËĸöÁ¬Ðø±äÁ¿Óëδע²áÓû§ºÍ×¢²áÓû§µÄ¹ØÏµÔÚÒ»ÕÅͼÉÏ.
sns.pairplot(df[['temp',
'atemp', 'humidity', 'windspeed', 'casual', 'registered',
'count']]) |
ΪÁË·½±ã×ÝÀÀÈ«¾Ö, ÎÒ½«Í¼Æ¬³ß´çËõС, ÈçÏÂͼËùʾ. ×ÝÖá´ÓÉÏÍùÏÂÒÀ´ÎÊÇζÈ,
Ìå¸ÐζÈ, Ïà¶Ôʪ¶È, ·çËÙ, δע²áÓû§, ×¢²áÓû§, ËùÓÐÓû§, ºáÖá´Ó×óÍùÓÒÊÇͬÑùµÄ˳Ðò.

´ÓͼÉÏ¿ÉÒÔ¿´³ö, ζȺÍÌå¸ÐζȷֱðÓëδע²áÓû§, ×¢²áÓû§, ËùÓÐÓû§¾ùÓÐÒ»¶¨³Ì¶ÈµÄÕýÏà¹Ø,
¶øÏà¶Ôʪ¶ÈºÍ·çËÙÓëÖ®³ÊÏÖÒ»¶¨³Ì¶ÈµÄ¸ºÏà¹Ø. ÁíÍâ, ÆäËû±äÁ¿Ö®¼äÒ²Óв»Í¬³Ì¶ÈµÄÏà¹Ø¹ØÏµ.
ÁíÍâ, µÚËÄÁÐ(·çËÙ)ÔÚÉ¢µãͼÖмäÓÐÃ÷ÏԵļä϶. ÐèÒª¾¾³öÕâÒ»¿éÀ´¿´¿´.
0 0.0000
1 0.0000
2 0.0000
...
10883 15.0013
10884 6.0032
10885 8.9981
Name: windspeed, Length: 10886, dtype: float64 |
·çËÙΪ0, ÕâÃ÷ÏÔ²»ºÏÀí, °ÑÆäµ±³ÉȱʧֵÀ´´¦Àí. ÎÒÕâÀïÑ¡ÔñµÄÊÇÏòºóÌî³ä.
df.loc[df.windspeed
== 0, 'windspeed'] = np.nan
df.fillna(method='bfill', inplace=True)
df.windspeed.isnull().sum() |
9) Ïà¹Ø¾ØÕó
ÓÉÓÚ¶à¸ö±äÁ¿²»Âú×ãÕý̬·Ö²¼, ¶ÔÆä½øÐжÔÊý±ä»».
#¶ÔÊýת»»
df['windspeed'] = np.log(df['windspeed'].apply(lambda
x: x+1))
df['casual'] = np.log(df['casual'].apply(lambda
x: x+1))
df['registered'] = np.log(df['registered'].apply(lambda
x: x+1))
df['count'] = np.log(df['count'].apply(lambda
x: x+1))
sns.pairplot(df[['windspeed', 'casual', 'registered',
'count']]) |

¾¹ý¶ÔÊý±ä»»Ö®ºó, ×¢²áÓû§ºÍËùÓÐÓû§µÄ×âÁÞÊýÁ¿ºÍÕý̬»¹ÊÇÏà²î½Ï´ó,
¹ÊÔÚ¼ÆËãÏà¹ØÏµÊýʱѡÔñspearmanÏà¹ØÏµÊý.
correlation
= df.corr(method='spearman')
plt.figure(figsize=(12, 8))
#»æÖÆÈÈÁ¦Í¼
sns.heatmap(correlation, linewidths=0.2, vmax=1,
vmin=-1, linecolor='w',
annot=True,annot_kws={'size':8},square=True) |
¾ùÓв»Í¬³Ì¶ÈµÄÏà¹Ø³Ì¶È, ÆäÖÐ, tempºÍatemp¸ß¶ÈÏà¹Ø,
countºÍregistered¸ß¶ÈÏà¹Ø, ÊýÖµ¾ù´ïµ½0.99.

5. »Ø¹éÄ£ÐÍ
Áë»Ø¹éºÍLasso»Ø¹éÊǼÓÁËÕýÔò»¯ÏîµÄÏßÐԻعé, ÏÂÃæ½«·Ö±ð¹¹ÔìÁ½¸öÄ£ÐÍ.
5.1 Áë»Ø¹é
1. »®·ÖÊý¾Ý¼¯
from sklearn.model_selection
import train_test_split
#ÓÉÓÚËùÓÐÓû§µÄ×âÁÞÊýÁ¿ÊÇÓÉδע²áÓû§ºÍ×¢²áÓû§Ïà¼Ó¶ø³É, ¹Êɾ³ý.
df.drop(['casual','registered'], axis=1, inplace=True)
X = df.drop(['count'], axis=1)
y = df['count']
#»®·ÖѵÁ·¼¯ºÍ²âÊÔ¼¯
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.2, random_state=1) |
2. Ä£ÐÍѵÁ·
from sklearn.linear_model
import Ridge
#ÕâÀïµÄalphaÖ¸µÄÊÇÕýÔò»¯Ïî²ÎÊý, ³õʼÏÈÉèÖÃΪ1.
rd = Ridge(alpha=1)
rd.fit(X_train, y_train)
print(rd.coef_)
print(rd.intercept_) |
[ 0.00770067
-0.00034301 0.0039196 0.00818243 0.03635549 -0.01558927
0.09080788 0.0971406 0.02791812 0.06114358 -0.00099811]
2.6840271343740754 |
ͨ¹ýÇ°ÃæÎÒÃÇÖªµÀ, ÕýÔò»¯Ïî²ÎÊý¶Ô½á¹ûµÄÓ°Ïì½Ï´ó, ÏÂÒ»²½ÎÒÃǾÍͨ¹ýÁ뼣ͼÀ´Ñ¡ÔñÕýÔò»¯²ÎÊý.
#ÉèÖòÎÊýÒÔ¼°ÑµÁ·Ä£ÐÍ
alphas = 10**np.linspace(-5, 10, 500)
betas = []
for alpha in alphas:
rd = Ridge(alpha = alpha)
rd.fit(X_train, y_train)
betas.append(rd.coef_)
#»æÖÆÁ뼣ͼ
plt.figure(figsize=(8,6))
plt.plot(alphas, betas)
#¶ÔÊý¾Ý½øÐжÔÊýת»», ±ãÓÚ¹Û²ì.
plt.xscale('log')
#Ìí¼ÓÍø¸ñÏß
plt.grid(True)
#×ø±êÖáÊÊÓ¦Êý¾ÝÁ¿
plt.axis('tight')
plt.title(r'ÕýÔò»¯Ïî²ÎÊý$\alpha$ºÍ»Ø¹éϵÊý$\beta$Á뼣ͼ')
plt.xlabel(r'$\alpha$')
plt.ylabel(r'$\beta$')
plt.show() |

ͨ¹ýͼÏñ¿ÉÒÔ¿´³ö, µ±alphaΪ107ʱËùÓбäÁ¿Áë¼£Ç÷ÓÚÎȶ¨.°´ÕÕÁë¼£·¨Ó¦µ±È¡alpha=107.
ÓÉÓÚÊÇͨ¹ýÈâÑÛ¹Û²ìµÄ, Æä²»Ò»¶¨ÊÇ×î¼Ñ, ²ÉÓÃÁíÍâÒ»ÖÖ·½Ê½: ½»²æÑéÖ¤µÄÁë»Ø¹é.
from sklearn.linear_model
import RidgeCV
from sklearn import metrics
rd_cv = RidgeCV(alphas=alphas, cv=10, scoring='r2')
rd_cv.fit(X_train, y_train)
rd_cv.alpha_ |
×îºóÑ¡³öµÄ×î¼ÑÕýÔò»¯Ïî²ÎÊýΪ805.03, È»ºóÓÃÕâ¸ö²ÎÊý½øÐÐÄ£ÐÍѵÁ·
rd = Ridge(alpha=805.0291812295973)
#, fit_intercept=False
rd.fit(X_train, y_train)
print(rd.coef_)
print(rd.intercept_) |
[ 0.00074612
-0.00382265 0.00532093 0.01100823 0.03375475 -0.01582157
0.0584206 0.09708992 0.02639369 0.0604242 -0.00116086]
2.7977274604845856 |
4. Ä£ÐÍÔ¤²â
from sklearn
import metrics
from math import sqrt
#·Ö±ðÔ¤²âѵÁ·Êý¾ÝºÍ²âÊÔÊý¾Ý
y_train_pred = rd.predict(X_train)
y_test_pred = rd.predict(X_test)
#·Ö±ð¼ÆËãÆä¾ù·½¸ùÎó²îºÍÄâºÏÓŶÈ
y_train_rmse = sqrt(metrics.mean_squared_error(y_train,
y_train_pred))
y_train_score = rd.score(X_train, y_train)
y_test_rmse = sqrt(metrics.mean_squared_error(y_test,
y_test_pred))
y_test_score = rd.score(X_test, y_test)
print('ѵÁ·¼¯RMSE: {0}, ÆÀ·Ö: {1}'.format(y_train_rmse,
y_train_score))
print('²âÊÔ¼¯RMSE: {0}, ÆÀ·Ö: {1}'.format(y_test_rmse,
y_test_score)) |
ѵÁ·¼¯RMSE: 1.0348076524200298,
ÆÀ·Ö: 0.46691272323469246
²âÊÔ¼¯RMSE: 1.0508046977499312, ÆÀ·Ö: 0.45801571689420706 |
5.2 Lasso»Ø¹é
1. Ä£ÐÍѵÁ·
from sklearn.linear_model
import Lasso
alphas = 10**np.linspace(-5, 10, 500)
betas = []
for alpha in alphas:
Las = Lasso(alpha = alpha)
Las.fit(X_train, y_train)
betas.append(Las.coef_)
plt.figure(figsize=(8,6))
plt.plot(alphas, betas)
plt.xscale('log')
plt.grid(True)
plt.axis('tight')
plt.title(r'ÕýÔò»¯Ïî²ÎÊý$\alpha$ºÍ»Ø¹éϵÊý$\beta$µÄLassoͼ')
plt.xlabel(r'$\alpha$')
plt.ylabel(r'$\beta$')
plt.show() |
ͨ¹ýLasso»Ø¹éÇúÏß, ¿ÉÒÔ¿´³ö´óÖÂÔÚ10¸½½üËùÓбäÁ¿Ç÷ÓÚÎȶ¨

ͬÑù²ÉÓý»²æÑé֤ѡÔñLasso»Ø¹é×îÓÅÕýÔò»¯Ïî²ÎÊý
from sklearn.linear_model
import LassoCV
from sklearn import metrics
Las_cv = LassoCV(alphas=alphas, cv=10)
Las_cv.fit(X_train, y_train)
Las_cv.alpha_ |
ÓÃÕâ¸ö²ÎÊýÖØÐÂѵÁ·Ä£ÐÍ
Las = Lasso(alpha=0.005074705239490466)
#, fit_intercept=False
Las.fit(X_train, y_train)
print(Las.coef_)
print(Las.intercept_) |
[ 0. -0. 0.
0.01001827 0.03467474 -0.01570339
0.06202352 0.09721864 0.02632133 0.06032038 -0.
]
2.7808303982442952 |
¶Ô±ÈÁë»Ø¹é¿ÉÒÔ·¢ÏÖ, ÕâÀïµÄ»Ø¹éϵÊýÖÐÓÐ0´æÔÚ, Ò²¾ÍÊÇÉáÆúÁËholiday,
workingday, weatherºÍgroup_seasonÕâËĸö×Ô±äÁ¿.
#ÓÃLasso·Ö±ðÔ¤²âѵÁ·¼¯ºÍ²âÊÔ¼¯,
²¢¼ÆËã¾ù·½¸ùÎó²îºÍÄâºÏÓŶÈ
y_train_pred = Las.predict(X_train)
y_test_pred = Las.predict(X_test)
y_train_rmse = sqrt(metrics.mean_squared_error(y_train,
y_train_pred))
y_train_score = Las.score(X_train, y_train)
y_test_rmse = sqrt(metrics.mean_squared_error(y_test,
y_test_pred))
y_test_score = Las.score(X_test, y_test)
print('ѵÁ·¼¯RMSE: {0}, ÆÀ·Ö: {1}'.format(y_train_rmse,
y_train_score))
print('²âÊÔ¼¯RMSE: {0}, ÆÀ·Ö: {1}'.format(y_test_rmse,
y_test_score)) |
ѵÁ·¼¯RMSE: 1.0347988070045209,
ÆÀ·Ö: 0.4669218367318746
²âÊÔ¼¯RMSE: 1.050818996520012, ÆÀ·Ö: 0.45800096674816204 |
×îºó, ÔÙÓô«Í³µÄÏßÐÔ»Ø¹é½øÐÐÔ¤²â, ´Ó¶ø¶Ô±ÈÈýÕßÖ®¼äµÄ²îÒì.
from sklearn.linear_model
import LinearRegression
#ѵÁ·ÏßÐԻعéÄ£ÐÍ
LR = LinearRegression()
LR.fit(X_train, y_train)
print(LR.coef_)
print(LR.intercept_)
#·Ö±ðÔ¤²âѵÁ·¼¯ºÍ²âÊÔ¼¯, ²¢¼ÆËã¾ù·½¸ùÎó²îºÍÄâºÏÓŶÈ
y_train_pred = LR.predict(X_train)
y_test_pred = LR.predict(X_test)
y_train_rmse = sqrt(metrics.mean_squared_error(y_train,
y_train_pred))
y_train_score = LR.score(X_train, y_train)
y_test_rmse = sqrt(metrics.mean_squared_error(y_test,
y_test_pred))
y_test_score = LR.score(X_test, y_test)
print('ѵÁ·¼¯RMSE: {0}, ÆÀ·Ö: {1}'.format(y_train_rmse,
y_train_score))
print('²âÊÔ¼¯RMSE: {0}, ÆÀ·Ö: {1}'.format(y_test_rmse,
y_test_score)) |
[ 0.00775915
-0.00032048 0.00391537 0.00817703 0.03636054 -0.01558878
0.09087069 0.09714058 0.02792397 0.06114454 -0.00099731]
2.6837869701964014
ѵÁ·¼¯RMSE: 1.0347173340121176, ÆÀ·Ö: 0.46700577529675036
²âÊÔ¼¯RMSE: 1.0510323073614725, ÆÀ·Ö: 0.45778089839236114 |
×ܽá
¾Í²âÊÔ¼¯ºÍѵÁ·¼¯¾ù·½¸ùÎó²îÖ®²îÀ´Ëµ, ÏßÐԻعé×î´ó, Áë»Ø¹é×îС,
ÁíÍâ»Ø¹éÔÚ²âÊÔ¼¯µÄÄâºÏÓŶÈ×î´ó, ×ÜÌåÀ´Ëµ, Áë»Ø¹éÔÚ´ËÊý¾Ý¼¯ÉϱíÏÖÂÔÓÅ.

¾ÍÕâ¸öÆÀ·ÖÀ´Ëµ, ÒÔÉÏÄ£ÐÍ»¹²»ÊǺܺÃ, »¹ÐèҪѧϰÆäËûÄ£ÐÍ, ±ÈÈç¾ö²ßÊ÷,
Ëæ»úÉÁÖ, Éñ¾ÍøÂçµÈ.
|