Äú¿ÉÒÔ¾èÖú£¬Ö§³ÖÎÒÃǵĹ«ÒæÊÂÒµ¡£

1Ôª 10Ôª 50Ôª





ÈÏÖ¤Â룺  ÑéÖ¤Âë,¿´²»Çå³þ?Çëµã»÷Ë¢ÐÂÑéÖ¤Âë ±ØÌî



  ÇóÖª ÎÄÕ ÎÄ¿â Lib ÊÓÆµ iPerson ¿Î³Ì ÈÏÖ¤ ×Éѯ ¹¤¾ß ½²×ù Modeler   Code  
»áÔ±   
 
   
 
 
     
   
 ¶©ÔÄ
  ¾èÖú
ÈçºÎÔÚMLlibÖÐʵÏÖËæ»úÉ­ÁÖºÍÌݶÈÌáÉýÊ÷£¨GBTs£©£¿
 

ÒëÕߣº²®ÀÖÔÚÏß - Den À´Ô´£ºdatabricks.com ·¢²¼ÓÚ£º2015-4-24

  4436  次浏览      27
 

Spark 1.2ÔÚMLlibÖÐÒýÈëÁËËæ»úÉ­ÁÖºÍÌݶÈÌáÉýÊ÷(GBTs).ÕâÁ½ÖÖ»úÆ÷ѧϰ·½·¨ÊÊÓÃÓÚ·ÖÀàºÍ»Ø¹é£¬ÇÒÊÇÔÚ»úÆ÷ѧϰËã·¨ÖÐÓ¦ÓõÃ×î¶àºÍ×î³É¹¦µÄËã·¨¡£Ëæ»úÉ­ÁÖºÍGBTs¶¼ÊǼ¯³ÉѧϰËã·¨£¬ËüÃÇͨ¹ý¼¯³É¶à¿Ã¾ö²ßÊ÷À´ÊµÏÖÇ¿·ÖÀàÆ÷¡£ÕâÆª²©ÎÄÖУ¬ÎÒÃÇ»á²ûÊöÕâЩģÐͼ°ÆäËûÃÇÔÚMLlibÖеķֲ¼Ê½ÊµÏÖ¡£ÎÒÃÇÒ²¸ø³öһЩ¼òµ¥Àý×ÓºÍÒªµãÒÔ±ãÄãÖªµÀÈçºÎÉÏÊÖ¡£

¼¯³Éѧϰ·½·¨

¼òµ¥À´Ëµ£¬¼¯³Éѧϰ·½·¨¾ÍÊÇ»ùÓÚÆäËûµÄ»úÆ÷ѧϰËã·¨£¬²¢°ÑËüÃÇÓÐЧµÄ×éºÏÆðÀ´µÄÒ»ÖÖ»úÆ÷ѧϰËã·¨¡£×éºÏ²úÉúµÄËã·¨Ïà±ÈÆäÖÐÈκÎÒ»ÖÖË㷨ģÐ͸üÇ¿´ó¡¢×¼È·¡£

ÔÚMLlib 1.2ÖУ¬ÎÒÃÇʹÓþö²ßÊ÷×÷Ϊ»ù´¡Ä£ÐÍ¡£ÎÒÃÇÌṩÁ½ÖÖ¼¯³ÉËã·¨£ºËæ»úÉ­ÁÖºÍÌݶÈÌáÉýÊ÷(GBTs)¡£Á½ÕßÖ®¼äÖ÷Òª²î±ðÔÚÓÚÿ¿ÃÊ÷ѵÁ·µÄ˳Ðò¡£

Ëæ»úÉ­ÁÖͨ¹ý¶ÔÊý¾ÝËæ»ú²ÉÑùÀ´µ¥¶ÀѵÁ·Ã¿Ò»¿ÃÊ÷¡£ÕâÖÖËæ»úÐÔҲʹµÃÄ£ÐÍÏà¶ÔÓÚµ¥¾ö²ßÊ÷¸ü½¡×³£¬ÇÒ²»Ò×ÔÚѵÁ·¼¯ÉϲúÉú¹ýÄâºÏ¡£

GBTsÔòÒ»´ÎֻѵÁ·Ò»¿ÃÊ÷£¬ºóÃæÃ¿Ò»¿Ãеľö²ßÊ÷Öð²½½ÃÕýÇ°Ãæ¾ö²ßÊ÷²úÉúµÄÎó²î¡£Ëæ×ÅÊ÷µÄÌí¼Ó£¬Ä£Ð͵ıí´ïÁ¦Ò²ÓúÇ¿¡£

×îºó£¬Á½ÖÖ·½·¨¶¼Éú³ÉÁËÒ»¸ö¾ö²ßÊ÷µÄÈ¨ÖØ¼¯ºÏ¡£¸Ã¼¯³ÉÄ£ÐÍͨ¹ý×éºÏÿ¿Ã¶ÀÁ¢Ê÷µÄ½á¹ûÀ´½øÐÐÔ¤²â¡£ÏÂͼÏÔʾһ¸öÓÉ3¿Ã¾ö²ßÊ÷¼¯³ÉµÄ¼òµ¥ÊµÀý¡£

ÔÚÉÏÊöÀý×ӵĻع鼯ºÏÖУ¬Ã¿¿ÃÊ÷¶¼Ô¤²â³öÒ»¸öʵֵ¡£ÕâЩԤ²âÖµ±»×éºÏÆðÀ´²úÉú×îÖÕ¼¯³ÉµÄÔ¤²â½á¹û¡£ÕâÀÎÒÃÇͨ¹ýÈ¡¾ùÖµµÄ·½·¨À´È¡µÃ×îÖÕµÄÔ¤²â½á¹û£¨µ±È»²»Í¬µÄÔ¤²âÈÎÎñÐèÒªÓõ½²»Í¬µÄ×éºÏËã·¨£©¡£

¼¯³ÉѧϰµÄ·Ö²¼Ê½Ñ§Ï°Ëã·¨

ÔÚMLlibÖУ¬Ëæ»úÉ­ÁÖºÍGBTsµÄÊý¾Ý¶¼Êǰ´ÊµÀý£¨ÐУ©´æ´¢µÄ¡£Ëã·¨µÄʵÏÖÒÔԭʼµÄ¾ö²ßÊ÷´úÂëΪ»ù´¡£¬Ã¿¿Ã¾ö²ßÊ÷²ÉÓ÷ֲ¼Ê½Ñ§Ï°£¨ÔçǰµÄ²©¿ÍÖÐÓÐÌáµ½£©¡£ÎÒÃǵÄÐí¶àËã·¨ÓÅ»¯¶¼ÊDzο¼Google¡¯s PLANET project£¬ÌرðÊÇÆäÖÐһƪ¹ØÓÚ·Ö²¼Ê½»·¾³Ïµļ¯³ÉѧϰµÄÎÄÕ¡£

Ëæ»úÉ­ÁÖ£ºËæ»úÉ­ÁÖÖеÄÿ¿ÃÊ÷¶¼Êǵ¥¶ÀѵÁ·£¬¶à¿ÃÊ÷¿ÉÒÔ²¢ÐÐѵÁ·£¨³ý´ËÖ®Í⣬µ¥¶ÀµÄÿ¿ÃÊ÷µÄѵÁ·Ò²¿ÉÒÔ²¢Ðл¯£©¡£MLlibҲȷʵÊÇÕâô×öµÄ£º¸ù¾Ýµ±Ç°µü´úÄÚ´æµÄÏÞÖÆÌõ¼þ£¬¶¯Ì¬µ÷Õû¿É²¢ÐÐѵÁ·µÄ×ÓÊ÷µÄÊýÁ¿¡£

GBTs£ºÒòΪGBTsÖ»ÄÜÒ»´ÎѵÁ·Ò»¿ÃÊ÷£¬Òò´Ë²¢ÐÐѵÁ·µÄÁ£¶ÈÒ²Ö»Äܵ½µ¥¿ÃÊ÷¡£

ÎÒÃÇÔÚÕâÀïÇ¿µ÷Ò»ÏÂMLlibÖÐÓõ½µÄÁ½ÏîÖØÒªµÄÓÅ»¯¼¼Êõ

1.ÄÚ´æ£ºËæ»úÉ­ÁÖʹÓÃÒ»¸ö²»Í¬µÄÑù±¾Êý¾ÝѵÁ·Ã¿Ò»¿ÃÊ÷¡£ÎÒÃÇÀûÓÃTreePointÕâÖÖÊý¾Ý½á¹¹À´´æ´¢Ã¿¸ö×Ó²ÉÑùµÄÊý¾Ý£¬Ìæ´úÖ±½Ó¸´ÖÆÃ¿·Ý×Ó²ÉÑùÊý¾ÝµÄ·½·¨£¬½ø¶ø½ÚÊ¡ÁËÄÚ´æ¡£

2.ͨÐÅ£º¾¡¹Ü¾ö²ßÊ÷¾­³£Í¨¹ýÑ¡ÔñÊ÷ÖÐÿ¸ö¾ö²ßµãµÄËùÓй¦ÄܽøÐÐѵÁ·£¬µ«Ëæ»úÉ­ÁÖÔòÍùÍùÔÚÿһ¸ö½ÚµãÏÞÖÆÑ¡ÔñÒ»¸öËæ»ú×Ó¼¯¡£MLlibµÄʵÏÖÖоͳä·ÖÀûÓÃÁËÕâ¸ö×Ó²ÉÑùÌØµãÀ´¼õÉÙͨÐÅ£ºÀýÈ磬Èôÿ¸ö½ÚµãÖµÓõ½1/3µÄÌØÕ÷£¬ÄÇôÎÒÃǾͻá¼õÉÙ1/3µÄͨÐÅ¡£

Ïêϸ²¿·ÖÇë¼ûMLlib±à³ÌÖ¸ÄϵÉÕ½ڡ£

ʹÓÃMLlib¼¯³Éѧϰ

ÎÒÃǽ«ÑÝʾÈçºÎʹÓÃMLlib½øÐÐѧϰ¼¯³ÉÄ£ÐÍ¡£ÏÂÃæµÄScalaÀý×Ó˵Ã÷ÁËÔõô¶ÁÈ¡Êý¾Ý¼¯£¬½«Êý¾Ý¼¯·Ö¸îΪѵÁ·¼¯ºÍ²âÊÔ¼¯£¬Ñ§Ï°Ò»¸öÄ£ÐÍÒÔ¼°´òÓ¡³öÄ£Ðͼ°Æä²âÊÔ¾«¶È¡£JavaºÍPytonµÄÀý×ÓÇë²ÎÔÄMLlib±à³ÌÖ¸ÄÏ¡£ÐèҪעÒâµÄÊÇGBTsÔÝʱ»¹Ã»ÓÐPython½Ó¿Ú£¬µ«ÊÇÎÒÃÇÆÚÍûSpark1.3·¢²¼°æÖлá°üº¬¡££¨via Github PR 3951£©

Ëæ»úÉ­ÁÖÀý×Ó

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.configuration.Strategy
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data =
MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// Split data into training/test sets
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a RandomForest model.
val treeStrategy = Strategy.defaultStrategy("Classification")
val numTrees = 3 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val model = RandomForest.trainClassifier(trainingData,
treeStrategy, numTrees, featureSubsetStrategy, seed = 12345)

// Evaluate model on test instances and compute test error
val testErr = testData.map { point =>
val prediction = model.predict(point.features)
if (point.label == prediction) 1.0 else 0.0
}.mean()
println("Test Error = " + testErr)
println("Learned Random Forest:n" + model.toDebugString)

GBTsÀý×Ó

import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data =
MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// Split data into training/test sets
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a GradientBoostedTrees model.
val boostingStrategy =
BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 3 // Note: Use more in practice
val model =
GradientBoostedTrees.train(trainingData, boostingStrategy)

// Evaluate model on test instances and compute test error
val testErr = testData.map { point =>
val prediction = model.predict(point.features)
if (point.label == prediction) 1.0 else 0.0
}.mean()
println("Test Error = " + testErr)
println("Learned GBT model:n" + model.toDebugString)

Scalability À©Õ¹ÐÔ

ͨ¹ý¶þ·ÖÀàÎÊÌâµÄʵ֤½á¹û£¬ÎÒÃÇÖ¤Ã÷ÁËMLlibµÄÀ©Õ¹ÐÔ¡£ÒÔϵĸ÷ÕÅͼ±í·Ö±ð¶ÔGBTsºÍËæ»úÉ­ÁÖµÄÌØÐÔ½øÐбȽϣ¬ÆäÖÐÿ¿ÃÊ÷¶¼Óв»Í¬µÄ×î´óÉî¶È¡£

ÕâЩ²âÊÔÊÇÒ»¸ö»Ø¹éµÄÈÎÎñ£¬¼´´ÓÒôÆµÌØÕ÷´ÓÔ¤²â³ö¸èÇúµÄ·¢²¼ÈÕÆÚ£¨YearPredictionMSDÊý¾Ý¼¯À´×ÔUCI ML repository£©¡£ÎÒÃÇʹÓÃEC2 r3.2xlarge»úÆ÷£¬Ëã·¨µÄ²ÎÊý³ý·ÇÌØ±ð˵Ã÷¶¼Ê¹ÓÃĬÈÏÖµ¡£

Ä£ÐÍ´óСµÄÉìËõ£ºÑµÁ·Ê±¼äºÍ²âÊÔÎó²î

ÏÂÃæµÄÁ½ÕÅͼ±íÏÔʾÁËÔö¼ÓÊ÷µÄÊýÁ¿¶Ô¼¯³ÉЧ¹ûµÄÓ°Ïì¡£¶ÔÓÚGBTsºÍËæ»úÉ­ÁÖÕâÁ½Õß¶øÑÔ£¬Ôö¼ÓÊ÷µÄÊýÁ¿¶¼»áÔö¼ÓѵÁ·µÄʱ¼ä£¨µÚÒ»ÕÅͼËùʾ),ͬʱÊ÷µÄÊýÁ¿Ôö¼ÓÒ²Ìá¸ßÁËÔ¤²â¾«¶È£¨ÒÔ²âÊÔµÄÆ½¾ù¾ù·½Îó²îΪºâÁ¿±ê×¼£¬Í¼¶þËùʾ)¡£

Á½ÕßÏà±È£¬Ëæ»úÉ­ÁÖѵÁ·µÄʱ¼ä¸ü¶Ì£¬µ«ÊÇÒª´ïµ½ºÍGBTsͬÑùµÄÔ¤²â¾«¶ÈÔòÐèÒª¸üÉîµÄÊ÷¡£GBTsÔòÄÜÔÚÿ´Îµü´úʱÏÔÖøµØ¼õÉÙÎó²î£¬µ«ÊǾ­¹ý¹ý¶àµÄµü´ú£¬ËüÓÖÌ«ÈÝÒ×¹ýÄâºÏ(Ôö¼ÓÁ˲âÊÔÎó²î£©¡£Ëæ»úÉ­ÁÖÔò²»Ì«ÈÝÒ×¹ýÄâºÏ£¬²âÊÔÎó²îÒ²Ç÷ÓÚÎȶ¨¡£

ÏÂÃæÎª¾ù·½Îó²îËæµ¥¿Ã¾ö²ßÊ÷Éî¶È£¨Éî¶È·Ö±ðΪ2£¬5£¬10£©±ä»¯ÇúÏßͼ¡£

˵Ã÷£º463,715 ¸öѵÁ·ÊµÀý. 16¸ö½Úµã¡£

ѵÁ·¼¯µÄÉìËõ£ºÑµÁ·Ê±¼äºÍ²âÊÔÎó²î

ÏÂÃæÁ½ÕÅͼ±íÏÔʾÁËʹÓò»Í¬µÄѵÁ·¼¯¶ÔËã·¨½á¹û²úÉúµÄÓ°Ï졣ͼ±í±íÃ÷£¬ËäÈ»Êý¾Ý¼¯Ô½´ó£¬Á½ÖÖ·½·¨µÄѵÁ·Ê±¼ä¸ü³¤£¬µ«ÊÇÈ´ÄܲúÉú¸üºÃµÄ²âÊÔ½á¹û¡£

½øÒ»²½ÉìËõ£º¸ü¶àµÄ½Úµã£¬¸ü¿ìµÄѵÁ·ËÙ¶È

×îºóÒ»ÕÅͼ±íչʾÁËʹÓøü´óµÄ¼ÆËã»ú¼¯ÈºÀ´½â¾öÉÏÊöÎÊÌâµÄЧ¹û£¬½áÂÛÊÇGBTsºÍËæ»úÉ­ÁÖÔÚ´ó¼¯ÈºÉÏËٶȵõ½ÏÔÖøÌáÉý¡£ÀýÈç˵£¬µ¥Ê÷Éî¶ÈΪ2µÄGBTsÔÚ16¸ö½ÚµãÉϵÄѵÁ·ËÙ¶È´óÔ¼ÊÇÔÚ2¸ö½ÚµãÉϵÄ4.7±¶¡£Êý¾Ý¼¯Ô½´óÔòЧ¹ûÌáÉýµÄÔ½Ã÷ÏÔ¡£

Õ¹Íû

GBTs²»¾Ã¾Í»áÌṩPythonµÄAPI¡£Î´À´µÄÁíÒ»¸ö¿ª·¢ÒéÌâ¾ÍÊǿɲåÈëÐÔ£º¼¯³É·½·¨²»½ö½ö¿ÉÒÔ¼¯³É¾ö²ßÊ÷£¬Ëü¿ÉÒÔ¼¯³É¼¸ºõËùÓеķÖÀàºÍ»Ø¹éËã·¨¡£ÔÚSpark 1.2ÖУ¬´¦ÓÚʵÑéÖеÄspark.ml°üÖÐÒýÈëµÄPipelines API½«Ê¹µÃ¼¯³É·½·¨Í¨Óû¯£¬²¢×öµ½ÕæÕýµÄ¿É²åÈë¡£

½øÒ»²½Á˽â

APIºÍÏà¹ØÀý×ÓÏê¼ûMLlib¼¯³ÉѧϰÎĵµ¡£

ÒªÏëÁ˽â¸ü¶àÓÃÓÚ¹¹½¨¼¯ºÏµÄ¾ö²ßÊ÷Ïà¹Ø±³¾°ÖªÊ¶£¬Ïê¼û֮ǰµÄ²©¿Í¡£

ÖÂл MLlib¼¯³ÉËã·¨Óɱ¾²©¿ÍµÄ×÷ÕßÃǺÏ×÷¿ª·¢Íê³É£¬ËûÃÇÊÇQiping Li (Alibaba), Sung Chung (Alpine Data Labs), and Davies Liu (Databricks).ÎÒÃÇÒ²¸ÐлLee Yang, Andrew Feng, and Hirakendu Das (Yahoo) £¬ËûÃǰïÖúÉè¼ÆÓë²âÊÔ¡£ÎÒÃÇÒ²»¶Ó­ÄãÀ´¹±Ï×Ò»·ÝÁ¦Á¿£¡

   
4436 ´Îä¯ÀÀ       27
     
Ïà¹ØÎÄÕ Ïà¹ØÎĵµ Ïà¹ØÊÓÆµ



ÎÒÃǸÃÈçºÎÉè¼ÆÊý¾Ý¿â
Êý¾Ý¿âÉè¼Æ¾­Ñé̸
Êý¾Ý¿âÉè¼Æ¹ý³Ì
Êý¾Ý¿â±à³Ì×ܽá
Êý¾Ý¿âÐÔÄܵ÷Óż¼ÇÉ
Êý¾Ý¿âÐÔÄܵ÷Õû
Êý¾Ý¿âÐÔÄÜÓÅ»¯½²×ù
Êý¾Ý¿âϵͳÐÔÄܵ÷ÓÅϵÁÐ
¸ßÐÔÄÜÊý¾Ý¿âÉè¼ÆÓëÓÅ»¯
¸ß¼¶Êý¾Ý¿â¼Ü¹¹Ê¦
Êý¾Ý²Ö¿âºÍÊý¾ÝÍÚ¾ò¼¼Êõ
HadoopÔ­Àí¡¢²¿ÊðÓëÐÔÄܵ÷ÓÅ
×îл¼Æ»®
DeepSeekÔÚÈí¼þ²âÊÔÓ¦ÓÃʵ¼ù 4-12[ÔÚÏß]
DeepSeek´óÄ£ÐÍÓ¦Óÿª·¢Êµ¼ù 4-19[ÔÚÏß]
UAF¼Ü¹¹ÌåϵÓëʵ¼ù 4-11[±±¾©]
AIÖÇÄÜ»¯Èí¼þ²âÊÔ·½·¨Óëʵ¼ù 5-23[ÉϺ£]
»ùÓÚ UML ºÍEA½øÐзÖÎöÉè¼Æ 4-26[±±¾©]
ÒµÎñ¼Ü¹¹Éè¼ÆÓ뽨ģ 4-18[±±¾©]

MySQLË÷Òý±³ºóµÄÊý¾Ý½á¹¹
MySQLÐÔÄܵ÷ÓÅÓë¼Ü¹¹Éè¼Æ
SQL ServerÊý¾Ý¿â±¸·ÝÓë»Ö¸´
ÈÃÊý¾Ý¿â·ÉÆðÀ´ 10´óDB2ÓÅ»¯
oracleµÄÁÙʱ±í¿Õ¼äдÂú´ÅÅÌ
Êý¾Ý¿âµÄ¿çƽ̨Éè¼Æ


²¢·¢¡¢´óÈÝÁ¿¡¢¸ßÐÔÄÜÊý¾Ý¿â
¸ß¼¶Êý¾Ý¿â¼Ü¹¹Éè¼ÆÊ¦
HadoopÔ­ÀíÓëʵ¼ù
Oracle Êý¾Ý²Ö¿â
Êý¾Ý²Ö¿âºÍÊý¾ÝÍÚ¾ò
OracleÊý¾Ý¿â¿ª·¢Óë¹ÜÀí


GE Çø¿éÁ´¼¼ÊõÓëʵÏÖÅàѵ
º½Ìì¿Æ¹¤Ä³×Ó¹«Ë¾ Nodejs¸ß¼¶Ó¦Óÿª·¢
ÖÐÊ¢Òæ»ª ׿Խ¹ÜÀíÕß±ØÐë¾ß±¸µÄÎåÏîÄÜÁ¦
ijÐÅÏ¢¼¼Êõ¹«Ë¾ PythonÅàѵ
ij²©²ÊITϵͳ³§ÉÌ Ò×ÓÃÐÔ²âÊÔÓëÆÀ¹À
ÖйúÓÊ´¢ÒøÐÐ ²âÊÔ³ÉÊì¶ÈÄ£Ðͼ¯³É(TMMI)
ÖÐÎïÔº ²úÆ·¾­ÀíÓë²úÆ·¹ÜÀí