Spark 1.2ÔÚMLlibÖÐÒýÈëÁËËæ»úÉÁÖºÍÌݶÈÌáÉýÊ÷(GBTs).ÕâÁ½ÖÖ»úÆ÷ѧϰ·½·¨ÊÊÓÃÓÚ·ÖÀàºÍ»Ø¹é£¬ÇÒÊÇÔÚ»úÆ÷ѧϰËã·¨ÖÐÓ¦ÓõÃ×î¶àºÍ×î³É¹¦µÄËã·¨¡£Ëæ»úÉÁÖºÍGBTs¶¼ÊǼ¯³ÉѧϰËã·¨£¬ËüÃÇͨ¹ý¼¯³É¶à¿Ã¾ö²ßÊ÷À´ÊµÏÖÇ¿·ÖÀàÆ÷¡£ÕâÆª²©ÎÄÖУ¬ÎÒÃÇ»á²ûÊöÕâЩģÐͼ°ÆäËûÃÇÔÚMLlibÖеķֲ¼Ê½ÊµÏÖ¡£ÎÒÃÇÒ²¸ø³öһЩ¼òµ¥Àý×ÓºÍÒªµãÒÔ±ãÄãÖªµÀÈçºÎÉÏÊÖ¡£
¼¯³Éѧϰ·½·¨
¼òµ¥À´Ëµ£¬¼¯³Éѧϰ·½·¨¾ÍÊÇ»ùÓÚÆäËûµÄ»úÆ÷ѧϰËã·¨£¬²¢°ÑËüÃÇÓÐЧµÄ×éºÏÆðÀ´µÄÒ»ÖÖ»úÆ÷ѧϰËã·¨¡£×éºÏ²úÉúµÄËã·¨Ïà±ÈÆäÖÐÈκÎÒ»ÖÖË㷨ģÐ͸üÇ¿´ó¡¢×¼È·¡£
ÔÚMLlib 1.2ÖУ¬ÎÒÃÇʹÓþö²ßÊ÷×÷Ϊ»ù´¡Ä£ÐÍ¡£ÎÒÃÇÌṩÁ½ÖÖ¼¯³ÉËã·¨£ºËæ»úÉÁÖºÍÌݶÈÌáÉýÊ÷(GBTs)¡£Á½ÕßÖ®¼äÖ÷Òª²î±ðÔÚÓÚÿ¿ÃÊ÷ѵÁ·µÄ˳Ðò¡£
Ëæ»úÉÁÖͨ¹ý¶ÔÊý¾ÝËæ»ú²ÉÑùÀ´µ¥¶ÀѵÁ·Ã¿Ò»¿ÃÊ÷¡£ÕâÖÖËæ»úÐÔҲʹµÃÄ£ÐÍÏà¶ÔÓÚµ¥¾ö²ßÊ÷¸ü½¡×³£¬ÇÒ²»Ò×ÔÚѵÁ·¼¯ÉϲúÉú¹ýÄâºÏ¡£
GBTsÔòÒ»´ÎֻѵÁ·Ò»¿ÃÊ÷£¬ºóÃæÃ¿Ò»¿Ãеľö²ßÊ÷Öð²½½ÃÕýÇ°Ãæ¾ö²ßÊ÷²úÉúµÄÎó²î¡£Ëæ×ÅÊ÷µÄÌí¼Ó£¬Ä£Ð͵ıí´ïÁ¦Ò²ÓúÇ¿¡£
×îºó£¬Á½ÖÖ·½·¨¶¼Éú³ÉÁËÒ»¸ö¾ö²ßÊ÷µÄÈ¨ÖØ¼¯ºÏ¡£¸Ã¼¯³ÉÄ£ÐÍͨ¹ý×éºÏÿ¿Ã¶ÀÁ¢Ê÷µÄ½á¹ûÀ´½øÐÐÔ¤²â¡£ÏÂͼÏÔʾһ¸öÓÉ3¿Ã¾ö²ßÊ÷¼¯³ÉµÄ¼òµ¥ÊµÀý¡£

ÔÚÉÏÊöÀý×ӵĻع鼯ºÏÖУ¬Ã¿¿ÃÊ÷¶¼Ô¤²â³öÒ»¸öʵֵ¡£ÕâЩԤ²âÖµ±»×éºÏÆðÀ´²úÉú×îÖÕ¼¯³ÉµÄÔ¤²â½á¹û¡£ÕâÀÎÒÃÇͨ¹ýÈ¡¾ùÖµµÄ·½·¨À´È¡µÃ×îÖÕµÄÔ¤²â½á¹û£¨µ±È»²»Í¬µÄÔ¤²âÈÎÎñÐèÒªÓõ½²»Í¬µÄ×éºÏËã·¨£©¡£
¼¯³ÉѧϰµÄ·Ö²¼Ê½Ñ§Ï°Ëã·¨
ÔÚMLlibÖУ¬Ëæ»úÉÁÖºÍGBTsµÄÊý¾Ý¶¼Êǰ´ÊµÀý£¨ÐУ©´æ´¢µÄ¡£Ëã·¨µÄʵÏÖÒÔÔʼµÄ¾ö²ßÊ÷´úÂëΪ»ù´¡£¬Ã¿¿Ã¾ö²ßÊ÷²ÉÓ÷ֲ¼Ê½Ñ§Ï°£¨ÔçǰµÄ²©¿ÍÖÐÓÐÌáµ½£©¡£ÎÒÃǵÄÐí¶àËã·¨ÓÅ»¯¶¼ÊDzο¼Google¡¯s
PLANET project£¬ÌرðÊÇÆäÖÐһƪ¹ØÓÚ·Ö²¼Ê½»·¾³Ïµļ¯³ÉѧϰµÄÎÄÕ¡£
Ëæ»úÉÁÖ£ºËæ»úÉÁÖÖеÄÿ¿ÃÊ÷¶¼Êǵ¥¶ÀѵÁ·£¬¶à¿ÃÊ÷¿ÉÒÔ²¢ÐÐѵÁ·£¨³ý´ËÖ®Í⣬µ¥¶ÀµÄÿ¿ÃÊ÷µÄѵÁ·Ò²¿ÉÒÔ²¢Ðл¯£©¡£MLlibҲȷʵÊÇÕâô×öµÄ£º¸ù¾Ýµ±Ç°µü´úÄÚ´æµÄÏÞÖÆÌõ¼þ£¬¶¯Ì¬µ÷Õû¿É²¢ÐÐѵÁ·µÄ×ÓÊ÷µÄÊýÁ¿¡£
GBTs£ºÒòΪGBTsÖ»ÄÜÒ»´ÎѵÁ·Ò»¿ÃÊ÷£¬Òò´Ë²¢ÐÐѵÁ·µÄÁ£¶ÈÒ²Ö»Äܵ½µ¥¿ÃÊ÷¡£
ÎÒÃÇÔÚÕâÀïÇ¿µ÷Ò»ÏÂMLlibÖÐÓõ½µÄÁ½ÏîÖØÒªµÄÓÅ»¯¼¼Êõ
1.ÄÚ´æ£ºËæ»úÉÁÖʹÓÃÒ»¸ö²»Í¬µÄÑù±¾Êý¾ÝѵÁ·Ã¿Ò»¿ÃÊ÷¡£ÎÒÃÇÀûÓÃTreePointÕâÖÖÊý¾Ý½á¹¹À´´æ´¢Ã¿¸ö×Ó²ÉÑùµÄÊý¾Ý£¬Ìæ´úÖ±½Ó¸´ÖÆÃ¿·Ý×Ó²ÉÑùÊý¾ÝµÄ·½·¨£¬½ø¶ø½ÚÊ¡ÁËÄÚ´æ¡£
2.ͨÐÅ£º¾¡¹Ü¾ö²ßÊ÷¾³£Í¨¹ýÑ¡ÔñÊ÷ÖÐÿ¸ö¾ö²ßµãµÄËùÓй¦ÄܽøÐÐѵÁ·£¬µ«Ëæ»úÉÁÖÔòÍùÍùÔÚÿһ¸ö½ÚµãÏÞÖÆÑ¡ÔñÒ»¸öËæ»ú×Ó¼¯¡£MLlibµÄʵÏÖÖоͳä·ÖÀûÓÃÁËÕâ¸ö×Ó²ÉÑùÌØµãÀ´¼õÉÙͨÐÅ£ºÀýÈ磬Èôÿ¸ö½ÚµãÖµÓõ½1/3µÄÌØÕ÷£¬ÄÇôÎÒÃǾͻá¼õÉÙ1/3µÄͨÐÅ¡£
Ïêϸ²¿·ÖÇë¼ûMLlib±à³ÌÖ¸ÄϵÉÕ½ڡ£
ʹÓÃMLlib¼¯³Éѧϰ
ÎÒÃǽ«ÑÝʾÈçºÎʹÓÃMLlib½øÐÐѧϰ¼¯³ÉÄ£ÐÍ¡£ÏÂÃæµÄScalaÀý×Ó˵Ã÷ÁËÔõô¶ÁÈ¡Êý¾Ý¼¯£¬½«Êý¾Ý¼¯·Ö¸îΪѵÁ·¼¯ºÍ²âÊÔ¼¯£¬Ñ§Ï°Ò»¸öÄ£ÐÍÒÔ¼°´òÓ¡³öÄ£Ðͼ°Æä²âÊÔ¾«¶È¡£JavaºÍPytonµÄÀý×ÓÇë²ÎÔÄMLlib±à³ÌÖ¸ÄÏ¡£ÐèҪעÒâµÄÊÇGBTsÔÝʱ»¹Ã»ÓÐPython½Ó¿Ú£¬µ«ÊÇÎÒÃÇÆÚÍûSpark1.3·¢²¼°æÖлá°üº¬¡££¨via
Github PR 3951£©
Ëæ»úÉÁÖÀý×Ó
import org.apache.spark.mllib.tree.RandomForest import org.apache.spark.mllib.tree.configuration.Strategy import org.apache.spark.mllib.util.MLUtils // Load and parse the data file. val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // Split data into training/test sets val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) // Train a RandomForest model. val treeStrategy = Strategy.defaultStrategy("Classification") val numTrees = 3 // Use more in practice. val featureSubsetStrategy = "auto" // Let the algorithm choose. val model = RandomForest.trainClassifier(trainingData, treeStrategy, numTrees, featureSubsetStrategy, seed = 12345) // Evaluate model on test instances and compute test error val testErr = testData.map { point => val prediction = model.predict(point.features) if (point.label == prediction) 1.0 else 0.0 }.mean() println("Test Error = " + testErr) println("Learned Random Forest:n" + model.toDebugString) |
GBTsÀý×Ó
import org.apache.spark.mllib.tree.GradientBoostedTrees import org.apache.spark.mllib.tree.configuration.BoostingStrategy import org.apache.spark.mllib.util.MLUtils // Load and parse the data file. val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // Split data into training/test sets val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) // Train a GradientBoostedTrees model. val boostingStrategy = BoostingStrategy.defaultParams("Classification") boostingStrategy.numIterations = 3 // Note: Use more in practice val model = GradientBoostedTrees.train(trainingData, boostingStrategy) // Evaluate model on test instances and compute test error val testErr = testData.map { point => val prediction = model.predict(point.features) if (point.label == prediction) 1.0 else 0.0 }.mean() println("Test Error = " + testErr) println("Learned GBT model:n" + model.toDebugString) |
Scalability À©Õ¹ÐÔ
ͨ¹ý¶þ·ÖÀàÎÊÌâµÄʵ֤½á¹û£¬ÎÒÃÇÖ¤Ã÷ÁËMLlibµÄÀ©Õ¹ÐÔ¡£ÒÔϵĸ÷ÕÅͼ±í·Ö±ð¶ÔGBTsºÍËæ»úÉÁÖµÄÌØÐÔ½øÐбȽϣ¬ÆäÖÐÿ¿ÃÊ÷¶¼Óв»Í¬µÄ×î´óÉî¶È¡£
ÕâЩ²âÊÔÊÇÒ»¸ö»Ø¹éµÄÈÎÎñ£¬¼´´ÓÒôÆµÌØÕ÷´ÓÔ¤²â³ö¸èÇúµÄ·¢²¼ÈÕÆÚ£¨YearPredictionMSDÊý¾Ý¼¯À´×ÔUCI
ML repository£©¡£ÎÒÃÇʹÓÃEC2 r3.2xlarge»úÆ÷£¬Ëã·¨µÄ²ÎÊý³ý·ÇÌØ±ð˵Ã÷¶¼Ê¹ÓÃĬÈÏÖµ¡£
Ä£ÐÍ´óСµÄÉìËõ£ºÑµÁ·Ê±¼äºÍ²âÊÔÎó²î
ÏÂÃæµÄÁ½ÕÅͼ±íÏÔʾÁËÔö¼ÓÊ÷µÄÊýÁ¿¶Ô¼¯³ÉЧ¹ûµÄÓ°Ïì¡£¶ÔÓÚGBTsºÍËæ»úÉÁÖÕâÁ½Õß¶øÑÔ£¬Ôö¼ÓÊ÷µÄÊýÁ¿¶¼»áÔö¼ÓѵÁ·µÄʱ¼ä£¨µÚÒ»ÕÅͼËùʾ),ͬʱÊ÷µÄÊýÁ¿Ôö¼ÓÒ²Ìá¸ßÁËÔ¤²â¾«¶È£¨ÒÔ²âÊÔµÄÆ½¾ù¾ù·½Îó²îΪºâÁ¿±ê×¼£¬Í¼¶þËùʾ)¡£
Á½ÕßÏà±È£¬Ëæ»úÉÁÖѵÁ·µÄʱ¼ä¸ü¶Ì£¬µ«ÊÇÒª´ïµ½ºÍGBTsͬÑùµÄÔ¤²â¾«¶ÈÔòÐèÒª¸üÉîµÄÊ÷¡£GBTsÔòÄÜÔÚÿ´Îµü´úʱÏÔÖøµØ¼õÉÙÎó²î£¬µ«ÊǾ¹ý¹ý¶àµÄµü´ú£¬ËüÓÖÌ«ÈÝÒ×¹ýÄâºÏ(Ôö¼ÓÁ˲âÊÔÎó²î£©¡£Ëæ»úÉÁÖÔò²»Ì«ÈÝÒ×¹ýÄâºÏ£¬²âÊÔÎó²îÒ²Ç÷ÓÚÎȶ¨¡£

ÏÂÃæÎª¾ù·½Îó²îËæµ¥¿Ã¾ö²ßÊ÷Éî¶È£¨Éî¶È·Ö±ðΪ2£¬5£¬10£©±ä»¯ÇúÏßͼ¡£

˵Ã÷£º463,715 ¸öѵÁ·ÊµÀý. 16¸ö½Úµã¡£
ѵÁ·¼¯µÄÉìËõ£ºÑµÁ·Ê±¼äºÍ²âÊÔÎó²î
ÏÂÃæÁ½ÕÅͼ±íÏÔʾÁËʹÓò»Í¬µÄѵÁ·¼¯¶ÔËã·¨½á¹û²úÉúµÄÓ°Ï졣ͼ±í±íÃ÷£¬ËäÈ»Êý¾Ý¼¯Ô½´ó£¬Á½ÖÖ·½·¨µÄѵÁ·Ê±¼ä¸ü³¤£¬µ«ÊÇÈ´ÄܲúÉú¸üºÃµÄ²âÊÔ½á¹û¡£


½øÒ»²½ÉìËõ£º¸ü¶àµÄ½Úµã£¬¸ü¿ìµÄѵÁ·ËÙ¶È
×îºóÒ»ÕÅͼ±íչʾÁËʹÓøü´óµÄ¼ÆËã»ú¼¯ÈºÀ´½â¾öÉÏÊöÎÊÌâµÄЧ¹û£¬½áÂÛÊÇGBTsºÍËæ»úÉÁÖÔÚ´ó¼¯ÈºÉÏËٶȵõ½ÏÔÖøÌáÉý¡£ÀýÈç˵£¬µ¥Ê÷Éî¶ÈΪ2µÄGBTsÔÚ16¸ö½ÚµãÉϵÄѵÁ·ËÙ¶È´óÔ¼ÊÇÔÚ2¸ö½ÚµãÉϵÄ4.7±¶¡£Êý¾Ý¼¯Ô½´óÔòЧ¹ûÌáÉýµÄÔ½Ã÷ÏÔ¡£

Õ¹Íû
GBTs²»¾Ã¾Í»áÌṩPythonµÄAPI¡£Î´À´µÄÁíÒ»¸ö¿ª·¢ÒéÌâ¾ÍÊǿɲåÈëÐÔ£º¼¯³É·½·¨²»½ö½ö¿ÉÒÔ¼¯³É¾ö²ßÊ÷£¬Ëü¿ÉÒÔ¼¯³É¼¸ºõËùÓеķÖÀàºÍ»Ø¹éËã·¨¡£ÔÚSpark
1.2ÖУ¬´¦ÓÚʵÑéÖеÄspark.ml°üÖÐÒýÈëµÄPipelines API½«Ê¹µÃ¼¯³É·½·¨Í¨Óû¯£¬²¢×öµ½ÕæÕýµÄ¿É²åÈë¡£
½øÒ»²½Á˽â
APIºÍÏà¹ØÀý×ÓÏê¼ûMLlib¼¯³ÉѧϰÎĵµ¡£
ÒªÏëÁ˽â¸ü¶àÓÃÓÚ¹¹½¨¼¯ºÏµÄ¾ö²ßÊ÷Ïà¹Ø±³¾°ÖªÊ¶£¬Ïê¼û֮ǰµÄ²©¿Í¡£
ÖÂл MLlib¼¯³ÉËã·¨Óɱ¾²©¿ÍµÄ×÷ÕßÃǺÏ×÷¿ª·¢Íê³É£¬ËûÃÇÊÇQiping Li (Alibaba),
Sung Chung (Alpine Data Labs), and Davies Liu (Databricks).ÎÒÃÇÒ²¸ÐлLee
Yang, Andrew Feng, and Hirakendu Das (Yahoo) £¬ËûÃǰïÖúÉè¼ÆÓë²âÊÔ¡£ÎÒÃÇÒ²»¶ÓÄãÀ´¹±Ï×Ò»·ÝÁ¦Á¿£¡
|