±¾ÎÄ´Ó¿ª·¢Ð§ÂÊ£¨Ò×ÓÃÐÔ£©¡¢¿ÉÀ©Õ¹ÐÔ¡¢Ö´ÐÐЧÂÊÈý¸ö·½Ã棬½éÉÜÁË΢²©»úÆ÷ѧϰ¿ò¼ÜWeiflowÔÚ΢²©µÄÓ¦ÓúÍ×î¼Ñʵ¼ù¡£
ÔÚÉÏÆÚ¡¶»ùÓÚSparkµÄ´ó¹æÄ£»úÆ÷ѧϰÔÚ΢²©µÄÓ¦Óá·Ò»ÎÄÖÐÎÒÃÇÌáµ½£¬ÔÚ»úÆ÷ѧϰÁ÷ÖУ¬Ä£ÐÍѵÁ·Ö»ÊÇÆäÖкÄʱ×î¶ÌµÄÒ»»·¡£Èç¹û°Ñ»úÆ÷ѧϰÁ÷±È×÷Åëâ¿£¬ÄÇôģÐÍѵÁ·¾ÍÊÇ×îºó·³´µÄ¹ý³Ì£»Åë⿵Ĵ󲿷Öʱ¼äʵ¼ÊÉ϶¼»¨ÔÚÁËʳ²Ä¡¢×ôÁϵÄÌôÑ¡£¬Ï´²Ë¡¢Ôñ²Ë£¬Ê³²ÄÔÙ¼Ó¹¤£¨Çж¡¡¢Çп顢¹ýÓÍ¡¢Ô¤ÈÈ£©µÈ²½Öè¡£ÔÚ΢²©µÄ»úÆ÷ѧϰÁ÷ÖУ¬ÔʼÑù±¾Éú³É¡¢Êý¾Ý´¦Àí¡¢ÌØÕ÷¹¤³Ì¡¢ÑµÁ·Ñù±¾Éú³É¡¢Ä£ÐͺóÆÚµÄ²âÊÔ¡¢ÆÀ¹ÀµÈ²½ÖèËùÐèҪͶÈëµÄʱ¼äºÍ¾«Á¦£¬Õ¼¾ÝÁËÕû¸öÁ÷³ÌµÄ80%Ö®¶à¡£ÈçºÎÄܹ»¸ßЧµØ¶Ëµ½¶Ë½øÐлúÆ÷ѧϰÁ÷µÄ¿ª·¢£¬ÈçºÎÄܹ»¸ù¾ÝÏßÉϵķ´À¡¼°Ê±µØÑ¡È¡¸ßÇø·Ö¶ÈÌØÕ÷£¬¶ÔÄ£ÐͽøÐÐÓÅ»¯£¬Ñé֤ģÐ͵ÄÓÐЧÐÔ£¬¼ÓËÙÄ£Ð͵ü´úЧÂÊ£¬Âú×ãÏßÉϵÄÒªÇ󣬶¼ÊÇÎÒÃÇÐèÒª½â¾öµÄÎÊÌâ¡£
WeiflowµÄµ®ÉúÔ´×ÔÓÚ΢²©»úÆ÷ѧϰÁ÷µÄÒµÎñÐèÇó£¬ÔÚ΢²©µÄ»úÆ÷ѧϰÁ÷ͼÖУ¨Èçͼ1Ëùʾ£©£¬¶àÖÖÊý¾ÝÁ÷£¨Èç·¢²©Á÷¡¢ÆØ¹âÁ÷¡¢»¥¶¯Á÷£©¾¹ýSpark
Streaming¡¢StormµÄʵʱ´¦Àí£¬´æ´¢ÖÁÌØÕ÷¹¤³Ì²¢Éú³ÉÀëÏßµÄÔʼÑù±¾¡£ÔÚÀëÏßϵͳ£¬¸ù¾ÝÒµÎñÈËÔ±µÄ¿ª·¢¾Ñ飬¶ÔÔʼÑù±¾½øÐи÷ʽ¸÷ÑùµÄÊý¾Ý´¦Àí£¨Í³¼Æ¡¢ÇåÏ´¡¢¹ýÂË¡¢²ÉÑùµÈ£©¡¢ÌØÕ÷´¦Àí¡¢ÌØÕ÷Ó³É䣬´Ó¶øÉú³É¿ÉѵÁ·µÄѵÁ·Ñù±¾£»ÒµÎñÈËÔ±¸ù¾Ýʵ¼ÊÒµÎñ³¡¾°£¨ÅÅÐò¡¢ÍƼö£©£¬Ñ¡Ôñ²»Í¬µÄË㷨ģÐÍ£¨LR¡¢GBDT¡¢Æµ·±Ï¡¢SVM¡¢DNNµÈ£©£¬½øÐÐÄ£ÐÍѵÁ·¡¢Ô¤²â¡¢²âÊÔºÍÆÀ¹À£»´ýÄ£Ð͵ü´úÂú×ãÒªÇóºó£¬Í¨¹ý×Ô¶¯²¿Êð½«Ä£ÐÍÎļþºÍÓ³É乿Ôò²¿Êðµ½ÏßÉÏ¡£ÏßÉÏϵͳ¸ù¾ÝÄ£ÐÍÎļþºÍÓ³É乿Ôò£¬´ÓÌØÕ÷¹¤³ÌÖÐÀÈ¡Ïà¹ØµÄÌØÕ÷Öµ£¬²¢¸ù¾ÝÓ³É乿Ôò½øÐÐÔ¤´¦Àí£¬Éú³É¿ÉÓÃÓÚÔ¤²âµÄÑù±¾¸ñʽ£¬½øÐÐÏßÉϵÄʵʱԤ²â£¬×îÖÕ½«Ô¤²âµÄ½á¹û£¨Óû§¶Ô΢²©ÄÚÈݵÄÐËȤ³Ì¶È£©Êä³ö£¬¹©ÏßÉÏ·þÎñµ÷Óá£

ͼ1 ΢²©»úÆ÷ѧϰÁ÷ͼ
WeiflowµÄÉè¼Æ³õÖÔ¾ÍÊǽ«Î¢²©»úÆ÷ѧϰÁ÷µÄ¿ª·¢¼òµ¥»¯¡¢Éµ¹Ï»¯£¬ÈÃÒµÎñ¿ª·¢ÈËÔ±´Ó·×·±¸´ÔÓµÄÊý¾Ý´¦Àí¡¢ÌØÕ÷¹¤³Ì¡¢Ä£Ð͹¤³ÌÖнâÍѳöÀ´£¬½«±¦¹óµÄʱ¼äºÍ¾«Á¦Í¶Èëµ½ÒµÎñ³¡¾°µÄ¿ª·¢ºÍÓÅ»¯µ±ÖУ¬³¹µ×½â·ÅÒµÎñÈËÔ±µÄÉú²úÁ¦£¬´ó·ùÌáÉý¿ª·¢Ð§ÂÊ¡£
¿¼Âǵ½Î¢²©ÒµÎñ³¡¾°Ô½À´Ô½¸´ÔÓ¡¢¶àÑùµÄÇ÷ÊÆ£¬WeiflowÔÚÉè¼ÆÖ®³õ¾Í³ä·Ö¿¼ÂDz¢È¨ºâÁË¿ò¼ÜµÄ¿ª·¢Ð§ÂÊ¡¢¿ÉÀ©Õ¹ÐÔºÍÖ´ÐÐЧÂÊ¡£Weiflowͨ¹ýͳһ¸ñʽµÄÅäÖÃÎļþʽ¿ª·¢£¨XMLÁ÷³ÌÎļþ£©£¬ÔÊÐíÒµÎñÈËÔ±Ïñ´î»ýľһÑùÁé»îµØ½«ÐèÒªÓõ½µÄÄ£¿é£¨Êý¾Ý´¦Àí¡¢ÌØÕ÷Ó³Éä¡¢Éú³ÉѵÁ·Ñù±¾¡¢Ä£Ð͵ÄѵÁ·¡¢Ô¤²â¡¢²âÊÔ¡¢ÆÀ¹ÀµÈ£©¶Ñµþµ½Ò»Æð£¬¸ù¾ÝÒÀÀµ¹ØÏµÐγɼÆËãÁ÷ͼ£¨Directed
Acyclic GraphÓÐÏòÎÞ»·Í¼£©£¬Weiflow½«×Ô¶¯½âÎö²»Í¬Ä£¿éÖ®¼äµÄÒÀÀµ¹ØÏµ£¬²¢µ÷ÓÃÿ¸öÄ£Ð͵ÄÖ´ÐÐÀà½øÐÐÁ÷Ë®ÏßʽµÄ×÷Òµ¡£¶ÔÓÚÿһ¸ö¼ÆËãÄ£¿é£¬Óû§ÎÞÐè¹ØÐÄÆäÄÚ²¿ÊµÏÖ¡¢Ö´ÐÐЧÂÊ£¬Ö»Ðè¹ØÐÄÓëÒµÎñ¿ª·¢Ïà¹ØµÄ²ÎÊýµ÷ÓÅ£¬ÈçËã·¨µÄ³¬²ÎÊý¡¢Êý¾Ý²ÉÑùÂÊ¡¢²ÉÑù·½Ê½¡¢ÌØÕ÷Ó³É乿Ôò¡¢Êý¾Ýͳ¼Æ·½Ê½¡¢Êý¾ÝÇåÏ´¹æÔòµÈµÈ£¬´Ó¶ø´ó·ùÌáÉý¿ª·¢Ð§ÂÊ¡¢Ä£Ð͵ü´úËÙ¶È¡£ÎªÁËÈøü¶àµÄ¿ª·¢Õߣ¨°üÀ¨¾ßÓдúÂëÄÜÁ¦µÄÒµÎñÈËÔ±£©Äܹ»²ÎÓëµ½WeiflowµÄ¿ª·¢ÖÐÀ´£¬WeiflowÉè¼Æ²¢ÌṩÁ˷ḻµÄ¶à²ã´Î³éÏ󣬻ùÓÚÔ¤¶¨ÒåµÄ»ùÀàºÍ½Ó¿Ú£¬ÔÊÐí¿ª·¢Õ߸ù¾ÝеÄÒµÎñÐèÇóʵÏÖ×Ô¼ºµÄ´¦ÀíÄ£¿é£¨ÈçеÄË㷨ģÐÍѵÁ·¡¢Ô¤²â¡¢ÆÀ¹ÀÄ£¿é£©¡¢¼ÆË㺯Êý£¨È縴ÔÓµÄÌØÕ÷¼ÆË㹫ʽ¡¢ÌØÕ÷×éºÏº¯ÊýµÈ£©£¬´Ó¶ø²»¶Ï·á¸»¡¢À©Õ¹WeiflowµÄ¹¦ÄÜ¡£ÔÚ¿ò¼ÜµÄÖ´ÐÐЧÂÊ·½Ã棬ÔÚµÚ¶þ²ãDAGÖУ¨ºóÃæ½«Ïêϸ½éÉÜWeiflowµÄË«²ãDAG½á¹¹£©£¬³ä·ÖÀûÓø÷ÖÖ¼ÆËãÒýÇæ£¨Spark¡¢Tensorflow¡¢Hive¡¢Storm¡¢FlinkµÈ£©µÄÓÅ»¯»úÖÆ£¬Í¬Ê±½áºÏÇÉÃîµÄÊý¾Ý½á¹¹Éè¼ÆÓ뿪·¢ÓïÑÔ£¨ÈçScalaµÄCurrying¡¢Partial
FunctionsµÈ£©±¾ÉíµÄÌØÐÔ£¬±£Ö¤¿ò¼ÜÔÚÌṩ×ã¹»µÄÁé»îÐԺͽüºõÎÞÏ޵ĿÉÀ©Õ¹ÐԵĻù´¡ÉÏ£¬¾¡¿ÉÄܵØÌáÉýÖ´ÐÐÐÔÄÜ¡£
ΪÁËÓ¦¶Ô΢²©¶àÑùµÄ¼ÆËã»·¾³£¨Spark¡¢Tensorflow¡¢Hive¡¢Storm¡¢FlinkµÈ£©£¬Weiflow²ÉÓÃÁËË«²ãµÄDAGÈÎÎñÁ÷Éè¼Æ£¬Èçͼ2Ëùʾ¡£

ͼ2 WeiflowË«²ãDAGÈÎÎñÁ÷Éè¼Æ
Íâ²ãµÄDAGÓɲ»Í¬µÄnode¹¹³É£¬Ã¿Ò»¸önode¾ß±¸¶ÀÁ¢µÄÖ´Ðл·¾³£¬¼´ÉÏÎÄÌá¼°µÄSpark¡¢Tensorflow¡¢Hive¡¢Storm¡¢FlinkµÈ¼ÆËãÒýÇæ¡£Íâ²ãDAGÉè¼ÆµÄ³õÖÔÊÇÈÃ×îºÏÊʵĴ¸×ÓÈ¥Çû÷×îÊʺϵͤ×Ó£¬´ó¶àÊý¼ÆËãÒýÇæÒòÆäÉè¼Æ½×¶ÎµÄÀúÊ·¾ÖÏÞÐÔ£¬¶¼ºÜÄÑ×öµ½¼æ¹ËËùÓеŤ×÷¸ºÔØÀàÐÍ£¬¶øÊÇÔÚ²»Í¬³Ì¶ÈÉϸüºÃµØÖ§³ÖijЩ¸ºÔØ£¨ÈçÅú´¦Àí¡¢Á÷ʽʵʱ´¦Àí¡¢¼´Ê±²éѯ¡¢·ÖÎöÐÍÊý¾Ý²Ö¿â¡¢»úÆ÷ѧϰ¡¢Í¼¼ÆËã¡¢½»Ò×ÐÍÊý¾Ý¿âµÈ£©£¬Òò´ËÎÒÃǵÄ˼·ÊÇÈÃÓû§Ñ¡Ôñ×îÊʺÏ×Ô¼ºÒµÎñ¸ºÔصļÆËãÒýÇæ¡£ÄÚ²ãµÄDAG£¬¸ù¾Ý¼ÆËãÒýÇæµÄ²»Í¬£¬ÀûÓÃÒýÇæµÄÌØÐÔÓëÓÅ»¯»úÖÆ£¬ÊµÏÖ²»Í¬µÄ³éÏó×÷ΪDAGÖмÆËãÄ£¿éÖ®¼äÊý¾Ý½»»¥µÄÔØÌå¡£ÀýÈçÔÚSpark
nodeÖУ¬ÎÒÃÇ»á³ä·ÖÍÚ¾ò²¢ÀûÓÃSparkÒÑÓеÄÓÅ»¯²ßÂÔºÍÊý¾Ý½á¹¹£¬ÈçDatasets¡¢Dataframe¡¢Tungsten¡¢Whole
Stage Code Generation£¬²¢½«Dataframe×÷ΪSpark nodeÄÚDAGÊý¾ÝÁ÷µÄÔØÌå¡£ÔÚÿһ¸önodeÄÚ²¿£¬¸ù¾ÝÆäÔÚDAGÖÐÉÏÏÂÓεÄλÖã¬ÌṩÁËÈýÖÖ²Ù×÷ÀàÐ͵ijéÏ󣬼´Input¡¢Process¡¢Output¡£Input»ùÀඨÒåÁËSpark
nodeÖÐÊäÈëÊý¾ÝµÄ¸ñʽ¡¢¶ÁÈ¡ºÍ½âÎö¹æ·¶£¬Óû§¿ÉÒÔ¸ù¾ÝSparkÖ§³ÖµÄÊý¾ÝÔ´£¬´´½¨¸÷ÖÖ¸ñʽµÄInput£¬Èçͼ2ÖÐʾÀýµÄParquet¡¢Orc¡¢Json¡¢Text¡¢CSV¡£µ±È»Óû§Ò²¿ÉÒÔ¶¨Òå×Ô¼ºµÄÊäÈë¸ñʽ£¬Èçͼ2ÖÐʾÀýµÄLibsvm¡£ÔÚ΢²©µÄ»úÆ÷ѧϰģÐÍѵÁ·ÖУ¬ÓÐÒ»²¿·Ö³¡¾°ÊÇÐèÒªLibsvm¸ñʽÊý¾Ý×÷ΪѵÁ·Ñù±¾£¬Óû§¿ÉÒÔͨ¹ýʵÏÖInputÖж¨ÒåµÄ¹æ·¶ºÍ½Ó¿Ú£¬ÊµÏÖLibsvm¸ñʽÊý¾ÝµÄ¶ÁÈëÄ£¿é¡£Í¨¹ýInput¶ÁÈëµÄÊý¾Ý»á±»·âװΪDataframe£¬´«µÝ¸øÏÂÓεÄProcessÀà´¦ÀíÄ£¿é¡£Process»ùÀඨÒåÁËÓû§¼ÆËãÂß¼µÄͨÓù淶ºÍ½Ó¿Ú£¬Í¨¹ýʵÏÖProcess»ùÀàÖеĺ¯Êý£¬¿ª·¢Õß¿ÉÒÔÁé»îµØÊµÏÖ×Ô¼ºµÄ¼ÆËãÂß¼£¬Èçͼ2ÖÐʾÀýµÄÊý¾Ýͳ¼Æ¡¢ÇåÏ´¡¢¹ýÂË¡¢×éºÏ¡¢²ÉÑù¡¢×ª»»µÈ£¬Óë»úÆ÷ѧϰÏà¹ØµÄÄ£ÐÍѵÁ·¡¢Ô¤²â¡¢²âÊԵȲ½Ö裬¶¼¿ÉÒÔÔÚProcess»·½ÚʵÏÖ¡£Í¨¹ýProcess´¦ÀíµÄÊý¾Ý£¬ÒÀÈ»±»·âװΪDataframe£¬²¢´«µÝ¸øÏÂÓεÄOutputÀà´¦ÀíÄ£¿é¡£OutputÀཫProcessÀà´«µÝµÄÊý¾Ý½øÒ»²½´¦Àí£¬ÈçÄ£ÐÍÆÀ¹À¡¢Êä³öÊý¾Ý´æ´¢¡¢Ä£ÐÍÎļþ´æ´¢¡¢Êä³öAUCµÈ£¬×îÖÕ½«½á¹ûÒÔ²»Í¬µÄ·½Ê½£¨´ÅÅÌ´æ´¢¡¢ÆÁÄ»´òÓ¡µÈ£©Êä³ö¡£ÐèÒªÖ¸³öµÄÊÇ£¬·²ÊÇInputÖ§³ÖµÄÊý¾Ý¶ÁÈë¸ñʽ£¬Output¶¼ÓжÔÓ¦µÄ´æ´¢¸ñʽ֧³Ö£¬´Ó¶øÐγÉÂß¼Éϵıջ·¡£
ÔÚʹÓ÷½Ã棬ҵÎñÈËÔ±¸ù¾ÝÊÂÏÈÔ¼¶¨ºÃµÄ¹æ·¶ºÍ¸ñʽ£¬½«Ë«²ãDAGµÄ¼ÆËãÂß¼¶¨ÒåÔÚXMLÅäÖÃÎļþÖС£ÒÀ¾ÝÓû§ÔÚXMLÖ¸¶¨µÄÒÀÀµ¹ØÏµºÍ´¦ÀíÄ£¿éÀ࣬Weiflow½«×Ô¶¯Éú³ÉDAGÈÎÎñÁ÷ͼ£¬²¢ÔÚÔËÐÐʱ½×¶Îµ÷Óô¦ÀíÄ£¿éµÄʵÏÖÀàÀ´Íê³ÉÓû§Ö¸¶¨µÄÈÎÎñÁ÷¡£´úÂë1չʾÁË΢²©Ó¦Óù㷺µÄGBDT£«LRÄ£ÐÍѵÁ·Á÷³ÌµÄ¿ª·¢Ê¾Àý£¨ÓÉÓÚÆª·ùÓÐÏÞ£¬Ê¾ÀýÖÐÖ»±£ÁôÁ˵ÚÒ»¸önodeµÄϸ½Ú£©£¬´úÂë1ʾÀýµÄѵÁ·Á÷³ÌËù¹¹³ÉµÄË«²ãDAGÒÀÀµ¼°ÈÎÎñÁ÷ͼÈçͼ3Ëùʾ¡£Í¨¹ýÔÚXMLÅäÖÃÎļþÖн«ËùÐè¼ÆËãÄ£¿é°´ÕÕÒÀÀµ¹ØÏµ£¨Íâ²ãµÄnodeÒÀÀµ¹ØÏµÓëÄÚ²ãµÄ¼ÆËãÂß¼ÒÀÀµ¹ØÏµ£©¶Ñµþ£¬¼´¿ÉÒÔ´î»ýľµÄ·½Ê½Íê³ÉÅäÖû¯¡¢Ä£¿é»¯µÄÁ÷Ë®Ïß×÷Òµ¿ª·¢¡£
<weiflow>
<node id="1" preid="-1">GBDTtraining</
node>
<node id="2" preid="1">GBDTplusLR</node>
</weiflow>
<nodes>
<node name="GBDTtraining">
<input name="input1">
<className>com.weibo.datasys.
dataflow.input.InputSparkText</className>
<dataPath>hdfs://path/of/your/
data</dataPath>
<metaPath>/path/of/your/meta</
metaPath>
<fieldDelimiter>\t</
fieldDelimiter>
</input>
<process name="process1">
<className>com.weibo.
datasys.dataflow.process.
ProcessSparkGBDTTraining</className>
<dependency>input1</dependency>
<conf>gbdt.data.conf</conf>
</process>
<output name="output1">
<className>com.weibo.datasys.
dataflow.output.OutputSparkGBDTModel</
className>
<dependency>process1</
dependency>
<modelPath>hdfs://path/of/your/
data</dataPath>
</output>
</node>
</nodes> |
´úÂë1 ÓÃWeiflowÍê³É΢²©GBDT+LRÄ£ÐÍѵÁ·Á÷³Ì
ͼ3 WeiflowÖÐ΢²©GBDT+LRÄ£ÐÍѵÁ·Á÷³ÌµÄË«²ãDAGÒÀÀµ¹ØÏµ¼°ÈÎÎñÁ÷ͼ
ͨ¹ýÁé»îµÄÄ£¿é»¯¿ª·¢£¬ÒµÎñÈËÔ±´ó·ùÌáÉýÁË»úÆ÷ѧϰ¡¢Êý¾Ý¿ÆÑ§×÷ÒµµÄЧÂÊ¡£Ëæ×Å΢²©µÄÒµÎñ³¡¾°Ô½À´Ô½¸´ÔÓ£¬ÒµÎñÐèÇóÒ²³Ê¶àÑù»¯µÄ·¢Õ¹Ç÷ÊÆ£¬ÎªÁËÈøü¶àµÄ¿ª·¢ÕßÁé»îµØÀ©Õ¹WeiflowµÄ¹¦ÄÜ£¬WeiflowÔÚÉè¼ÆÖ®³õ±ã³ä·Ö¿¼Á¿ÁË¿ò¼ÜµÄ¿ÉÀ©Õ¹ÐÔ¡£Weiflowͨ¹ý¶à²ã´Î¡¢Ä£¿é»¯µÄ³éÏó£¬Ìṩ½üºõÎÞÏÞµÄÀ©Õ¹ÄÜÁ¦¡£
¶à²ã´ÎµÄ³éÏóÊÇΪÁËÂú×ãDAGÍâ²ã¼ÆËãÒýÇæ£¨ÉÏÎÄÌá¼°µÄSpark¡¢Tensorflow¡¢Hive¡¢Storm¡¢FlinkµÈ£©µÄ¿ÉÀ©Õ¹ÐÔ£¬Í¨¹ýTop
level abstractionÌṩµÄ¸ß¶È³éÏó¶¨Ò壬DAGÍâ²ãµÄ¸÷¸ö¼ÆËãÒýÇæÖ»Ðè¼Ì³ÐTop level³éÏóÖж¨ÒåµÄÊôÐԺͷ½·¨£¬¼´¿ÉʵÏÖ¶Ô¼ÆËãÒýÇæ²ãÃæ³éÏóµÄʵÏÖ¡£Èçͼ4Ëùʾ£¬ºÚÉ«Îı¾¿òÖеÄTop
level abstractionÌṩÁ˶à¸ö³éÏóBase£¬À¶É«Îı¾¿òÖв»Í¬µÄÖ´ÐÐÒýÇæÍ¨¹ý¼Ì³ÐÆäÊôÐԺͷ½·¨£¬Ìṩ¸ü¼Ó¾ßÌåµÄ³éÏóʵÏÖ¡£µ±ÓÐеļÆËãÒýÇæ£¨ÈçApache
Flink£©ÐèÒªÌí¼ÓÖÁWeiflowʱ£¬Óû§Ö»Ð轫ж¨ÒåµÄ¼ÆËãÒýÇæÀà¼Ì³ÐTop levelµÄ³éÏóÀ࣬¼´¿ÉÌṩ¸ÃÒýÇæµÄ³éÏóʵÏÖ¡£
Ä£¿é»¯µÄ³éÏóÊÇ´ÓÒµÎñ´¦ÀíµÄ½Ç¶È³ö·¢£¬´ÓÒµÎñÐèÇóÖгéÏó³ö»ù´¡¡¢Í¨ÓõÄÄ£¿é¸ÅÄ½ø¶ø¶¨ÒåÕâЩ»ù±¾Ä£¿éµÄ»ù´¡ÊôÐԺͻù´¡·½·¨¡£Èçͼ4Ëùʾ¸÷Îı¾¿òÖзֱð¶¨Òå¡¢¼Ì³Ð¡¢ÊµÏÖÁËËÄ´ó»ù´¡Ä£¿é£¬¼´Node¡¢Input¡¢ProcessºÍOutput¡£Node»ù´¡ÀඨÒåÁ˼ÆËãÒýÇæÏà¹ØµÄ»ù´¡ÊôÐÔ£¬ÈçÊý¾ÝÁ÷ͨý½é¡¢Ö´Ðл·¾³¡¢ÔËÐÐʱÊý¾ÝÁ÷·½Ê½¡¢ÔËÐвÎÊý³éÏóµÈ¡£Input»ù´¡ÀàΪ¼ÆËãÒýÇæ¶¨ÒåÁ˸ÃÒýÇæÄÚÖ§³ÖµÄËùÓÐÊäÈëÀàÐÍ£¬ÈçSparkÒýÇæÖÐÖ§³ÖParquet¡¢Orc¡¢Json¡¢CSV¡¢TextµÈ£¬²¢½«ÊäÈëÀàÐÍת»»ÎªÊý¾ÝÁ÷ͨý½é£¨ÈçSparkÖ´ÐÐÒýÇæµÄDataframe¡¢RDD£©¡£ÔÚWeiflowµÄʵÏÖ¹ý³ÌÖУ¨ºóÎĽ«Ïêϸ½éÉÜWeiflowʵÏÖÓëÓÅ»¯µÄ×î¼Ñʵ¼ù£©£¬Ã¿¸önodeÄÚ²¿µÄÄ£¿éʵÏÖ¶¼³ä·ÖÀûÓÃÁËÏÖÓÐÒýÇæµÄÊý¾Ý½á¹¹ÓëÓÅ»¯»úÖÆ£¬ÈçÔÚSpark
nodeÖУ¬ÎÒÃdzä·ÖÀûÓÃÁËSparkÔÉúÖ§³ÖµÄ¹¦ÄÜ£¨Èç¶Ô¸÷ÖÖÊý¾ÝÔ´µÄÖ§³Ö£©ºÍÐÔÄÜÓÅ»¯£¨ÈçTungstenÓÅ»¯»úÖÆ¡¢¶þ½øÖÆÊý¾Ý½á¹¹¡¢Whole
Stage Code GenerationµÈ£©¡£ÀýÈçÔÚInput»ù´¡ÀàÖУ¬ÎÒÃÇͨ¹ýSparkÔÉúÊý¾ÝÔ´µÄÖ§³Ö£¬ÌṩÁ˶àÖÖѹËõ¡¢´¿Îı¾¸ñʽµÄÊäÈ빩Óû§Ñ¡Ôñ¡£Í¨¹ýʵÏÖInput»ù´¡ÀàÖж¨ÒåµÄ¶ÔÏóºÍ·½·¨£¬¿ª·¢Õß¿ÉÒÔÁé»îµØÊµÏÖÒµÎñËùÐèµÄÊý¾Ý¸ñʽ£¬ÈçǰÎÄÌá¼°µÄLibsvm¸ñʽ¡£Process»ù´¡ÀàÄÒÀ¨ÁËËùÓÐÒµÎñ´¦ÀíÂß¼£¬ÔÚʵÏÖ·½Ã棬ͬÑùÀûÓÃÁËËùÔÚÒýÇæËùÌṩµÄ¸÷ÖÖÔÉúÖ§³Ö¡£ÈçÔÚSpark
nodeÖУ¬Í¨¹ýSpark SQL»òDataframe DSL£¨Domain Specific Language£©¿ÉÒÔÇáËɵØÊµÏִ󲿷ִ¦ÀíÂß¼£¬ÈçÊý¾Ýͳ¼Æ¡¢ÇåÏ´¡¢¹ýÂË¡¢Áª½ÓµÈ²Ù×÷¡£µ±¿ª·¢ÕßÐèҪʵÏÖеÄÒµÎñÂ߼ʱ£¬Èç¶ÔÊý¾Ý°´±ÈÀý½øÐÐÏòÉÏ¡¢ÏòϲÉÑù£¬Ö»Ðè¼Ì³ÐProcess»ù´¡ÀàÖж¨ÒåµÄÊôÐԺͷ½·¨£¬³ä·ÖÀûÓÃSpark
DataframeºÍRDDµÄ¿ª·ÅAPI£¬½«²ÉÑùµÄ¾ßÌåʵÏÖ·â×°µ½¼È¶¨µÄ½Ó¿ÚÄÚ£¬¼´¿ÉÍê³É¿ª·¢£¬½ø¶øÀ©Õ¹Weiflow¹¦ÄÜ£¬¹©ÒµÎñÈËԱʹÓá£ÓëInputÏà¶ÔÓ¦£¬Output»ù´¡ÀඨÒåÁËWeiflowÔÚ¼ÆËãÒýÇæÄڵĸ÷ÖÖÊý¾Ý¸ñʽµÄÊä³ö£¬ÌṩÁËÓëInputÏà¶ÔÓ¦µÄ½Ó¿Ú£¬ÈçInputÌṩÁËread½Ó¿Ú£¬OutputÔòÌṩÁËwrite½Ó¿Ú£¬ÐγÉÂß¼²ãÃæµÄ±Õ»·¡£
ͨ¹ýWeiflow¶à²ã´Î¡¢Ä£¿é»¯µÄ³éÏó»úÖÆ£¬¿ª·¢Õß¿ÉÒÔÇáËɵØÔÚÖ´ÐÐÒýÇæºÍÒµÎñ¹¦ÄÜ·½Ãæ½øÐÐÀ©Õ¹£¬´Ó¶øÂú×ã²»¶Ï±ä»¯µÄÒµÎñÐèÇó¡£Ç°ÎÄÌáµ½£¬×Ô2016ÄêÒÔÀ´£¬Î¢²©ÒµÎñ²½ÈëÁ˶þ´Î·±ÈÙ£¬Î¢²©µÄÒµÎñ³Ê¶àÑù¡¢¸´Ôӵķ¢Õ¹Ç÷ÊÆ£¬Óû§¡¢²©ÎÄ¡¢»¥¶¯Ïà¹ØµÄÊý¾Ý³Ê±¬Õ¨Ê½Ôö³¤£¬»úÆ÷ѧϰ¹æÄ£»¯µÄÌôÕ½ÆÈÔÚü½Þ¡£ÎªÁËÂú×ã΢²©»úÆ÷ѧϰ¹æÄ£»¯µÄÐèÇó£¬WeiflowÔÚÉè¼ÆÖ®³õ¾Í³ä·Ö¿¼Âǵ½ÊµÏÖÖеÄÖ´ÐÐЧÂÊÎÊÌâ¡£

ͼ4 Weiflow¿ª·ÅAPIµÄ³éÏó²ã´Î
WeiflowÔÚʵÏÖ·½Ãæ·Ö±ð´ÓÓïÑÔÌØÐÔ¡¢Êý¾Ý½á¹¹¡¢ÒýÇæÓÅ»¯µÈ¼¸¸ö·½Ã濼ÂÇ£¬ÓÅ»¯ÈÎÎñÖ´ÐÐÐÔÄÜ¡£¿¼Âǵ½Scalaº¯Êýʽ±à³ÌÓïÑÔµÄÁé»îÐÔ¡¢·á¸»Ëã×Ó¡¢³¬¸ßµÄ¿ª·¢Ð§Âʼ°Æä²¢·¢ÄÜÁ¦£¬Weiflow¿ò¼ÜµÄÖ÷¸É´úÂëºÍSpark
node²¿·ÖÒµÎñʵÏÖ¶¼²ÉÓÃScalaÀ´ÊµÏÖ¡£
¶ÔÓÚÒµÎñÈËÔ±À´Ëµ£¬XMLÅäÖÿª·¢Îļþ¼´ÊÇWeiflowµÄÈë¿Ú¡£Weiflowͨ¹ýScalaµÄXMLÄÚÖÃÄ£¿é¶ÔÓû§ÌṩµÄXMLÎļþ½øÐнâÎö²¢Éú³ÉÏàÓ¦µÄÊý¾Ý½á¹¹£¬ÈçDAG
node£¬Ä£¿é¼äÒÀÀµ¹ØÏµµÈ¡£Ôڳɹ¦½âÎö¸÷Ä£¿éµÄÒÀÀµ¹ØÏµºó£¬Weiflowͨ¹ýScalaÓïÑÔµÄÀÁÖµÌØÐÔºÍCall
By Name»úÖÆ£¬½«ÒÀÀµ¹ØÏµ×ª»¯ÎªDAGÍøÂçͼ£¬²¢Í¨¹ýµ÷ÓÃOutputʵÏÖÀàÖÐÌṩµÄActionº¯Êý£¨Output.write£©£¬´¥·¢Õû¸öDAGÍøÂçµÄ»ØËÝÖ´ÐС£ÔÚ»ØËÝÖ´Ðн׶Σ¬Weiflowµ÷È¡Óû§XMLÎļþÖÐÌṩµÄʵÏÖÀ࣬ͨ¹ýScalaÓïÑÔÌṩµÄ·´Éä»úÖÆ£¬ÔÚÔËÐÐʱÉú³ÉʵÏÖÀà¶ÔÏó£¬Íê³É¼ÆËãÂß¼µÄÖ´ÐС£
ÔÚÖ´ÐÐЧÂÊ·½Ã棬Weiflow³ä·ÖÀûÓÃÁËScalaµÄÓïÑÔÌØÐÔÀ´´ó·ùÌáÉýÕûÌåÖ´ÐÐÐÔÄÜ¡£ÔÚ΢²©µÄ´ó²¿·Ö»úÆ÷ѧϰӦÓó¡¾°ÖУ¬ÐèÒªÀûÓø÷ÖÖ´¦Àíº¯Êý£¨Èçlog10¡¢hash¡¢ÌØÕ÷×éºÏ¡¢¹«Ê½¼ÆËãµÈ£©½«ÔÊ¼ÌØÕ÷Ó³Éäµ½¸ßÎ¬ÌØÕ÷¿Õ¼ä¡£ÆäÖÐÒ»²¿·Ö¸´ÔÓº¯Êý£¨Èçpickcat£¬¸ù¾Ý×Ö·û´®ÁÐ±í·´²é×Ö·û´®Ë÷Òý£©ÐèÒª¶à¸öÊäÈë²ÎÊý¡£ÕâÀຯÊýÊ×ÏÈͨ¹ýµÚÒ»¸ö²ÎÊý£¬Èçpickcatº¯ÊýËùÐèµÄ×Ö·û´®ÁÐ±í£¨ÔÚ¹æÄ£»¯»úÆ÷ѧϰӦÓÃÖлá±äµÃÒì³£¾Þ´ó£©£¬Éú³ÉÔ¤¶¨ÒåµÄÊý¾Ý½á¹¹£¬È»ºóͨ¹ýµÚ¶þ¸ö²ÎÊý·´²é¸ÃÊý¾Ý½á¹¹£¬²¢·µ»ØÆäÔÚÊý¾Ý½á¹¹ÖеÄË÷Òý¡£¶ÔÓÚÕâÑùµÄÐèÇó£¬Èç¹û²ÉÓô«Í³±à³ÌÓïÑÔÖеĺ¯ÊýÀ´ÊµÏÖ£¬½«´øÀ´¾Þ´óµÄ¼ÆË㿪Ïú¡£´¦Àíº¯Êý±»¶¨Òåºó£¬Í¨¹ý±Õ°ü·¢Ë͵½¸÷Ö´Ðнڵ㣨ÈçSparkÖеÄExecutor£©£¬ÔÚÖ´Ðнڵã±éÀúÊý¾Ýʱ£¬¸Ãº¯Êý½«Ã¿´ÎÖ´ÐжÁÈ¡µÚÒ»¸ö×Ö·û´®Áбí²ÎÊý¡¢Éú³ÉÌØ¶¨Êý¾Ý½á¹¹µÄÈÎÎñ£»È»ºó¶ÁÈ¡µÚ¶þ¸ö×Ö·û´®²ÎÊý£¬·´²éÊý¾Ý½á¹¹²¢·µ»ØË÷Òý¡£µ«ÒµÎñÈËÔ±ÕæÕý¹ØÐĵÄÊǵڶþ¸ö²ÎÊýËù·µ»ØµÄË÷ÒýÖµ£¬ÎÞÐèÿ´Î±éÀúÊý¾Ý¶¼ÔËÐÐÉú³ÉÊý¾Ý½á¹¹µÄÈÎÎñ£¬Òò´Ë¸Ãº¯ÊýÔÚÖ´ÐнڵãµÄÔËÐдøÀ´´óÁ¿²»±ØÒªµÄ¼ÆË㿪Ïú¡£È»¶øÍ¨¹ýScalaÓïÑÔÖеÄCurryingÌØÐÔ£¬¿ÉÒÔºÜÈݵؽâ¾öÉÏÊöÎÊÌâ¡£ÔÚScalaÖУ¬º¯ÊýΪһµÈ¹«Ãñ£¬ÇÒËùÓк¯Êý¾ùΪ¶ÔÏó¡£Í¨¹ý½«pickcatº¯Êý¿ÂÀﻯ£¬½«pickcat´¦ÀíµÚÒ»¸ö²ÎÊýµÄ¹ý³Ì·âװΪÁíÒ»¸öº¯Êý£¨pickcat_£©£¬È»ºó½«¸Ãº¯Êýͨ¹ý±Õ°ü·¢Ë͵½Ö´Ðнڵ㣬ִÐÐÒýÇæÔÚ±éÀúÊý¾Ýʱ£¬ÆäËù¼ûµÄº¯Êýpickcat_½«Ö»½ÓÊÕÒ»¸ö²ÎÊý£¬Ò²¼´Ôº¯ÊýpickcatµÄµÚ¶þ¸ö²ÎÊý£¬È»ºó´¦Àí·´²éË÷ÒýµÄ¼ÆËã¼´¿É¡£µ±È»£¬¿ÂÀﻯֻÊÇScalaº¯Êýʽ±à³ÌÓïÑԷḻµÄÌØÐÔÖ®Ò»£¬ÆäËûÌØÐÔÖîÈçPartial
functions¡¢Case class¡¢Pattern matching¡¢Function chainµÈ¶¼±»Ó¦Óõ½ÁËWeiflowµÄʵÏÖÖ®ÖС£
ÔÚÊý¾Ý½á¹¹µÄÉè¼ÆºÍÑ¡ÔñÉÏ£¬WeiflowµÄʵÏÖ¾ÀúÁË´Ó¼òµ¥´Ö±©µ½¾«µñϸ×ÁµÄ±äǨ¡£ÔÚWeiflowµÄ³õÆÚ°æ±¾ÖУ¬ÒòΪµ±Ê±»¹Ã»ÓÐÓöµ½¹æÄ£»¯¼ÆËãµÄÌôÕ½£¬³öÓÚ¿ª·¢Ð§ÂʵĿ¼ÂÇ£¬Êý¾Ý½á¹¹´óÁ¿²ÉÓÃÁ˲»¿É±ä³¤Êý×飬´Ëʱ²¢Î´Óöµ½ÈκÎÐÔÄÜÆ¿¾±¡£µ«µ±Weiflow³ÐÔØ´ó¹æÄ£¼ÆËãʱ£¬Ö´ÐÐÐÔÄܼ¸ºõÎÞ·¨ÈÝÈÌ¡£¾¹ýÅŲ鷢ÏÖ£¬ÔÒòÔÚÓÚÌØÕ÷Ó³Éä¹ý³ÌÖУ¬´æÔÚ´óÁ¿¸ù¾ÝÊý¾Ý×ֵ䣬·´²éÊý¾ÝÖµË÷ÒýµÄÐèÇó£¬ÈçÉÏÎÄÌá¼°µÄpickcatº¯Êý¡£Ãæ¶ÔǧÍò¼¶¡¢ÒÚ¼¶´ý¼ìË÷Êý¾Ý£¬µ±Êý¾Ý×ÖµäÒÔ²»¿É±ä³¤Êý×é´æ´¢Ê±£¬Í¨¹ýÊý¾ÝÖµ·´²éË÷ÒýµÄʱ¼ä¸´ÔÓ¶ÈÏÔ¶øÒ×¼û¡£ºóÀ´Í¨¹ýµ÷ÕûÊý¾Ý×Öµä½á¹¹£¬¶Ô¶àÖÖÊý¾Ý½á¹¹½øÐжԱȡ¢²âÊÔ£¬×îÖÕ½«²»¿É±ä³¤Êý×éÌæ»»ÎªHashMap£¬½â¾öÁË·´²éË÷ÒýµÄÐÔÄÜÎÊÌâ¡£ÔÚÌØÕ÷Ó³ÉäÖ®ºóµÄÉú³ÉLibsvm¸ñʽÑù±¾½×¶ÎÖУ¬Ò²´óÁ¿Ê¹ÓÃÁËÊý×éÊý¾Ý½á¹¹£¬ÒÔ³íÃÜÊý×éµÄ·½Ê½ÊµÏÖÁËLibsvmÊý¾ÝÖµµÄ´æ´¢¡£µ±ÌØÕ÷¿Õ¼äά¶ÈÉÏÉýµ½Ê®ÒÚ¡¢°ÙÒÚ¼¶Ê±£¬¼¸ºõÎÞ·¨Õý³£Íê³ÉÉú³ÉÑù±¾µÄÈÎÎñ¡£Í¨¹ý×ÐϸµÄ·ÖÎöÒµÎñ³¡¾°·¢ÏÖ£¬¼¸ºõËùÓеÄÌØÕ÷¿Õ¼ä¶¼ÊǼ«ÆäÏ¡ÊèµÄ£¬ÒÔ10ÒÚάµÄÌØÕ÷¿Õ¼äΪÀý£¬ÆäÌØÕ÷Ï¡Êè¶Èͨ³£¶¼ÔÚǧ¡¢Íò¼¶±ð£¬½«ÌØÕ÷¿Õ¼äÒÔ³íÃܾØÕóµÄ·½Ê½´æ´¢ºÍ¼ÆË㣬ÎÞÒÉÊǾ޴óµÄÀË·Ñ¡£×îºóͨ¹ý½«³íÃܾØÕóÌæ»»ÎªÏ¡Êè¾ØÕ󣬽â¾öÁËÕâÒ»ÐÔÄÜÎÊÌâ¡£

±í1 ²ÉÓÃWeiflowǰºó¿ª·¢Ð§ÂÊ¡¢¿ÉÀ©Õ¹ÐÔºÍÖ´ÐÐЧÂʵÄÁ¿»¯¶Ô±È
ǰÎÄÌáµ½¹ý£¬ÔÚWeiflowµÄË«²ãDAGÉè¼ÆÖУ¬ÄÚ´æµÄDAGʵÏÖ»á³ä·ÖµØÀûÓÃÖ´ÐÐÒýÇæÒÑÓеÄÌØÐÔÀ´ÌáÉýÖ´ÐÐÐÔÄÜ¡£ÒÔSparkΪÀý£¬ÔÚWeiflowµÄÒµÎñÄ£¿éʵÏÖ²¿·Ö£¬³ä·ÖÀûÓÃÁËSparkµÄ¸÷ÖÖÐÔÄÜÓÅ»¯¼¼ÇÉ£¬ÈçMap
Partitions¡¢Broadcast variables¡¢Dataframe¡¢Aggregate
By Key¡¢Filter and Coalesce¡¢Data SaltingµÈ¡£
¾¹ý¶à¸ö·½ÃæµÄÐÔÄÜÓÅ»¯£¬WeiflowÔÚÖ´ÐÐЧÂÊ·½ÃæÒѾÍêÈ«Äܹ»Ê¤ÈÎ΢²©»úÆ÷ѧϰ¹æÄ£»¯µÄÐèÇó£¬Èç±í1ÖÐËùʾ¶Ô±È£¬WeiflowÓÅ»¯ºóÖ´ÐÐÐÔÄÜÌáÉý6±¶ÒÔÉÏ¡£±í1ÖÐͬʱÁоÙÁËWeiflowÔÚ¿ª·¢Ð§ÂÊ¡¢Ò×ÓÃÐÔ¡¢¿ÉÀ©Õ¹ÐÔ·½ÃæµÄÓÅÊÆºÍÌáÉý¡£
±¾ÎÄ´Ó¿ª·¢Ð§ÂÊ£¨Ò×ÓÃÐÔ£©¡¢¿ÉÀ©Õ¹ÐÔ¡¢Ö´ÐÐЧÂÊÈý¸ö·½Ã棬½éÉÜÁË΢²©»úÆ÷ѧϰ¿ò¼ÜWeiflowÔÚ΢²©µÄÓ¦ÓúÍ×î¼Ñʵ¼ù£¬Ï£ÍûÄܹ»¶Ô¶ÁÕßÌṩÓÐÒæµÄ²Î¿¼¡£
|