ÔÚ±¾ÆªÎÄÕ£¬×÷Õß½«ÌÖÂÛ»úÆ÷ѧϰ¸ÅÄîÒÔ¼°ÈçºÎʹÓÃSpark MLlibÀ´½øÐÐÔ¤²â·ÖÎö¡£ºóÃæ½«»áʹÓÃÒ»¸öÀý×ÓչʾSpark MLlibÔÚ»úÆ÷ѧϰÁìÓòµÄÇ¿º·¡£
1.ÒýÑÔ
Spark»úÆ÷ѧϰAPI°üº¬Á½¸öpackage£ºspark.mllib ºÍspark.ml¡£
spark.mllib °üº¬»ùÓÚµ¯ÐÔÊý¾Ý¼¯£¨RDD£©µÄÔʼSpark»úÆ÷ѧϰAPI¡£ËüÌṩµÄ»úÆ÷ѧϰ¼¼ÊõÓУºÏà¹ØÐÔ¡¢·ÖÀàºÍ»Ø¹é¡¢Ðͬ¹ýÂË¡¢¾ÛÀàºÍÊý¾Ý½µÎ¬¡£
spark.mlÌṩ½¨Á¢ÔÚDataFrameµÄ»úÆ÷ѧϰAPI£¬DataFrameÊÇSpark SQLµÄºËÐIJ¿·Ö¡£Õâ¸ö°üÌṩ¿ª·¢ºÍ¹ÜÀí»úÆ÷ѧϰ¹ÜµÀµÄ¹¦ÄÜ£¬¿ÉÒÔÓÃÀ´½øÐÐÌØÕ÷ÌáÈ¡¡¢×ª»»¡¢Ñ¡ÔñÆ÷ºÍ»úÆ÷ѧϰËã·¨£¬±ÈÈç·ÖÀàºÍ»Ø¹éºÍ¾ÛÀà¡£
±¾ÆªÎÄÕ¾۽¹ÔÚSpark MLlibÉÏ£¬²¢ÌÖÂÛ¸÷¸ö»úÆ÷ѧϰËã·¨¡£
2.»úÆ÷ѧϰºÍÊý¾Ý¿ÆÑ§
»úÆ÷ѧϰÊÇ´ÓÒѾ´æÔÚµÄÊý¾Ý½øÐÐѧϰÀ´¶Ô½«À´½øÐÐÊý¾ÝÔ¤²â£¬ËüÊÇ»ùÓÚÊäÈëÊý¾Ý¼¯´´½¨Ä£ÐÍ×öÊý¾ÝÇý¶¯¾ö²ß¡£
Êý¾Ý¿ÆÑ§ÊÇ´Óº£ÀïÊý¾Ý¼¯£¨½á¹¹»¯ºÍ·Ç½á¹¹»¯Êý¾Ý£©Öгéȡ֪ʶ£¬ÎªÉÌÒµÍŶÓÌṩÊý¾Ý¶´²ìÒÔ¼°Ó°ÏìÉÌÒµ¾ö²ßºÍ·Ïßͼ¡£Êý¾Ý¿ÆÑ§¼ÒµÄµØÎ»±ÈÒÔǰÓô«Í³ÊýÖµ·½·¨½â¾öÎÊÌâµÄÈËÒªÖØÒª¡£
ÒÔÏÂÊǼ¸Àà»úÆ÷ѧϰģÐÍ£º
¼à¶½Ñ§Ï°Ä£ÐÍ
·Ç¼à¶½Ñ§Ï°Ä£ÐÍ
°ë¼à¶½Ñ§Ï°Ä£ÐÍ
ÔöǿѧϰģÐÍ
ÏÂÃæ¼òµ¥µÄÁ˽âϸ÷»úÆ÷ѧϰģÐÍ£¬²¢½øÐбȽϣº
¼à¶½Ñ§Ï°Ä£ÐÍ£º¼à¶½Ñ§Ï°Ä£ÐͶÔÒѱê¼ÇµÄѵÁ·Êý¾Ý¼¯ÑµÁ·³ö½á¹û£¬È»ºó¶Ôδ±ê¼ÇµÄÊý¾Ý¼¯½øÐÐÔ¤²â£»
¼à¶½Ñ§Ï°ÓÖ°üº¬Á½¸ö×ÓÄ£ÐÍ£º»Ø¹éÄ£ÐͺͷÖÀàÄ£ÐÍ¡£
·Ç¼à¶½Ñ§Ï°Ä£ÐÍ£º·Ç¼à¶½Ñ§Ï°Ä£ÐÍÊÇÓÃÀ´´ÓÔʼÊý¾Ý£¨ÎÞѵÁ·Êý¾Ý£©ÖÐÕÒµ½Òþ²ØµÄģʽ»òÕß¹ØÏµ£¬Òò¶ø·Ç¼à¶½Ñ§Ï°Ä£ÐÍÊÇ»ùÓÚδ±ê¼ÇÊý¾Ý¼¯µÄ£»
°ë¼à¶½Ñ§Ï°Ä£ÐÍ£º°ë¼à¶½Ñ§Ï°Ä£ÐÍÓÃÔڼලºÍ·Ç¼à¶½»úÆ÷ѧϰÖÐ×öÔ¤²â·ÖÎö£¬Æä¼ÈÓбê¼ÇÊý¾ÝÓÖÓÐδ±ê¼ÇÊý¾Ý¡£µäÐ͵ij¡¾°ÊÇ»ìºÏÉÙÁ¿±ê¼ÇÊý¾ÝºÍ´óÁ¿Î´±ê¼ÇÊý¾Ý¡£°ë¼à¶½Ñ§Ï°Ò»°ãʹÓ÷ÖÀàºÍ»Ø¹éµÄ»úÆ÷ѧϰ·½·¨£»
ÔöǿѧϰģÐÍ£ºÔöǿѧϰģÐÍͨ¹ý²»Í¬µÄÐÐΪÀ´Ñ°ÕÒÄ¿±ê»Ø±¨º¯Êý×î´ó»¯¡£
ÏÂÃæ¸ø¸÷¸ö»úÆ÷ѧϰģÐ;ٸöÁÐ×Ó£º
¼à¶½Ñ§Ï°£ºÒì³£¼à²â£»
·Ç¼à¶½Ñ§Ï°£ºÉç½»ÍøÂ磬ÓïÑÔÔ¤²â£»
°ë¼à¶½Ñ§Ï°£ºÍ¼Ïñ·ÖÀà¡¢ÓïÒôʶ±ð£»
Ôöǿѧϰ£ºÈ˹¤ÖÇÄÜ£¨AI£©¡£
3.»úÆ÷ѧϰÏîÄ¿²½Öè
¿ª·¢»úÆ÷ѧϰÏîĿʱ£¬Êý¾ÝÔ¤´¦Àí¡¢ÇåÏ´ºÍ·ÖÎöµÄ¹¤×÷ÊǷdz£ÖØÒªµÄ£¬Óë½â¾öÒµÎñÎÊÌâµÄʵ¼ÊµÄѧϰģÐͺÍËã·¨Ò»ÑùÖØÒª¡£
µäÐ͵ĻúÆ÷ѧϰ½â¾ö·½°¸µÄÒ»°ã²½Ö裺
ÌØÕ÷¹¤³Ì
Ä£ÐÍѵÁ·
Ä£ÐÍÆÀ¹À

ͼ1
ÔʼÊý¾ÝÈç¹û²»ÄÜÇåÏ´»òÕßÔ¤´¦Àí£¬Ôò»áÔì³É×îÖյĽá¹û²»×¼È·»òÕß²»¿ÉÓã¬ÉõÖÁ¶ªÊ§ÖØÒªµÄϸ½Ú¡£
ѵÁ·Êý¾ÝµÄÖÊÁ¿¶Ô×îÖÕµÄÔ¤²â½á¹û·Ç³£ÖØÒª£¬Èç¹ûѵÁ·Êý¾Ý²»¹»Ëæ»ú£¬µÃ³öµÄ½á¹ûÄ£ÐͲ»¾«È·£»Èç¹ûÊý¾ÝÁ¿Ì«Ð¡£¬»úÆ÷ѧϰ³öµÄÄ£ÐÍÒ²²»×¼È·¡£
ʹÓð¸Àý£º
ÒµÎñʹÓð¸Àý·Ö²¼ÓÚ¸÷¸öÁìÓò£¬°üÀ¨¸öÐÔ»¯ÍƼöÒýÇæ£¨Ê³Æ·ÍƼöÒýÇæ£©£¬Êý¾ÝÔ¤²â·ÖÎö£¨¹É¼ÛÔ¤²â»òÕßÔ¤²âº½°àÑÓ³Ù£©£¬¹ã¸æ£¬Òì³£¼à²â£¬Í¼ÏñºÍÊÓÆµÄ£ÐÍʶ±ð£¬ÒÔ¼°ÆäËû¸÷ÀàÈ˹¤ÖÇÄÜ¡£
½Ó×ÅÀ´¿´Á½¸ö±È½ÏÁ÷ÐеĻúÆ÷ѧϰӦÓ㺸öÐÔ»¯ÍƼöÒýÇæºÍÒì³£¼à²â¡£
4.»úÆ÷ѧϰӦÓÃ
4.1¡¢ÍƼöÒýÇæ
¸öÐÔ»¯ÍƼöÒýÇæÊ¹ÓÃÉÌÆ·ÊôÐÔºÍÓû§ÐÐΪÀ´½øÐÐÔ¤²â¡£ÍƼöÒýÇæÒ»°ãÓÐÁ½ÖÖË㷨ʵÏÖ£º»ùÓÚÄÚÈݹýÂ˺ÍÐͬ¹ýÂË¡£
е÷¹ýÂ˵Ľâ¾ö·½°¸±ÈÆäËûËã·¨ÒªºÃ£¬Spark MLlibʵÏÖÁËALSÐͬ¹ýÂËËã·¨¡£Spark MLlibµÄÐͬ¹ýÂËÓÐÁ½ÖÖÐÎʽ£ºÏÔʽ·´À¡ºÍÒþÊÔ·´À¡¡£ÏÔʽ·´À¡ÊÇ»ùÓÚÓû§¹ºÂòµÄÉÌÆ·£¨±ÈÈ磬µçÓ°£©£¬ÏÔʽ·´À¡ËäºÃ£¬µ«ºÜ¶àÇé¿öÏ»á³öÏÖÊý¾ÝÇãб£»ÒþÊÔ·´À¡ÊÇ»ùÓÚÓû§µÄÐÐΪÊý¾Ý£¬±ÈÈ磬ä¯ÀÀ¡¢µã»÷¡¢Ï²»¶µÈÐÐΪ¡£ÒþÊÔ·´À¡ÏÖÔÚ´ó¹æÄ£Ó¦ÓÃÔÚ¹¤ÒµÉϽøÐÐÊý¾ÝÔ¤²â·ÖÎö£¬ÒòΪÆäºÜÈÝÒ×ÊÕ¼¯¸÷ÀàÊý¾Ý¡£
ÁíÍâÓлùÓÚÄ£Ð͵ķ½·¨ÊµÏÖÍÆ¼öÒýÇæ£¬ÕâÀïÔÝÇÒÂÔ¹ý¡£
4.2Òì³£¼à²â
Òì³£¼à²âÊÇ»úÆ÷ѧϰÖÐÁíÍâÒ»¸öÓ¦Ó÷dz£¹ã·ºµÄ¼¼Êõ£¬ÒòΪÆä¿ÉÒÔ¿ìËÙºÍ׼ȷµØ½â¾ö½ðÈÚÐÐÒµµÄ¼¬ÊÖÎÊÌâ¡£½ðÈÚ·þÎñÒµÐèÒªÔÚ¼¸°ÙºÁÃëÄÚÅжϳöÒ»±ÊÔÚÏß½»Ò×ÊÇ·ñ·Ç·¨¡£
Éñ¾ÍøÂç¼¼Êõ±»ÓÃÀ´½øÐÐÏúÊÛµãµÄÒì³£¼à²â¡£±ÈÈçÏñPayPalµÈ¹«Ë¾Ê¹Óò»Í¬µÄ»úÆ÷ѧϰËã·¨£¨±ÈÈ磬ÏßÐԻع飬Éñ¾ÍøÂçºÍÉî¶Èѧϰ£©À´½øÐзçÏÕ¹ÜÀí¡£
Spark MLlib¿âÌṩ¸øÁ˼¸¸öʵÏÖµÄËã·¨£¬±ÈÈ磬ÏßÐÔSVM¡¢Âß¼»Ø¹é¡¢¾ö²ßÊ÷ºÍ±´Ò¶Ë¹Ëã·¨¡£ÁíÍ⣬һЩ¼¯³ÉÄ£ÐÍ£¬±ÈÈçËæ»úÉÁÖºÍgradient-boostingÊ÷¡£
ÄÇôÏÖÔÚ¿ªÊ¼ÎÒÃǵÄʹÓÃApache Spark¿ò¼Ü½øÐлúÆ÷ѧϰ֮Âá£
5.Spark Mlib
Spark MLlibʵÏֵĻúÆ÷ѧϰ¿âʹµÃ»úÆ÷ѧϰģÐÍ¿ÉÀ©Õ¹ºÍÒ×ʹÓ㬰üÀ¨·ÖÀàËã·¨¡¢»Ø¹éËã·¨¡¢¾ÛÀàËã·¨¡¢Ðͬ¹ýÂËËã·¨¡¢½µÎ¬Ëã·¨£¬²¢ÌṩÁËÏàÓ¦µÄAPI¡£³ýÁËÕâЩËã·¨Í⣬Spark MLlib»¹ÌṩÁ˸÷ÖÖÊý¾Ý´¦Àí¹¦ÄܺÍÊý¾Ý·ÖÎö¹¤¾ßΪ´ó¼ÒʹÓãº
ͨ¹ýFP-growthËã·¨½øÐÐÆµ·±ÏÍÚ¾òºÍ¹ØÁª·ÖÎö£»
ͨ¹ýPrefixSpanËã·¨½øÐÐÐòÁÐģʽÍÚ¾ò£»
Ìṩ¸ÅÀ¨ÐÔͳ¼ÆºÍ¼ÙÉè¼ìÑ飻
Ìá¹©ÌØÕ÷ת»»£»
»úÆ÷ѧϰģÐÍÆÀ¹ÀºÍ³¬²ÎÊýµ÷ÓÅ¡£

ͼ2 չʾSparkÉú̬
Spark MLlib APIÖ§³ÖScala£¬JavaºÍPython±à³Ì¡£
6.Spark MLlibÓ¦ÓÃʵ¼ù
ʹÓÃSpark MLlibʵÏÖÍÆ¼öÒýÇæ¡£ÍƼöÒýÇæ×î¼Ñʵ¼ùÊÇ»ùÓÚÒÑÖªÓû§µÄÉÌÆ·ÐÐΪ¶øÈ¥Ô¤²âÓû§¿ÉÄܸÐÐËȤµÄδ֪ÉÌÆ·¡£ÍƼöÒýÇæ»ùÓÚÒÑÖªÊý¾Ý£¨Ò²¼´£¬ÑµÁ·Êý¾Ý£©ÑµÁ·³öÔ¤²âÄ£ÐÍ¡£È»ºóÀûÓÃѵÁ·ºÃµÄÔ¤²âÄ£ÐÍÀ´Ô¤²â¡£
×î¼ÑµçÓ°ÍÆ¼öÒýÇæµÄʵÏÖÓÐÏÂÃæ¼¸²½£º
¼ÓÔØµçÓ°Êý¾Ý£»
¼ÓÔØÄãÖ¸¶¨µÄÆÀ¼ÛÊý¾Ý£»
¼ÓÔØÉçÇøÌṩµÄÆÀ¼ÛÊý¾Ý£»
½«ÆÀ¼ÛÊý¾Ýjoin³Éµ¥¸öRDD£»
ʹÓÃALSË㷨ѵÁ·Ä£ÐÍ£»
È·ÈÏÖ¸¶¨Óû§£¨userId £½ 1£©Î´ÆÀ¼ÛµÄµçÓ°£»
Ô¤²âδ±»Óû§ÆÀ¼ÛµÄµçÓ°µÄÆÀ¼Û£»
»ñÈ¡Top NµÄÍÆ¼ö£¨ÕâÀïN£½ 5£©£»
ÔÚÖÕ¶ËÏÔÊ¾ÍÆ¼ö½á¹û¡£
Èç¹ûÄãÏë¶ÔÊä³öµÄÊý¾Ý×ö½øÒ»²½·ÖÎö£¬Äã¿ÉÒÔ°ÑÔ¤²âµÄ½á¹û´æ´¢µ½Cassandra»òÕßMongoDBµÈÊý¾Ý¿â¡£
7.ʹÓõ½µÄ¼¼Êõ
ÕâÀï²ÉÓÃJava¿ª·¢Spark MLlib³ÌÐò£¬²¢ÔÚstand£aloneÄ£ÐÍÏÂÖ´ÐС£Ê¹Óõ½µÄMLlib JavaÀࣺorg.apache.spark.mllib.recommendation¡£
ALS
MatrixFactorizationModel
Rating

ͼ3 Spark»úÆ÷ѧϰµÄÀý×Ó³ÌÐò¼Ü¹¹
³ÌÐòÖ´ÐУº
¿ª·¢ºÃµÄ³ÌÐò½øÐдò°ü£¬ÉèÖû·¾³±äÁ¿£ºJDK (JAVA_HOME), Maven (MAVEN_HOME)ºÍSpark (SPARK_HOME)¡£
ÔÚWindows»·¾³ÖУº
set JAVA_HOME=[JDK_INSTALL_DIRECTORY]
set PATH=%PATH%;%JAVA_HOME%\bin
set MAVEN_HOME=[MAVEN_INSTALL_DIRECTORY]
set PATH=%PATH%;%MAVEN_HOME%\bin
set SPARK_HOME=[SPARK_INSTALL_DIRECTORY]
set PATH=%PATH%;%SPARK_HOME%\bin
cd c:\dev\projects\spark-mllib-sample-app
mvn clean install
mvn eclipse:clean eclipse:eclipse
ÔÚLinux»òÕßMACϵͳÖУ»
export JAVA_HOME=[JDK_INSTALL_DIRECTORY]
export PATH=$PATH:$JAVA_HOME/bin
export MAVEN_HOME=[MAVEN_INSTALL_DIRECTORY]
export PATH=$PATH:$MAVEN_HOME/bin
export SPARK_HOME=[SPARK_INSTALL_DIRECTORY]
export PATH=$PATH:$SPARK_HOME/bin
cd /Users/USER_NAME/spark-mllib-sample-app
mvn clean install
mvn eclipse:clean eclipse:eclipse
ÔËÐÐSpark³ÌÐò£¬ÃüÁîÈçÏ£º
%SPARK_HOME%\bin\spark-submit --class "org.apache.spark.examples.mllib.JavaRecommendationExample" --master local[*] target\spark-mllib-sample-1.0.jar
ÔÚWindows»·¾³Ï£º
%SPARK_HOME%\bin\spark-submit --class "org.apache.spark.examples.mllib.JavaRecommendationExample" --master local[*] target\spark-mllib-sample-1.0.jar
ÔÚLinux»òÕßMAC»·¾³Ï£º
$SPARK_HOME/bin/spark-submit --class "org.apache.spark.examples.mllib.JavaRecommendationExample" --master local[*] target/spark-mllib-sample-1.0.jar
Spark MLlibÓ¦ÓÃ¼à¿Ø
ʹÓÃSparkµÄweb¿ØÖÆÌ¨¿ÉÒÔ½øÐÐ¼à¿Ø³ÌÐòÔËÐÐ״̬¡£ÕâÀïÖ»¸ø³ö³ÌÐòÔËÐеÄÓÐÏòÎÞ»·Í¼£¨DAG£©£º

ͼ4 DAGµÄ¿ÉÊÓ»¯
8.½áÂÛ
Spark MLlibÊÇSparkʵÏֵĻúÆ÷ѧϰ¿âÖеÄÒ»ÖÖ£¬¾³£ÓÃÀ´×öÒµÎñÊý¾ÝµÄÔ¤²â·ÖÎö£¬±ÈÈç¸öÐÔ»¯ÍƼöÒýÇæºÍÒì³£¼à²âϵͳ |