±à¼ÍƼö: |
±¾ÎÄÀ´×ÔÍøÂç´óÊý¾Ý,±¾ÎÄͨ¹ý½éÉÜApache
SparkµÄÒ»Ð©ÌØÐÔÒÔ¼°ÈçºÎʹÓÃPythonÉèÖÃSpark£¬Ê¹SparkºÍPythonÏà½áºÏ¡£ |
|
Apache SparkÊÇ´¦ÀíºÍʹÓôóÊý¾Ý×î¹ã·ºµÄ¿ò¼ÜÖ®Ò»£¬PythonÊÇÊý¾Ý·ÖÎö¡¢»úÆ÷ѧϰµÈÁìÓò×î¹ã·ºÊ¹Óõıà³ÌÓïÑÔÖ®Ò»¡£Èç¹ûÏëÒª»ñµÃ¸ü°ôµÄ»úÆ÷ѧϰÄÜÁ¦£¬ÎªÊ²Ã´²»½«SparkºÍPythonÒ»ÆðʹÓÃÄØ?
ÔÚ¹úÍ⣬Apache Spark¿ª·¢ÈËÔ±µÄƽ¾ùÄêнΪ110,000ÃÀÔª¡£ºÁÎÞÒÉÎÊ£¬SparkÔÚÕâ¸öÐÐÒµÖб»¹ã·ºÊ¹Óá£ÓÉÓÚÆä·á¸»µÄ¿â¼¯£¬PythonÒ²±»´ó¶àÊýÊý¾Ý¿ÆÑ§¼ÒºÍ·ÖÎöר¼ÒʹÓ᣶þÕß¼¯³ÉÒ²²¢Ã»ÓÐÄÇôÀ§ÄÑ£¬SparkÓÃScalaÓïÑÔ¿ª·¢£¬ÕâÖÖÓïÑÔÓëJava·Ç³£ÏàËÆ¡£Ëü½«³ÌÐò´úÂë±àÒëΪÓÃÓÚSpark´óÊý¾Ý´¦ÀíµÄJVM×Ö½ÚÂ롣ΪÁ˼¯³ÉSparkºÍPython£¬Apache
SparkÉçÇø·¢²¼ÁËPySpark¡£
Apache SparkÊÇApache Software Foundation¿ª·¢µÄÓÃÓÚʵʱ´¦ÀíµÄ¿ªÔ´¼¯Èº¼ÆËã¿ò¼Ü¡£SparkÌṩÁËÒ»¸ö½Ó¿Ú£¬ÓÃÓÚ±à³Ì¾ßÓÐÒþʽÊý¾Ý²¢ÐкÍÈÝ´í¹¦ÄܵÄÕû¸ö¼¯Èº¡£
ÏÂÃæÊÇApache SparkµÄÒ»Ð©ÌØÐÔ£¬Ëü±ÈÆäËû¿ò¼Ü¸ü¾ßÓÅÊÆ£º
ËÙ¶È£º±È´«Í³µÄ´óÐÍÊý¾Ý´¦Àí¿ò¼Ü¿ì100±¶¡£
Ç¿´óµÄ»º´æ£º¼òµ¥µÄ±à³Ì²ãÌṩǿ´óµÄ»º´æºÍ´ÅÅ̳־ÃÐÔ¹¦ÄÜ¡£
²¿Ê𣺿ÉÒÔͨ¹ýMesos¡¢Yarn»òSpark×Ô¼ºµÄ¼¯Èº¹ÜÀíÆ÷½øÐв¿Êð¡£
ʵʱ£ºÄÚ´æ¼ÆË㣬ʵʱ¼ÆËãÇÒµÍÑÓ³Ù¡£
Polyglot£ºÕâÊǸÿò¼Ü×îÖØÒªµÄÌØÐÔÖ®Ò»£¬ÒòΪËü¿ÉÒÔÔÚScala£¬Java£¬PythonºÍRÖбà³Ì¡£
ËäÈ»SparkÊÇÔÚScalaÖÐÉè¼ÆµÄ£¬µ«ËüµÄËٶȱÈPython¿ì10±¶£¬µ«Ö»Óе±Ê¹ÓõÄÄÚºËÊýÁ¿ÉÙʱ£¬Scala²Å»áÌåÏÖ³öËÙ¶ÈÓÅÊÆ¡£ÓÉÓÚÏÖÔÚ´ó¶àÊý·ÖÎöºÍ´¦Àí¶¼ÐèÒª´óÁ¿Äںˣ¬Òò´ËScalaµÄÐÔÄÜÓÅÊÆ²¢²»´ó¡£
¶ÔÓÚ³ÌÐòÔ±À´Ëµ£¬ÓÉÓÚÆäÓï·¨ºÍ±ê×¼¿â·á¸»£¬PythonÏà¶ÔÀ´Ëµ¸üÈÝÒ×ѧϰ¡£¶øÇÒ£¬ËüÊÇÒ»ÖÖ¶¯Ì¬ÀàÐÍÓïÑÔ£¬ÕâÒâζ×ÅRDD¿ÉÒÔ±£´æ¶àÖÖÀàÐ͵ĶÔÏó¡£
¾¡¹ÜScalaÓµÓÐSparkMLlib£¬µ«ËüûÓÐ×ã¹»µÄ¿âºÍ¹¤¾ßÀ´ÊµÏÖ»úÆ÷ѧϰºÍNLP¡£´ËÍ⣬Scala
ȱ·¦Êý¾Ý¿ÉÊÓ»¯¡£
ʹÓÃPythonÉèÖÃSpark(PySpark)
Ê×ÏÈÒªÏÂÔØSpark²¢°²×°£¬Ò»µ©Äã½âѹËõÁËsparkÎļþ£¬°²×°²¢½«ÆäÌí¼Óµ½ .bashrcÎļþ·¾¶ÖУ¬ÄãÐèÒªÊäÈësource
.bashrc
Òª´ò¿ªPySpark shell£¬ÐèÒªÊäÈëÃüÁî./bin/pyspark
PySpark SparkContextºÍÊý¾ÝÁ÷
ÓÃPythonÀ´Á¬½ÓSpark£¬¿ÉÒÔʹÓÃRD4s²¢Í¨¹ý¿âPy4jÀ´ÊµÏÖ¡£PySpark Shell½«Python
APIÁ´½Óµ½Spark Core²¢³õʼ»¯Spark Context¡£SparkContextÊÇSparkÓ¦ÓóÌÐòµÄºËÐÄ¡£
Spark ContextÉèÖÃÄÚ²¿·þÎñ²¢½¨Á¢µ½SparkÖ´Ðл·¾³µÄÁ¬½Ó¡£
Çý¶¯³ÌÐòÖеÄSpark Context¶ÔÏóе÷ËùÓзֲ¼Ê½½ø³Ì²¢ÔÊÐí½øÐÐ×ÊÔ´·ÖÅä¡£
¼¯Èº¹ÜÀíÆ÷Ö´ÐгÌÐò£¬ËüÃÇÊǾßÓÐÂß¼µÄJVM½ø³Ì¡£
Spark Context¶ÔÏó½«Ó¦ÓóÌÐò·¢Ë͸øÖ´ÐÐÕß¡£
Spark ContextÔÚÿ¸öÖ´ÐÐÆ÷ÖÐÖ´ÐÐÈÎÎñ¡£
PySpark KDDÓÃÀý
ÏÖÔÚÈÃÎÒÃÇÀ´¿´Ò»¸öÓÃÀý£ºÊý¾ÝÀ´Ô´ÎªKDD'99 Cup(¹ú¼Ê֪ʶ·¢ÏÖºÍÊý¾ÝÍÚ¾ò¹¤¾ß¾ºÈü£¬¹úÄÚÒ²ÓÐÀàËÆµÄ¾ºÈü¿ª·ÅÊý¾Ý¼¯£¬±ÈÈçÖªºõ)¡£ÕâÀïÎÒÃǽ«È¡Êý¾Ý¼¯µÄÒ»²¿·Ö£¬ÒòΪÔʼÊý¾Ý¼¯Ì«´ó¡£
´´½¨RDD£º
ÏÖÔÚÎÒÃÇ¿ÉÒÔʹÓÃÕâ¸öÎļþÀ´´´½¨ÎÒÃǵÄRDD¡£
¹ýÂË
¼ÙÉèÎÒÃÇÒª¼ÆËãÎÒÃÇÔÚÊý¾Ý¼¯ÖÐÓжàÉÙÕý³£µÄÏ໥×÷Óᣣ¬¿ÉÒÔ°´ÈçϹýÂËÎÒÃǵÄraw_data RDD¡£
¼ÆÊý£º
ÏÖÔÚÎÒÃÇ¿ÉÒÔ¼ÆËã³öÐÂRDDÖÐÓжàÉÙÔªËØ¡£
Êä³ö£º
ÖÆÍ¼£º
ÔÚÕâÖÖÇé¿öÏ£¬ÎÒÃÇÏëÒª½«Êý¾ÝÎļþ×÷ΪCSV¸ñʽÎļþ¶ÁÈ¡¡£ÎÒÃÇ¿ÉÒÔͨ¹ý¶ÔRDDÖеÄÿ¸öÔªËØÓ¦ÓÃlambdaº¯Êý¡£ÈçÏÂËùʾ£¬ÕâÀïÎÒÃǽ«Ê¹ÓÃmap()ºÍtake()ת»»¡£
Êä³ö:
²ð·Ö£º
ÏÖÔÚ£¬ÎÒÃÇÏ£Íû½«RDDÖеÄÿ¸öÔªËØ¶¼ÓÃ×÷¼üÖµ¶Ô£¬ÆäÖмüÊDZê¼Ç(ÀýÈçÕý³£Öµ)£¬ÖµÊDZíʾCSV¸ñʽÎļþÖÐÐеÄÕû¸öÔªËØÁÐ±í¡£
ÎÒÃÇ¿ÉÒÔ°´ÈçϽøÐУ¬ÕâÀïÎÒÃÇʹÓÃline.split()ºÍmap()¡£
Êä³ö:
ÊÕ¼¯£º
ʹÓÃcollect()¶¯×÷£¬½«RDDËùÓÐÔªËØ´æÈëÄÚ´æ¡£Òò´Ë£¬Ê¹ÓôóÐÍRDDʱ±ØÐëСÐÄʹÓá£
Êä³ö:
µ±È»£¬Õâ±ÈÎÒÃÇ֮ǰµÄÈκβÙ×÷»¨·ÑµÄʱ¼ä¶¼Òª³¤¡£Ã¿¸ö¾ßÓÐRDDƬ¶ÎµÄSpark¹¤×÷½Úµã¶¼±ØÐë½øÐÐе÷£¬ÒÔ±ã¼ìË÷Æä¸÷²¿·ÖÄÚÈÝ£¬È»ºó½«ËùÓÐÄÚÈݼ¯ºÏµ½Ò»Æð¡£
×÷Ϊ½áºÏÇ°ÃæËùÓÐÄÚÈݵÄ×îºóÒ»¸öÀý×Ó£¬ÎÒÃÇÏ£ÍûÊÕ¼¯ËùÓг£¹æ½»»¥×÷Ϊ¼üÖµ¶Ô¡£
Êä³ö:
|