ʲôÊÇSpark
Apache SparkÊÇÒ»¸öÎ§ÈÆËÙ¶È¡¢Ò×ÓÃÐԺ͸´ÔÓ·ÖÎö¹¹½¨µÄ´óÊý¾Ý´¦Àí¿ò¼Ü¡£×î³õÔÚ2009ÄêÓɼÓÖÝ´óѧ²®¿ËÀû·ÖУµÄAMPLab¿ª·¢£¬²¢ÓÚ2010Äê³ÉΪApacheµÄ¿ªÔ´ÏîĿ֮һ¡£
ÓëHadoopºÍStormµÈÆäËû´óÊý¾ÝºÍMapReduce¼¼ÊõÏà±È£¬SparkÓÐÈçÏÂÓÅÊÆ¡£
Ê×ÏÈ£¬SparkΪÎÒÃÇÌṩÁËÒ»¸öÈ«Ãæ¡¢Í³Ò»µÄ¿ò¼ÜÓÃÓÚ¹ÜÀí¸÷ÖÖÓÐ×Ų»Í¬ÐÔÖÊ£¨Îı¾Êý¾Ý¡¢Í¼±íÊý¾ÝµÈ£©µÄÊý¾Ý¼¯ºÍÊý¾ÝÔ´£¨ÅúÁ¿Êý¾Ý»òʵʱµÄÁ÷Êý¾Ý£©µÄ´óÊý¾Ý´¦ÀíµÄÐèÇó¡£
Spark¿ÉÒÔ½«Hadoop¼¯ÈºÖеÄÓ¦ÓÃÔÚÄÚ´æÖеÄÔËÐÐËÙ¶ÈÌáÉý100±¶£¬ÉõÖÁÄܹ»½«Ó¦ÓÃÔÚ´ÅÅÌÉϵÄÔËÐÐËÙ¶ÈÌáÉý10±¶¡£
SparkÈÿª·¢Õß¿ÉÒÔ¿ìËÙµÄÓÃJava¡¢Scala»òPython±àд³ÌÐò¡£Ëü±¾Éí×Ô´øÁËÒ»¸ö³¬¹ý80¸ö¸ß½×²Ù×÷·û¼¯ºÏ¡£¶øÇÒ»¹¿ÉÒÔÓÃËüÔÚshellÖÐÒÔ½»»¥Ê½µØ²éѯÊý¾Ý¡£
³ýÁËMapºÍReduce²Ù×÷Ö®Í⣬Ëü»¹Ö§³ÖSQL²éѯ£¬Á÷Êý¾Ý£¬»úÆ÷ѧϰºÍͼ±íÊý¾Ý´¦Àí¡£¿ª·¢Õß¿ÉÒÔÔÚÒ»¸öÊý¾Ý¹ÜµÀÓÃÀýÖе¥¶ÀʹÓÃijһÄÜÁ¦»òÕß½«ÕâЩÄÜÁ¦½áºÏÔÚÒ»ÆðʹÓá£
ÔÚÕâ¸öApache SparkÎÄÕÂϵÁеĵÚÒ»²¿·ÖÖУ¬ÎÒÃǽ«Á˽⵽ʲôÊÇSpark£¬ËüÓëµäÐ͵ÄMapReduce½â¾ö·½°¸µÄ±È½ÏÒÔ¼°ËüÈçºÎΪ´óÊý¾Ý´¦ÀíÌṩÁËÒ»Ì×ÍêÕûµÄ¹¤¾ß¡£
HadoopºÍSpark
HadoopÕâÏî´óÊý¾Ý´¦Àí¼¼Êõ´ó¸ÅÒÑÓÐÊ®ÄêÀúÊ·£¬¶øÇÒ±»¿´×öÊÇÊ×Ñ¡µÄ´óÊý¾Ý¼¯ºÏ´¦ÀíµÄ½â¾ö·½°¸¡£MapReduceÊÇһ·¼ÆËãµÄÓÅÐã½â¾ö·½°¸£¬²»
¹ý¶ÔÓÚÐèÒª¶à·¼ÆËãºÍËã·¨µÄÓÃÀýÀ´Ëµ£¬²¢·ÇÊ®·Ö¸ßЧ¡£Êý¾Ý´¦ÀíÁ÷³ÌÖеÄÿһ²½¶¼ÐèÒªÒ»¸öMap½×¶ÎºÍÒ»¸öReduce½×¶Î£¬¶øÇÒÈç¹ûÒªÀûÓÃÕâÒ»½â¾ö·½°¸£¬
ÐèÒª½«ËùÓÐÓÃÀý¶¼×ª»»³ÉMapReduceģʽ¡£
ÔÚÏÂÒ»²½¿ªÊ¼Ö®Ç°£¬ÉÏÒ»²½µÄ×÷ÒµÊä³öÊý¾Ý±ØÐëÒª´æ´¢µ½·Ö²¼Ê½ÎļþϵͳÖС£Òò´Ë£¬¸´ÖƺʹÅÅÌ´æ´¢»áµ¼ÖÂÕâÖÖ·½Ê½ËٶȱäÂý¡£ÁíÍâHadoop½â¾ö·½°¸ÖÐ
ͨ³£»á°üº¬ÄÑÒÔ°²×°ºÍ¹ÜÀíµÄ¼¯Èº¡£¶øÇÒΪÁË´¦Àí²»Í¬µÄ´óÊý¾ÝÓÃÀý£¬»¹ÐèÒª¼¯³É¶àÖÖ²»Í¬µÄ¹¤¾ß£¨ÈçÓÃÓÚ»úÆ÷ѧϰµÄMahoutºÍÁ÷Êý¾Ý´¦ÀíµÄStorm£©¡£
Èç¹ûÏëÒªÍê³É±È½Ï¸´ÔӵŤ×÷£¬¾Í±ØÐ뽫һϵÁеÄMapReduce×÷Òµ´®ÁªÆðÀ´È»ºó˳ÐòÖ´ÐÐÕâЩ×÷Òµ¡£Ã¿Ò»¸ö×÷Òµ¶¼ÊǸßʱÑӵ쬶øÇÒÖ»ÓÐÔÚǰһ¸ö×÷ÒµÍê³ÉÖ®ºóÏÂÒ»¸ö×÷Òµ²ÅÄÜ¿ªÊ¼Æô¶¯¡£
¶øSparkÔòÔÊÐí³ÌÐò¿ª·¢ÕßʹÓÃÓÐÏòÎÞ»·Í¼£¨DAG£©¿ª·¢¸´ÔӵĶಽÊý¾Ý¹ÜµÀ¡£¶øÇÒ»¹Ö§³Ö¿çÓÐÏòÎÞ»·Í¼µÄÄÚ´æÊý¾Ý¹²Ïí£¬ÒԱ㲻ͬµÄ×÷Òµ¿ÉÒÔ¹²Í¬´¦Àíͬһ¸öÊý¾Ý¡£
SparkÔËÐÐÔÚÏÖÓеÄHadoop·Ö²¼Ê½Îļþϵͳ»ù´¡Ö®ÉÏ£¨HDFS£©Ìṩ¶îÍâµÄÔöÇ¿¹¦ÄÜ¡£ËüÖ§³Ö½«SparkÓ¦Óò¿Êðµ½ÏÖ´æµÄHadoop
v1¼¯Èº£¨with SIMR ¨C Spark-Inside-MapReduce£©»òHadoop v2 YARN¼¯ÈºÉõÖÁÊÇApache
MesosÖ®ÖС£
ÎÒÃÇÓ¦¸Ã½«Spark¿´×÷ÊÇHadoop MapReduceµÄÒ»¸öÌæ´úÆ·¶ø²»ÊÇHadoopµÄÌæ´úÆ·¡£ÆäÒâͼ²¢·ÇÊÇÌæ´úHadoop£¬¶øÊÇΪÁËÌṩһ¸ö¹ÜÀí²»Í¬µÄ´óÊý¾ÝÓÃÀýºÍÐèÇóµÄÈ«ÃæÇÒͳһµÄ½â¾ö·½°¸¡£
SparkÌØÐÔ
Sparkͨ¹ýÔÚÊý¾Ý´¦Àí¹ý³ÌÖгɱ¾¸üµÍµÄÏ´ÅÆ£¨Shuffle£©·½Ê½£¬½«MapReduceÌáÉýµ½Ò»¸ö¸ü¸ßµÄ²ã´Î¡£ÀûÓÃÄÚ´æÊý¾Ý´æ´¢ºÍ½Ó½üʵʱµÄ´¦ÀíÄÜÁ¦£¬Spark±ÈÆäËûµÄ´óÊý¾Ý´¦Àí¼¼ÊõµÄÐÔÄÜÒª¿ìºÜ¶à±¶¡£
Spark»¹Ö§³Ö´óÊý¾Ý²éѯµÄÑÓ³Ù¼ÆË㣬Õâ¿ÉÒÔ°ïÖúÓÅ»¯´óÊý¾Ý´¦ÀíÁ÷³ÌÖеĴ¦Àí²½Öè¡£Spark»¹Ìṩ¸ß¼¶µÄAPIÒÔÌáÉý¿ª·¢ÕßµÄÉú²úÁ¦£¬³ý´ËÖ®Í⻹Ϊ´óÊý¾Ý½â¾ö·½°¸ÌṩһÖµÄÌåϵ¼Ü¹¹Ä£ÐÍ¡£
Spark½«Öмä½á¹û±£´æÔÚÄÚ´æÖжø²»Êǽ«ÆäдÈë´ÅÅÌ£¬µ±ÐèÒª¶à´Î´¦ÀíͬһÊý¾Ý¼¯Ê±£¬ÕâÒ»µãÌØ±ðʵÓá£SparkµÄÉè¼Æ³õÖÔ¾ÍÊǼȿÉÒÔÔÚÄÚ´æÖÐÓÖ¿É
ÒÔÔÚ´ÅÅÌÉϹ¤×÷µÄÖ´ÐÐÒýÇæ¡£µ±ÄÚ´æÖеÄÊý¾Ý²»ÊÊÓÃʱ£¬Spark²Ù×÷·û¾Í»áÖ´ÐÐÍⲿ²Ù×÷¡£Spark¿ÉÒÔÓÃÓÚ´¦Àí´óÓÚ¼¯ÈºÄÚ´æÈÝÁ¿×ܺ͵ÄÊý¾Ý¼¯¡£
Spark»á³¢ÊÔÔÚÄÚ´æÖд洢¾¡¿ÉÄܶàµÄÊý¾ÝÈ»ºó½«ÆäдÈë´ÅÅÌ¡£Ëü¿ÉÒÔ½«Ä³¸öÊý¾Ý¼¯µÄÒ»²¿·Ö´æÈëÄÚ´æ¶øÊ£Óಿ·Ö´æÈë´ÅÅÌ¡£¿ª·¢ÕßÐèÒª¸ù¾ÝÊý¾ÝºÍÓÃÀýÆÀ¹À¶ÔÄÚ´æµÄÐèÇó¡£SparkµÄÐÔÄÜÓÅÊÆµÃÒæÓÚÕâÖÖÄÚ´æÖеÄÊý¾Ý´æ´¢¡£
SparkµÄÆäËûÌØÐÔ°üÀ¨£º
1.Ö§³Ö±ÈMapºÍReduce¸ü¶àµÄº¯Êý¡£
2.ÓÅ»¯ÈÎÒâ²Ù×÷Ëã×Óͼ£¨operator graphs£©¡£
3.¿ÉÒÔ°ïÖúÓÅ»¯ÕûÌåÊý¾Ý´¦ÀíÁ÷³ÌµÄ´óÊý¾Ý²éѯµÄÑÓ³Ù¼ÆËã¡£
4.Ìṩ¼òÃ÷¡¢Ò»ÖµÄScala£¬JavaºÍPython API¡£
5.Ìṩ½»»¥Ê½ScalaºÍPython Shell¡£Ä¿Ç°Ôݲ»Ö§³ÖJava¡£
SparkÊÇÓÃScala³ÌÐòÉè¼ÆÓïÑÔ±àд¶ø³É£¬ÔËÐÐÓÚJavaÐéÄâ»ú£¨JVM£©»·¾³Ö®ÉÏ¡£Ä¿Ç°Ö§³ÖÈçϳÌÐòÉè¼ÆÓïÑÔ±àдSparkÓ¦Óãº
1.Scala
2.Java
3.Python
4.Clojure
5.R
SparkÉú̬ϵͳ
³ýÁËSparkºËÐÄAPIÖ®Í⣬SparkÉú̬ϵͳÖл¹°üÀ¨ÆäËû¸½¼Ó¿â£¬¿ÉÒÔÔÚ´óÊý¾Ý·ÖÎöºÍ»úÆ÷ѧϰÁìÓòÌṩ¸ü¶àµÄÄÜÁ¦¡£
ÕâЩ¿â°üÀ¨£º
1.Spark Streaming:
Spark Streaming»ùÓÚ΢ÅúÁ¿·½Ê½µÄ¼ÆËãºÍ´¦Àí£¬¿ÉÒÔÓÃÓÚ´¦ÀíʵʱµÄÁ÷Êý¾Ý¡£ËüʹÓÃDStream£¬¼òµ¥À´Ëµ¾ÍÊÇÒ»¸öµ¯ÐÔ·Ö²¼Ê½Êý¾Ý¼¯£¨RDD£©ÏµÁУ¬´¦ÀíʵʱÊý¾Ý¡£
2.Spark SQL:
Spark SQL¿ÉÒÔͨ¹ýJDBC API½«SparkÊý¾Ý¼¯±©Â¶³öÈ¥£¬¶øÇÒ»¹¿ÉÒÔÓô«Í³µÄBIºÍ¿ÉÊÓ»¯¹¤¾ßÔÚSparkÊý¾ÝÉÏÖ´ÐÐÀàËÆSQLµÄ²éѯ¡£Óû§»¹¿ÉÒÔÓÃSpark
SQL¶Ô²»Í¬¸ñʽµÄÊý¾Ý£¨ÈçJSON£¬ParquetÒÔ¼°Êý¾Ý¿âµÈ£©Ö´ÐÐETL£¬½«Æäת»¯£¬È»ºó±©Â¶¸øÌض¨µÄ²éѯ¡£
3.Spark MLlib:
MLlibÊÇÒ»¸ö¿ÉÀ©Õ¹µÄSpark»úÆ÷ѧϰ¿â£¬ÓÉͨÓõÄѧϰËã·¨ºÍ¹¤¾ß×é³É£¬°üÀ¨¶þÔª·ÖÀà¡¢ÏßÐԻع顢¾ÛÀà¡¢Ðͬ¹ýÂË¡¢ÌݶÈϽµÒÔ¼°µ×²ãÓÅ»¯ÔÓï¡£
4.Spark GraphX:
GraphXÊÇÓÃÓÚͼ¼ÆËãºÍ²¢ÐÐͼ¼ÆËãµÄ еģ¨alpha£©Spark API¡£Í¨¹ýÒýÈ뵯ÐÔ·Ö²¼Ê½ÊôÐÔͼ£¨Resilient
Distributed Property Graph£©£¬Ò»ÖÖ¶¥µãºÍ±ß¶¼´øÓÐÊôÐÔµÄÓÐÏò¶àÖØÍ¼£¬À©Õ¹ÁËSpark
RDD¡£ÎªÁËÖ§³Öͼ¼ÆË㣬GraphX±©Â¶ÁËÒ»¸ö»ù´¡²Ù×÷·û¼¯ºÏ£¨Èçsubgraph£¬joinVerticesºÍaggregateMessages£©
ºÍÒ»¸ö¾¹ýÓÅ»¯µÄPregel API±äÌå¡£´ËÍ⣬GraphX»¹°üÀ¨Ò»¸ö³ÖÐøÔö³¤µÄÓÃÓÚ¼ò»¯Í¼·ÖÎöÈÎÎñµÄͼËã·¨ºÍ¹¹½¨Æ÷¼¯ºÏ¡£
³ýÁËÕâЩ¿âÒÔÍ⣬»¹ÓÐһЩÆäËûµÄ¿â£¬ÈçBlinkDBºÍTachyon¡£
BlinkDBÊÇÒ»¸ö½üËÆ²éѯÒýÇæ£¬ÓÃÓÚÔÚº£Á¿Êý¾ÝÉÏÖ´Ðн»»¥Ê½SQL²éѯ¡£BlinkDB¿ÉÒÔͨ¹ýÎþÉüÊý¾Ý¾«¶ÈÀ´ÌáÉý²éѯÏìӦʱ¼ä¡£Í¨¹ýÔÚÊý¾ÝÑù±¾ÉÏÖ´Ðвéѯ²¢Õ¹Ê¾°üº¬ÓÐÒâÒåµÄ´íÎóÏß×¢½âµÄ½á¹û£¬²Ù×÷´óÊý¾Ý¼¯ºÏ¡£
TachyonÊÇÒ»¸öÒÔÄÚ´æÎªÖÐÐÄµÄ ·Ö²¼Ê½Îļþϵͳ£¬Äܹ»ÌṩÄÚ´æ¼¶±ðËٶȵĿ缯Ⱥ¿ò¼Ü£¨ÈçSparkºÍMapReduce£©µÄ¿ÉÐÅÎļþ¹²Ïí¡£Ëü½«¹¤×÷¼¯Îļþ»º´æÔÚÄÚ´æÖУ¬´Ó¶ø±ÜÃâµ½´ÅÅÌÖÐ
¼ÓÔØÐèÒª¾³£¶ÁÈ¡µÄÊý¾Ý¼¯¡£Í¨¹ýÕâÒ»»úÖÆ£¬²»Í¬µÄ×÷Òµ/²éѯºÍ¿ò¼Ü¿ÉÒÔÒÔÄÚ´æ¼¶µÄËÙ¶È·ÃÎÊ»º´æµÄÎļþ¡£
´ËÍ⣬»¹ÓÐһЩÓÃÓÚÓëÆäËû²úÆ·¼¯³ÉµÄÊÊÅäÆ÷£¬ÈçCassandra£¨Spark Cassandra Á¬½ÓÆ÷£©ºÍR£¨SparkR£©¡£Cassandra
Connector¿ÉÓÃÓÚ·ÃÎÊ´æ´¢ÔÚCassandraÊý¾Ý¿âÖеÄÊý¾Ý²¢ÔÚÕâЩÊý¾ÝÉÏÖ´ÐÐÊý¾Ý·ÖÎö¡£
ÏÂͼչʾÁËÔÚSparkÉú̬ϵͳÖУ¬ÕâЩ²»Í¬µÄ¿âÖ®¼äµÄÏ໥¹ØÁª¡£
ͼ1. Spark¿ò¼ÜÖеĿâ
ÎÒÃǽ«ÔÚÕâһϵÁÐÎÄÕÂÖÐÖð²½Ì½Ë÷ÕâЩSpark¿â
SparkÌåϵ¼Ü¹¹
SparkÌåϵ¼Ü¹¹°üÀ¨ÈçÏÂÈý¸öÖ÷Òª×é¼þ£º
1.Êý¾Ý´æ´¢
2.API
3.¹ÜÀí¿ò¼Ü
½ÓÏÂÀ´ÈÃÎÒÃÇÏêϸÁ˽âÒ»ÏÂÕâЩ×é¼þ¡£
Êý¾Ý´æ´¢£º
SparkÓÃHDFSÎļþϵͳ´æ´¢Êý¾Ý¡£Ëü¿ÉÓÃÓÚ´æ´¢ÈκμæÈÝÓÚHadoopµÄÊý¾ÝÔ´£¬°üÀ¨HDFS£¬HBase£¬CassandraµÈ¡£
API£º
ÀûÓÃAPI£¬Ó¦Óÿª·¢Õß¿ÉÒÔÓñê×¼µÄAPI½Ó¿Ú´´½¨»ùÓÚSparkµÄÓ¦Óá£SparkÌṩScala£¬JavaºÍPythonÈýÖÖ³ÌÐòÉè¼ÆÓïÑÔµÄAPI¡£
×ÊÔ´¹ÜÀí£º
Spark¼È¿ÉÒÔ²¿ÊðÔÚÒ»¸öµ¥¶ÀµÄ·þÎñÆ÷Ò²¿ÉÒÔ²¿ÊðÔÚÏñMesos»òYARNÕâÑùµÄ·Ö²¼Ê½¼ÆËã¿ò¼ÜÖ®ÉÏ¡£
ÏÂͼ2չʾÁËSparkÌåϵ¼Ü¹¹Ä£ÐÍÖеĸ÷¸ö×é¼þ¡£
ͼ2 SparkÌåϵ¼Ü¹¹
µ¯ÐÔ·Ö²¼Ê½Êý¾Ý¼¯
µ¯ÐÔ·Ö²¼Ê½Êý¾Ý¼¯£¨»ùÓÚMateiµÄÑо¿ÂÛÎÄ£©»òRDDÊÇSpark¿ò¼ÜÖеĺËÐĸÅÄî¡£¿ÉÒÔ½«RDDÊÓ×÷Êý¾Ý¿âÖеÄÒ»ÕÅ±í¡£ÆäÖпÉÒÔ±£´æÈκÎÀàÐ͵ÄÊý¾Ý¡£Spark½«Êý¾Ý´æ´¢ÔÚ²»Í¬·ÖÇøÉϵÄRDDÖ®ÖС£
RDD¿ÉÒÔ°ïÖúÖØÐ°²ÅżÆËã²¢ÓÅ»¯Êý¾Ý´¦Àí¹ý³Ì¡£
´ËÍ⣬Ëü»¹¾ßÓÐÈÝ´íÐÔ£¬ÒòΪRDDÖªµÀÈçºÎÖØÐ´´½¨ºÍÖØÐ¼ÆËãÊý¾Ý¼¯¡£
RDDÊDz»¿É±äµÄ¡£Äã¿ÉÒÔÓñ任£¨Transformation£©ÐÞ¸ÄRDD£¬µ«ÊÇÕâ¸ö±ä»»Ëù·µ»ØµÄÊÇÒ»¸öȫеÄRDD£¬¶øÔÓеÄRDDÈÔÈ»±£³Ö²»±ä¡£
RDDÖ§³ÖÁ½ÖÖÀàÐ͵IJÙ×÷£º
1.±ä»»£¨Transformation£©
2.Ðж¯£¨Action£©
±ä»»£º±ä»»µÄ·µ»ØÖµÊÇÒ»¸öеÄRDD¼¯ºÏ£¬¶ø²»Êǵ¥¸öÖµ¡£µ÷ÓÃÒ»¸ö±ä»»·½·¨£¬²»»áÓÐÈκÎÇóÖµ¼ÆË㣬ËüÖ»»ñȡһ¸öRDD×÷Ϊ²ÎÊý£¬È»ºó·µ»ØÒ»¸öеÄRDD¡£
±ä»»º¯Êý°üÀ¨£ºmap£¬filter£¬flatMap£¬groupByKey£¬reduceByKey£¬aggregateByKey£¬pipeºÍcoalesce¡£
Ðж¯£ºÐж¯²Ù×÷¼ÆËã²¢·µ»ØÒ»¸öеÄÖµ¡£µ±ÔÚÒ»¸öRDD¶ÔÏóÉϵ÷ÓÃÐж¯º¯Êýʱ£¬»áÔÚÕâһʱ¿Ì¼ÆËãÈ«²¿µÄÊý¾Ý´¦Àí²éѯ²¢·µ»Ø½á¹ûÖµ¡£
Ðж¯²Ù×÷°üÀ¨£ºreduce£¬collect£¬count£¬first£¬take£¬countByKeyÒÔ¼°foreach¡£
ÈçºÎ°²×°Spark
°²×°ºÍʹÓÃSparkÓм¸ÖÖ²»Í¬·½Ê½¡£Äã¿ÉÒÔÔÚ×Ô¼ºµÄµçÄÔÉϽ«Spark×÷Ϊһ¸ö¶ÀÁ¢µÄ¿ò¼Ü°²×°»òÕß´ÓÖîÈçCloudera£¬HortonWorks»òMapRÖ®ÀàµÄ¹©Ó¦ÉÌ´¦»ñȡһ¸öSparkÐéÄâ»ú¾µÏñÖ±½ÓʹÓ᣻òÕßÄãÒ²¿ÉÒÔʹÓÃÔÚÔÆ¶Ë»·¾³£¨ÈçDatabricks
Cloud£©°²×°²¢ÅäÖúõÄSpark¡£
ÔÚ±¾ÎÄÖУ¬ÎÒÃǽ«°ÑSpark×÷Ϊһ¸ö¶ÀÁ¢µÄ¿ò¼Ü°²×°²¢ÔÚ±¾µØÆô¶¯Ëü¡£×î½üSpark¸Õ¸Õ·¢²¼ÁË1.2.0°æ±¾¡£ÎÒÃǽ«ÓÃÕâÒ»°æ±¾Íê³ÉʾÀýÓ¦ÓõĴúÂëչʾ¡£
ÈçºÎÔËÐÐSpark
µ±ÄãÔÚ±¾µØ»úÆ÷°²×°ÁËSpark»òʹÓÃÁË»ùÓÚÔÆ¶ËµÄSparkºó£¬Óм¸ÖÖ²»Í¬µÄ·½Ê½¿ÉÒÔÁ¬½Óµ½SparkÒýÇæ¡£
ϱíչʾÁ˲»Í¬µÄSparkÔËÐÐģʽËùÐèµÄMaster URL²ÎÊý¡£
ÈçºÎÓëSpark½»»¥
SparkÆô¶¯²¢ÔËÐк󣬿ÉÒÔÓÃSpark shellÁ¬½Óµ½SparkÒýÇæ½øÐн»»¥Ê½Êý¾Ý·ÖÎö¡£Spark
shellÖ§³ÖScalaºÍPythonÁ½ÖÖÓïÑÔ¡£Java²»Ö§³Ö½»»¥Ê½µÄShell£¬Òò´ËÕâÒ»¹¦ÄÜÔÝδÔÚJavaÓïÑÔÖÐʵÏÖ¡£
¿ÉÒÔÓÃspark-shell.cmdºÍpyspark.cmdÃüÁî·Ö±ðÔËÐÐScala°æ±¾ºÍPython°æ±¾µÄSpark
Shell¡£
SparkÍøÒ³¿ØÖÆÌ¨
²»ÂÛSparkÔËÐÐÔÚÄÄÒ»ÖÖģʽÏ£¬¶¼¿ÉÒÔͨ¹ý·ÃÎÊSparkÍøÒ³¿ØÖÆÌ¨²é¿´SparkµÄ×÷Òµ½á¹ûºÍÆäËûµÄͳ¼ÆÊý¾Ý£¬¿ØÖÆÌ¨µÄURLµØÖ·ÈçÏ£º
http://localhost:4040
Spark¿ØÖÆÌ¨ÈçÏÂͼ3Ëùʾ£¬°üÀ¨Stages£¬Storage£¬EnvironmentºÍExecutorsËĸö±êǩҳ
ͼ3. SparkÍøÒ³¿ØÖÆÌ¨
¹²Ïí±äÁ¿
SparkÌṩÁ½ÖÖÀàÐ͵Ĺ²Ïí±äÁ¿¿ÉÒÔÌáÉý¼¯Èº»·¾³ÖеÄSpark³ÌÐòÔËÐÐЧÂÊ¡£·Ö±ðÊǹ㲥±äÁ¿ºÍÀÛ¼ÓÆ÷¡£
¹ã²¥±äÁ¿£º¹ã²¥±äÁ¿¿ÉÒÔÔÚÿ̨»úÆ÷ÉÏ»º´æÖ»¶Á±äÁ¿¶ø²»ÐèҪΪ¸÷¸öÈÎÎñ·¢Ë͸ñäÁ¿µÄ¿½±´¡£ËûÃÇ¿ÉÒÔÈôóµÄÊäÈëÊý¾Ý¼¯µÄ¼¯Èº¿½±´ÖеĽڵã¸ü¼Ó¸ßЧ¡£
ÏÂÃæµÄ´úÂëÆ¬¶ÎչʾÁËÈçºÎʹÓù㲥±äÁ¿¡£
// // Broadcast Variables // val broadcastVar = sc.broadcast(Array(1, 2, 3)) broadcastVar.value |
ÀÛ¼ÓÆ÷£ºÖ»ÓÐÔÚʹÓÃÏà¹Ø²Ù×÷ʱ²Å»áÌí¼ÓÀÛ¼ÓÆ÷£¬Òò´ËËü¿ÉÒԺܺõØÖ§³Ö²¢ÐС£ÀÛ¼ÓÆ÷¿ÉÓÃÓÚʵÏÖ¼ÆÊý£¨¾ÍÏñÔÚMapReduceÖÐÄÇÑù£©»òÇóºÍ¡£¿ÉÒÔÓÃadd·½·¨½«ÔËÐÐÔÚ¼¯ÈºÉϵÄÈÎÎñÌí¼Óµ½Ò»¸öÀÛ¼ÓÆ÷±äÁ¿ÖС£²»¹ýÕâЩÈÎÎñÎÞ·¨¶ÁÈ¡±äÁ¿µÄÖµ¡£Ö»ÓÐÇý¶¯³ÌÐò²ÅÄܹ»¶ÁÈ¡ÀÛ¼ÓÆ÷µÄÖµ¡£
ÏÂÃæµÄ´úÂëÆ¬¶ÎչʾÁËÈçºÎʹÓÃÀÛ¼ÓÆ÷¹²Ïí±äÁ¿£º
//
// Accumulators
//
val accum = sc.accumulator(0, "My Accumulator")
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
accum.value |
SparkÓ¦ÓÃʾÀý
±¾ÆªÎÄÕÂÖÐËùÉæ¼°µÄʾÀýÓ¦ÓÃÊÇÒ»¸ö¼òµ¥µÄ×ÖÊýͳ¼ÆÓ¦Óá£ÕâÓëѧϰÓÃHadoop½øÐдóÊý¾Ý´¦ÀíʱµÄʾÀýÓ¦ÓÃÏàͬ¡£ÎÒÃǽ«ÔÚÒ»¸öÎı¾ÎļþÉÏÖ´ÐÐһЩÊý
¾Ý·ÖÎö²éѯ¡£±¾Ê¾ÀýÖеÄÎı¾ÎļþºÍÊý¾Ý¼¯¶¼ºÜС£¬²»¹ýÎÞÐëÐÞ¸ÄÈκδúÂ룬ʾÀýÖÐËùÓõ½µÄSpark²éѯͬÑù¿ÉÒÔÓõ½´óÈÝÁ¿Êý¾Ý¼¯Ö®ÉÏ¡£
ΪÁËÈÃÌÖÂÛ¾¡Á¿¼òµ¥£¬ÎÒÃǽ«Ê¹ÓÃSpark Scala Shell¡£
Ê×ÏÈÈÃÎÒÃÇ¿´Ò»ÏÂÈçºÎÔÚÄã×Ô¼ºµÄµçÄÔÉϰ²×°Spark¡£
ǰÌáÌõ¼þ£º
ΪÁËÈÃSparkÄܹ»ÔÚ±¾»úÕý³£¹¤×÷£¬ÄãÐèÒª°²×°Java¿ª·¢¹¤¾ß°ü£¨JDK£©¡£Õ⽫°üº¬ÔÚÏÂÃæµÄµÚÒ»²½ÖС£
ͬÑù»¹ÐèÒªÔÚµçÄÔÉϰ²×°SparkÈí¼þ¡£ÏÂÃæµÄµÚ¶þ²½½«½éÉÜÈçºÎÍê³ÉÕâÏ×÷¡£
×¢£ºÏÂÃæÕâЩָÁî¶¼ÊÇÒÔWindows»·¾³ÎªÀý¡£Èç¹ûÄãʹÓò»Í¬µÄ²Ù×÷ϵͳ»·¾³£¬ÐèÒªÏàÓ¦µÄÐÞ¸Äϵͳ±äÁ¿ºÍĿ¼·¾¶ÒÑÆ¥ÅäÄãµÄ»·¾³¡£
I. °²×°JDK
1£©´ÓOracleÍøÕ¾ÉÏÏÂÔØJDK¡£ÍƼöʹÓÃJDK 1.7°æ±¾¡£
½«JDK°²×°µ½Ò»¸öûÓпոñµÄĿ¼Ï¡£¶ÔÓÚWindowsÓû§£¬ÐèÒª½«JDK°²×°µ½Ïñc:\devÕâÑùµÄÎļþ¼ÐÏ£¬¶ø²»Äܰ²×°µ½¡°c:
\Program Files¡±Îļþ¼ÐÏ¡£¡°c:\Program Files¡±Îļþ¼ÐµÄÃû×ÖÖаüº¬¿Õ¸ñ£¬Èç¹ûÈí¼þ°²×°µ½Õâ¸öÎļþ¼ÐÏ»ᵼÖÂһЩÎÊÌâ¡£
×¢£º²»ÒªÔÚ¡°c:\Program Files¡±Îļþ¼ÐÖа²×°JDK»ò£¨µÚ¶þ²½ÖÐËùÃèÊöµÄ£©SparkÈí¼þ¡£
2£©Íê³ÉJDK°²×°ºó£¬Çл»ÖÁJDK 1.7Ŀ¼Ïµġ±bin¡°Îļþ¼Ð£¬È»ºó¼üÈëÈçÏÂÃüÁÑéÖ¤JDKÊÇ·ñÕýÈ·°²×°£º
java -version
Èç¹ûJDK°²×°ÕýÈ·£¬ÉÏÊöÃüÁÏÔʾJava°æ±¾¡£
II. °²×°SparkÈí¼þ£º
´ÓSparkÍøÕ¾ÉÏÏÂÔØ×îа汾 µÄSpark¡£ÔÚ±¾ÎÄ·¢±íʱ£¬×îеÄSpark°æ±¾ÊÇ1.2¡£Äã¿ÉÒÔ¸ù¾ÝHadoopµÄ°æ±¾Ñ¡ÔñÒ»¸öÌØ¶¨µÄSpark°æ±¾°²×°¡£ÎÒÏÂÔØÁËÓëHadoop
2.4»ò¸ü¸ß°æ±¾Æ¥ÅäµÄSpark£¬ÎļþÃûÊÇspark-1.2.0-bin-hadoop2.4.tgz¡£
½«°²×°Îļþ½âѹµ½±¾µØÎļþ¼ÐÖУ¨È磺c:\dev£©¡£
ΪÁËÑéÖ¤Spark°²×°µÄÕýÈ·ÐÔ£¬Çл»ÖÁSparkÎļþ¼ÐÈ»ºóÓÃÈçÏÂÃüÁîÆô¶¯Spark Shell¡£ÕâÊÇWindows»·¾³ÏµÄÃüÁî¡£Èç¹ûʹÓÃLinux»òMac
OS£¬ÇëÏàÓ¦µØ±à¼ÃüÁîÒÔ±ãÄܹ»ÔÚÏàÓ¦µÄƽ̨ÉÏÕýÈ·ÔËÐС£
c: cd c:\dev\spark-1.2.0-bin-hadoop2.4 bin\spark-shell |
Èç¹ûSpark°²×°ÕýÈ·£¬¾ÍÄܹ»ÔÚ¿ØÖÆÌ¨µÄÊä³öÖп´µ½ÈçÏÂÐÅÏ¢¡£
¡.
15/01/17 23:17:46 INFO HttpServer: Starting HTTP Server
15/01/17 23:17:46 INFO Utils: Successfully started service 'HTTP class server' on port 58132.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.2.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Type :help for more information.
¡.
15/01/17 23:17:53 INFO BlockManagerMaster: Registered BlockManager
15/01/17 23:17:53 INFO SparkILoop: Created spark context..
Spark context available as sc. |
¿ÉÒÔ¼üÈëÈçÏÂÃüÁî¼ì²éSpark ShellÊÇ·ñ¹¤×÷Õý³£¡£
£¨»ò£©
Íê³ÉÉÏÊö²½ÖèÖ®ºó£¬¿ÉÒÔ¼üÈëÈçÏÂÃüÁîÍ˳öSpark Shell´°¿Ú£º
Èç¹ûÏëÆô¶¯Spark Python Shell£¬ÐèÒªÏÈÔÚµçÄÔÉϰ²×°Python¡£Äã¿ÉÒÔÏÂÔØ²¢°²×°Anaconda£¬ÕâÊÇÒ»¸öÃâ·ÑµÄPython·¢Ðа汾£¬ÆäÖаüÀ¨ÁËһЩ±È½ÏÁ÷ÐеĿÆÑ§¡¢Êýѧ¡¢¹¤³ÌºÍÊý¾Ý·ÖÎö·½ÃæµÄPython°ü¡£
È»ºó¿ÉÒÔÔËÐÐÈçÏÂÃüÁîÆô¶¯Spark Python Shell£º
c: cd c:\dev\spark-1.2.0-bin-hadoop2.4 bin\pyspark |
SparkʾÀýÓ¦ÓÃ
Íê³ÉSpark°²×°²¢Æô¶¯ºó£¬¾Í¿ÉÒÔÓÃSpark APIÖ´ÐÐÊý¾Ý·ÖÎö²éѯÁË¡£
ÕâЩ´ÓÎı¾ÎļþÖжÁÈ¡²¢´¦ÀíÊý¾ÝµÄÃüÁî¶¼ºÜ¼òµ¥¡£ÎÒÃǽ«ÔÚÕâһϵÁÐÎÄÕµĺóÐøÎÄÕÂÖÐÏò´ó¼Ò½éÉܸü¸ß¼¶µÄSpark¿ò¼ÜʹÓõÄÓÃÀý¡£
Ê×ÏÈÈÃÎÒÃÇÓÃSpark APIÔËÐÐÁ÷ÐеÄWord CountʾÀý¡£Èç¹û»¹Ã»ÓÐÔËÐÐSpark Scala
Shell£¬Ê×ÏÈ´ò¿ªÒ»¸öScala Shell´°¿Ú¡£Õâ¸öʾÀýµÄÏà¹ØÃüÁîÈçÏÂËùʾ£º
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ val txtFile = "README.md" val txtData = sc.textFile(txtFile) txtData.cache() |
ÎÒÃÇ¿ÉÒÔµ÷ÓÃcacheº¯Êý½«ÉÏÒ»²½Éú³ÉµÄRDD¶ÔÏó±£´æµ½»º´æÖУ¬ÔÚ´ËÖ®ºóSpark¾Í²»ÐèÒªÔÚÿ´ÎÊý¾Ý²éѯʱ¶¼ÖØÐ¼ÆËã¡£ÐèҪעÒâµÄ
ÊÇ£¬cache()ÊÇÒ»¸öÑÓ³Ù²Ù×÷¡£ÔÚÎÒÃǵ÷ÓÃcacheʱ£¬Spark²¢²»»áÂíÉϽ«Êý¾Ý´æ´¢µ½ÄÚ´æÖС£Ö»Óе±ÔÚij¸öRDDÉϵ÷ÓÃÒ»¸öÐж¯Ê±£¬²Å»áÕæÕýÖ´
ÐÐÕâ¸ö²Ù×÷¡£
ÏÖÔÚ£¬ÎÒÃÇ¿ÉÒÔµ÷ÓÃcountº¯Êý£¬¿´Ò»ÏÂÔÚÎı¾ÎļþÖÐÓжàÉÙÐÐÊý¾Ý¡£
È»ºó£¬ÎÒÃÇ¿ÉÒÔÖ´ÐÐÈçÏÂÃüÁî½øÐÐ×ÖÊýͳ¼Æ¡£ÔÚÎı¾ÎļþÖÐͳ¼ÆÊý¾Ý»áÏÔʾÔÚÿ¸öµ¥´ÊµÄºóÃæ¡£
val wcData = txtData.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
wcData.collect().foreach(println) |
Èç¹ûÏë²é¿´¸ü¶à¹ØÓÚÈçºÎʹÓÃSparkºËÐÄAPIµÄ´úÂëʾÀý£¬Çë²Î¿¼ÍøÕ¾ÉϵÄSparkÎĵµ¡£
ºóÐø¼Æ»®
ÔÚºóÐøµÄϵÁÐÎÄÕÂÖУ¬ÎÒÃǽ«´ÓSpark SQL¿ªÊ¼£¬Ñ§Ï°¸ü¶à¹ØÓÚSparkÉú̬ϵͳµÄÆäËû²¿·Ö¡£Ö®ºó£¬ÎÒÃǽ«¼ÌÐøÁ˽âSpark
Streaming£¬Spark MLlibºÍSpark GraphX¡£ÎÒÃÇÒ²»áÓлú»áѧϰÏñTachyonºÍBlinkDBµÈ¿ò¼Ü¡£
С½á
ÔÚ±¾ÎÄÖУ¬ÎÒÃÇÁ˽âÁËApache Spark¿ò¼ÜÈçºÎͨ¹ýÆä±ê×¼API°ïÖúÍê³É´óÊý¾Ý´¦ÀíºÍ·ÖÎö¹¤×÷¡£ÎÒÃÇ»¹¶ÔSparkºÍ´«Í³µÄMapReduceʵÏÖ£¨ÈçApache
Hadoop£©½øÐÐÁ˱Ƚϡ£SparkÓëHadoop»ùÓÚÏàͬµÄHDFSÎļþ´æ´¢ÏµÍ³£¬Òò´ËÈç¹ûÄãÒѾÔÚHadoopÉϽøÐÐÁË´óÁ¿Í¶×ʺͻù´¡ÉèÊ©½¨É裬¿É
ÒÔÒ»ÆðʹÓÃSparkºÍMapReduce¡£
´ËÍ⣬Ҳ¿ÉÒÔ½«Spark´¦ÀíÓëSpark SQL¡¢»úÆ÷ѧϰÒÔ¼°Spark Streaming½áºÏÔÚÒ»Æð¡£¹ØÓÚÕâ·½ÃæµÄÄÚÈÝÎÒÃǽ«ÔÚºóÐøµÄÎÄÕÂÖнéÉÜ¡£
ÀûÓÃSparkµÄһЩ¼¯³É¹¦ÄܺÍÊÊÅäÆ÷£¬ÎÒÃÇ¿ÉÒÔ½«ÆäËû¼¼ÊõÓëSpark½áºÏÔÚÒ»Æð¡£ÆäÖÐÒ»¸ö°¸Àý¾ÍÊǽ«Spark¡¢KafkaºÍApache
Cassandra½áºÏÔÚÒ»Æð£¬ÆäÖÐKafka¸ºÔðÊäÈëµÄÁ÷ʽÊý¾Ý£¬SparkÍê³É¼ÆË㣬×îºóCassandra
NoSQLÊý¾Ý¿âÓÃÓÚ±£´æ¼ÆËã½á¹ûÊý¾Ý¡£
²»¹ýÐèÒªÀμǵÄÊÇ£¬SparkÉú̬ϵͳÈÔ²»³ÉÊ죬ÔÚ°²È«ºÍÓëBI¹¤¾ß¼¯³ÉµÈÁìÓòÈÔÈ»ÐèÒª½øÒ»²½µÄ¸Ä½ø¡£
|