Äú¿ÉÒÔ¾èÖú£¬Ö§³ÖÎÒÃǵĹ«ÒæÊÂÒµ¡£

1Ôª 10Ôª 50Ôª





ÈÏÖ¤Â룺  ÑéÖ¤Âë,¿´²»Çå³þ?Çëµã»÷Ë¢ÐÂÑéÖ¤Âë ±ØÌî



  ÇóÖª ÎÄÕ ÎÄ¿â Lib ÊÓÆµ iPerson ¿Î³Ì ÈÏÖ¤ ×Éѯ ¹¤¾ß ½²×ù Modeler   Code  
»áÔ±   
 
   
 
 
     
   
 ¶©ÔÄ
  ¾èÖú
ÓÃApache Spark½øÐдóÊý¾Ý´¦Àí
 
À´Ô´£ºOpen¾­Ñé¿â ·¢²¼ÓÚ£º2015-04-27
  2763  次浏览      29
 

ʲôÊÇSpark

Apache SparkÊÇÒ»¸öÎ§ÈÆËÙ¶È¡¢Ò×ÓÃÐԺ͸´ÔÓ·ÖÎö¹¹½¨µÄ´óÊý¾Ý´¦Àí¿ò¼Ü¡£×î³õÔÚ2009ÄêÓɼÓÖÝ´óѧ²®¿ËÀû·ÖУµÄAMPLab¿ª·¢£¬²¢ÓÚ2010Äê³ÉΪApacheµÄ¿ªÔ´ÏîĿ֮һ¡£

ÓëHadoopºÍStormµÈÆäËû´óÊý¾ÝºÍMapReduce¼¼ÊõÏà±È£¬SparkÓÐÈçÏÂÓÅÊÆ¡£

Ê×ÏÈ£¬SparkΪÎÒÃÇÌṩÁËÒ»¸öÈ«Ãæ¡¢Í³Ò»µÄ¿ò¼ÜÓÃÓÚ¹ÜÀí¸÷ÖÖÓÐ×Ų»Í¬ÐÔÖÊ£¨Îı¾Êý¾Ý¡¢Í¼±íÊý¾ÝµÈ£©µÄÊý¾Ý¼¯ºÍÊý¾ÝÔ´£¨ÅúÁ¿Êý¾Ý»òʵʱµÄÁ÷Êý¾Ý£©µÄ´óÊý¾Ý´¦ÀíµÄÐèÇó¡£

Spark¿ÉÒÔ½«Hadoop¼¯ÈºÖеÄÓ¦ÓÃÔÚÄÚ´æÖеÄÔËÐÐËÙ¶ÈÌáÉý100±¶£¬ÉõÖÁÄܹ»½«Ó¦ÓÃÔÚ´ÅÅÌÉϵÄÔËÐÐËÙ¶ÈÌáÉý10±¶¡£

SparkÈÿª·¢Õß¿ÉÒÔ¿ìËÙµÄÓÃJava¡¢Scala»òPython±àд³ÌÐò¡£Ëü±¾Éí×Ô´øÁËÒ»¸ö³¬¹ý80¸ö¸ß½×²Ù×÷·û¼¯ºÏ¡£¶øÇÒ»¹¿ÉÒÔÓÃËüÔÚshellÖÐÒÔ½»»¥Ê½µØ²éѯÊý¾Ý¡£

³ýÁËMapºÍReduce²Ù×÷Ö®Í⣬Ëü»¹Ö§³ÖSQL²éѯ£¬Á÷Êý¾Ý£¬»úÆ÷ѧϰºÍͼ±íÊý¾Ý´¦Àí¡£¿ª·¢Õß¿ÉÒÔÔÚÒ»¸öÊý¾Ý¹ÜµÀÓÃÀýÖе¥¶ÀʹÓÃijһÄÜÁ¦»òÕß½«ÕâЩÄÜÁ¦½áºÏÔÚÒ»ÆðʹÓá£

ÔÚÕâ¸öApache SparkÎÄÕÂϵÁеĵÚÒ»²¿·ÖÖУ¬ÎÒÃǽ«Á˽⵽ʲôÊÇSpark£¬ËüÓëµäÐ͵ÄMapReduce½â¾ö·½°¸µÄ±È½ÏÒÔ¼°ËüÈçºÎΪ´óÊý¾Ý´¦ÀíÌṩÁËÒ»Ì×ÍêÕûµÄ¹¤¾ß¡£

HadoopºÍSpark

HadoopÕâÏî´óÊý¾Ý´¦Àí¼¼Êõ´ó¸ÅÒÑÓÐÊ®ÄêÀúÊ·£¬¶øÇÒ±»¿´×öÊÇÊ×Ñ¡µÄ´óÊý¾Ý¼¯ºÏ´¦ÀíµÄ½â¾ö·½°¸¡£MapReduceÊÇһ·¼ÆËãµÄÓÅÐã½â¾ö·½°¸£¬²» ¹ý¶ÔÓÚÐèÒª¶à·¼ÆËãºÍËã·¨µÄÓÃÀýÀ´Ëµ£¬²¢·ÇÊ®·Ö¸ßЧ¡£Êý¾Ý´¦ÀíÁ÷³ÌÖеÄÿһ²½¶¼ÐèÒªÒ»¸öMap½×¶ÎºÍÒ»¸öReduce½×¶Î£¬¶øÇÒÈç¹ûÒªÀûÓÃÕâÒ»½â¾ö·½°¸£¬ ÐèÒª½«ËùÓÐÓÃÀý¶¼×ª»»³ÉMapReduceģʽ¡£

ÔÚÏÂÒ»²½¿ªÊ¼Ö®Ç°£¬ÉÏÒ»²½µÄ×÷ÒµÊä³öÊý¾Ý±ØÐëÒª´æ´¢µ½·Ö²¼Ê½ÎļþϵͳÖС£Òò´Ë£¬¸´ÖƺʹÅÅÌ´æ´¢»áµ¼ÖÂÕâÖÖ·½Ê½ËٶȱäÂý¡£ÁíÍâHadoop½â¾ö·½°¸ÖРͨ³£»á°üº¬ÄÑÒÔ°²×°ºÍ¹ÜÀíµÄ¼¯Èº¡£¶øÇÒΪÁË´¦Àí²»Í¬µÄ´óÊý¾ÝÓÃÀý£¬»¹ÐèÒª¼¯³É¶àÖÖ²»Í¬µÄ¹¤¾ß£¨ÈçÓÃÓÚ»úÆ÷ѧϰµÄMahoutºÍÁ÷Êý¾Ý´¦ÀíµÄStorm£©¡£

Èç¹ûÏëÒªÍê³É±È½Ï¸´ÔӵŤ×÷£¬¾Í±ØÐ뽫һϵÁеÄMapReduce×÷Òµ´®ÁªÆðÀ´È»ºó˳ÐòÖ´ÐÐÕâЩ×÷Òµ¡£Ã¿Ò»¸ö×÷Òµ¶¼ÊǸßʱÑӵ쬶øÇÒÖ»ÓÐÔÚǰһ¸ö×÷ÒµÍê³ÉÖ®ºóÏÂÒ»¸ö×÷Òµ²ÅÄÜ¿ªÊ¼Æô¶¯¡£

¶øSparkÔòÔÊÐí³ÌÐò¿ª·¢ÕßʹÓÃÓÐÏòÎÞ»·Í¼£¨DAG£©¿ª·¢¸´ÔӵĶಽÊý¾Ý¹ÜµÀ¡£¶øÇÒ»¹Ö§³Ö¿çÓÐÏòÎÞ»·Í¼µÄÄÚ´æÊý¾Ý¹²Ïí£¬ÒԱ㲻ͬµÄ×÷Òµ¿ÉÒÔ¹²Í¬´¦Àíͬһ¸öÊý¾Ý¡£

SparkÔËÐÐÔÚÏÖÓеÄHadoop·Ö²¼Ê½Îļþϵͳ»ù´¡Ö®ÉÏ£¨HDFS£©Ìṩ¶îÍâµÄÔöÇ¿¹¦ÄÜ¡£ËüÖ§³Ö½«SparkÓ¦Óò¿Êðµ½ÏÖ´æµÄHadoop v1¼¯Èº£¨with SIMR ¨C Spark-Inside-MapReduce£©»òHadoop v2 YARN¼¯ÈºÉõÖÁÊÇApache MesosÖ®ÖС£

ÎÒÃÇÓ¦¸Ã½«Spark¿´×÷ÊÇHadoop MapReduceµÄÒ»¸öÌæ´úÆ·¶ø²»ÊÇHadoopµÄÌæ´úÆ·¡£ÆäÒâͼ²¢·ÇÊÇÌæ´úHadoop£¬¶øÊÇΪÁËÌṩһ¸ö¹ÜÀí²»Í¬µÄ´óÊý¾ÝÓÃÀýºÍÐèÇóµÄÈ«ÃæÇÒͳһµÄ½â¾ö·½°¸¡£

SparkÌØÐÔ

Sparkͨ¹ýÔÚÊý¾Ý´¦Àí¹ý³ÌÖгɱ¾¸üµÍµÄÏ´ÅÆ£¨Shuffle£©·½Ê½£¬½«MapReduceÌáÉýµ½Ò»¸ö¸ü¸ßµÄ²ã´Î¡£ÀûÓÃÄÚ´æÊý¾Ý´æ´¢ºÍ½Ó½üʵʱµÄ´¦ÀíÄÜÁ¦£¬Spark±ÈÆäËûµÄ´óÊý¾Ý´¦Àí¼¼ÊõµÄÐÔÄÜÒª¿ìºÜ¶à±¶¡£

Spark»¹Ö§³Ö´óÊý¾Ý²éѯµÄÑÓ³Ù¼ÆË㣬Õâ¿ÉÒÔ°ïÖúÓÅ»¯´óÊý¾Ý´¦ÀíÁ÷³ÌÖеĴ¦Àí²½Öè¡£Spark»¹Ìṩ¸ß¼¶µÄAPIÒÔÌáÉý¿ª·¢ÕßµÄÉú²úÁ¦£¬³ý´ËÖ®Í⻹Ϊ´óÊý¾Ý½â¾ö·½°¸ÌṩһÖµÄÌåϵ¼Ü¹¹Ä£ÐÍ¡£

Spark½«Öмä½á¹û±£´æÔÚÄÚ´æÖжø²»Êǽ«ÆäдÈë´ÅÅÌ£¬µ±ÐèÒª¶à´Î´¦ÀíͬһÊý¾Ý¼¯Ê±£¬ÕâÒ»µãÌØ±ðʵÓá£SparkµÄÉè¼Æ³õÖÔ¾ÍÊǼȿÉÒÔÔÚÄÚ´æÖÐÓÖ¿É ÒÔÔÚ´ÅÅÌÉϹ¤×÷µÄÖ´ÐÐÒýÇæ¡£µ±ÄÚ´æÖеÄÊý¾Ý²»ÊÊÓÃʱ£¬Spark²Ù×÷·û¾Í»áÖ´ÐÐÍⲿ²Ù×÷¡£Spark¿ÉÒÔÓÃÓÚ´¦Àí´óÓÚ¼¯ÈºÄÚ´æÈÝÁ¿×ܺ͵ÄÊý¾Ý¼¯¡£

Spark»á³¢ÊÔÔÚÄÚ´æÖд洢¾¡¿ÉÄܶàµÄÊý¾ÝÈ»ºó½«ÆäдÈë´ÅÅÌ¡£Ëü¿ÉÒÔ½«Ä³¸öÊý¾Ý¼¯µÄÒ»²¿·Ö´æÈëÄÚ´æ¶øÊ£Óಿ·Ö´æÈë´ÅÅÌ¡£¿ª·¢ÕßÐèÒª¸ù¾ÝÊý¾ÝºÍÓÃÀýÆÀ¹À¶ÔÄÚ´æµÄÐèÇó¡£SparkµÄÐÔÄÜÓÅÊÆµÃÒæÓÚÕâÖÖÄÚ´æÖеÄÊý¾Ý´æ´¢¡£

SparkµÄÆäËûÌØÐÔ°üÀ¨£º

1.Ö§³Ö±ÈMapºÍReduce¸ü¶àµÄº¯Êý¡£

2.ÓÅ»¯ÈÎÒâ²Ù×÷Ëã×Óͼ£¨operator graphs£©¡£

3.¿ÉÒÔ°ïÖúÓÅ»¯ÕûÌåÊý¾Ý´¦ÀíÁ÷³ÌµÄ´óÊý¾Ý²éѯµÄÑÓ³Ù¼ÆËã¡£

4.Ìṩ¼òÃ÷¡¢Ò»ÖµÄScala£¬JavaºÍPython API¡£

5.Ìṩ½»»¥Ê½ScalaºÍPython Shell¡£Ä¿Ç°Ôݲ»Ö§³ÖJava¡£

SparkÊÇÓÃScala³ÌÐòÉè¼ÆÓïÑÔ±àд¶ø³É£¬ÔËÐÐÓÚJavaÐéÄâ»ú£¨JVM£©»·¾³Ö®ÉÏ¡£Ä¿Ç°Ö§³ÖÈçϳÌÐòÉè¼ÆÓïÑÔ±àдSparkÓ¦Óãº

1.Scala

2.Java

3.Python

4.Clojure

5.R

SparkÉú̬ϵͳ

³ýÁËSparkºËÐÄAPIÖ®Í⣬SparkÉú̬ϵͳÖл¹°üÀ¨ÆäËû¸½¼Ó¿â£¬¿ÉÒÔÔÚ´óÊý¾Ý·ÖÎöºÍ»úÆ÷ѧϰÁìÓòÌṩ¸ü¶àµÄÄÜÁ¦¡£

ÕâЩ¿â°üÀ¨£º

1.Spark Streaming:

Spark Streaming»ùÓÚ΢ÅúÁ¿·½Ê½µÄ¼ÆËãºÍ´¦Àí£¬¿ÉÒÔÓÃÓÚ´¦ÀíʵʱµÄÁ÷Êý¾Ý¡£ËüʹÓÃDStream£¬¼òµ¥À´Ëµ¾ÍÊÇÒ»¸öµ¯ÐÔ·Ö²¼Ê½Êý¾Ý¼¯£¨RDD£©ÏµÁУ¬´¦ÀíʵʱÊý¾Ý¡£

2.Spark SQL:

Spark SQL¿ÉÒÔͨ¹ýJDBC API½«SparkÊý¾Ý¼¯±©Â¶³öÈ¥£¬¶øÇÒ»¹¿ÉÒÔÓô«Í³µÄBIºÍ¿ÉÊÓ»¯¹¤¾ßÔÚSparkÊý¾ÝÉÏÖ´ÐÐÀàËÆSQLµÄ²éѯ¡£Óû§»¹¿ÉÒÔÓÃSpark SQL¶Ô²»Í¬¸ñʽµÄÊý¾Ý£¨ÈçJSON£¬ParquetÒÔ¼°Êý¾Ý¿âµÈ£©Ö´ÐÐETL£¬½«Æäת»¯£¬È»ºó±©Â¶¸øÌض¨µÄ²éѯ¡£

3.Spark MLlib:

MLlibÊÇÒ»¸ö¿ÉÀ©Õ¹µÄSpark»úÆ÷ѧϰ¿â£¬ÓÉͨÓõÄѧϰËã·¨ºÍ¹¤¾ß×é³É£¬°üÀ¨¶þÔª·ÖÀà¡¢ÏßÐԻع顢¾ÛÀࡢЭͬ¹ýÂË¡¢ÌݶÈϽµÒÔ¼°µ×²ãÓÅ»¯Ô­Óï¡£

4.Spark GraphX:

GraphXÊÇÓÃÓÚͼ¼ÆËãºÍ²¢ÐÐͼ¼ÆËãµÄ еģ¨alpha£©Spark API¡£Í¨¹ýÒýÈ뵯ÐÔ·Ö²¼Ê½ÊôÐÔͼ£¨Resilient Distributed Property Graph£©£¬Ò»ÖÖ¶¥µãºÍ±ß¶¼´øÓÐÊôÐÔµÄÓÐÏò¶àÖØÍ¼£¬À©Õ¹ÁËSpark RDD¡£ÎªÁËÖ§³Öͼ¼ÆË㣬GraphX±©Â¶ÁËÒ»¸ö»ù´¡²Ù×÷·û¼¯ºÏ£¨Èçsubgraph£¬joinVerticesºÍaggregateMessages£© ºÍÒ»¸ö¾­¹ýÓÅ»¯µÄPregel API±äÌå¡£´ËÍ⣬GraphX»¹°üÀ¨Ò»¸ö³ÖÐøÔö³¤µÄÓÃÓÚ¼ò»¯Í¼·ÖÎöÈÎÎñµÄͼËã·¨ºÍ¹¹½¨Æ÷¼¯ºÏ¡£
³ýÁËÕâЩ¿âÒÔÍ⣬»¹ÓÐһЩÆäËûµÄ¿â£¬ÈçBlinkDBºÍTachyon¡£

BlinkDBÊÇÒ»¸ö½üËÆ²éѯÒýÇæ£¬ÓÃÓÚÔÚº£Á¿Êý¾ÝÉÏÖ´Ðн»»¥Ê½SQL²éѯ¡£BlinkDB¿ÉÒÔͨ¹ýÎþÉüÊý¾Ý¾«¶ÈÀ´ÌáÉý²éѯÏìӦʱ¼ä¡£Í¨¹ýÔÚÊý¾ÝÑù±¾ÉÏÖ´Ðвéѯ²¢Õ¹Ê¾°üº¬ÓÐÒâÒåµÄ´íÎóÏß×¢½âµÄ½á¹û£¬²Ù×÷´óÊý¾Ý¼¯ºÏ¡£

TachyonÊÇÒ»¸öÒÔÄÚ´æÎªÖÐÐÄµÄ ·Ö²¼Ê½Îļþϵͳ£¬Äܹ»ÌṩÄÚ´æ¼¶±ðËٶȵĿ缯Ⱥ¿ò¼Ü£¨ÈçSparkºÍMapReduce£©µÄ¿ÉÐÅÎļþ¹²Ïí¡£Ëü½«¹¤×÷¼¯Îļþ»º´æÔÚÄÚ´æÖУ¬´Ó¶ø±ÜÃâµ½´ÅÅÌÖÐ ¼ÓÔØÐèÒª¾­³£¶ÁÈ¡µÄÊý¾Ý¼¯¡£Í¨¹ýÕâÒ»»úÖÆ£¬²»Í¬µÄ×÷Òµ/²éѯºÍ¿ò¼Ü¿ÉÒÔÒÔÄÚ´æ¼¶µÄËÙ¶È·ÃÎÊ»º´æµÄÎļþ¡£
´ËÍ⣬»¹ÓÐһЩÓÃÓÚÓëÆäËû²úÆ·¼¯³ÉµÄÊÊÅäÆ÷£¬ÈçCassandra£¨Spark Cassandra Á¬½ÓÆ÷£©ºÍR£¨SparkR£©¡£Cassandra Connector¿ÉÓÃÓÚ·ÃÎÊ´æ´¢ÔÚCassandraÊý¾Ý¿âÖеÄÊý¾Ý²¢ÔÚÕâЩÊý¾ÝÉÏÖ´ÐÐÊý¾Ý·ÖÎö¡£

ÏÂͼչʾÁËÔÚSparkÉú̬ϵͳÖУ¬ÕâЩ²»Í¬µÄ¿âÖ®¼äµÄÏ໥¹ØÁª¡£

ͼ1. Spark¿ò¼ÜÖеĿâ

ÎÒÃǽ«ÔÚÕâһϵÁÐÎÄÕÂÖÐÖð²½Ì½Ë÷ÕâЩSpark¿â

SparkÌåϵ¼Ü¹¹

SparkÌåϵ¼Ü¹¹°üÀ¨ÈçÏÂÈý¸öÖ÷Òª×é¼þ£º

1.Êý¾Ý´æ´¢

2.API

3.¹ÜÀí¿ò¼Ü

½ÓÏÂÀ´ÈÃÎÒÃÇÏêϸÁ˽âÒ»ÏÂÕâЩ×é¼þ¡£

Êý¾Ý´æ´¢£º

SparkÓÃHDFSÎļþϵͳ´æ´¢Êý¾Ý¡£Ëü¿ÉÓÃÓÚ´æ´¢ÈκμæÈÝÓÚHadoopµÄÊý¾ÝÔ´£¬°üÀ¨HDFS£¬HBase£¬CassandraµÈ¡£

API£º

ÀûÓÃAPI£¬Ó¦Óÿª·¢Õß¿ÉÒÔÓñê×¼µÄAPI½Ó¿Ú´´½¨»ùÓÚSparkµÄÓ¦Óá£SparkÌṩScala£¬JavaºÍPythonÈýÖÖ³ÌÐòÉè¼ÆÓïÑÔµÄAPI¡£

×ÊÔ´¹ÜÀí£º

Spark¼È¿ÉÒÔ²¿ÊðÔÚÒ»¸öµ¥¶ÀµÄ·þÎñÆ÷Ò²¿ÉÒÔ²¿ÊðÔÚÏñMesos»òYARNÕâÑùµÄ·Ö²¼Ê½¼ÆËã¿ò¼ÜÖ®ÉÏ¡£

ÏÂͼ2չʾÁËSparkÌåϵ¼Ü¹¹Ä£ÐÍÖеĸ÷¸ö×é¼þ¡£

ͼ2 SparkÌåϵ¼Ü¹¹

µ¯ÐÔ·Ö²¼Ê½Êý¾Ý¼¯

µ¯ÐÔ·Ö²¼Ê½Êý¾Ý¼¯£¨»ùÓÚMateiµÄÑо¿ÂÛÎÄ£©»òRDDÊÇSpark¿ò¼ÜÖеĺËÐĸÅÄî¡£¿ÉÒÔ½«RDDÊÓ×÷Êý¾Ý¿âÖеÄÒ»ÕÅ±í¡£ÆäÖпÉÒÔ±£´æÈκÎÀàÐ͵ÄÊý¾Ý¡£Spark½«Êý¾Ý´æ´¢ÔÚ²»Í¬·ÖÇøÉϵÄRDDÖ®ÖС£

RDD¿ÉÒÔ°ïÖúÖØÐ°²ÅżÆËã²¢ÓÅ»¯Êý¾Ý´¦Àí¹ý³Ì¡£

´ËÍ⣬Ëü»¹¾ßÓÐÈÝ´íÐÔ£¬ÒòΪRDDÖªµÀÈçºÎÖØÐ´´½¨ºÍÖØÐ¼ÆËãÊý¾Ý¼¯¡£

RDDÊDz»¿É±äµÄ¡£Äã¿ÉÒÔÓñ任£¨Transformation£©ÐÞ¸ÄRDD£¬µ«ÊÇÕâ¸ö±ä»»Ëù·µ»ØµÄÊÇÒ»¸öȫеÄRDD£¬¶øÔ­ÓеÄRDDÈÔÈ»±£³Ö²»±ä¡£

RDDÖ§³ÖÁ½ÖÖÀàÐ͵IJÙ×÷£º

1.±ä»»£¨Transformation£©

2.Ðж¯£¨Action£©

±ä»»£º±ä»»µÄ·µ»ØÖµÊÇÒ»¸öеÄRDD¼¯ºÏ£¬¶ø²»Êǵ¥¸öÖµ¡£µ÷ÓÃÒ»¸ö±ä»»·½·¨£¬²»»áÓÐÈκÎÇóÖµ¼ÆË㣬ËüÖ»»ñȡһ¸öRDD×÷Ϊ²ÎÊý£¬È»ºó·µ»ØÒ»¸öеÄRDD¡£

±ä»»º¯Êý°üÀ¨£ºmap£¬filter£¬flatMap£¬groupByKey£¬reduceByKey£¬aggregateByKey£¬pipeºÍcoalesce¡£

Ðж¯£ºÐж¯²Ù×÷¼ÆËã²¢·µ»ØÒ»¸öеÄÖµ¡£µ±ÔÚÒ»¸öRDD¶ÔÏóÉϵ÷ÓÃÐж¯º¯Êýʱ£¬»áÔÚÕâһʱ¿Ì¼ÆËãÈ«²¿µÄÊý¾Ý´¦Àí²éѯ²¢·µ»Ø½á¹ûÖµ¡£

Ðж¯²Ù×÷°üÀ¨£ºreduce£¬collect£¬count£¬first£¬take£¬countByKeyÒÔ¼°foreach¡£

ÈçºÎ°²×°Spark

°²×°ºÍʹÓÃSparkÓм¸ÖÖ²»Í¬·½Ê½¡£Äã¿ÉÒÔÔÚ×Ô¼ºµÄµçÄÔÉϽ«Spark×÷Ϊһ¸ö¶ÀÁ¢µÄ¿ò¼Ü°²×°»òÕß´ÓÖîÈçCloudera£¬HortonWorks»òMapRÖ®ÀàµÄ¹©Ó¦ÉÌ´¦»ñȡһ¸öSparkÐéÄâ»ú¾µÏñÖ±½ÓʹÓ᣻òÕßÄãÒ²¿ÉÒÔʹÓÃÔÚÔÆ¶Ë»·¾³£¨ÈçDatabricks Cloud£©°²×°²¢ÅäÖúõÄSpark¡£

ÔÚ±¾ÎÄÖУ¬ÎÒÃǽ«°ÑSpark×÷Ϊһ¸ö¶ÀÁ¢µÄ¿ò¼Ü°²×°²¢ÔÚ±¾µØÆô¶¯Ëü¡£×î½üSpark¸Õ¸Õ·¢²¼ÁË1.2.0°æ±¾¡£ÎÒÃǽ«ÓÃÕâÒ»°æ±¾Íê³ÉʾÀýÓ¦ÓõĴúÂëչʾ¡£

ÈçºÎÔËÐÐSpark

µ±ÄãÔÚ±¾µØ»úÆ÷°²×°ÁËSpark»òʹÓÃÁË»ùÓÚÔÆ¶ËµÄSparkºó£¬Óм¸ÖÖ²»Í¬µÄ·½Ê½¿ÉÒÔÁ¬½Óµ½SparkÒýÇæ¡£

ϱíչʾÁ˲»Í¬µÄSparkÔËÐÐģʽËùÐèµÄMaster URL²ÎÊý¡£

 

ÈçºÎÓëSpark½»»¥

SparkÆô¶¯²¢ÔËÐк󣬿ÉÒÔÓÃSpark shellÁ¬½Óµ½SparkÒýÇæ½øÐн»»¥Ê½Êý¾Ý·ÖÎö¡£Spark shellÖ§³ÖScalaºÍPythonÁ½ÖÖÓïÑÔ¡£Java²»Ö§³Ö½»»¥Ê½µÄShell£¬Òò´ËÕâÒ»¹¦ÄÜÔÝδÔÚJavaÓïÑÔÖÐʵÏÖ¡£

¿ÉÒÔÓÃspark-shell.cmdºÍpyspark.cmdÃüÁî·Ö±ðÔËÐÐScala°æ±¾ºÍPython°æ±¾µÄSpark Shell¡£

SparkÍøÒ³¿ØÖÆÌ¨

²»ÂÛSparkÔËÐÐÔÚÄÄÒ»ÖÖģʽÏ£¬¶¼¿ÉÒÔͨ¹ý·ÃÎÊSparkÍøÒ³¿ØÖÆÌ¨²é¿´SparkµÄ×÷Òµ½á¹ûºÍÆäËûµÄͳ¼ÆÊý¾Ý£¬¿ØÖÆÌ¨µÄURLµØÖ·ÈçÏ£º

http://localhost:4040

Spark¿ØÖÆÌ¨ÈçÏÂͼ3Ëùʾ£¬°üÀ¨Stages£¬Storage£¬EnvironmentºÍExecutorsËĸö±êǩҳ

ͼ3. SparkÍøÒ³¿ØÖÆÌ¨

¹²Ïí±äÁ¿

SparkÌṩÁ½ÖÖÀàÐ͵Ĺ²Ïí±äÁ¿¿ÉÒÔÌáÉý¼¯Èº»·¾³ÖеÄSpark³ÌÐòÔËÐÐЧÂÊ¡£·Ö±ðÊǹ㲥±äÁ¿ºÍÀÛ¼ÓÆ÷¡£

¹ã²¥±äÁ¿£º¹ã²¥±äÁ¿¿ÉÒÔÔÚÿ̨»úÆ÷ÉÏ»º´æÖ»¶Á±äÁ¿¶ø²»ÐèҪΪ¸÷¸öÈÎÎñ·¢Ë͸ñäÁ¿µÄ¿½±´¡£ËûÃÇ¿ÉÒÔÈôóµÄÊäÈëÊý¾Ý¼¯µÄ¼¯Èº¿½±´ÖеĽڵã¸ü¼Ó¸ßЧ¡£

ÏÂÃæµÄ´úÂëÆ¬¶ÎչʾÁËÈçºÎʹÓù㲥±äÁ¿¡£

//
// Broadcast Variables
//
val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value

ÀÛ¼ÓÆ÷£ºÖ»ÓÐÔÚʹÓÃÏà¹Ø²Ù×÷ʱ²Å»áÌí¼ÓÀÛ¼ÓÆ÷£¬Òò´ËËü¿ÉÒԺܺõØÖ§³Ö²¢ÐС£ÀÛ¼ÓÆ÷¿ÉÓÃÓÚʵÏÖ¼ÆÊý£¨¾ÍÏñÔÚMapReduceÖÐÄÇÑù£©»òÇóºÍ¡£¿ÉÒÔÓÃadd·½·¨½«ÔËÐÐÔÚ¼¯ÈºÉϵÄÈÎÎñÌí¼Óµ½Ò»¸öÀÛ¼ÓÆ÷±äÁ¿ÖС£²»¹ýÕâЩÈÎÎñÎÞ·¨¶ÁÈ¡±äÁ¿µÄÖµ¡£Ö»ÓÐÇý¶¯³ÌÐò²ÅÄܹ»¶ÁÈ¡ÀÛ¼ÓÆ÷µÄÖµ¡£

ÏÂÃæµÄ´úÂëÆ¬¶ÎչʾÁËÈçºÎʹÓÃÀÛ¼ÓÆ÷¹²Ïí±äÁ¿£º

//
// Accumulators
//

val accum = sc.accumulator(0, "My Accumulator")

sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)

accum.value

SparkÓ¦ÓÃʾÀý

±¾ÆªÎÄÕÂÖÐËùÉæ¼°µÄʾÀýÓ¦ÓÃÊÇÒ»¸ö¼òµ¥µÄ×ÖÊýͳ¼ÆÓ¦Óá£ÕâÓëѧϰÓÃHadoop½øÐдóÊý¾Ý´¦ÀíʱµÄʾÀýÓ¦ÓÃÏàͬ¡£ÎÒÃǽ«ÔÚÒ»¸öÎı¾ÎļþÉÏÖ´ÐÐһЩÊý ¾Ý·ÖÎö²éѯ¡£±¾Ê¾ÀýÖеÄÎı¾ÎļþºÍÊý¾Ý¼¯¶¼ºÜС£¬²»¹ýÎÞÐëÐÞ¸ÄÈκδúÂ룬ʾÀýÖÐËùÓõ½µÄSpark²éѯͬÑù¿ÉÒÔÓõ½´óÈÝÁ¿Êý¾Ý¼¯Ö®ÉÏ¡£

ΪÁËÈÃÌÖÂÛ¾¡Á¿¼òµ¥£¬ÎÒÃǽ«Ê¹ÓÃSpark Scala Shell¡£

Ê×ÏÈÈÃÎÒÃÇ¿´Ò»ÏÂÈçºÎÔÚÄã×Ô¼ºµÄµçÄÔÉϰ²×°Spark¡£

ǰÌáÌõ¼þ£º

ΪÁËÈÃSparkÄܹ»ÔÚ±¾»úÕý³£¹¤×÷£¬ÄãÐèÒª°²×°Java¿ª·¢¹¤¾ß°ü£¨JDK£©¡£Õ⽫°üº¬ÔÚÏÂÃæµÄµÚÒ»²½ÖС£

ͬÑù»¹ÐèÒªÔÚµçÄÔÉϰ²×°SparkÈí¼þ¡£ÏÂÃæµÄµÚ¶þ²½½«½éÉÜÈçºÎÍê³ÉÕâÏ×÷¡£

×¢£ºÏÂÃæÕâЩָÁî¶¼ÊÇÒÔWindows»·¾³ÎªÀý¡£Èç¹ûÄãʹÓò»Í¬µÄ²Ù×÷ϵͳ»·¾³£¬ÐèÒªÏàÓ¦µÄÐÞ¸Äϵͳ±äÁ¿ºÍĿ¼·¾¶ÒÑÆ¥ÅäÄãµÄ»·¾³¡£

I. °²×°JDK

1£©´ÓOracleÍøÕ¾ÉÏÏÂÔØJDK¡£ÍƼöʹÓÃJDK 1.7°æ±¾¡£

½«JDK°²×°µ½Ò»¸öûÓпոñµÄĿ¼Ï¡£¶ÔÓÚWindowsÓû§£¬ÐèÒª½«JDK°²×°µ½Ïñc:\devÕâÑùµÄÎļþ¼ÐÏ£¬¶ø²»Äܰ²×°µ½¡°c: \Program Files¡±Îļþ¼ÐÏ¡£¡°c:\Program Files¡±Îļþ¼ÐµÄÃû×ÖÖаüº¬¿Õ¸ñ£¬Èç¹ûÈí¼þ°²×°µ½Õâ¸öÎļþ¼ÐÏ»ᵼÖÂһЩÎÊÌâ¡£

×¢£º²»ÒªÔÚ¡°c:\Program Files¡±Îļþ¼ÐÖа²×°JDK»ò£¨µÚ¶þ²½ÖÐËùÃèÊöµÄ£©SparkÈí¼þ¡£

2£©Íê³ÉJDK°²×°ºó£¬Çл»ÖÁJDK 1.7Ŀ¼Ïµġ±bin¡°Îļþ¼Ð£¬È»ºó¼üÈëÈçÏÂÃüÁÑéÖ¤JDKÊÇ·ñÕýÈ·°²×°£º

java -version

Èç¹ûJDK°²×°ÕýÈ·£¬ÉÏÊöÃüÁÏÔʾJava°æ±¾¡£

II. °²×°SparkÈí¼þ£º

´ÓSparkÍøÕ¾ÉÏÏÂÔØ×îа汾 µÄSpark¡£ÔÚ±¾ÎÄ·¢±íʱ£¬×îеÄSpark°æ±¾ÊÇ1.2¡£Äã¿ÉÒÔ¸ù¾ÝHadoopµÄ°æ±¾Ñ¡ÔñÒ»¸öÌØ¶¨µÄSpark°æ±¾°²×°¡£ÎÒÏÂÔØÁËÓëHadoop 2.4»ò¸ü¸ß°æ±¾Æ¥ÅäµÄSpark£¬ÎļþÃûÊÇspark-1.2.0-bin-hadoop2.4.tgz¡£

½«°²×°Îļþ½âѹµ½±¾µØÎļþ¼ÐÖУ¨È磺c:\dev£©¡£

ΪÁËÑéÖ¤Spark°²×°µÄÕýÈ·ÐÔ£¬Çл»ÖÁSparkÎļþ¼ÐÈ»ºóÓÃÈçÏÂÃüÁîÆô¶¯Spark Shell¡£ÕâÊÇWindows»·¾³ÏµÄÃüÁî¡£Èç¹ûʹÓÃLinux»òMac OS£¬ÇëÏàÓ¦µØ±à¼­ÃüÁîÒÔ±ãÄܹ»ÔÚÏàÓ¦µÄƽ̨ÉÏÕýÈ·ÔËÐС£

c:
cd c:\dev\spark-1.2.0-bin-hadoop2.4
bin\spark-shell

Èç¹ûSpark°²×°ÕýÈ·£¬¾ÍÄܹ»ÔÚ¿ØÖÆÌ¨µÄÊä³öÖп´µ½ÈçÏÂÐÅÏ¢¡£

¡­.
15/01/17 23:17:46 INFO HttpServer: Starting HTTP Server
15/01/17 23:17:46 INFO Utils: Successfully started service 'HTTP class server' on port 58132.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.2.0
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Type :help for more information.
¡­.
15/01/17 23:17:53 INFO BlockManagerMaster: Registered BlockManager
15/01/17 23:17:53 INFO SparkILoop: Created spark context..
Spark context available as sc.

¿ÉÒÔ¼üÈëÈçÏÂÃüÁî¼ì²éSpark ShellÊÇ·ñ¹¤×÷Õý³£¡£

sc.version

£¨»ò£©

sc.appName

Íê³ÉÉÏÊö²½ÖèÖ®ºó£¬¿ÉÒÔ¼üÈëÈçÏÂÃüÁîÍ˳öSpark Shell´°¿Ú£º

:quit

Èç¹ûÏëÆô¶¯Spark Python Shell£¬ÐèÒªÏÈÔÚµçÄÔÉϰ²×°Python¡£Äã¿ÉÒÔÏÂÔØ²¢°²×°Anaconda£¬ÕâÊÇÒ»¸öÃâ·ÑµÄPython·¢Ðа汾£¬ÆäÖаüÀ¨ÁËһЩ±È½ÏÁ÷ÐеĿÆÑ§¡¢Êýѧ¡¢¹¤³ÌºÍÊý¾Ý·ÖÎö·½ÃæµÄPython°ü¡£

È»ºó¿ÉÒÔÔËÐÐÈçÏÂÃüÁîÆô¶¯Spark Python Shell£º

c:
cd c:\dev\spark-1.2.0-bin-hadoop2.4
bin\pyspark

SparkʾÀýÓ¦ÓÃ

Íê³ÉSpark°²×°²¢Æô¶¯ºó£¬¾Í¿ÉÒÔÓÃSpark APIÖ´ÐÐÊý¾Ý·ÖÎö²éѯÁË¡£

ÕâЩ´ÓÎı¾ÎļþÖжÁÈ¡²¢´¦ÀíÊý¾ÝµÄÃüÁî¶¼ºÜ¼òµ¥¡£ÎÒÃǽ«ÔÚÕâһϵÁÐÎÄÕµĺóÐøÎÄÕÂÖÐÏò´ó¼Ò½éÉܸü¸ß¼¶µÄSpark¿ò¼ÜʹÓõÄÓÃÀý¡£

Ê×ÏÈÈÃÎÒÃÇÓÃSpark APIÔËÐÐÁ÷ÐеÄWord CountʾÀý¡£Èç¹û»¹Ã»ÓÐÔËÐÐSpark Scala Shell£¬Ê×ÏÈ´ò¿ªÒ»¸öScala Shell´°¿Ú¡£Õâ¸öʾÀýµÄÏà¹ØÃüÁîÈçÏÂËùʾ£º

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

val txtFile = "README.md"
val txtData = sc.textFile(txtFile)
txtData.cache()

ÎÒÃÇ¿ÉÒÔµ÷ÓÃcacheº¯Êý½«ÉÏÒ»²½Éú³ÉµÄRDD¶ÔÏó±£´æµ½»º´æÖУ¬ÔÚ´ËÖ®ºóSpark¾Í²»ÐèÒªÔÚÿ´ÎÊý¾Ý²éѯʱ¶¼ÖØÐ¼ÆËã¡£ÐèҪעÒâµÄ ÊÇ£¬cache()ÊÇÒ»¸öÑÓ³Ù²Ù×÷¡£ÔÚÎÒÃǵ÷ÓÃcacheʱ£¬Spark²¢²»»áÂíÉϽ«Êý¾Ý´æ´¢µ½ÄÚ´æÖС£Ö»Óе±ÔÚij¸öRDDÉϵ÷ÓÃÒ»¸öÐж¯Ê±£¬²Å»áÕæÕýÖ´ ÐÐÕâ¸ö²Ù×÷¡£

ÏÖÔÚ£¬ÎÒÃÇ¿ÉÒÔµ÷ÓÃcountº¯Êý£¬¿´Ò»ÏÂÔÚÎı¾ÎļþÖÐÓжàÉÙÐÐÊý¾Ý¡£

txtData.count()

È»ºó£¬ÎÒÃÇ¿ÉÒÔÖ´ÐÐÈçÏÂÃüÁî½øÐÐ×ÖÊýͳ¼Æ¡£ÔÚÎı¾ÎļþÖÐͳ¼ÆÊý¾Ý»áÏÔʾÔÚÿ¸öµ¥´ÊµÄºóÃæ¡£

val wcData = txtData.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
wcData.collect().foreach(println)

Èç¹ûÏë²é¿´¸ü¶à¹ØÓÚÈçºÎʹÓÃSparkºËÐÄAPIµÄ´úÂëʾÀý£¬Çë²Î¿¼ÍøÕ¾ÉϵÄSparkÎĵµ¡£

ºóÐø¼Æ»®

ÔÚºóÐøµÄϵÁÐÎÄÕÂÖУ¬ÎÒÃǽ«´ÓSpark SQL¿ªÊ¼£¬Ñ§Ï°¸ü¶à¹ØÓÚSparkÉú̬ϵͳµÄÆäËû²¿·Ö¡£Ö®ºó£¬ÎÒÃǽ«¼ÌÐøÁ˽âSpark Streaming£¬Spark MLlibºÍSpark GraphX¡£ÎÒÃÇÒ²»áÓлú»áѧϰÏñTachyonºÍBlinkDBµÈ¿ò¼Ü¡£

С½á

ÔÚ±¾ÎÄÖУ¬ÎÒÃÇÁ˽âÁËApache Spark¿ò¼ÜÈçºÎͨ¹ýÆä±ê×¼API°ïÖúÍê³É´óÊý¾Ý´¦ÀíºÍ·ÖÎö¹¤×÷¡£ÎÒÃÇ»¹¶ÔSparkºÍ´«Í³µÄMapReduceʵÏÖ£¨ÈçApache Hadoop£©½øÐÐÁ˱Ƚϡ£SparkÓëHadoop»ùÓÚÏàͬµÄHDFSÎļþ´æ´¢ÏµÍ³£¬Òò´ËÈç¹ûÄãÒѾ­ÔÚHadoopÉϽøÐÐÁË´óÁ¿Í¶×ʺͻù´¡ÉèÊ©½¨É裬¿É ÒÔÒ»ÆðʹÓÃSparkºÍMapReduce¡£

´ËÍ⣬Ҳ¿ÉÒÔ½«Spark´¦ÀíÓëSpark SQL¡¢»úÆ÷ѧϰÒÔ¼°Spark Streaming½áºÏÔÚÒ»Æð¡£¹ØÓÚÕâ·½ÃæµÄÄÚÈÝÎÒÃǽ«ÔÚºóÐøµÄÎÄÕÂÖнéÉÜ¡£

ÀûÓÃSparkµÄһЩ¼¯³É¹¦ÄܺÍÊÊÅäÆ÷£¬ÎÒÃÇ¿ÉÒÔ½«ÆäËû¼¼ÊõÓëSpark½áºÏÔÚÒ»Æð¡£ÆäÖÐÒ»¸ö°¸Àý¾ÍÊǽ«Spark¡¢KafkaºÍApache Cassandra½áºÏÔÚÒ»Æð£¬ÆäÖÐKafka¸ºÔðÊäÈëµÄÁ÷ʽÊý¾Ý£¬SparkÍê³É¼ÆË㣬×îºóCassandra NoSQLÊý¾Ý¿âÓÃÓÚ±£´æ¼ÆËã½á¹ûÊý¾Ý¡£

²»¹ýÐèÒªÀμǵÄÊÇ£¬SparkÉú̬ϵͳÈÔ²»³ÉÊ죬ÔÚ°²È«ºÍÓëBI¹¤¾ß¼¯³ÉµÈÁìÓòÈÔÈ»ÐèÒª½øÒ»²½µÄ¸Ä½ø¡£

   
2763 ´Îä¯ÀÀ       29
Ïà¹ØÎÄÕÂ

»ùÓÚEAµÄÊý¾Ý¿â½¨Ä£
Êý¾ÝÁ÷½¨Ä££¨EAÖ¸ÄÏ£©
¡°Êý¾Ýºþ¡±£º¸ÅÄî¡¢ÌØÕ÷¡¢¼Ü¹¹Óë°¸Àý
ÔÚÏßÉ̳ÇÊý¾Ý¿âϵͳÉè¼Æ ˼·+Ч¹û
 
Ïà¹ØÎĵµ

GreenplumÊý¾Ý¿â»ù´¡Åàѵ
MySQL5.1ÐÔÄÜÓÅ»¯·½°¸
ijµçÉÌÊý¾ÝÖÐ̨¼Ü¹¹Êµ¼ù
MySQL¸ßÀ©Õ¹¼Ü¹¹Éè¼Æ
Ïà¹Ø¿Î³Ì

Êý¾ÝÖÎÀí¡¢Êý¾Ý¼Ü¹¹¼°Êý¾Ý±ê×¼
MongoDBʵս¿Î³Ì
²¢·¢¡¢´óÈÝÁ¿¡¢¸ßÐÔÄÜÊý¾Ý¿âÉè¼ÆÓëÓÅ»¯
PostgreSQLÊý¾Ý¿âʵսÅàѵ
×îл¼Æ»®
DeepSeekÔÚÈí¼þ²âÊÔÓ¦ÓÃʵ¼ù 4-12[ÔÚÏß]
DeepSeek´óÄ£ÐÍÓ¦Óÿª·¢Êµ¼ù 4-19[ÔÚÏß]
UAF¼Ü¹¹ÌåϵÓëʵ¼ù 4-11[±±¾©]
AIÖÇÄÜ»¯Èí¼þ²âÊÔ·½·¨Óëʵ¼ù 5-23[ÉϺ£]
»ùÓÚ UML ºÍEA½øÐзÖÎöÉè¼Æ 4-26[±±¾©]
ÒµÎñ¼Ü¹¹Éè¼ÆÓ뽨ģ 4-18[±±¾©]

MySQLË÷Òý±³ºóµÄÊý¾Ý½á¹¹
MySQLÐÔÄܵ÷ÓÅÓë¼Ü¹¹Éè¼Æ
SQL ServerÊý¾Ý¿â±¸·ÝÓë»Ö¸´
ÈÃÊý¾Ý¿â·ÉÆðÀ´ 10´óDB2ÓÅ»¯
oracleµÄÁÙʱ±í¿Õ¼äдÂú´ÅÅÌ
Êý¾Ý¿âµÄ¿çƽ̨Éè¼Æ


²¢·¢¡¢´óÈÝÁ¿¡¢¸ßÐÔÄÜÊý¾Ý¿â
¸ß¼¶Êý¾Ý¿â¼Ü¹¹Éè¼ÆÊ¦
HadoopÔ­ÀíÓëʵ¼ù
Oracle Êý¾Ý²Ö¿â
Êý¾Ý²Ö¿âºÍÊý¾ÝÍÚ¾ò
OracleÊý¾Ý¿â¿ª·¢Óë¹ÜÀí


GE Çø¿éÁ´¼¼ÊõÓëʵÏÖÅàѵ
º½Ìì¿Æ¹¤Ä³×Ó¹«Ë¾ Nodejs¸ß¼¶Ó¦Óÿª·¢
ÖÐÊ¢Òæ»ª ׿Խ¹ÜÀíÕß±ØÐë¾ß±¸µÄÎåÏîÄÜÁ¦
ijÐÅÏ¢¼¼Êõ¹«Ë¾ PythonÅàѵ
ij²©²ÊITϵͳ³§ÉÌ Ò×ÓÃÐÔ²âÊÔÓëÆÀ¹À
ÖйúÓÊ´¢ÒøÐÐ ²âÊÔ³ÉÊì¶ÈÄ£Ðͼ¯³É(TMMI)
ÖÐÎïÔº ²úÆ·¾­ÀíÓë²úÆ·¹ÜÀí