Spark
StreamingÊÇ´ó¹æÄ£Á÷ʽÊý¾Ý´¦ÀíµÄй󣬽«Á÷ʽ¼ÆËã·Ö½â³ÉһϵÁжÌСµÄÅú´¦Àí×÷Òµ¡£±¾ÎIJûÊÍÁËSpark
StreamingµÄ¼Ü¹¹¼°±à³ÌÄ£ÐÍ£¬²¢½áºÏʵ¼ù¶ÔÆäºËÐļ¼Êõ½øÐÐÁËÉîÈëµÄÆÊÎö£¬¸ø³öÁ˾ßÌåµÄÓ¦Óó¡¾°¼°ÓÅ»¯·½°¸¡£
Ìáµ½Spark Streaming£¬ÎÒÃDz»µÃ²»ËµÒ»ÏÂBDAS£¨Berkeley
Data Analytics Stack£©£¬Õâ¸ö²®¿ËÀû´óѧÌá³öµÄ¹ØÓÚÊý¾Ý·ÖÎöµÄÈí¼þÕ»¡£´ÓËüµÄÊÓ½ÇÀ´¿´£¬Ä¿Ç°µÄ´óÊý¾Ý´¦Àí¿ÉÒÔ·ÖΪÈçÒÔÏÂÈý¸öÀàÐÍ¡£
1.¸´ÔÓµÄÅúÁ¿Êý¾Ý´¦Àí£¨batch data processing£©£¬Í¨³£µÄʱ¼ä¿ç¶ÈÔÚÊýÊ®·ÖÖÓµ½ÊýСʱ֮¼ä¡£
2.»ùÓÚÀúÊ·Êý¾ÝµÄ½»»¥Ê½²éѯ£¨interactive query£©£¬Í¨³£µÄʱ¼ä¿ç¶ÈÔÚÊýÊ®Ãëµ½Êý·ÖÖÓÖ®¼ä¡£
3.»ùÓÚʵʱÊý¾ÝÁ÷µÄÊý¾Ý´¦Àí£¨streaming data processing£©£¬Í¨³£µÄʱ¼ä¿ç¶ÈÔÚÊý°ÙºÁÃëµ½ÊýÃëÖ®¼ä¡£
ĿǰÒÑÓкܶàÏà¶Ô³ÉÊìµÄ¿ªÔ´Èí¼þÀ´´¦ÀíÒÔÉÏÈýÖÖÇé¾°£¬ÎÒÃÇ¿ÉÒÔÀûÓÃMapReduceÀ´½øÐÐÅúÁ¿Êý¾Ý´¦Àí£¬¿ÉÒÔÓÃImpalaÀ´½øÐн»»¥Ê½²éѯ£¬¶ÔÓÚÁ÷ʽÊý¾Ý´¦Àí£¬ÎÒÃÇ¿ÉÒÔ²ÉÓÃStorm¡£¶ÔÓÚ´ó¶àÊý»¥ÁªÍø¹«Ë¾À´Ëµ£¬Ò»°ã¶¼»áͬʱÓöµ½ÒÔÉÏÈýÖÖÇé¾°£¬ÄÇôÔÚʹÓõĹý³ÌÖÐÕâЩ¹«Ë¾¿ÉÄÜ»áÓöµ½ÈçϵIJ»±ã¡£
1.ÈýÖÖÇé¾°µÄÊäÈëÊä³öÊý¾ÝÎÞ·¨ÎÞ·ì¹²Ïí£¬ÐèÒª½øÐиñʽÏ໥ת»»¡£
2.ÿһ¸ö¿ªÔ´Èí¼þ¶¼ÐèÒªÒ»¸ö¿ª·¢ºÍά»¤ÍŶӣ¬Ìá¸ßÁ˳ɱ¾¡£
3.ÔÚͬһ¸ö¼¯ÈºÖжԸ÷¸öϵͳе÷×ÊÔ´·ÖÅä±È½ÏÀ§ÄÑ¡£
BDAS¾ÍÊÇÒÔSparkΪ»ù´¡µÄÒ»Ì×Èí¼þÕ»£¬ÀûÓûùÓÚÄÚ´æµÄͨÓüÆËãÄ£Ðͽ«ÒÔÉÏÈýÖÖÇé¾°Ò»Íø´ò¾¡£¬Í¬Ê±Ö§³ÖBatch¡¢Interactive¡¢StreamingµÄ´¦Àí£¬ÇÒ¼æÈÝÖ§³ÖHDFSºÍS3µÈ·Ö²¼Ê½Îļþϵͳ£¬¿ÉÒÔ²¿ÊðÔÚYARNºÍMesosµÈÁ÷Ðеļ¯Èº×ÊÔ´¹ÜÀíÆ÷Ö®ÉÏ¡£BDASµÄ¹¹¼ÜÈçͼ1Ëùʾ£¬ÆäÖÐSpark¿ÉÒÔÌæ´úMapReduce½øÐÐÅú´¦Àí£¬ÀûÓÃÆä»ùÓÚÄÚ´æµÄÌØµã£¬ÌرðÉó¤µü´úʽºÍ½»»¥Ê½Êý¾Ý´¦Àí£»Shark´¦Àí´ó¹æÄ£Êý¾ÝµÄSQL²éѯ£¬¼æÈÝHiveµÄHQL¡£±¾ÎÄÒªÖØµã½éÉܵÄSpark
Streaming£¬ÔÚÕû¸öBDASÖнøÐдó¹æÄ£Á÷ʽ´¦Àí¡£

ͼ1 BDASÈí¼þÕ»
Spark Streaming¹¹¼Ü
¼ÆËãÁ÷³Ì£ºSpark StreamingÊǽ«Á÷ʽ¼ÆËã·Ö½â³ÉһϵÁжÌСµÄÅú´¦Àí×÷Òµ¡£ÕâÀïµÄÅú´¦ÀíÒýÇæÊÇSpark£¬Ò²¾ÍÊǰÑSpark
StreamingµÄÊäÈëÊý¾Ý°´ÕÕbatch size£¨Èç1Ã룩·Ö³ÉÒ»¶ÎÒ»¶ÎµÄÊý¾Ý£¨Discretized
Stream£©£¬Ã¿Ò»¶ÎÊý¾Ý¶¼×ª»»³ÉSparkÖеÄRDD£¨Resilient Distributed Dataset£©£¬È»ºó½«Spark
StreamingÖжÔDStreamµÄTransformation²Ù×÷±äΪÕë¶ÔSparkÖжÔRDDµÄTransformation²Ù×÷£¬½«RDD¾¹ý²Ù×÷±ä³ÉÖмä½á¹û±£´æÔÚÄÚ´æÖС£Õû¸öÁ÷ʽ¼ÆËã¸ù¾ÝÒµÎñµÄÐèÇó¿ÉÒÔ¶ÔÖмäµÄ½á¹û½øÐеþ¼Ó£¬»òÕß´æ´¢µ½ÍⲿÉ豸¡£Í¼2ÏÔʾÁËSpark
StreamingµÄÕû¸öÁ÷³Ì¡£

ͼ2 Spark Streaming¹¹¼Üͼ
ÈÝ´íÐÔ£º¶ÔÓÚÁ÷ʽ¼ÆËãÀ´Ëµ£¬ÈÝ´íÐÔÖÁ¹ØÖØÒª¡£Ê×ÏÈÎÒÃÇÒªÃ÷È·Ò»ÏÂSparkÖÐRDDµÄÈÝ´í»úÖÆ¡£Ã¿Ò»¸öRDD¶¼ÊÇÒ»¸ö²»¿É±äµÄ·Ö²¼Ê½¿ÉÖØËãµÄÊý¾Ý¼¯£¬Æä¼Ç¼×ÅÈ·¶¨ÐԵIJÙ×÷¼Ì³Ð¹ØÏµ£¨lineage£©£¬ËùÒÔÖ»ÒªÊäÈëÊý¾ÝÊÇ¿ÉÈÝ´íµÄ£¬ÄÇôÈÎÒâÒ»¸öRDDµÄ·ÖÇø£¨Partition£©³ö´í»ò²»¿ÉÓ㬶¼ÊÇ¿ÉÒÔÀûÓÃÔʼÊäÈëÊý¾Ýͨ¹ýת»»²Ù×÷¶øÖØÐÂËã³öµÄ¡£

ͼ3 Spark StreamingÖÐRDDµÄlineage¹ØÏµÍ¼
¶ÔÓÚSpark StreamingÀ´Ëµ£¬ÆäRDDµÄ´«³Ð¹ØÏµÈçͼ3Ëùʾ£¬Í¼ÖеÄÿһ¸öÍÖÔ²Ðαíʾһ¸öRDD£¬ÍÖÔ²ÐÎÖеÄÿ¸öÔ²Ðδú±íÒ»¸öRDDÖеÄÒ»¸öPartition£¬Í¼ÖеÄÿһÁеĶà¸öRDD±íʾһ¸öDStream£¨Í¼ÖÐÓÐÈý¸öDStream£©£¬¶øÃ¿Ò»ÐÐ×îºóÒ»¸öRDDÔò±íʾÿһ¸öBatch
SizeËù²úÉúµÄÖмä½á¹ûRDD¡£ÎÒÃÇ¿ÉÒÔ¿´µ½Í¼ÖеÄÿһ¸öRDD¶¼ÊÇͨ¹ýlineageÏàÁ¬½ÓµÄ£¬ÓÉÓÚSpark
StreamingÊäÈëÊý¾Ý¿ÉÒÔÀ´×ÔÓÚ´ÅÅÌ£¬ÀýÈçHDFS£¨¶à·Ý¿½±´£©»òÊÇÀ´×ÔÓÚÍøÂçµÄÊý¾ÝÁ÷£¨Spark Streaming»á½«ÍøÂçÊäÈëÊý¾ÝµÄÿһ¸öÊý¾ÝÁ÷¿½±´Á½·Ýµ½ÆäËûµÄ»úÆ÷£©¶¼Äܱ£Ö¤ÈÝ´íÐÔ¡£ËùÒÔRDDÖÐÈÎÒâµÄPartition³ö´í£¬¶¼¿ÉÒÔ²¢ÐеØÔÚÆäËû»úÆ÷ÉϽ«È±Ê§µÄPartition¼ÆËã³öÀ´¡£Õâ¸öÈÝ´í»Ö¸´·½Ê½±ÈÁ¬Ðø¼ÆËãÄ£ÐÍ£¨ÈçStorm£©µÄЧÂʸü¸ß¡£
ʵʱÐÔ£º¶ÔÓÚʵʱÐÔµÄÌÖÂÛ£¬»áÇ£Éæµ½Á÷ʽ´¦Àí¿ò¼ÜµÄÓ¦Óó¡¾°¡£Spark Streaming½«Á÷ʽ¼ÆËã·Ö½â³É¶à¸öSpark
Job£¬¶ÔÓÚÿһ¶ÎÊý¾ÝµÄ´¦Àí¶¼»á¾¹ýSpark DAGͼ·Ö½â£¬ÒÔ¼°SparkµÄÈÎÎñ¼¯µÄµ÷¶È¹ý³Ì¡£¶ÔÓÚĿǰ°æ±¾µÄSpark
Streaming¶øÑÔ£¬Æä×îСµÄBatch SizeµÄѡȡÔÚ0.5~2ÃëÖÓÖ®¼ä£¨StormĿǰ×îСµÄÑÓ³ÙÊÇ100ms×óÓÒ£©£¬ËùÒÔSpark
StreamingÄܹ»Âú×ã³ý¶ÔʵʱÐÔÒªÇó·Ç³£¸ß£¨Èç¸ßƵʵʱ½»Ò×£©Ö®ÍâµÄËùÓÐÁ÷ʽ׼ʵʱ¼ÆË㳡¾°¡£
À©Õ¹ÐÔÓëÍÌÍÂÁ¿£ºSparkĿǰÔÚEC2ÉÏÒÑÄܹ»ÏßÐÔÀ©Õ¹µ½100¸ö½Úµã£¨Ã¿¸ö½Úµã4Core£©£¬¿ÉÒÔÒÔÊýÃëµÄÑÓ³Ù´¦Àí6GB/sµÄÊý¾ÝÁ¿£¨60M
records/s£©£¬ÆäÍÌÍÂÁ¿Ò²±ÈÁ÷ÐеÄStorm¸ß2¡«5±¶£¬Í¼4ÊÇBerkeleyÀûÓÃWordCountºÍGrepÁ½¸öÓÃÀýËù×öµÄ²âÊÔ£¬ÔÚGrepÕâ¸ö²âÊÔÖУ¬Spark
StreamingÖеÄÿ¸ö½ÚµãµÄÍÌÍÂÁ¿ÊÇ670k records/s£¬¶øStormÊÇ115k records/s¡£

ͼ4 Spark StreamingÓëStormÍÌÍÂÁ¿±È½Ïͼ
Spark StreamingµÄ±à³ÌÄ£ÐÍ
Spark StreamingµÄ±à³ÌºÍSparkµÄ±à³ÌÈç³öÒ»ÕÞ£¬¶ÔÓÚ±à³ÌµÄÀí½âÒ²·Ç³£ÀàËÆ¡£¶ÔÓÚSparkÀ´Ëµ£¬±à³Ì¾ÍÊǶÔÓÚRDDµÄ²Ù×÷£»¶ø¶ÔÓÚSpark
StreamingÀ´Ëµ£¬¾ÍÊǶÔDStreamµÄ²Ù×÷¡£ÏÂÃæ½«Í¨¹ýÒ»¸ö´ó¼ÒÊìϤµÄWordCountµÄÀý×ÓÀ´ËµÃ÷Spark
StreamingÖеÄÊäÈë²Ù×÷¡¢×ª»»²Ù×÷ºÍÊä³ö²Ù×÷¡£
Spark Streaming³õʼ»¯£ºÔÚ¿ªÊ¼½øÐÐDStream²Ù×÷֮ǰ£¬ÐèÒª¶ÔSpark
Streaming½øÐгõʼ»¯Éú³ÉStreamingContext¡£²ÎÊýÖбȽÏÖØÒªµÄÊǵÚÒ»¸öºÍµÚÈý¸ö£¬µÚÒ»¸ö²ÎÊýÊÇÖ¸¶¨Spark
StreamingÔËÐеļ¯ÈºµØÖ·£¬¶øµÚÈý¸ö²ÎÊýÊÇÖ¸¶¨Spark StreamingÔËÐÐʱµÄbatch´°¿Ú´óС¡£ÔÚÕâ¸öÀý×ÓÖоÍÊǽ«1ÃëÖÓµÄÊäÈëÊý¾Ý½øÐÐÒ»´ÎSpark
Job´¦Àí¡£
val ssc = new StreamingContext(¡°Spark://¡¡±, ¡°WordCount¡±, Seconds(1), [Homes], [Jars]) |
Spark StreamingµÄÊäÈë²Ù×÷£ºÄ¿Ç°Spark StreamingÒÑÖ§³ÖÁ˷ḻµÄÊäÈë½Ó¿Ú£¬´óÖ·ÖΪÁ½ÀࣺһÀàÊÇ´ÅÅÌÊäÈ룬ÈçÒÔbatch
size×÷Ϊʱ¼ä¼ä¸ô¼à¿ØHDFSÎļþϵͳµÄij¸öĿ¼£¬½«Ä¿Â¼ÖÐÄÚÈݵı仯×÷ΪSpark StreamingµÄÊäÈ룻ÁíÒ»Àà¾ÍÊÇÍøÂçÁ÷µÄ·½Ê½£¬Ä¿Ç°Ö§³ÖKafka¡¢Flume¡¢TwitterºÍTCP
socket¡£ÔÚWordCountÀý×ÓÖУ¬¼Ù¶¨Í¨¹ýÍøÂçsocket×÷ΪÊäÈëÁ÷£¬¼àÌýij¸öÌØ¶¨µÄ¶Ë¿Ú£¬×îºóµÃ³öÊäÈëDStream£¨lines£©¡£
val lines = ssc.socketTextStream(¡°localhost¡±,8888) |
Spark StreamingµÄת»»²Ù×÷£ºÓëSpark RDDµÄ²Ù×÷¼«ÎªÀàËÆ£¬Spark
StreamingÒ²¾ÍÊÇͨ¹ýת»»²Ù×÷½«Ò»¸ö»ò¶à¸öDStreamת»»³ÉеÄDStream¡£³£ÓõIJÙ×÷°üÀ¨map¡¢filter¡¢flatmapºÍjoin£¬ÒÔ¼°ÐèÒª½øÐÐshuffle²Ù×÷µÄgroupByKey/reduceByKeyµÈ¡£ÔÚWordCountÀý×ÓÖУ¬ÎÒÃÇÊ×ÏÈÐèÒª½«DStream(lines)Çзֳɵ¥´Ê£¬È»ºó½«Ïàͬµ¥´ÊµÄÊýÁ¿½øÐеþ¼Ó,
×îÖյõ½µÄwordCounts¾ÍÊÇÿһ¸öbatch sizeµÄ£¨µ¥´Ê£¬ÊýÁ¿£©Öмä½á¹û¡£
val words = lines.flatMap(_.split(¡° ¡±))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) |
ÁíÍ⣬Spark StreamingÓÐÌØ¶¨µÄ´°¿Ú²Ù×÷£¬´°¿Ú²Ù×÷Éæ¼°Á½¸ö²ÎÊý£ºÒ»¸öÊÇ»¬¶¯´°¿ÚµÄ¿í¶È£¨Window
Duration£©£»ÁíÒ»¸öÊÇ´°¿Ú»¬¶¯µÄƵÂÊ£¨Slide Duration£©£¬ÕâÁ½¸ö²ÎÊý±ØÐëÊÇbatch
sizeµÄ±¶Êý¡£ÀýÈçÒÔ¹ýÈ¥5ÃëÖÓΪһ¸öÊäÈë´°¿Ú£¬Ã¿1Ãëͳ¼ÆÒ»ÏÂWordCount£¬ÄÇôÎÒÃǻὫ¹ýÈ¥5ÃëÖÓµÄÿһÃëÖÓµÄWordCount¶¼½øÐÐͳ¼Æ£¬È»ºó½øÐеþ¼Ó£¬µÃ³öÕâ¸ö´°¿ÚÖеĵ¥´Êͳ¼Æ¡£
val wordCounts = words.map(x => (x, 1)).reduceByKeyAndWindow(_ + _, Seconds(5s)£¬seconds(1)) |
µ«ÉÏÃæÕâÖÖ·½Ê½»¹²»¹»¸ßЧ¡£Èç¹ûÎÒÃÇÒÔÔöÁ¿µÄ·½Ê½À´¼ÆËã¾Í¸ü¼Ó¸ßЧ£¬ÀýÈ磬¼ÆËãt+4ÃëÕâ¸öʱ¿Ì¹ýÈ¥5Ãë´°¿ÚµÄWordCount£¬ÄÇôÎÒÃÇ¿ÉÒÔ½«t+3ʱ¿Ì¹ýÈ¥5ÃëµÄͳ¼ÆÁ¿¼ÓÉÏ[t+3£¬t+4]µÄͳ¼ÆÁ¿£¬ÔÚ¼õÈ¥[t-2£¬t-1]µÄͳ¼ÆÁ¿£¨Èçͼ5Ëùʾ£©£¬ÕâÖÖ·½·¨¿ÉÒÔ¸´ÓÃÖмäÈýÃëµÄͳ¼ÆÁ¿£¬Ìá¸ßͳ¼ÆµÄЧÂÊ¡£
val wordCounts = words.map(x => (x, 1)).reduceByKeyAndWindow(_ + _, _ - _, Seconds(5s)£¬seconds(1)) |

ͼ5 Spark StreamingÖ묶¯´°¿ÚµÄµþ¼Ó´¦ÀíºÍÔöÁ¿´¦Àí
Spark StreamingµÄÊäÈë²Ù×÷£º¶ÔÓÚÊä³ö²Ù×÷£¬SparkÌṩÁ˽«Êý¾Ý´òÓ¡µ½ÆÁÄ»¼°ÊäÈëµ½ÎļþÖС£ÔÚWordCountÖÐÎÒÃǽ«DStream
wordCountsÊäÈëµ½HDFSÎļþÖС£
wordCounts = saveAsHadoopFiles(¡°WordCount¡±) |
Spark StreamingÆô¶¯£º¾¹ýÉÏÊöµÄ²Ù×÷£¬Spark Streaming»¹Ã»ÓнøÐй¤×÷£¬ÎÒÃÇ»¹ÐèÒªµ÷ÓÃStart²Ù×÷£¬Spark
Streaming²Å¿ªÊ¼¼àÌýÏàÓ¦µÄ¶Ë¿Ú£¬È»ºóÊÕÈ¡Êý¾Ý£¬²¢½øÐÐͳ¼Æ¡£
Spark Streaming°¸Àý·ÖÎö
ÔÚ»¥ÁªÍøÓ¦ÓÃÖУ¬ÍøÕ¾Á÷Á¿Í³¼Æ×÷ΪһÖÖ³£ÓõÄÓ¦ÓÃģʽ£¬ÐèÒªÔÚ²»Í¬Á£¶ÈÉ϶Բ»Í¬Êý¾Ý½øÐÐͳ¼Æ£¬¼ÈÓÐʵʱÐÔµÄÐèÇó£¬ÓÖÐè񻃾¼°µ½¾ÛºÏ¡¢È¥ÖØ¡¢Á¬½ÓµÈ½ÏΪ¸´ÔÓµÄͳ¼ÆÐèÇó¡£´«Í³ÉÏ£¬ÈôÊÇʹÓÃHadoop
MapReduce¿ò¼Ü£¬ËäÈ»¿ÉÒÔÈÝÒ×µØÊµÏÖ½ÏΪ¸´ÔÓµÄͳ¼ÆÐèÇ󣬵«ÊµÊ±ÐÔÈ´ÎÞ·¨µÃµ½±£Ö¤£»·´Ö®ÈôÊDzÉÓÃStormÕâÑùµÄÁ÷ʽ¿ò¼Ü£¬ÊµÊ±ÐÔËä¿ÉÒԵõ½±£Ö¤£¬µ«ÐèÇóµÄʵÏÖ¸´ÔÓ¶ÈÒ²´ó´óÌá¸ßÁË¡£Spark
StreamingÔÚÁ½ÕßÖ®¼äÕÒµ½ÁËÒ»¸öƽºâµã£¬Äܹ»ÒÔ׼ʵʱµÄ·½Ê½ÈÝÒ×µØÊµÏÖ½ÏΪ¸´ÔÓµÄͳ¼ÆÐèÇó¡£ ÏÂÃæ½éÉÜÒ»ÏÂʹÓÃKafkaºÍSpark
Streaming´î½¨ÊµÊ±Á÷Á¿Í³¼Æ¿ò¼Ü¡£
1.Êý¾ÝÔݴ棺Kafka×÷Ϊ·Ö²¼Ê½ÏûÏ¢¶ÓÁУ¬¼ÈÓзdz£ÓÅÐãµÄÍÌÍÂÁ¿£¬ÓÖÓнϸߵĿɿ¿ÐÔºÍÀ©Õ¹ÐÔ£¬ÔÚÕâÀï²ÉÓÃKafka×÷ΪÈÕÖ¾´«µÝÖмä¼þÀ´½ÓÊÕÈÕÖ¾£¬×¥È¡¿Í»§¶Ë·¢Ë͵ÄÁ÷Á¿ÈÕÖ¾£¬Í¬Ê±½ÓÊÜSpark
StreamingµÄÇëÇ󣬽«Á÷Á¿ÈÕÖ¾°´Ðò·¢Ë͸øSpark Streaming¼¯Èº¡£
2.Êý¾Ý´¦Àí£º½«Spark Streaming¼¯ÈºÓëKafka¼¯Èº¶Ô½Ó£¬Spark
Streaming´ÓKafka¼¯ÈºÖлñÈ¡Á÷Á¿ÈÕÖ¾²¢½øÐд¦Àí¡£Spark Streaming»áʵʱµØ´ÓKafka¼¯ÈºÖлñÈ¡Êý¾Ý²¢½«Æä´æ´¢ÔÚÄÚ²¿µÄ¿ÉÓÃÄÚ´æ¿Õ¼äÖС£µ±Ã¿Ò»¸öbatch´°¿Úµ½À´Ê±£¬±ã¶ÔÕâЩÊý¾Ý½øÐд¦Àí¡£
3.½á¹û´æ´¢£ºÎªÁ˱ãÓÚǰ¶ËչʾºÍÒ³ÃæÇëÇ󣬴¦ÀíµÃµ½µÄ½á¹û½«Ð´Èëµ½Êý¾Ý¿âÖС£
Ïà±ÈÓÚ´«Í³µÄ´¦Àí¿ò¼Ü£¬Kafka+Spark StreamingµÄ¼Ü¹¹ÓÐÒÔϼ¸¸öÓŵ㡣
1.Spark¿ò¼ÜµÄ¸ßЧºÍµÍÑÓ³Ù±£Ö¤ÁËSpark Streaming²Ù×÷µÄ׼ʵʱÐÔ¡£
2.ÀûÓÃSpark¿ò¼ÜÌṩµÄ·á¸»APIºÍ¸ßÁé»îÐÔ£¬¿ÉÒÔ¾«¼òµØÐ´³ö½ÏΪ¸´ÔÓµÄËã·¨¡£
3.±à³ÌÄ£Ð͵ĸ߶ÈÒ»ÖÂʹµÃÉÏÊÖSpark StreamingÏ൱ÈÝÒ×£¬Í¬Ê±Ò²¿ÉÒÔ±£Ö¤ÒµÎñÂß¼ÔÚʵʱ´¦ÀíºÍÅú´¦ÀíÉϵĸ´Óá£
ÔÚ»ùÓÚKafka+Spark StreamingµÄÁ÷Á¿Í³¼ÆÓ¦ÓÃÔËÐйý³ÌÖУ¬ÓÐʱ»áÓöµ½ÄÚ´æ²»×ã¡¢GC×èÈûµÈ¸÷ÖÖÎÊÌâ¡£ÏÂÃæ½éÉÜÒ»ÏÂÈçºÎ¶ÔSpark
StreamingÓ¦ÓóÌÐò½øÐе÷ÓÅÀ´¼õÉÙÉõÖÁ±ÜÃâÕâЩÎÊÌâµÄÓ°Ïì¡£
ÐÔÄܵ÷ÓÅ
ÓÅ»¯ÔËÐÐʱ¼ä
Ôö¼Ó²¢Ðжȡ£È·±£Ê¹ÓÃÕû¸ö¼¯ÈºµÄ×ÊÔ´£¬¶ø²»ÊǰÑÈÎÎñ¼¯ÖÐÔÚ¼¸¸öÌØ¶¨µÄ½ÚµãÉÏ¡£¶ÔÓÚ°üº¬shuffleµÄ²Ù×÷£¬Ôö¼ÓÆä²¢ÐжÈÒÔÈ·±£¸üΪ³ä·ÖµØÊ¹Óü¯Èº×ÊÔ´¡£
¼õÉÙÊý¾ÝÐòÁл¯¡¢·´ÐòÁл¯µÄ¸ºµ£¡£Spark StreamingĬÈϽ«½ÓÊÕµ½µÄÊý¾ÝÐòÁл¯ºó´æ´¢ÒÔ¼õÉÙÄÚ´æµÄʹÓᣵ«ÐòÁл¯ºÍ·´ÐòÁл¯ÐèÒª¸ü¶àµÄCPUʱ¼ä£¬Òò´Ë¸ü¼Ó¸ßЧµÄÐòÁл¯·½Ê½£¨Kryo£©ºÍ×Ô¶¨ÒåµÄÐòÁл¯½Ó¿Ú¿ÉÒÔ¸ü¸ßЧµØÊ¹ÓÃCPU¡£
ÉèÖúÏÀíµÄbatch´°¿Ú¡£ÔÚSpark StreamingÖУ¬JobÖ®¼äÓпÉÄÜ´æÔÚ×ÅÒÀÀµ¹ØÏµ£¬ºóÃæµÄJob±ØÐëÈ·±£Ç°ÃæµÄJobÖ´ÐнáÊøºó²ÅÄÜÌá½»¡£ÈôÇ°ÃæµÄJobÖ´ÐÐʱ¼ä³¬³öÁËÉèÖõÄbatch´°¿Ú£¬ÄÇôºóÃæµÄJob¾ÍÎÞ·¨°´Ê±Ìá½»£¬ÕâÑù¾Í»á½øÒ»²½ÍÏÑÓ½ÓÏÂÀ´µÄJob£¬Ôì³ÉºóÐøJobµÄ×èÈû¡£Òò´Ë£¬ÉèÖÃÒ»¸öºÏÀíµÄbatch´°¿ÚÈ·±£JobÄܹ»ÔÚÕâ¸öbatch´°¿ÚÖнáÊøÊDZØÐëµÄ¡£
¼õÉÙÈÎÎñÌá½»ºÍ·Ö·¢Ëù´øÀ´µÄ¸ºµ£¡£Í¨³£Çé¿öÏÂAkka¿ò¼ÜÄܹ»¸ßЧµØÈ·±£ÈÎÎñ¼°Ê±·Ö·¢£¬µ«µ±batch´°¿Ú·Ç³£Ð¡£¨500ms£©Ê±£¬Ìá½»ºÍ·Ö·¢ÈÎÎñµÄÑӳپͱäµÃ²»¿É½ÓÊÜÁË¡£Ê¹ÓÃStandaloneģʽºÍCoarse-grained
Mesosģʽͨ³£»á±ÈʹÓÃFine-Grained MesosģʽÓиüСµÄÑÓ³Ù¡£
ÓÅ»¯ÄÚ´æÊ¹ÓÃ
¿ØÖÆbatch size¡£Spark Streaming»á°Ñbatch´°¿ÚÄÚ½ÓÊÕµ½µÄËùÓÐÊý¾Ý´æ·ÅÔÚSparkÄÚ²¿µÄ¿ÉÓÃÄÚ´æÇøÓòÖУ¬Òò´Ë±ØÐëÈ·±£µ±Ç°½ÚµãSparkµÄ¿ÉÓÃÄÚ´æÖÁÉÙÄܹ»ÈÝÄÉÕâ¸öbatch´°¿ÚÄÚËùÓеÄÊý¾Ý£¬·ñÔò±ØÐëÔö¼ÓеÄ×ÊÔ´ÒÔÌá¸ß¼¯ÈºµÄ´¦ÀíÄÜÁ¦¡£
¼°Ê±ÇåÀí²»ÔÙʹÓõÄÊý¾Ý¡£ÉÏÃæËµµ½Spark Streaming»á½«½ÓÊÕµ½µÄÊý¾ÝÈ«²¿´æ´¢ÓÚÄÚ²¿µÄ¿ÉÓÃÄÚ´æÇøÓòÖУ¬Òò´Ë¶ÔÓÚ´¦Àí¹ýµÄ²»ÔÙÐèÒªµÄÊý¾ÝÓ¦¼°Ê±ÇåÀíÒÔÈ·±£Spark
StreamingÓи»ÓàµÄ¿ÉÓÃÄÚ´æ¿Õ¼ä¡£Í¨¹ýÉèÖúÏÀíµÄspark.cleaner.ttlʱ³¤À´¼°Ê±ÇåÀí³¬Ê±µÄÎÞÓÃÊý¾Ý¡£
¹Û²ì¼°Êʵ±µ÷ÕûGC²ßÂÔ¡£GC»áÓ°ÏìJobµÄÕý³£ÔËÐУ¬ÑÓ³¤JobµÄÖ´ÐÐʱ¼ä£¬ÒýÆðһϵÁв»¿ÉÔ¤ÁϵÄÎÊÌâ¡£¹Û²ìGCµÄÔËÐÐÇé¿ö£¬²ÉÈ¡²»Í¬µÄGC²ßÂÔÒÔ½øÒ»²½¼õСÄÚ´æ»ØÊÕ¶ÔJobÔËÐеÄÓ°Ïì¡£
×ܽá
Spark StreamingÌṩÁËÒ»Ì׸ßЧ¡¢¿ÉÈÝ´íµÄ׼ʵʱ´ó¹æÄ£Á÷ʽ´¦Àí¿ò¼Ü£¬ËüÄܺÍÅú´¦Àí¼°¼´Ê±²éѯ·ÅÔÚͬһ¸öÈí¼þÕ»ÖУ¬½µµÍѧϰ³É±¾¡£Èç¹ûÄãѧ»áÁËSpark±à³Ì£¬ÄÇôҲ¾Íѧ»áÁËSpark
Streaming±à³Ì£¬Èç¹ûÀí½âÁËSparkµÄµ÷¶ÈºÍ´æ´¢£¬ÄÇôSpark StreamingÒ²ÀàËÆ¡£¶Ô¿ªÔ´Èí¼þ¸ÐÐËȤµÄ¶ÁÕߣ¬ÎÒÃÇ¿ÉÒÔÒ»Æð¹±Ï×ÉçÇø¡£Ä¿Ç°SparkÒÑÔÚApache·õ»¯Æ÷ÖС£°´ÕÕĿǰµÄ·¢Õ¹Ç÷ÊÆ£¬Spark
StreamingÒ»¶¨½«»áµÃµ½¸ü´ó·¶Î§µÄʹÓᣠ|