±à¼ÍƼö: |
±¾ÎÄÖ÷Òª½éÉÜÁË
Spark Streaming¸ÅÊöºÍÏà¹ØÊõÓStreamingÔËÐÐÔÀí¡¢¼Ü¹¹¡¢±à³ÌÄ£Ðͼ°ÈÝ´í¡¢³Ö¾Ã»¯ºÍÐÔÄܵ÷ÓŵÈÏà¹ØÄÚÈÝ¡£
±¾ÎÄÀ´×Ô²©¿ÍÔ°£¬ÓÉ»ðÁú¹ûÈí¼þAnna±à¼¡¢ÍƼö¡£ |
|
1¡¢Spark Streaming¼ò½é
1.1 ¸ÅÊö
Spark Streaming ÊÇSparkºËÐÄAPIµÄÒ»¸öÀ©Õ¹£¬¿ÉÒÔʵÏÖ¸ßÍÌÍÂÁ¿µÄ¡¢¾ß±¸ÈÝ´í»úÖÆµÄʵʱÁ÷Êý¾ÝµÄ´¦Àí¡£Ö§³Ö´Ó¶àÖÖÊý¾ÝÔ´»ñÈ¡Êý¾Ý£¬°üÀ¨Kafk¡¢Flume¡¢Twitter¡¢ZeroMQ¡¢Kinesis
ÒÔ¼°TCP sockets£¬´ÓÊý¾ÝÔ´»ñÈ¡Êý¾ÝÖ®ºó£¬¿ÉÒÔʹÓÃÖîÈçmap¡¢reduce¡¢joinºÍwindowµÈ¸ß¼¶º¯Êý½øÐи´ÔÓËã·¨µÄ´¦Àí¡£×îºó»¹¿ÉÒÔ½«´¦Àí½á¹û´æ´¢µ½Îļþϵͳ£¬Êý¾Ý¿âºÍÏÖ³¡ÒDZíÅÌ¡£ÔÚ¡°One
Stack rule them all¡±µÄ»ù´¡ÉÏ£¬»¹¿ÉÒÔʹÓÃSparkµÄÆäËû×Ó¿ò¼Ü£¬È缯Ⱥѧϰ¡¢Í¼¼ÆËãµÈ£¬¶ÔÁ÷Êý¾Ý½øÐд¦Àí¡£
Spark Streaming´¦ÀíµÄÊý¾ÝÁ÷ͼ£º

SparkµÄ¸÷¸ö×Ó¿ò¼Ü£¬¶¼ÊÇ»ùÓÚºËÐÄSparkµÄ£¬Spark StreamingÔÚÄÚ²¿µÄ´¦Àí»úÖÆÊÇ£¬½ÓÊÕʵʱÁ÷µÄÊý¾Ý£¬²¢¸ù¾ÝÒ»¶¨µÄʱ¼ä¼ä¸ô²ð·Ö³ÉÒ»ÅúÅúµÄÊý¾Ý£¬È»ºóͨ¹ýSpark
Engine´¦ÀíÕâЩÅúÊý¾Ý£¬×îÖյõ½´¦ÀíºóµÄÒ»ÅúÅú½á¹ûÊý¾Ý¡£
¶ÔÓ¦µÄÅúÊý¾Ý£¬ÔÚSparkÄں˶ÔÓ¦Ò»¸öRDDʵÀý£¬Òò´Ë£¬¶ÔÓ¦Á÷Êý¾ÝµÄDStream¿ÉÒÔ¿´³ÉÊÇÒ»×éRDDs£¬¼´RDDµÄÒ»¸öÐòÁС£Í¨Ë×µãÀí½âµÄ»°£¬ÔÚÁ÷Êý¾Ý·Ö³ÉÒ»ÅúÒ»Åúºó£¬Í¨¹ýÒ»¸öÏȽøÏȳöµÄ¶ÓÁУ¬È»ºó
Spark Engine´Ó¸Ã¶ÓÁÐÖÐÒÀ´ÎÈ¡³öÒ»¸ö¸öÅúÊý¾Ý£¬°ÑÅúÊý¾Ý·â×°³ÉÒ»¸öRDD£¬È»ºó½øÐд¦Àí£¬ÕâÊÇÒ»¸öµäÐ͵ÄÉú²úÕßÏû·ÑÕßÄ£ÐÍ£¬¶ÔÓ¦µÄ¾ÍÓÐÉú²úÕßÏû·ÑÕßÄ£Ð͵ÄÎÊÌ⣬¼´ÈçºÎе÷Éú²úËÙÂʺÍÏû·ÑËÙÂÊ¡£
1.2 ÊõÓﶨÒå
lÀëÉ¢Á÷£¨discretized stream£©»òDStream£ºÕâÊÇSpark Streaming¶ÔÄÚ²¿³ÖÐøµÄʵʱÊý¾ÝÁ÷µÄ³éÏóÃèÊö£¬¼´ÎÒÃÇ´¦ÀíµÄÒ»¸öʵʱÊý¾ÝÁ÷£¬ÔÚSpark
StreamingÖжÔÓ¦ÓÚÒ»¸öDStream ʵÀý¡£
lÅúÊý¾Ý£¨batch data£©£ºÕâÊÇ»¯ÕûΪÁãµÄµÚÒ»²½£¬½«ÊµÊ±Á÷Êý¾ÝÒÔʱ¼äƬΪµ¥Î»½øÐзÖÅú£¬½«Á÷´¦Àíת»¯ÎªÊ±¼äƬÊý¾ÝµÄÅú´¦Àí¡£Ëæ×ųÖÐøÊ±¼äµÄÍÆÒÆ£¬ÕâЩ´¦Àí½á¹û¾ÍÐγÉÁ˶ÔÓ¦µÄ½á¹ûÊý¾ÝÁ÷ÁË¡£
lʱ¼äƬ»òÅú´¦Àíʱ¼ä¼ä¸ô£¨ batch interval£©£ºÕâÊÇÈËΪµØ¶ÔÁ÷Êý¾Ý½øÐж¨Á¿µÄ±ê×¼£¬ÒÔʱ¼äƬ×÷ΪÎÒÃDzð·ÖÁ÷Êý¾ÝµÄÒÀ¾Ý¡£Ò»¸öʱ¼äƬµÄÊý¾Ý¶ÔÓ¦Ò»¸öRDDʵÀý¡£
l´°¿Ú³¤¶È£¨window length£©£ºÒ»¸ö´°¿Ú¸²¸ÇµÄÁ÷Êý¾ÝµÄʱ¼ä³¤¶È¡£±ØÐëÊÇÅú´¦Àíʱ¼ä¼ä¸ôµÄ±¶Êý£¬
l»¬¶¯Ê±¼ä¼ä¸ô£ºÇ°Ò»¸ö´°¿Úµ½ºóÒ»¸ö´°¿ÚËù¾¹ýµÄʱ¼ä³¤¶È¡£±ØÐëÊÇÅú´¦Àíʱ¼ä¼ä¸ôµÄ±¶Êý
lInput DStream :Ò»¸öinput DStreamÊÇÒ»¸öÌØÊâµÄDStream£¬½«Spark
StreamingÁ¬½Óµ½Ò»¸öÍⲿÊý¾ÝÔ´À´¶ÁÈ¡Êý¾Ý¡£
1.3 StormÓëSpark Streming±È½Ï
l´¦ÀíÄ£ÐÍÒÔ¼°ÑÓ³Ù
ËäÈ»Á½¿ò¼Ü¶¼ÌṩÁË¿ÉÀ©Õ¹ÐÔ(scalability)ºÍ¿ÉÈÝ´íÐÔ(fault tolerance)£¬µ«ÊÇËüÃǵĴ¦ÀíÄ£ÐÍ´Ó¸ù±¾ÉÏ˵ÊDz»Ò»ÑùµÄ¡£Storm¿ÉÒÔʵÏÖÑÇÃ뼶ʱÑӵĴ¦Àí£¬¶øÃ¿´ÎÖ»´¦ÀíÒ»Ìõevent£¬¶øSpark
Streaming¿ÉÒÔÔÚÒ»¸ö¶ÌÔݵÄʱ¼ä´°¿ÚÀïÃæ´¦Àí¶àÌõ(batches)Event¡£ËùÒÔ˵Storm¿ÉÒÔʵÏÖÑÇÃ뼶ʱÑӵĴ¦Àí£¬¶øSpark
StreamingÔòÓÐÒ»¶¨µÄʱÑÓ¡£
lÈÝ´íºÍÊý¾Ý±£Ö¤
È»¶øÁ½ÕߵĴú¼Û¶¼ÊÇÈÝ´íʱºòµÄÊý¾Ý±£Ö¤£¬Spark StreamingµÄÈÝ´íΪÓÐ״̬µÄ¼ÆËãÌṩÁ˸üºÃµÄÖ§³Ö¡£ÔÚStormÖУ¬Ã¿Ìõ¼Ç¼ÔÚϵͳµÄÒÆ¶¯¹ý³ÌÖж¼ÐèÒª±»±ê¼Ç¸ú×Ù£¬ËùÒÔStormÖ»Äܱ£Ö¤Ã¿Ìõ¼Ç¼×îÉÙ±»´¦ÀíÒ»´Î£¬µ«ÊÇÔÊÐí´Ó´íÎó״̬»Ö¸´Ê±±»´¦Àí¶à´Î¡£Õâ¾ÍÒâζ×ſɱä¸üµÄ״̬¿ÉÄܱ»¸üÐÂÁ½´Î´Ó¶øµ¼Ö½á¹û²»ÕýÈ·¡£
ÈÎÒ»·½Ã棬Spark Streaming½ö½öÐèÒªÔÚÅú´¦Àí¼¶±ð¶Ô¼Ç¼½øÐÐ×·×Ù£¬ËùÒÔËûÄܱ£Ö¤Ã¿¸öÅú´¦Àí¼Ç¼½ö½ö±»´¦ÀíÒ»´Î£¬¼´Ê¹ÊÇnode½Úµã¹Òµô¡£ËäȻ˵StormµÄ
Trident library¿ÉÒÔ±£Ö¤Ò»Ìõ¼Ç¼±»´¦ÀíÒ»´Î£¬µ«ÊÇËüÒÀÀµÓÚÊÂÎñ¸üÐÂ״̬£¬¶øÕâ¸ö¹ý³ÌÊǺÜÂýµÄ£¬²¢ÇÒÐèÒªÓÉÓû§È¥ÊµÏÖ¡£
lʵÏֺͱà³ÌAPI
StormÖ÷ÒªÊÇÓÉClojureÓïÑÔʵÏÖ£¬Spark StreamingÊÇÓÉScalaʵÏÖ¡£Èç¹ûÄãÏë¿´¿´ÕâÁ½¸ö¿ò¼ÜÊÇÈçºÎʵÏֵĻòÕßÄãÏë×Ô¶¨ÒåһЩ¶«Î÷Äã¾ÍµÃ¼ÇסÕâÒ»µã¡£StormÊÇÓÉBackTypeºÍ
Twitter¿ª·¢£¬¶øSpark StreamingÊÇÔÚUC Berkeley¿ª·¢µÄ¡£
StormÌṩÁËJava API£¬Í¬Ê±Ò²Ö§³ÖÆäËûÓïÑÔµÄAPI¡£ Spark StreamingÖ§³ÖScalaºÍJavaÓïÑÔ(ÆäʵҲ֧³ÖPython)¡£
lÅú´¦Àí¿ò¼Ü¼¯³É
Spark StreamingµÄÒ»¸öºÜ°ôµÄÌØÐÔ¾ÍÊÇËüÊÇÔÚSpark¿ò¼ÜÉÏÔËÐеġ£ÕâÑùÄã¾Í¿ÉÒÔÏëʹÓÃÆäËûÅú´¦Àí´úÂëÒ»ÑùÀ´Ð´Spark
Streaming³ÌÐò£¬»òÕßÊÇÔÚSparkÖн»»¥²éѯ¡£Õâ¾Í¼õÉÙÁ˵¥¶À±àдÁ÷ÅúÁ¿´¦Àí³ÌÐòºÍÀúÊ·Êý¾Ý´¦Àí³ÌÐò¡£
lÉú²úÖ§³Ö
StormÒѾ³öÏֺöàÄêÁË£¬¶øÇÒ×Ô´Ó2011Ä꿪ʼ¾ÍÔÚTwitterÄÚ²¿Éú²ú»·¾³ÖÐʹÓ㬻¹ÓÐÆäËûһЩ¹«Ë¾¡£¶øSpark
StreamingÊÇÒ»¸öеÄÏîÄ¿£¬²¢ÇÒÔÚ2013Äê½ö½ö±»SharethroughʹÓÃ(¾Ý×÷ÕßÁ˽â)¡£
StormÊÇ Hortonworks HadoopÊý¾Ýƽ̨ÖÐÁ÷´¦ÀíµÄ½â¾ö·½°¸£¬¶øSpark Streaming³öÏÖÔÚ
MapRµÄ·Ö²¼Ê½Æ½Ì¨ºÍClouderaµÄÆóÒµÊý¾Ýƽ̨ÖС£³ý´ËÖ®Í⣬DatabricksÊÇΪSparkÌṩ¼¼ÊõÖ§³ÖµÄ¹«Ë¾£¬°üÀ¨ÁËSpark
Streaming¡£
ËäȻ˵Á½Õß¶¼¿ÉÒÔÔÚ¸÷×Եļ¯Èº¿ò¼ÜÖÐÔËÐУ¬µ«ÊÇStorm¿ÉÒÔÔÚMesosÉÏÔËÐÐ, ¶øSpark Streaming¿ÉÒÔÔÚYARNºÍMesosÉÏÔËÐС£
2¡¢ÔËÐÐÔÀí
2.1 Streaming¼Ü¹¹
SparkStreamingÊÇÒ»¸ö¶ÔʵʱÊý¾ÝÁ÷½øÐиßͨÁ¿¡¢ÈÝ´í´¦ÀíµÄÁ÷ʽ´¦Àíϵͳ£¬¿ÉÒÔ¶Ô¶àÖÖÊý¾ÝÔ´£¨ÈçKdfka¡¢Flume¡¢Twitter¡¢ZeroºÍTCP
Ì×½Ó×Ö£©½øÐÐÀàËÆMap¡¢ReduceºÍJoinµÈ¸´ÔÓ²Ù×÷£¬²¢½«½á¹û±£´æµ½ÍⲿÎļþϵͳ¡¢Êý¾Ý¿â»òÓ¦Óõ½ÊµÊ±ÒDZíÅÌ¡£
l¼ÆËãÁ÷³Ì£ºSpark StreamingÊǽ«Á÷ʽ¼ÆËã·Ö½â³ÉһϵÁжÌСµÄÅú´¦Àí×÷Òµ¡£ÕâÀïµÄÅú´¦ÀíÒýÇæÊÇSpark
Core£¬Ò²¾ÍÊǰÑSpark StreamingµÄÊäÈëÊý¾Ý°´ÕÕbatch size£¨Èç1Ã룩·Ö³ÉÒ»¶ÎÒ»¶ÎµÄÊý¾Ý£¨Discretized
Stream£©£¬Ã¿Ò»¶ÎÊý¾Ý¶¼×ª»»³ÉSparkÖеÄRDD£¨Resilient Distributed
Dataset£©£¬È»ºó½«Spark StreamingÖжÔDStreamµÄTransformation²Ù×÷±äΪÕë¶ÔSparkÖжÔRDDµÄTransformation²Ù×÷£¬½«RDD¾¹ý²Ù×÷±ä³ÉÖмä½á¹û±£´æÔÚÄÚ´æÖС£Õû¸öÁ÷ʽ¼ÆËã¸ù¾ÝÒµÎñµÄÐèÇó¿ÉÒÔ¶ÔÖмäµÄ½á¹û½øÐеþ¼Ó»òÕß´æ´¢µ½ÍⲿÉ豸¡£ÏÂͼÏÔʾÁËSpark
StreamingµÄÕû¸öÁ÷³Ì¡£

ͼSpark Streaming¹¹¼Ü
lÈÝ´íÐÔ£º¶ÔÓÚÁ÷ʽ¼ÆËãÀ´Ëµ£¬ÈÝ´íÐÔÖÁ¹ØÖØÒª¡£Ê×ÏÈÎÒÃÇÒªÃ÷È·Ò»ÏÂSparkÖÐRDDµÄÈÝ´í»úÖÆ¡£Ã¿Ò»¸öRDD¶¼ÊÇÒ»¸ö²»¿É±äµÄ·Ö²¼Ê½¿ÉÖØËãµÄÊý¾Ý¼¯£¬Æä¼Ç¼×ÅÈ·¶¨ÐԵIJÙ×÷¼Ì³Ð¹ØÏµ£¨lineage£©£¬ËùÒÔÖ»ÒªÊäÈëÊý¾ÝÊÇ¿ÉÈÝ´íµÄ£¬ÄÇôÈÎÒâÒ»¸öRDDµÄ·ÖÇø£¨Partition£©³ö´í»ò²»¿ÉÓ㬶¼ÊÇ¿ÉÒÔÀûÓÃÔʼÊäÈëÊý¾Ýͨ¹ýת»»²Ù×÷¶øÖØÐÂËã³öµÄ¡£
¶ÔÓÚSpark StreamingÀ´Ëµ£¬ÆäRDDµÄ´«³Ð¹ØÏµÈçÏÂͼËùʾ£¬Í¼ÖеÄÿһ¸öÍÖÔ²Ðαíʾһ¸öRDD£¬ÍÖÔ²ÐÎÖеÄÿ¸öÔ²Ðδú±íÒ»¸öRDDÖеÄÒ»¸öPartition£¬Í¼ÖеÄÿһÁеĶà¸öRDD±íʾһ¸öDStream£¨Í¼ÖÐÓÐÈý¸öDStream£©£¬¶øÃ¿Ò»ÐÐ×îºóÒ»¸öRDDÔò±íʾÿһ¸öBatch
SizeËù²úÉúµÄÖмä½á¹ûRDD¡£ÎÒÃÇ¿ÉÒÔ¿´µ½Í¼ÖеÄÿһ¸öRDD¶¼ÊÇͨ¹ýlineageÏàÁ¬½ÓµÄ£¬ÓÉÓÚSpark
StreamingÊäÈëÊý¾Ý¿ÉÒÔÀ´×ÔÓÚ´ÅÅÌ£¬ÀýÈçHDFS£¨¶à·Ý¿½±´£©»òÊÇÀ´×ÔÓÚÍøÂçµÄÊý¾ÝÁ÷£¨Spark
Streaming»á½«ÍøÂçÊäÈëÊý¾ÝµÄÿһ¸öÊý¾ÝÁ÷¿½±´Á½·Ýµ½ÆäËûµÄ»úÆ÷£©¶¼Äܱ£Ö¤ÈÝ´íÐÔ£¬ËùÒÔRDDÖÐÈÎÒâµÄPartition³ö´í£¬¶¼¿ÉÒÔ²¢ÐеØÔÚÆäËû»úÆ÷ÉϽ«È±Ê§µÄPartition¼ÆËã³öÀ´¡£Õâ¸öÈÝ´í»Ö¸´·½Ê½±ÈÁ¬Ðø¼ÆËãÄ£ÐÍ£¨ÈçStorm£©µÄЧÂʸü¸ß¡£

Spark StreamingÖÐRDDµÄlineage¹ØÏµÍ¼
lʵʱÐÔ£º¶ÔÓÚʵʱÐÔµÄÌÖÂÛ£¬»áÇ£Éæµ½Á÷ʽ´¦Àí¿ò¼ÜµÄÓ¦Óó¡¾°¡£Spark Streaming½«Á÷ʽ¼ÆËã·Ö½â³É¶à¸öSpark
Job£¬¶ÔÓÚÿһ¶ÎÊý¾ÝµÄ´¦Àí¶¼»á¾¹ýSpark DAGͼ·Ö½âÒÔ¼°SparkµÄÈÎÎñ¼¯µÄµ÷¶È¹ý³Ì¡£¶ÔÓÚĿǰ°æ±¾µÄSpark
Streaming¶øÑÔ£¬Æä×îСµÄBatch SizeµÄѡȡÔÚ0.5~2ÃëÖÓÖ®¼ä£¨StormĿǰ×îСµÄÑÓ³ÙÊÇ100ms×óÓÒ£©£¬ËùÒÔSpark
StreamingÄܹ»Âú×ã³ý¶ÔʵʱÐÔÒªÇó·Ç³£¸ß£¨Èç¸ßƵʵʱ½»Ò×£©Ö®ÍâµÄËùÓÐÁ÷ʽ׼ʵʱ¼ÆË㳡¾°¡£
lÀ©Õ¹ÐÔÓëÍÌÍÂÁ¿£ºSparkĿǰÔÚEC2ÉÏÒÑÄܹ»ÏßÐÔÀ©Õ¹µ½100¸ö½Úµã£¨Ã¿¸ö½Úµã4Core£©£¬¿ÉÒÔÒÔÊýÃëµÄÑÓ³Ù´¦Àí6GB/sµÄÊý¾ÝÁ¿£¨60M
records/s£©£¬ÆäÍÌÍÂÁ¿Ò²±ÈÁ÷ÐеÄStorm¸ß2¡«5±¶£¬Í¼4ÊÇBerkeleyÀûÓÃWordCountºÍGrepÁ½¸öÓÃÀýËù×öµÄ²âÊÔ£¬ÔÚGrepÕâ¸ö²âÊÔÖУ¬Spark
StreamingÖеÄÿ¸ö½ÚµãµÄÍÌÍÂÁ¿ÊÇ670k records/s£¬¶øStormÊÇ115k records/s¡£

Spark StreamingÓëStormÍÌÍÂÁ¿±È½Ïͼ
2.2 ±à³ÌÄ£ÐÍ
DStream£¨Discretized Stream£©×÷ΪSpark StreamingµÄ»ù´¡³éÏó£¬Ëü´ú±í³ÖÐøÐÔµÄÊý¾ÝÁ÷¡£ÕâЩÊý¾ÝÁ÷¼È¿ÉÒÔͨ¹ýÍⲿÊäÈëÔ´Àµ»ñÈ¡£¬Ò²¿ÉÒÔͨ¹ýÏÖÓеÄDstreamµÄtransformation²Ù×÷À´»ñµÃ¡£ÔÚÄÚ²¿ÊµÏÖÉÏ£¬DStreamÓÉÒ»×éʱ¼äÐòÁÐÉÏÁ¬ÐøµÄRDDÀ´±íʾ¡£Ã¿¸öRDD¶¼°üº¬ÁË×Ô¼ºÌض¨Ê±¼ä¼ä¸ôÄÚµÄÊý¾ÝÁ÷¡£Èçͼ7-3Ëùʾ¡£

ͼ7-3 DStreamÖÐÔÚʱ¼äÖáÏÂÉú³ÉÀëÉ¢µÄRDDÐòÁÐ

¶ÔDStreamÖÐÊý¾ÝµÄ¸÷ÖÖ²Ù×÷Ò²ÊÇÓ³Éäµ½ÄÚ²¿µÄRDDÉÏÀ´½øÐеģ¬Èçͼ7-4Ëùʾ£¬¶ÔDtreamµÄ²Ù×÷¿ÉÒÔͨ¹ýRDDµÄtransformationÉú³ÉеÄDStream¡£ÕâÀïµÄÖ´ÐÐÒýÇæÊÇSpark¡£
2.2.1 ÈçºÎʹÓÃSpark Streaming
×÷Ϊ¹¹½¨ÓÚSparkÖ®ÉϵÄÓ¦Óÿò¼Ü£¬Spark Streaming³ÐÏ®ÁËSparkµÄ±à³Ì·ç¸ñ£¬¶ÔÓÚÒѾÁ˽âSparkµÄÓû§À´ËµÄܹ»¿ìËÙµØÉÏÊÖ¡£½ÓÏÂÀ´ÒÔSpark
Streaming¹Ù·½ÌṩµÄWordCount´úÂëΪÀýÀ´½éÉÜSpark StreamingµÄʹÓ÷½Ê½¡£
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
// Create a local StreamingContext with two working
thread and batch interval of 1 second.
// The master requires 2 cores to prevent from a
starvation scenario.
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
// Create a DStream that will connect to hostname:port,
like localhost:9999
val lines = ssc.socketTextStream("localhost",
9999)
// Split each line into words
val words = lines.flatMap(_.split(" "))
import org.apache.spark.streaming.StreamingContext._
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated
in this DStream to the console
wordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation
to terminate
1.´´½¨StreamingContext¶ÔÏó ͬSpark³õʼ»¯ÐèÒª´´½¨SparkContext¶ÔÏóÒ»Ñù£¬Ê¹ÓÃSpark
Streaming¾ÍÐèÒª´´½¨StreamingContext¶ÔÏó¡£´´½¨StreamingContext¶ÔÏóËùÐèµÄ²ÎÊýÓëSparkContext»ù±¾Ò»Ö£¬°üÀ¨Ö¸Ã÷Master£¬É趨Ãû³Æ(ÈçNetworkWordCount)¡£ÐèҪעÒâµÄÊDzÎÊýSeconds(1)£¬Spark
StreamingÐèÒªÖ¸¶¨´¦ÀíÊý¾ÝµÄʱ¼ä¼ä¸ô£¬ÈçÉÏÀýËùʾµÄ1s£¬ÄÇôSpark Streaming»áÒÔ1sΪʱ¼ä´°¿Ú½øÐÐÊý¾Ý´¦Àí¡£´Ë²ÎÊýÐèÒª¸ù¾ÝÓû§µÄÐèÇóºÍ¼¯ÈºµÄ´¦ÀíÄÜÁ¦½øÐÐÊʵ±µÄÉèÖã»
2.´´½¨InputDStreamÈçͬStormµÄSpout£¬Spark StreamingÐèÒªÖ¸Ã÷Êý¾ÝÔ´¡£ÈçÉÏÀýËùʾµÄsocketTextStream£¬Spark
StreamingÒÔsocketÁ¬½Ó×÷ΪÊý¾ÝÔ´¶ÁÈ¡Êý¾Ý¡£µ±È»Spark StreamingÖ§³Ö¶àÖÖ²»Í¬µÄÊý¾ÝÔ´£¬°üÀ¨Kafka¡¢
Flume¡¢HDFS/S3¡¢KinesisºÍTwitterµÈÊý¾ÝÔ´£»
3.²Ù×÷DStream¶ÔÓÚ´ÓÊý¾ÝÔ´µÃµ½µÄDStream£¬Óû§¿ÉÒÔÔÚÆä»ù´¡ÉϽøÐи÷ÖÖ²Ù×÷£¬ÈçÉÏÀýËùʾµÄ²Ù×÷¾ÍÊÇÒ»¸öµäÐ͵ÄWordCountÖ´ÐÐÁ÷³Ì£º¶ÔÓÚµ±Ç°Ê±¼ä´°¿ÚÄÚ´ÓÊý¾ÝÔ´µÃµ½µÄÊý¾ÝÊ×ÏȽøÐзָȻºóÀûÓÃMapºÍReduceByKey·½·¨½øÐмÆË㣬µ±È»×îºó»¹ÓÐʹÓÃprint()·½·¨Êä³ö½á¹û£»
4.Æô¶¯Spark Streaming֮ǰËù×÷µÄËùÓв½ÖèÖ»ÊÇ´´½¨ÁËÖ´ÐÐÁ÷³Ì£¬³ÌÐòûÓÐÕæÕýÁ¬½ÓÉÏÊý¾ÝÔ´£¬Ò²Ã»ÓжÔÊý¾Ý½øÐÐÈκβÙ×÷£¬Ö»ÊÇÉ趨ºÃÁËËùÓеÄÖ´Ðмƻ®£¬µ±ssc.start()Æô¶¯ºó³ÌÐò²ÅÕæÕý½øÐÐËùÓÐÔ¤ÆÚµÄ²Ù×÷¡£
ÖÁ´Ë¶ÔÓÚSpark StreamingµÄÈçºÎʹÓÃÓÐÁËÒ»¸ö´ó¸ÅµÄÓ¡Ïó£¬ÔÚºóÃæµÄÕ½ÚÎÒÃÇ»áͨ¹ýÔ´´úÂëÉîÈë̽¾¿Ò»ÏÂSpark
StreamingµÄÖ´ÐÐÁ÷³Ì¡£
2.2.2 DStreamµÄÊäÈëÔ´
ÔÚSpark StreamingÖÐËùÓеIJÙ×÷¶¼ÊÇ»ùÓÚÁ÷µÄ£¬¶øÊäÈëÔ´ÊÇÕâһϵÁвÙ×÷µÄÆðµã¡£ÊäÈë DStreams
ºÍ DStreams ½ÓÊÕµÄÁ÷¶¼´ú±íÊäÈëÊý¾ÝÁ÷µÄÀ´Ô´£¬ÔÚSpark Streaming ÌṩÁ½ÖÖÄÚÖÃÊý¾ÝÁ÷À´Ô´£º
l »ù´¡À´Ô´ ÔÚ StreamingContext API ÖÐÖ±½Ó¿ÉÓõÄÀ´Ô´¡£ÀýÈ磺Îļþϵͳ¡¢Socket£¨Ì×½Ó×Ö£©Á¬½ÓºÍ
Akka actors£»
l ¸ß¼¶À´Ô´ Èç Kafka¡¢Flume¡¢Kinesis¡¢Twitter µÈ£¬¿ÉÒÔͨ¹ý¶îÍâµÄʵÓù¤¾ßÀà´´½¨¡£
2.2.2.1 »ù´¡À´Ô´
ÔÚÇ°Ãæ·ÖÎöÔõÑùʹÓÃSpark StreamingµÄÀý×ÓÖÐÎÒÃÇÒÑ¿´µ½ssc.socketTextStream()·½·¨£¬¿ÉÒÔͨ¹ý
TCP Ì×½Ó×ÖÁ¬½Ó£¬´Ó´ÓÎı¾Êý¾ÝÖд´½¨ÁËÒ»¸ö DStream¡£³ýÁËÌ×½Ó×Ö£¬StreamingContext
µÄAPI»¹ÌṩÁË·½·¨´ÓÎļþºÍ Akka actors Öд´½¨ DStreams×÷ΪÊäÈëÔ´¡£
Spark StreamingÌṩÁËstreamingContext.fileStream(dataDirectory)·½·¨¿ÉÒÔ´ÓÈκÎÎļþϵͳ(È磺HDFS¡¢S3¡¢NFS
µÈ£©µÄÎļþÖжÁÈ¡Êý¾Ý£¬È»ºó´´½¨Ò»¸öDStream¡£Spark Streaming ¼à¿Ø dataDirectory
Ŀ¼ºÍÔÚ¸ÃĿ¼ÏÂÈκÎÎļþ±»´´½¨´¦Àí(²»Ö§³ÖÔÚǶÌ×Ŀ¼ÏÂдÎļþ)¡£ÐèҪעÒâµÄÊÇ£º¶ÁÈ¡µÄ±ØÐëÊǾßÓÐÏàͬµÄÊý¾Ý¸ñʽµÄÎļþ£»´´½¨µÄÎļþ±ØÐëÔÚ
dataDirectory Ŀ¼Ï£¬²¢Í¨¹ý×Ô¶¯Òƶ¯»òÖØÃüÃû³ÉÊý¾ÝĿ¼£»ÎļþÒ»µ©Òƶ¯¾Í²»Äܱ»¸Ä±ä£¬Èç¹ûÎļþ±»²»¶Ï×·¼Ó,еÄÊý¾Ý½«²»»á±»ÔĶÁ¡£¶ÔÓÚ¼òµ¥µÄÎı¾ÎÄ£¬¿ÉÒÔʹÓÃÒ»¸ö¼òµ¥µÄ·½·¨streamingContext.textFileStream(dataDirectory)À´¶ÁÈ¡Êý¾Ý¡£
Spark StreamingÒ²¿ÉÒÔ»ùÓÚ×Ô¶¨Òå Actors µÄÁ÷´´½¨DStream £¬Í¨¹ý Akka
actors ½ÓÊÜÊý¾ÝÁ÷£¬Ê¹Ó÷½·¨streamingContext.actorStream(actorProps,
actor-name)¡£Spark StreamingʹÓà streamingContext.queueStream(queueOfRDDs)·½·¨¿ÉÒÔ´´½¨»ùÓÚ
RDD ¶ÓÁеÄDStream£¬Ã¿¸öRDD ¶ÓÁн«±»ÊÓΪ DStream ÖÐÒ»¿éÊý¾ÝÁ÷½øÐмӹ¤´¦Àí¡£
2.2.2.2 ¸ß¼¶À´Ô´
ÕâÒ»ÀàµÄÀ´Ô´ÐèÒªÍⲿ non-Spark ¿âµÄ½Ó¿Ú£¬ÆäÖÐһЩÓи´ÔÓµÄÒÀÀµ¹ØÏµ(Èç Kafka¡¢Flume)¡£Òò´Ëͨ¹ýÕâЩÀ´Ô´´´½¨
DStreams ÐèÒªÃ÷È·ÆäÒÀÀµ¡£ÀýÈ磬Èç¹ûÏë´´½¨Ò»¸öʹÓà Twitter tweets µÄÊý¾ÝµÄDStream
Á÷£¬±ØÐë°´ÒÔϲ½ÖèÀ´×ö£º
1£©ÔÚ SBT »ò Maven¹¤³ÌÀïÌí¼Ó spark-streaming-twitter_2.10
ÒÀÀµ¡£
2£©¿ª·¢£ºµ¼Èë TwitterUtils °ü£¬Í¨¹ý TwitterUtils.createStream
·½·¨´´½¨Ò»¸öDStream¡£
3£©²¿Êð£ºÌí¼ÓËùÓÐÒÀÀµµÄ jar °ü(°üÀ¨ÒÀÀµµÄspark-streaming-twitter_2.10
¼°ÆäÒÀÀµ)£¬È»ºó²¿ÊðÓ¦ÓóÌÐò¡£
ÐèҪעÒâµÄÊÇ£¬ÕâЩ¸ß¼¶µÄÀ´Ô´Ò»°ãÔÚSpark ShellÖв»¿ÉÓã¬Òò´Ë»ùÓÚÕâЩ¸ß¼¶À´Ô´µÄÓ¦Óò»ÄÜÔÚSpark
ShellÖнøÐвâÊÔ¡£Èç¹ûÄã±ØÐëÔÚSpark shellÖÐʹÓÃËüÃÇ£¬ÄãÐèÒªÏÂÔØÏàÓ¦µÄMaven¹¤³ÌµÄJarÒÀÀµ²¢Ìí¼Óµ½Àà·¾¶ÖС£
ÆäÖÐһЩ¸ß¼¶À´Ô´ÈçÏ£º
lTwitter Spark StreamingµÄTwitterUtils¹¤¾ßÀàʹÓÃTwitter4j£¬Twitter4J
¿âÖ§³Öͨ¹ýÈκη½·¨ÌṩÉí·ÝÑéÖ¤ÐÅÏ¢£¬Äã¿ÉÒԵõ½¹«ÖÚµÄÁ÷£¬»òµÃµ½»ùÓڹؼü´Ê¹ýÂËÁ÷¡£
lFlume Spark Streaming¿ÉÒÔ´ÓFlumeÖнÓÊÜÊý¾Ý¡£
lKafka Spark Streaming¿ÉÒÔ´ÓKafkaÖнÓÊÜÊý¾Ý¡£
lKinesis Spark Streaming¿ÉÒÔ´ÓKinesisÖнÓÊÜÊý¾Ý¡£
ÐèÒªÖØÉêµÄÒ»µãÊÇÔÚ¿ªÊ¼±àд×Ô¼ºµÄ SparkStreaming ³ÌÐò֮ǰ£¬Ò»¶¨Òª½«¸ß¼¶À´Ô´ÒÀÀµµÄJarÌí¼Óµ½SBT
»ò Maven ÏîÄ¿ÏàÓ¦µÄartifactÖС£³£¼ûµÄÊäÈëÔ´ºÍÆä¶ÔÓ¦µÄJar°üÈçÏÂͼËùʾ¡£

ÁíÍ⣬ÊäÈëDStreamÒ²¿ÉÒÔ´´½¨×Ô¶¨ÒåµÄÊý¾ÝÔ´£¬ÐèÒª×öµÄ¾ÍÊÇʵÏÖÒ»¸öÓû§¶¨ÒåµÄ½ÓÊÕÆ÷¡£
2.2.3 DStreamµÄ²Ù×÷
ÓëRDDÀàËÆ£¬DStreamÒ²ÌṩÁË×Ô¼ºµÄһϵÁвÙ×÷·½·¨£¬ÕâЩ²Ù×÷¿ÉÒÔ·Ö³ÉÈýÀࣺÆÕͨµÄת»»²Ù×÷¡¢´°¿Úת»»²Ù×÷ºÍÊä³ö²Ù×÷¡£
2.2.3.1 ÆÕͨµÄת»»²Ù×÷
ÆÕͨµÄת»»²Ù×÷ÈçϱíËùʾ£º

ÔÚÉÏÃæÁгöµÄÕâЩ²Ù×÷ÖУ¬transform()·½·¨ºÍupdateStateByKey()·½·¨ÖµµÃÎÒÃÇÉîÈëµÄ̽ÌÖһϣº
l transform(func)²Ù×÷
¸Ãtransform²Ù×÷£¨×ª»»²Ù×÷£©Á¬Í¬ÆäÆäÀàËÆµÄ transformWith²Ù×÷ÔÊÐíDStream
ÉÏÓ¦ÓÃÈÎÒâRDD-to-RDDº¯Êý¡£Ëü¿ÉÒÔ±»Ó¦ÓÃÓÚδÔÚ DStream API Öб©Â¶ÈκεÄRDD²Ù×÷¡£ÀýÈ磬ÔÚÿÅú´ÎµÄÊý¾ÝÁ÷ÓëÁíÒ»Êý¾Ý¼¯µÄÁ¬½Ó¹¦Äܲ»Ö±½Ó±©Â¶ÔÚDStream
API ÖУ¬µ«¿ÉÒÔÇáËɵØÊ¹ÓÃtransform²Ù×÷À´×öµ½ÕâÒ»µã£¬ÕâʹµÃDStreamµÄ¹¦Äܷdz£Ç¿´ó¡£ÀýÈ磬Äã¿ÉÒÔͨ¹ýÁ¬½ÓÔ¤ÏȼÆËãµÄÀ¬»øÓʼþÐÅÏ¢µÄÊäÈëÊý¾ÝÁ÷£¨¿ÉÄÜÒ²ÓÐSparkÉú³ÉµÄ£©£¬È»ºó»ùÓÚ´Ë×öʵʱÊý¾ÝÇåÀíµÄɸѡ£¬ÈçÏÂÃæ¹Ù·½ÌṩµÄα´úÂëËùʾ¡£ÊÂʵÉÏ£¬Ò²¿ÉÒÔÔÚtransform·½·¨ÖÐʹÓûúÆ÷ѧϰºÍͼÐμÆËãµÄËã·¨¡£
l updateStateByKey²Ù×÷
¸Ã updateStateByKey ²Ù×÷¿ÉÒÔÈÃÄã±£³ÖÈÎÒâ״̬£¬Í¬Ê±²»¶ÏÓÐеÄÐÅÏ¢½øÐиüС£ÒªÊ¹Óô˹¦ÄÜ£¬±ØÐë½øÐÐÁ½¸ö²½Öè
£º
£¨1£© ¶¨Òå״̬ - ״̬¿ÉÒÔÊÇÈÎÒâµÄÊý¾ÝÀàÐÍ¡£
£¨2£© ¶¨Òå״̬¸üк¯Êý - ÓÃÒ»¸öº¯ÊýÖ¸¶¨ÈçºÎʹÓÃÏÈǰµÄ״̬ºÍ´ÓÊäÈëÁ÷ÖлñÈ¡µÄÐÂÖµ ¸üÐÂ״̬¡£
ÈÃÎÒÃÇÓÃÒ»¸öÀý×ÓÀ´ËµÃ÷£¬¼ÙÉèÄãÒª½øÐÐÎı¾Êý¾ÝÁ÷Öе¥´Ê¼ÆÊý¡£ÔÚÕâÀÕýÔÚÔËÐеļÆÊýÊÇ״̬¶øÇÒËüÊÇÒ»¸öÕûÊý¡£ÎÒÃǶ¨ÒåÁ˸üй¦ÄÜÈçÏ£º

´Ëº¯ÊýÓ¦ÓÃÓÚº¬ÓмüÖµ¶ÔµÄDStreamÖУ¨ÈçÇ°ÃæµÄʾÀýÖУ¬ÔÚDStreamÖк¬ÓУ¨word£¬1£©¼üÖµ¶Ô£©¡£Ëü»áÕë¶ÔÀïÃæµÄÿ¸öÔªËØ£¨ÈçwordCountÖеÄword£©µ÷ÓÃһϸüк¯Êý£¬newValuesÊÇ×îеÄÖµ£¬runningCountÊÇ֮ǰµÄÖµ¡£

2.2.3.2 ´°¿Úת»»²Ù×÷
Spark Streaming »¹ÌṩÁË´°¿ÚµÄ¼ÆË㣬ËüÔÊÐíÄãͨ¹ý»¬¶¯´°¿Ú¶ÔÊý¾Ý½øÐÐת»»£¬´°¿Úת»»²Ù×÷ÈçÏ£º


Åú´¦Àí¼ä¸ôʾÒâͼ
ÔÚSpark StreamingÖУ¬Êý¾Ý´¦ÀíÊǰ´Åú½øÐе쬶øÊý¾Ý²É¼¯ÊÇÖðÌõ½øÐеģ¬Òò´ËÔÚSpark
StreamingÖлáÏÈÉèÖúÃÅú´¦Àí¼ä¸ô£¨batch duration£©£¬µ±³¬¹ýÅú´¦Àí¼ä¸ôµÄʱºò¾Í»á°Ñ²É¼¯µ½µÄÊý¾Ý»ã×ÜÆðÀ´³ÉΪһÅúÊý¾Ý½»¸øÏµÍ³È¥´¦Àí¡£
¶ÔÓÚ´°¿Ú²Ù×÷¶øÑÔ£¬ÔÚÆä´°¿ÚÄÚ²¿»áÓÐN¸öÅú´¦ÀíÊý¾Ý£¬Åú´¦ÀíÊý¾ÝµÄ´óСÓÉ´°¿Ú¼ä¸ô£¨window duration£©¾ö¶¨£¬¶ø´°¿Ú¼ä¸ôÖ¸µÄ¾ÍÊÇ´°¿ÚµÄ³ÖÐøÊ±¼ä£¬ÔÚ´°¿Ú²Ù×÷ÖУ¬Ö»Óд°¿ÚµÄ³¤¶ÈÂú×ãÁ˲Żᴥ·¢ÅúÊý¾ÝµÄ´¦Àí¡£³ýÁË´°¿ÚµÄ³¤¶È£¬´°¿Ú²Ù×÷»¹ÓÐÁíÒ»¸öÖØÒªµÄ²ÎÊý¾ÍÊÇ»¬¶¯¼ä¸ô£¨slide
duration£©£¬ËüÖ¸µÄÊǾ¹ý¶à³¤Ê±¼ä´°¿Ú»¬¶¯Ò»´ÎÐγÉеĴ°¿Ú£¬»¬¶¯´°¿ÚĬÈÏÇé¿öϺÍÅú´Î¼ä¸ôµÄÏàͬ£¬¶ø´°¿Ú¼ä¸ôÒ»°ãÉèÖõÄÒª±ÈËüÃÇÁ½¸ö´ó¡£ÔÚÕâÀï±ØÐë×¢ÒâµÄÒ»µãÊÇ»¬¶¯¼ä¸ôºÍ´°¿Ú¼ä¸ôµÄ´óСһ¶¨µÃÉèÖÃΪÅú´¦Àí¼ä¸ôµÄÕûÊý±¶¡£
ÈçÅú´¦Àí¼ä¸ôʾÒâͼËùʾ£¬Åú´¦Àí¼ä¸ôÊÇ1¸öʱ¼äµ¥Î»£¬´°¿Ú¼ä¸ôÊÇ3¸öʱ¼äµ¥Î»£¬»¬¶¯¼ä¸ôÊÇ2¸öʱ¼äµ¥Î»¡£¶ÔÓÚ³õʼµÄ´°¿Útime
1-time 3£¬Ö»Óд°¿Ú¼ä¸ôÂú×ãÁ˲Ŵ¥·¢Êý¾ÝµÄ´¦Àí¡£ÕâÀïÐèҪעÒâµÄÒ»µãÊÇ£¬³õʼµÄ´°¿ÚÓпÉÄÜÁ÷ÈëµÄÊý¾ÝûÓгÅÂú£¬µ«ÊÇËæ×Åʱ¼äµÄÍÆ½ø£¬´°¿Ú×îÖջᱻ³ÅÂú¡£µ±Ã¿¸ö2¸öʱ¼äµ¥Î»£¬´°¿Ú»¬¶¯Ò»´Îºó£¬»áÓÐеÄÊý¾ÝÁ÷Èë´°¿Ú£¬Õâʱ´°¿Ú»áÒÆÈ¥×îÔçµÄÁ½¸öʱ¼äµ¥Î»µÄÊý¾Ý£¬¶øÓë×îеÄÁ½¸öʱ¼äµ¥Î»µÄÊý¾Ý½øÐлã×ÜÐγÉеĴ°¿Ú£¨time3-time5£©¡£
¶ÔÓÚ´°¿Ú²Ù×÷£¬Åú´¦Àí¼ä¸ô¡¢´°¿Ú¼ä¸ôºÍ»¬¶¯¼ä¸ôÊǷdz£ÖØÒªµÄÈý¸öʱ¼ä¸ÅÄÊÇÀí½â´°¿Ú²Ù×÷µÄ¹Ø¼üËùÔÚ¡£
2.2.3.3 Êä³ö²Ù×÷
Spark StreamingÔÊÐíDStreamµÄÊý¾Ý±»Êä³öµ½Íⲿϵͳ£¬ÈçÊý¾Ý¿â»òÎļþϵͳ¡£ÓÉÓÚÊä³ö²Ù×÷ʵ¼ÊÉÏʹtransformation²Ù×÷ºóµÄÊý¾Ý¿ÉÒÔͨ¹ýÍⲿϵͳ±»Ê¹Óã¬Í¬Ê±Êä³ö²Ù×÷´¥·¢ËùÓÐDStreamµÄtransformation²Ù×÷µÄʵ¼ÊÖ´ÐУ¨ÀàËÆÓÚRDD²Ù×÷£©¡£ÒÔϱíÁгöÁËĿǰÖ÷ÒªµÄÊä³ö²Ù×÷£º

dstream.foreachRDDÊÇÒ»¸ö·Ç³£Ç¿´óµÄÊä³ö²Ù×÷£¬ËüÔʽ«ÐíÊý¾ÝÊä³öµ½Íⲿϵͳ¡£µ«ÊÇ £¬ÈçºÎÕýÈ·¸ßЧµØÊ¹ÓÃÕâ¸ö²Ù×÷ÊǺÜÖØÒªµÄ£¬ÏÂÃæÕ¹Ê¾ÁËÈçºÎÈ¥±ÜÃâһЩ³£¼ûµÄ´íÎó¡£
ͨ³£½«Êý¾ÝдÈëµ½ÍⲿϵͳÐèÒª´´½¨Ò»¸öÁ¬½Ó¶ÔÏó£¨Èç TCPÁ¬½Óµ½Ô¶³Ì·þÎñÆ÷£©£¬²¢ÓÃËüÀ´·¢ËÍÊý¾Ýµ½Ô¶³Ìϵͳ¡£³öÓÚÕâ¸öÄ¿µÄ£¬¿ª·¢Õß¿ÉÄÜÔÚ²»¾Òâ¼äÔÚSpark
driver¶Ë´´½¨ÁËÁ¬½Ó¶ÔÏ󣬲¢³¢ÊÔʹÓÃËü±£´æRDDÖеļǼµ½Spark workerÉÏ£¬ÈçÏÂÃæ´úÂ룺

ÕâÊDz»ÕýÈ·µÄ£¬ÕâÐèÒªÁ¬½Ó¶ÔÏó½øÐÐÐòÁл¯²¢´ÓDriver¶Ë·¢Ë͵½WorkerÉÏ¡£Á¬½Ó¶ÔÏóºÜÉÙÔÚ²»Í¬»úÆ÷¼ä½øÐÐÕâÖÖ²Ù×÷£¬´Ë´íÎó¿ÉÄܱíÏÖΪÐòÁл¯´íÎó£¨Á¬½Ó¶Ô²»¿ÉÐòÁл¯£©£¬³õʼ»¯´íÎó£¨Á¬½Ó¶ÔÏóÔÚÐèÒªÔÚWorker
ÉϽøÐÐÐèÒª³õʼ»¯£© µÈµÈ£¬ÕýÈ·µÄ½â¾ö°ì·¨ÊÇÔÚ workerÉÏ´´½¨µÄÁ¬½Ó¶ÔÏó¡£
ͨ³£Çé¿öÏ£¬´´½¨Ò»¸öÁ¬½Ó¶ÔÏóÓÐʱ¼äºÍ×ÊÔ´¿ªÏú¡£Òò´Ë£¬´´½¨ºÍÏú»ÙµÄÿÌõ¼Ç¼µÄÁ¬½Ó¶ÔÏó¿ÉÄÜÕÐÖ²»±ØÒªµÄ×ÊÔ´¿ªÏú£¬²¢ÏÔÖø½µµÍϵͳÕûÌåµÄÍÌÍÂÁ¿
¡£Ò»¸ö¸üºÃµÄ½â¾ö·½°¸ÊÇʹÓÃrdd.foreachPartition·½·¨´´½¨Ò»¸öµ¥¶ÀµÄÁ¬½Ó¶ÔÏó£¬È»ºóʹÓøÃÁ¬½Ó¶ÔÏóÊä³öµÄËùÓÐRDD·ÖÇøÖеÄÊý¾Ýµ½Íⲿϵͳ¡£
Õ⻺½âÁË´´½¨¶àÌõ¼Ç¼Á¬½ÓµÄ¿ªÏú¡£×îºó£¬»¹¿ÉÒÔ½øÒ»²½Í¨¹ýÔÚ¶à¸öRDDs/ batchesÉÏÖØÓÃÁ¬½Ó¶ÔÏó½øÐÐÓÅ»¯¡£Ò»¸ö±£³ÖÁ¬½Ó¶ÔÏóµÄ¾²Ì¬³Ø¿ÉÒÔÖØÓÃÔÚ¶à¸öÅú´¦ÀíµÄRDDÉϽ«ÆäÊä³öµ½Íⲿϵͳ£¬´Ó¶ø½øÒ»²½½µµÍÁË¿ªÏú¡£
ÐèҪעÒâµÄÊÇ£¬ÔÚ¾²Ì¬³ØÖеÄÁ¬½ÓÓ¦¸Ã°´ÐèÑÓ³Ù´´½¨£¬ÕâÑù¿ÉÒÔ¸üÓÐЧµØ°ÑÊý¾Ý·¢Ë͵½Íⲿϵͳ¡£ÁíÍâÐèҪҪעÒâµÄÊÇ£ºDStreamsÑÓ³ÙÖ´Ðе쬾ÍÏñRDDµÄ²Ù×÷ÊÇÓÉactions´¥·¢Ò»Ñù¡£Ä¬ÈÏÇé¿öÏ£¬Êä³ö²Ù×÷»á°´ÕÕËüÃÇÔÚStreamingÓ¦ÓóÌÐòÖж¨ÒåµÄ˳ÐòÒ»¸ö¸öÖ´ÐС£
2.3 ÈÝ´í¡¢³Ö¾Ã»¯ºÍÐÔÄܵ÷ÓÅ
2.3.1 ÈÝ´í
DStream»ùÓÚRDD×é³É£¬RDDµÄÈÝ´íÐÔÒÀ¾ÉÓÐЧ£¬ÎÒÃÇÊ×ÏÈ»ØÒäÒ»ÏÂSparkRDDµÄ»ù±¾ÌØÐÔ¡£
lRDDÊÇÒ»¸ö²»¿É±äµÄ¡¢È·¶¨ÐԵĿÉÖØ¸´¼ÆËãµÄ·Ö²¼Ê½Êý¾Ý¼¯¡£RDDµÄijЩpartition¶ªÊ§ÁË£¬¿ÉÒÔͨ¹ýѪͳ£¨lineage£©ÐÅÏ¢ÖØÐ¼ÆËã»Ö¸´£»
lÈç¹ûRDDÈκηÖÇøÒòworker½Úµã¹ÊÕ϶ø¶ªÊ§£¬ÄÇôÕâ¸ö·ÖÇø¿ÉÒÔ´ÓÔÀ´ÒÀÀµµÄÈÝ´íÊý¾Ý¼¯Öлָ´£»
lÓÉÓÚSparkÖÐËùÓеÄÊý¾ÝµÄת»»²Ù×÷¶¼ÊÇ»ùÓÚRDDµÄ£¬¼´Ê¹¼¯Èº³öÏÖ¹ÊÕÏ£¬Ö»ÒªÊäÈëÊý¾Ý¼¯´æÔÚ£¬ËùÓеÄÖмä½á¹û¶¼ÊÇ¿ÉÒÔ±»¼ÆËãµÄ¡£
Spark StreamingÊÇ¿ÉÒÔ´ÓHDFSºÍS3ÕâÑùµÄÎļþϵͳ¶ÁÈ¡Êý¾ÝµÄ£¬ÕâÖÖÇé¿öÏÂËùÓеÄÊý¾Ý¶¼¿ÉÒÔ±»ÖØÐ¼ÆË㣬²»Óõ£ÐÄÊý¾ÝµÄ¶ªÊ§¡£µ«ÊÇÔÚ´ó¶àÊýÇé¿öÏ£¬Spark
StreamingÊÇ»ùÓÚÍøÂçÀ´½ÓÊÜÊý¾ÝµÄ£¬´ËʱΪÁËʵÏÖÏàͬµÄÈÝ´í´¦Àí£¬ÔÚ½ÓÊÜÍøÂçµÄÊý¾Ýʱ»áÔÚ¼¯ÈºµÄ¶à¸öWorker½Úµã¼ä½øÐÐÊý¾ÝµÄ¸´ÖÆ£¨Ä¬Èϵĸ´ÖÆÊýÊÇ2£©£¬Õâµ¼Ö²úÉúÔÚ³öÏÖ¹ÊÕÏʱ±»´¦ÀíµÄÁ½ÖÖÀàÐ͵ÄÊý¾Ý£º
1£©Data received and replicated £ºÒ»µ©Ò»¸öWorker½ÚµãʧЧ£¬ÏµÍ³»á´ÓÁíÒ»·Ý»¹´æÔÚµÄÊý¾ÝÖÐÖØÐ¼ÆËã¡£
2£©Data received but buffered for replication £ºÒ»µ©Êý¾Ý¶ªÊ§£¬¿ÉÒÔͨ¹ýRDDÖ®¼äµÄÒÀÀµ¹ØÏµ£¬´ÓHDFSÕâÑùµÄÍⲿÎļþϵͳ¶ÁÈ¡Êý¾Ý¡£
´ËÍ⣬ÓÐÁ½ÖÖ¹ÊÕÏ£¬ÎÒÃÇÓ¦¸Ã¹ØÐÄ£º
£¨1£©Worker½ÚµãʧЧ£ºÍ¨¹ýÉÏÃæµÄ½²½âÎÒÃÇÖªµÀ£¬Õâʱϵͳ»á¸ù¾Ý³öÏÖ¹ÊÕϵÄÊý¾ÝµÄÀàÐÍ£¬Ñ¡ÔñÊÇ´ÓÁíÒ»¸öÓи´ÖƹýÊý¾ÝµÄ¹¤×÷½ÚµãÉÏÖØÐ¼ÆË㣬»¹ÊÇÖ±½Ó´Ó´ÓÍⲿÎļþϵͳ¶ÁÈ¡Êý¾Ý¡£
£¨2£©Driver£¨Çý¶¯½Úµã£©Ê§Ð§ £ºÈç¹ûÔËÐÐ Spark StreamingÓ¦ÓÃʱÇý¶¯½Úµã³öÏÖ¹ÊÕÏ£¬ÄÇôºÜÃ÷ÏÔµÄStreamingContextÒѾ¶ªÊ§£¬Í¬Ê±ÔÚÄÚ´æÖеÄÊý¾ÝÈ«²¿¶ªÊ§¡£¶ÔÓÚÕâÖÖÇé¿ö£¬Spark
StreamingÓ¦ÓóÌÐòÔÚ¼ÆËãÉÏÓÐÒ»¸öÄÚÔڵĽṹ¡ª¡ªÔÚÿ¶Îmicro-batchÊý¾ÝÖÜÆÚÐÔµØÖ´ÐÐͬÑùµÄSpark¼ÆËã¡£ÕâÖֽṹÔÊÐí°ÑÓ¦ÓõÄ״̬£¨Òà³Æcheckpoint£©ÖÜÆÚÐԵر£´æµ½¿É¿¿µÄ´æ´¢¿Õ¼äÖУ¬²¢ÔÚdriverÖØÐÂÆô¶¯Ê±»Ö¸´¸Ã״̬¡£¾ßÌå×ö·¨ÊÇÔÚssc.checkpoint(<checkpoint
directory>)º¯ÊýÖнøÐÐÉèÖã¬Spark Streaming¾Í»á¶¨ÆÚ°ÑDStreamµÄÔªÐÅϢдÈëµ½HDFSÖУ¬Ò»µ©Çý¶¯½ÚµãʧЧ£¬¶ªÊ§µÄStreamingContext»áͨ¹ýÒѾ±£´æµÄ¼ì²éµãÐÅÏ¢½øÐлָ´¡£
×îºóÎÒÃÇ̸һÏÂSpark StreamµÄÈÝ´íÔÚSpark 1.2°æ±¾µÄһЩ¸Ä½ø£º
ʵʱÁ÷´¦Àíϵͳ±ØÐëÒªÄÜÔÚ24/7ʱ¼äÄÚ¹¤×÷£¬Òò´ËËüÐèÒª¾ß±¸´Ó¸÷ÖÖϵͳ¹ÊÕÏÖлָ´¹ýÀ´µÄÄÜÁ¦¡£×ʼ£¬SparkStreaming¾ÍÖ§³Ö´ÓdriverºÍworker¹ÊÕϻָ´µÄÄÜÁ¦¡£È»¶øÓÐЩÊý¾ÝÔ´µÄÊäÈë¿ÉÄÜÔÚ¹ÊÕϻָ´ÒÔºó¶ªÊ§Êý¾Ý¡£ÔÚSpark1.2°æ±¾ÖУ¬SparkÒѾÔÚSparkStreamingÖжÔԤдÈÕÖ¾£¨Ò²±»³ÆÎªjournaling£©×÷Á˳õ²½Ö§³Ö£¬¸Ä½øÁ˻ָ´»úÖÆ£¬²¢Ê¹¸ü¶àÊý¾ÝÔ´µÄÁãÊý¾Ý¶ªÊ§ÓÐÁ˿ɿ¿¡£
¶ÔÓÚÎļþÕâÑùµÄÔ´Êý¾Ý£¬driver»Ö¸´»úÖÆ×ãÒÔ×öµ½ÁãÊý¾Ý¶ªÊ§£¬ÒòΪËùÓеÄÊý¾Ý¶¼±£´æÔÚÁËÏñHDFS»òS3ÕâÑùµÄÈÝ´íÎļþϵͳÖÐÁË¡£µ«¶ÔÓÚÏñKafkaºÍFlumeµÈÆäËüÊý¾ÝÔ´£¬ÓÐЩ½ÓÊÕµ½µÄÊý¾Ý»¹Ö»»º´æÔÚÄÚ´æÖУ¬ÉÐδ±»´¦Àí£¬ËüÃǾÍÓпÉÄܻᶪʧ¡£ÕâÊÇÓÉÓÚSparkÓ¦Óõķֲ¼²Ù×÷·½Ê½ÒýÆðµÄ¡£µ±driver½ø³Ìʧ°Üʱ£¬ËùÓÐÔÚstandalone/yarn/mesos¼¯ÈºÔËÐеÄexecutor£¬Á¬Í¬ËüÃÇÔÚÄÚ´æÖеÄËùÓÐÊý¾Ý£¬Ò²Í¬Ê±±»ÖÕÖ¹¡£¶ÔÓÚSpark
StreamingÀ´Ëµ£¬´ÓÖîÈçKafkaºÍFlumeµÄÊý¾ÝÔ´½ÓÊÕµ½µÄËùÓÐÊý¾Ý£¬ÔÚËüÃÇ´¦ÀíÍê³É֮ǰ£¬Ò»Ö±¶¼»º´æÔÚexecutorµÄÄÚ´æÖС£×ÝÈ»driverÖØÐÂÆô¶¯£¬ÕâЩ»º´æµÄÊý¾ÝÒ²²»Äܱ»»Ö¸´¡£ÎªÁ˱ÜÃâÕâÖÖÊý¾ÝËðʧ£¬ÔÚSpark1.2·¢²¼°æ±¾ÖÐÒý½øÁËԤдÈÕÖ¾£¨WriteAheadLogs£©¹¦ÄÜ¡£
ԤдÈÕÖ¾¹¦ÄܵÄÁ÷³ÌÊÇ£º1£©Ò»¸öSparkStreamingÓ¦ÓÿªÊ¼Ê±£¨Ò²¾ÍÊÇdriver¿ªÊ¼Ê±£©£¬Ïà¹ØµÄStreamingContextʹÓÃSparkContextÆô¶¯½ÓÊÕÆ÷³ÉΪ³¤×¤ÔËÐÐÈÎÎñ¡£ÕâЩ½ÓÊÕÆ÷½ÓÊÕ²¢±£´æÁ÷Êý¾Ýµ½SparkÄÚ´æÖÐÒÔ¹©´¦Àí¡£2£©½ÓÊÕÆ÷֪ͨdriver¡£3£©½ÓÊÕ¿éÖеÄÔªÊý¾Ý£¨metadata£©±»·¢Ë͵½driverµÄStreamingContext¡£Õâ¸öÔªÊý¾Ý°üÀ¨£º£¨a£©¶¨Î»ÆäÔÚexecutorÄÚ´æÖÐÊý¾ÝµÄ¿éreferenceid£¬£¨b£©¿éÊý¾ÝÔÚÈÕÖ¾ÖÐµÄÆ«ÒÆÐÅÏ¢£¨Èç¹ûÆôÓÃÁË£©¡£
Óû§´«ËÍÊý¾ÝµÄÉúÃüÖÜÆÚÈçÏÂͼËùʾ¡£

ÀàËÆKafkaÕâÑùµÄϵͳ¿ÉÒÔͨ¹ý¸´ÖÆÊý¾Ý±£³Ö¿É¿¿ÐÔ¡£ÔÊÐíԤдÈÕÖ¾Á½´Î¸ßЧµØ¸´ÖÆÍ¬ÑùµÄÊý¾Ý£ºÒ»´ÎÓÉKafka£¬¶øÁíÒ»´ÎÓÉSparkStreaming¡£SparkδÀ´°æ±¾½«°üº¬KafkaÈÝ´í»úÖÆµÄÔÉúÖ§³Ö£¬´Ó¶ø±ÜÃâµÚ¶þ¸öÈÕÖ¾¡£
2.3.2 ³Ö¾Ã»¯
ÓëRDDÒ»Ñù£¬DStreamͬÑùÒ²ÄÜͨ¹ýpersist()·½·¨½«Êý¾ÝÁ÷´æ·ÅÔÚÄÚ´æÖУ¬Ä¬Èϵij־û¯·½Ê½ÊÇMEMORY_ONLY_SER£¬Ò²¾ÍÊÇÔÚÄÚ´æÖдæ·ÅÊý¾ÝͬʱÐòÁл¯µÄ·½Ê½£¬ÕâÑù×öµÄºÃ´¦ÊÇÓöµ½ÐèÒª¶à´Îµü´ú¼ÆËãµÄ³ÌÐòʱ£¬ËÙ¶ÈÓÅÊÆÊ®·ÖµÄÃ÷ÏÔ¡£¶ø¶ÔÓÚһЩ»ùÓÚ´°¿ÚµÄ²Ù×÷£¬ÈçreduceByWindow¡¢reduceByKeyAndWindow£¬ÒÔ¼°»ùÓÚ״̬µÄ²Ù×÷£¬ÈçupdateStateBykey£¬ÆäĬÈϵij־û¯²ßÂÔ¾ÍÊDZ£´æÔÚÄÚ´æÖС£
¶ÔÓÚÀ´×ÔÍøÂçµÄÊý¾ÝÔ´£¨Kafka¡¢Flume¡¢socketsµÈ£©£¬Ä¬Èϵij־û¯²ßÂÔÊǽ«Êý¾Ý±£´æÔÚÁ½Ì¨»úÆ÷ÉÏ£¬ÕâÒ²ÊÇΪÁËÈÝ´íÐÔ¶øÉè¼ÆµÄ¡£
ÁíÍ⣬¶ÔÓÚ´°¿ÚºÍÓÐ״̬µÄ²Ù×÷±ØÐëcheckpoint£¬Í¨¹ýStreamingContextµÄcheckpointÀ´Ö¸¶¨Ä¿Â¼£¬Í¨¹ý
DtreamµÄcheckpointÖ¸¶¨¼ä¸ôʱ¼ä£¬¼ä¸ô±ØÐëÊÇ»¬¶¯¼ä¸ô£¨slide interval£©µÄ±¶Êý¡£
2.3.3 ÐÔÄܵ÷ÓÅ
1. ÓÅ»¯ÔËÐÐʱ¼ä
l Ôö¼Ó²¢ÐÐ¶È È·±£Ê¹ÓÃÕû¸ö¼¯ÈºµÄ×ÊÔ´£¬¶ø²»ÊǰÑÈÎÎñ¼¯ÖÐÔÚ¼¸¸öÌØ¶¨µÄ½ÚµãÉÏ¡£¶ÔÓÚ°üº¬shuffleµÄ²Ù×÷£¬Ôö¼ÓÆä²¢ÐжÈÒÔÈ·±£¸üΪ³ä·ÖµØÊ¹Óü¯Èº×ÊÔ´£»
l ¼õÉÙÊý¾ÝÐòÁл¯£¬·´ÐòÁл¯µÄ¸ºµ£ Spark StreamingĬÈϽ«½ÓÊܵ½µÄÊý¾ÝÐòÁл¯ºó´æ´¢£¬ÒÔ¼õÉÙÄÚ´æµÄʹÓᣵ«ÊÇÐòÁл¯ºÍ·´ÐòÁл°ÐèÒª¸ü¶àµÄCPUʱ¼ä£¬Òò´Ë¸ü¼Ó¸ßЧµÄÐòÁл¯·½Ê½£¨Kryo£©ºÍ×Ô¶¨ÒåµÄϵÁл¯½Ó¿Ú¿ÉÒÔ¸ü¸ßЧµØÊ¹ÓÃCPU£»
l ÉèÖúÏÀíµÄbatch duration£¨Åú´¦Àíʱ¼ä¼ä£© ÔÚSpark StreamingÖУ¬JobÖ®¼äÓпÉÄÜ´æÔÚÒÀÀµ¹ØÏµ£¬ºóÃæµÄJob±ØÐëÈ·±£Ç°ÃæµÄ×÷ÒµÖ´ÐнáÊøºó²ÅÄÜÌá½»¡£ÈôÇ°ÃæµÄJobÖ´ÐеÄʱ¼ä³¬³öÁËÅú´¦Àíʱ¼ä¼ä¸ô£¬ÄÇôºóÃæµÄJob¾ÍÎÞ·¨°´Ê±Ìá½»£¬ÕâÑù¾Í»á½øÒ»²½ÍÏÑÓ½ÓÏÂÀ´µÄJob£¬Ôì³ÉºóÐøJobµÄ×èÈû¡£Òò´ËÉèÖÃÒ»¸öºÏÀíµÄÅú´¦Àí¼ä¸ôÒÔÈ·±£×÷ÒµÄܹ»ÔÚÕâ¸öÅú´¦Àí¼ä¸ôÄÚ½áÊøÊ±±ØÐëµÄ£»
l ¼õÉÙÒòÈÎÎñÌá½»ºÍ·Ö·¢Ëù´øÀ´µÄ¸ºµ£ ͨ³£Çé¿öÏ£¬Akka¿ò¼ÜÄܹ»¸ßЧµØÈ·±£ÈÎÎñ¼°Ê±·Ö·¢£¬µ«Êǵ±Åú´¦Àí¼ä¸ô·Ç³£Ð¡£¨500ms£©Ê±£¬Ìá½»ºÍ·Ö·¢ÈÎÎñµÄÑӳپͱäµÃ²»¿É½ÓÊÜÁË¡£Ê¹ÓÃStandaloneºÍCoarse-grained
Mesosģʽͨ³£»á±ÈʹÓÃFine-grained MesosģʽÓиüСµÄÑÓ³Ù¡£
2. ÓÅ»¯ÄÚ´æÊ¹ÓÃ
l¿ØÖÆbatch size£¨Åú´¦Àí¼ä¸ôÄÚµÄÊý¾ÝÁ¿£© Spark Streaming»á°ÑÅú´¦Àí¼ä¸ôÄÚ½ÓÊÕµ½µÄËùÓÐÊý¾Ý´æ·ÅÔÚSparkÄÚ²¿µÄ¿ÉÓÃÄÚ´æÇøÓòÖУ¬Òò´Ë±ØÐëÈ·±£µ±Ç°½ÚµãSparkµÄ¿ÉÓÃÄÚ´æÖÐÉÙÄÜÈÝÄÉÕâ¸öÅú´¦Àíʱ¼ä¼ä¸ôÄÚµÄËùÓÐÊý¾Ý£¬·ñÔò±ØÐëÔö¼ÓеÄ×ÊÔ´ÒÔÌá¸ß¼¯ÈºµÄ´¦ÀíÄÜÁ¦£»
l¼°Ê±ÇåÀí²»ÔÙʹÓõÄÊý¾Ý Ç°Ãæ½²µ½Spark Streaming»á½«½ÓÊܵÄÊý¾ÝÈ«²¿´æ´¢µ½ÄÚ²¿¿ÉÓÃÄÚ´æÇøÓòÖУ¬Òò´Ë¶ÔÓÚ´¦Àí¹ýµÄ²»ÔÙÐèÒªµÄÊý¾ÝÓ¦¼°Ê±ÇåÀí£¬ÒÔÈ·±£Spark
StreamingÓи»ÓàµÄ¿ÉÓÃÄÚ´æ¿Õ¼ä¡£Í¨¹ýÉèÖúÏÀíµÄspark.cleaner.ttlʱ³¤À´¼°Ê±ÇåÀí³¬Ê±µÄÎÞÓÃÊý¾Ý£¬Õâ¸ö²ÎÊýÐèҪСÐÄÉèÖÃÒÔÃâºóÐø²Ù×÷ÖÐËùÐèÒªµÄÊý¾Ý±»³¬Ê±´íÎó´¦Àí£»
l¹Û²ì¼°Êʵ±µ÷ÕûGC²ßÂÔ GC»áÓ°ÏìJobµÄÕý³£ÔËÐУ¬¿ÉÄÜÑÓ³¤JobµÄÖ´ÐÐʱ¼ä£¬ÒýÆðһϵÁв»¿ÉÔ¤ÁϵÄÎÊÌâ¡£¹Û²ìGCµÄÔËÐÐÇé¿ö£¬²ÉÓò»Í¬µÄGC²ßÂÔÒÔ½øÒ»²½¼õСÄÚ´æ»ØÊÕ¶ÔJobÔËÐеÄÓ°Ïì¡£ |