Äú¿ÉÒÔ¾èÖú£¬Ö§³ÖÎÒÃǵĹ«ÒæÊÂÒµ¡£

1Ôª 10Ôª 50Ôª





ÈÏÖ¤Â룺  ÑéÖ¤Âë,¿´²»Çå³þ?Çëµã»÷Ë¢ÐÂÑéÖ¤Âë ±ØÌî



  ÇóÖª ÎÄÕ ÎÄ¿â Lib ÊÓÆµ iPerson ¿Î³Ì ÈÏÖ¤ ×Éѯ ¹¤¾ß ½²×ù Model Center   Code  
»áÔ±   
   
 
     
   
 ¶©ÔÄ
  ¾èÖú
Spark ÒÔ¼° spark streaming ºËÐÄÔ­Àí¼°Êµ¼ù
 
×÷Õߣº½¯×¨
 
  3113  次浏览      28
2020-9-17 
 
±à¼­ÍƼö:
±¾ÎÄÖ÷Òª½éÉÜÁËspark Éú̬¼°ÔËÐÐÔ­Àí¡¢sparkÔËÐмܹ¹¡¢SparkºËÐĸÅÄîÖ®RDD¡¢SparkºËÐĸÅÄîÖ®Jobs / Stage¡¢Spark StreamingÔËÐÐÔ­Àí¡¢Spark ×ÊÔ´µ÷ÓŵÈÄÚÈÝ
±¾ÎÄÀ´×Ô²©¿ÍÔ°£¬ÓÉ»ðÁú¹ûÈí¼þAnna±à¼­¡¢ÍƼö¡£

µ¼Óï

spark ÒѾ­³ÉΪ¹ã¸æ¡¢±¨±íÒÔ¼°ÍƼöϵͳµÈ´óÊý¾Ý¼ÆË㳡¾°ÖÐÊ×ѡϵͳ£¬ÒòЧÂʸߣ¬Ò×ÓÃÒÔ¼°Í¨ÓÃÐÔÔ½À´Ô½µÃµ½´ó¼ÒµÄÇàíù£¬ÎÒ×Ô¼º×î½ü°ëÄêÔÚ½Ó´¥sparkÒÔ¼°spark streamingÖ®ºó£¬¶Ôspark¼¼ÊõµÄʹÓÃÓÐһЩ×Ô¼ºµÄ¾­Ñé»ýÀÛÒÔ¼°ÐĵÃÌå»á£¬ÔÚ´Ë·ÖÏí¸ø´ó¼Ò¡£

±¾ÎÄÒÀ´Î´ÓsparkÉú̬£¬Ô­Àí£¬»ù±¾¸ÅÄspark streamingÔ­Àí¼°Êµ¼ù£¬»¹ÓÐsparkµ÷ÓÅÒÔ¼°»·¾³´î½¨µÈ·½Ãæ½øÐнéÉÜ£¬Ï£Íû¶Ô´ó¼ÒÓÐËù°ïÖú¡£

spark Éú̬¼°ÔËÐÐÔ­Àí

Spark ÌØµã

ÔËÐÐËÙ¶È¿ì SparkÓµÓÐDAGÖ´ÐÐÒýÇæ£¬Ö§³ÖÔÚÄÚ´æÖжÔÊý¾Ý½øÐеü´ú¼ÆËã¡£¹Ù·½ÌṩµÄÊý¾Ý±íÃ÷£¬Èç¹ûÊý¾ÝÓÉ´ÅÅ̶ÁÈ¡£¬ËÙ¶ÈÊÇHadoop MapReduceµÄ10±¶ÒÔÉÏ£¬Èç¹ûÊý¾Ý´ÓÄÚ´æÖжÁÈ¡£¬ËÙ¶È¿ÉÒԸߴï100¶à±¶¡£

ÊÊÓó¡¾°¹ã·º ´óÊý¾Ý·ÖÎöͳ¼Æ£¬ÊµÊ±Êý¾Ý´¦Àí£¬Í¼¼ÆËã¼°»úÆ÷ѧϰ

Ò×ÓÃÐÔ ±àд¼òµ¥£¬Ö§³Ö80ÖÖÒÔÉϵĸ߼¶Ëã×Ó£¬Ö§³Ö¶àÖÖÓïÑÔ£¬Êý¾ÝÔ´·á¸»£¬¿É²¿ÊðÔÚ¶àÖÖ¼¯ÈºÖÐ

ÈÝ´íÐԸߡ£SparkÒý½øÁ˵¯ÐÔ·Ö²¼Ê½Êý¾Ý¼¯RDD (Resilient Distributed Dataset) µÄ³éÏó£¬ËüÊÇ·Ö²¼ÔÚÒ»×é½ÚµãÖеÄÖ»¶Á¶ÔÏ󼯺ϣ¬ÕâЩ¼¯ºÏÊǵ¯ÐԵģ¬Èç¹ûÊý¾Ý¼¯Ò»²¿·Ö¶ªÊ§£¬Ôò¿ÉÒÔ¸ù¾Ý¡°ÑªÍ³¡±£¨¼´³äÐí»ùÓÚÊý¾ÝÑÜÉú¹ý³Ì£©¶ÔËüÃǽøÐÐÖØ½¨¡£ÁíÍâÔÚRDD¼ÆËãʱ¿ÉÒÔͨ¹ýCheckPointÀ´ÊµÏÖÈÝ´í£¬¶øCheckPointÓÐÁ½ÖÖ·½Ê½£ºCheckPoint Data£¬ºÍLogging The Updates£¬Óû§¿ÉÒÔ¿ØÖƲÉÓÃÄÄÖÖ·½Ê½À´ÊµÏÖÈÝ´í¡£

SparkµÄÊÊÓó¡¾°

Ŀǰ´óÊý¾Ý´¦Àí³¡¾°ÓÐÒÔϼ¸¸öÀàÐÍ£º

¸´ÔÓµÄÅúÁ¿´¦Àí£¨Batch Data Processing£©£¬Æ«ÖصãÔÚÓÚ´¦Àíº£Á¿Êý¾ÝµÄÄÜÁ¦£¬ÖÁÓÚ´¦ÀíËÙ¶È¿ÉÈÌÊÜ£¬Í¨³£µÄʱ¼ä¿ÉÄÜÊÇÔÚÊýÊ®·ÖÖÓµ½ÊýСʱ£»

»ùÓÚÀúÊ·Êý¾ÝµÄ½»»¥Ê½²éѯ£¨Interactive Query£©£¬Í¨³£µÄʱ¼äÔÚÊýÊ®Ãëµ½ÊýÊ®·ÖÖÓÖ®¼ä

»ùÓÚʵʱÊý¾ÝÁ÷µÄÊý¾Ý´¦Àí£¨Streaming Data Processing£©£¬Í¨³£ÔÚÊý°ÙºÁÃëµ½ÊýÃëÖ®¼ä

Spark³É¹¦°¸Àý

Ŀǰ´óÊý¾ÝÔÚ»¥ÁªÍø¹«Ë¾Ö÷ÒªÓ¦ÓÃÔÚ¹ã¸æ¡¢±¨±í¡¢ÍƼöϵͳµÈÒµÎñÉÏ¡£ÔÚ¹ã¸æÒµÎñ·½ÃæÐèÒª´óÊý¾Ý×öÓ¦Ó÷ÖÎö¡¢Ð§¹û·ÖÎö¡¢¶¨ÏòÓÅ»¯µÈ£¬ÔÚÍÆ¼öϵͳ·½ÃæÔòÐèÒª´óÊý¾ÝÓÅ»¯Ïà¹ØÅÅÃû¡¢¸öÐÔ»¯ÍƼöÒÔ¼°Èȵãµã»÷·ÖÎöµÈ¡£ÕâЩӦÓó¡¾°µÄÆÕ±éÌØµãÊǼÆËãÁ¿´ó¡¢Ð§ÂÊÒªÇó¸ß¡£

ÌÚѶ / yahoo / ÌÔ±¦ / ÓÅ¿áÍÁ¶¹

sparkÔËÐмܹ¹

spark»ù´¡ÔËÐмܹ¹ÈçÏÂËùʾ£º

spark½áºÏyarn¼¯Èº±³ºóµÄÔËÐÐÁ÷³ÌÈçÏÂËùʾ£º

spark ÔËÐÐÁ÷³Ì£º

Spark¼Ü¹¹²ÉÓÃÁË·Ö²¼Ê½¼ÆËãÖеÄMaster-SlaveÄ£ÐÍ¡£MasterÊǶÔÓ¦¼¯ÈºÖеĺ¬ÓÐMaster½ø³ÌµÄ½Úµã£¬SlaveÊǼ¯ÈºÖк¬ÓÐWorker½ø³ÌµÄ½Úµã¡£

Master×÷ΪÕû¸ö¼¯ÈºµÄ¿ØÖÆÆ÷£¬¸ºÔðÕû¸ö¼¯ÈºµÄÕý³£ÔËÐУ»

WorkerÏ൱ÓÚ¼ÆËã½Úµã£¬½ÓÊÕÖ÷½ÚµãÃüÁîÓë½øÐÐ״̬»ã±¨£»

Executor¸ºÔðÈÎÎñµÄÖ´ÐУ»

Client×÷ΪÓû§µÄ¿Í»§¶Ë¸ºÔðÌá½»Ó¦Óã»

Driver¸ºÔð¿ØÖÆÒ»¸öÓ¦ÓõÄÖ´ÐС£

Spark¼¯Èº²¿Êðºó£¬ÐèÒªÔÚÖ÷½ÚµãºÍ´Ó½Úµã·Ö±ðÆô¶¯Master½ø³ÌºÍWorker½ø³Ì£¬¶ÔÕû¸ö¼¯Èº½øÐпØÖÆ¡£ÔÚÒ»¸öSparkÓ¦ÓõÄÖ´Ðйý³ÌÖУ¬DriverºÍWorkerÊÇÁ½¸öÖØÒª½ÇÉ«¡£Driver ³ÌÐòÊÇÓ¦ÓÃÂß¼­Ö´ÐÐµÄÆðµã£¬¸ºÔð×÷ÒµµÄµ÷¶È£¬¼´TaskÈÎÎñµÄ·Ö·¢£¬¶ø¶à¸öWorkerÓÃÀ´¹ÜÀí¼ÆËã½ÚµãºÍ´´½¨Executor²¢Ðд¦ÀíÈÎÎñ¡£ÔÚÖ´Ðн׶Σ¬Driver»á½«TaskºÍTaskËùÒÀÀµµÄfileºÍjarÐòÁл¯ºó´«µÝ¸ø¶ÔÓ¦µÄWorker»úÆ÷£¬Í¬Ê±Executor¶ÔÏàÓ¦Êý¾Ý·ÖÇøµÄÈÎÎñ½øÐд¦Àí¡£

Excecutor /Task ÿ¸ö³ÌÐò×ÔÓУ¬²»Í¬³ÌÐò»¥Ïà¸ôÀ룬task¶àÏ̲߳¢ÐÐ

¼¯Èº¶ÔSpark͸Ã÷£¬SparkÖ»ÒªÄÜ»ñÈ¡Ïà¹Ø½ÚµãºÍ½ø³Ì

Driver ÓëExecutor±£³ÖͨÐÅ£¬Ð­×÷´¦Àí

ÈýÖÖ¼¯ÈºÄ£Ê½£º

1.Standalone ¶ÀÁ¢¼¯Èº

2.Mesos, apache mesos

3.Yarn, hadoop yarn

»ù±¾¸ÅÄ

Application ⇨SparkµÄÓ¦ÓóÌÐò£¬°üº¬Ò»¸öDriver programºÍÈô¸ÉExecutor

SparkContext⇨ SparkÓ¦ÓóÌÐòµÄÈë¿Ú£¬¸ºÔðµ÷¶È¸÷¸öÔËËã×ÊÔ´£¬Ð­µ÷¸÷¸öWorker NodeÉϵÄExecutor

Driver Program ⇨ÔËÐÐApplicationµÄmain()º¯Êý²¢ÇÒ´´½¨SparkContext

Executor ⇨ÊÇΪApplicationÔËÐÐÔÚWorker nodeÉϵÄÒ»¸ö½ø³Ì£¬¸Ã½ø³Ì¸ºÔðÔËÐÐTask£¬²¢ÇÒ¸ºÔð½«Êý¾Ý´æÔÚÄÚ´æ»òÕß´ÅÅÌÉÏ¡£Ã¿¸öApplication¶¼»áÉêÇë¸÷×ÔµÄExecutorÀ´´¦ÀíÈÎÎñ

Cluster Manager ⇨ÔÚ¼¯ÈºÉÏ»ñÈ¡×ÊÔ´µÄÍⲿ·þÎñ (ÀýÈ磺Standalone¡¢Mesos¡¢Yarn)

Worker Node⇨ ¼¯ÈºÖÐÈκοÉÒÔÔËÐÐApplication´úÂëµÄ½Úµã£¬ÔËÐÐÒ»¸ö»ò¶à¸öExecutor½ø³Ì

Task ⇨ÔËÐÐÔÚExecutorÉϵŤ×÷µ¥Ôª

Job ⇨SparkContextÌá½»µÄ¾ßÌåAction²Ù×÷£¬³£ºÍAction¶ÔÓ¦

Stage ⇨ÿ¸öJob»á±»²ð·ÖºÜ¶à×étask£¬Ã¿×éÈÎÎñ±»³ÆÎªStage£¬Ò²³ÆTaskSet

RDD ⇨ÊÇResilient distributed datasetsµÄ¼ò³Æ£¬ÖÐÎÄΪµ¯ÐÔ·Ö²¼Ê½Êý¾Ý¼¯;ÊÇSpark×îºËÐĵÄÄ£¿éºÍÀà

DAGScheduler ⇨¸ù¾ÝJob¹¹½¨»ùÓÚStageµÄDAG£¬²¢Ìá½»Stage¸øTaskScheduler

TaskScheduler ⇨½«TasksetÌá½»¸øWorker node¼¯ÈºÔËÐв¢·µ»Ø½á¹û

Transformations ⇨ÊÇSpark APIµÄÒ»ÖÖÀàÐÍ£¬Transformation·µ»ØÖµ»¹ÊÇÒ»¸öRDD£¬ËùÓеÄTransformation²ÉÓõͼÊÇÀÁ²ßÂÔ£¬Èç¹ûÖ»Êǽ«TransformationÌá½»ÊDz»»áÖ´ÐмÆËãµÄ

Action ⇨ÊÇSpark APIµÄÒ»ÖÖÀàÐÍ£¬Action·µ»ØÖµ²»ÊÇÒ»¸öRDD£¬¶øÊÇÒ»¸öscala¼¯ºÏ£»¼ÆËãÖ»ÓÐÔÚAction±»Ìá½»µÄʱºò¼ÆËã²Å±»´¥·¢¡£

SparkºËÐĸÅÄîÖ®RDD

SparkºËÐĸÅÄîÖ®Transformations / Actions

Transformation·µ»ØÖµ»¹ÊÇÒ»¸öRDD¡£ËüʹÓÃÁËÁ´Ê½µ÷ÓõÄÉè¼ÆÄ£Ê½£¬¶ÔÒ»¸öRDD½øÐмÆËãºó£¬±ä»»³ÉÁíÍâÒ»¸öRDD£¬È»ºóÕâ¸öRDDÓÖ¿ÉÒÔ½øÐÐÁíÍâÒ»´Îת»»¡£Õâ¸ö¹ý³ÌÊÇ·Ö²¼Ê½µÄ¡£ Action·µ»ØÖµ²»ÊÇÒ»¸öRDD¡£ËüҪôÊÇÒ»¸öScalaµÄÆÕͨ¼¯ºÏ£¬ÒªÃ´ÊÇÒ»¸öÖµ£¬ÒªÃ´Êǿգ¬×îÖÕ»ò·µ»Øµ½Driver³ÌÐò£¬»ò°ÑRDDдÈëµ½ÎļþϵͳÖС£

ActionÊÇ·µ»ØÖµ·µ»Ø¸ødriver»òÕß´æ´¢µ½Îļþ£¬ÊÇRDDµ½resultµÄ±ä»»£¬TransformationÊÇRDDµ½RDDµÄ±ä»»¡£

Ö»ÓÐactionÖ´ÐÐʱ£¬rdd²Å»á±»¼ÆËãÉú³É£¬ÕâÊÇrddÀÁ¶èÖ´Ðеĸù±¾ËùÔÚ¡£

SparkºËÐĸÅÄîÖ®Jobs / Stage

Job ⇨°üº¬¶à¸ötaskµÄ²¢ÐмÆË㣬һ¸öaction´¥·¢Ò»¸öjob

stage ⇨Ò»¸öjob»á±»²ðΪ¶à×étask£¬Ã¿×éÈÎÎñ³ÆÎªÒ»¸östage£¬ÒÔshuffle½øÐл®·Ö

SparkºËÐĸÅÄîÖ®Shuffle

ÒÔreduceByKeyΪÀý½âÊÍshuffle¹ý³Ì¡£

ÔÚûÓÐtaskµÄÎļþ·ÖƬºÏ²¢ÏµÄshuffle¹ý³ÌÈçÏ£º£¨spark.shuffle.consolidateFiles=false£©

fetch À´µÄÊý¾Ý´æ·Åµ½ÄÄÀ

¸Õ fetch À´µÄ FileSegment ´æ·ÅÔÚ softBuffer »º³åÇø£¬¾­¹ý´¦ÀíºóµÄÊý¾Ý·ÅÔÚÄÚ´æ + ´ÅÅÌÉÏ¡£ÕâÀïÎÒÃÇÖ÷ÒªÌÖÂÛ´¦ÀíºóµÄÊý¾Ý£¬¿ÉÒÔÁé»îÉèÖÃÕâЩÊý¾ÝÊÇ¡°Ö»ÓÃÄڴ桱»¹ÊÇ¡°Äڴ棫´ÅÅÌ¡±¡£Èç¹ûspark.shuffle.spill = false¾ÍÖ»ÓÃÄÚ´æ¡£ÓÉÓÚ²»ÒªÇóÊý¾ÝÓÐÐò£¬shuffle write µÄÈÎÎñºÜ¼òµ¥£º½«Êý¾Ý partition ºÃ£¬²¢³Ö¾Ã»¯¡£Ö®ËùÒÔÒª³Ö¾Ã»¯£¬Ò»·½ÃæÊÇÒª¼õÉÙÄÚ´æ´æ´¢¿Õ¼äѹÁ¦£¬ÁíÒ»·½ÃæÒ²ÊÇΪÁË fault-tolerance¡£

shuffleÖ®ËùÒÔÐèÒª°ÑÖмä½á¹û·Åµ½´ÅÅÌÎļþÖУ¬ÊÇÒòΪËäÈ»ÉÏÒ»Åútask½áÊøÁË£¬ÏÂÒ»Åútask»¹ÐèҪʹÓÃÄÚ´æ¡£Èç¹ûÈ«²¿·ÅÔÚÄÚ´æÖУ¬ÄÚ´æ»á²»¹»¡£ÁíÍâÒ»·½ÃæÎªÁËÈÝ´í£¬·ÀÖ¹ÈÎÎñ¹Òµô¡£

´æÔÚÎÊÌâÈçÏ£º

²úÉúµÄ FileSegment ¹ý¶à¡£Ã¿¸ö ShuffleMapTask ²úÉú R£¨reducer ¸öÊý£©¸ö FileSegment£¬M ¸ö ShuffleMapTask ¾Í»á²úÉú M * R ¸öÎļþ¡£Ò»°ã Spark job µÄ M ºÍ R ¶¼ºÜ´ó£¬Òò´Ë´ÅÅÌÉÏ»á´æÔÚ´óÁ¿µÄÊý¾ÝÎļþ¡£

»º³åÇøÕ¼ÓÃÄÚ´æ¿Õ¼ä´ó¡£Ã¿¸ö ShuffleMapTask ÐèÒª¿ª R ¸ö bucket£¬M ¸ö ShuffleMapTask ¾Í»á²úÉú MR ¸ö bucket¡£ËäȻһ¸ö ShuffleMapTask ½áÊøºó£¬¶ÔÓ¦µÄ»º³åÇø¿ÉÒÔ±»»ØÊÕ£¬µ«Ò»¸ö worker node ÉÏͬʱ´æÔÚµÄ bucket ¸öÊý¿ÉÒÔ´ïµ½ cores R ¸ö£¨Ò»°ã worker ͬʱ¿ÉÒÔÔËÐÐ cores ¸ö ShuffleMapTask£©£¬Õ¼ÓõÄÄÚ´æ¿Õ¼äÒ²¾Í´ïµ½ÁËcores¡Á R ¡Á 32 KB¡£¶ÔÓÚ 8 ºË 1000 ¸ö reducer À´Ëµ£¬Õ¼ÓÃÄÚ´æ¾ÍÊÇ 256MB¡£

ΪÁ˽â¾öÉÏÊöÎÊÌ⣬ÎÒÃÇ¿ÉÒÔʹÓÃÎļþºÏ²¢µÄ¹¦ÄÜ¡£

ÔÚ½øÐÐtaskµÄÎļþ·ÖƬºÏ²¢ÏµÄshuffle¹ý³ÌÈçÏ£º£¨spark.shuffle.consolidateFiles=true£©

¿ÉÒÔÃ÷ÏÔ¿´³ö£¬ÔÚÒ»¸ö core ÉÏÁ¬ÐøÖ´ÐÐµÄ ShuffleMapTasks ¿ÉÒÔ¹²ÓÃÒ»¸öÊä³öÎļþ ShuffleFile¡£ÏÈÖ´ÐÐÍêµÄ ShuffleMapTask ÐÎ³É ShuffleBlock i£¬ºóÖ´ÐÐµÄ ShuffleMapTask ¿ÉÒÔ½«Êä³öÊý¾ÝÖ±½Ó×·¼Óµ½ ShuffleBlock i ºóÃæ£¬ÐÎ³É ShuffleBlock i'£¬Ã¿¸ö ShuffleBlock ±»³ÆÎª FileSegment¡£ÏÂÒ»¸ö stage µÄ reducer Ö»ÐèÒª fetch Õû¸ö ShuffleFile ¾ÍÐÐÁË¡£ÕâÑù£¬Ã¿¸ö worker ³ÖÓеÄÎļþÊý½µÎª cores¡Á R¡£FileConsolidation ¹¦ÄÜ¿ÉÒÔͨ¹ýspark.shuffle.consolidateFiles=trueÀ´¿ªÆô¡£

SparkºËÐĸÅÄîÖ®Cache

val rdd1 = ... // ¶ÁÈ¡hdfsÊý¾Ý£¬¼ÓÔØ³ÉRDD

rdd1.cache

val rdd2 = rdd1.map(...)

val rdd3 = rdd1.filter(...)

rdd2.take(10).foreach(println)

rdd3.take(10).foreach(println)

rdd1.unpersist

cacheºÍunpersisitÁ½¸ö²Ù×÷±È½ÏÌØÊ⣬ËûÃǼȲ»ÊÇactionÒ²²»ÊÇtransformation¡£cache»á½«±ê¼ÇÐèÒª»º´æµÄrdd£¬ÕæÕý»º´æÊÇÔÚµÚÒ»´Î±»Ïà¹Øactionµ÷Óúó²Å»º´æ£»unpersisitÊÇĨµô¸Ã±ê¼Ç£¬²¢ÇÒÁ¢¿ÌÊÍ·ÅÄÚ´æ¡£Ö»ÓÐactionÖ´ÐÐʱ£¬rdd1²Å»á¿ªÊ¼´´½¨²¢½øÐкóÐøµÄrdd±ä»»¼ÆËã¡£

cacheÆäʵҲÊǵ÷ÓõÄpersist³Ö¾Ã»¯º¯Êý£¬Ö»ÊÇÑ¡ÔñµÄ³Ö¾Ã»¯¼¶±ðΪMEMORY_ONLY¡£

persistÖ§³ÖµÄRDD³Ö¾Ã»¯¼¶±ðÈçÏ£º

ÐèҪעÒâµÄÎÊÌ⣺

Cache»òshuffle³¡¾°ÐòÁл¯Ê±£¬ sparkÐòÁл¯²»Ö§³Öprotobuf message£¬ÐèÒªjava ¿ÉÒÔserializableµÄ¶ÔÏó¡£Ò»µ©ÔÚÐòÁл¯Óõ½²»Ö§³Öjava serializableµÄ¶ÔÏó¾Í»á³öÏÖÉÏÊö´íÎó¡£

SparkֻҪд´ÅÅÌ£¬¾Í»áÓõ½ÐòÁл¯¡£³ýÁËshuffle½×¶ÎºÍpersist»áÐòÁл¯£¬ÆäËûʱºòRDD´¦Àí¶¼ÔÚÄÚ´æÖУ¬²»»áÓõ½ÐòÁл¯¡£

Spark StreamingÔËÐÐÔ­Àí

spark³ÌÐòÊÇʹÓÃÒ»¸ösparkÓ¦ÓÃʵÀýÒ»´ÎÐÔ¶ÔÒ»ÅúÀúÊ·Êý¾Ý½øÐд¦Àí£¬spark streamingÊǽ«³ÖÐø²»¶ÏÊäÈëµÄÊý¾ÝÁ÷ת»»³É¶à¸öbatch·ÖƬ£¬Ê¹ÓÃÒ»ÅúsparkÓ¦ÓÃʵÀý½øÐд¦Àí¡£

´ÓÔ­ÀíÉÏ¿´£¬°Ñ´«Í³µÄsparkÅú´¦Àí³ÌÐò±ä³Éstreaming³ÌÐò£¬sparkÐèÒª¹¹½¨Ê²Ã´£¿

ÐèÒª¹¹½¨4¸ö¶«Î÷£º

Ò»¸ö¾²Ì¬µÄ RDD DAG µÄÄ£°å£¬À´±íʾ´¦ÀíÂß¼­£»

Ò»¸ö¶¯Ì¬µÄ¹¤×÷¿ØÖÆÆ÷£¬½«Á¬ÐøµÄ streaming data ÇзÖÊý¾ÝƬ¶Î£¬²¢°´ÕÕÄ£°å¸´ÖƳöÐ嵀 RDD £»

DAG µÄʵÀý£¬¶ÔÊý¾ÝƬ¶Î½øÐд¦Àí£»

Receiver½øÐÐԭʼÊý¾ÝµÄ²úÉúºÍµ¼È룻Receiver½«½ÓÊÕµ½µÄÊý¾ÝºÏ²¢ÎªÊý¾Ý¿é²¢´æµ½ÄÚ´æ»òÓ²ÅÌÖУ¬¹©ºóÐøbatch RDD½øÐÐÏû·Ñ£»

¶Ô³¤Ê±ÔËÐÐÈÎÎñµÄ±£ÕÏ£¬°üÀ¨ÊäÈëÊý¾ÝµÄʧЧºóµÄÖØ¹¹£¬´¦ÀíÈÎÎñµÄʧ°ÜºóµÄÖØµ÷¡£

¾ßÌåstreamingµÄÏêϸԭÀí¿ÉÒԲο¼¹ãµãͨ³öÆ·µÄÔ´Âë½âÎöÎÄÕ£º

¶ÔÓÚspark streamingÐèҪעÒâÒÔÏÂÈýµã£º

¾¡Á¿±£Ö¤Ã¿¸öwork½ÚµãÖеÄÊý¾Ý²»ÒªÂäÅÌ£¬ÒÔÌáÉýÖ´ÐÐЧÂÊ¡£

±£Ö¤Ã¿¸öbatchµÄÊý¾ÝÄܹ»ÔÚbatch intervalʱ¼äÄÚ´¦ÀíÍê±Ï£¬ÒÔÃâÔì³ÉÊý¾Ý¶Ñ»ý¡£

ʹÓÃstevenÌṩµÄ¿ò¼Ü½øÐÐÊý¾Ý½ÓÊÕʱµÄÔ¤´¦Àí£¬¼õÉÙ²»±ØÒªÊý¾ÝµÄ´æ´¢ºÍ´«Êä¡£´ÓtdbankÖнÓÊÕºóת´¢Ç°½øÐйýÂË£¬¶ø²»ÊÇÔÚtask¾ßÌå´¦Àíʱ²Å½øÐйýÂË¡£

Spark ×ÊÔ´µ÷ÓÅ

ÄÚ´æ¹ÜÀí£º

ExecutorµÄÄÚ´æÖ÷Òª·ÖΪÈý¿é£º

µÚÒ»¿éÊÇÈÃtaskÖ´ÐÐÎÒÃÇ×Ô¼º±àдµÄ´úÂëʱʹÓã¬Ä¬ÈÏÊÇÕ¼Executor×ÜÄÚ´æµÄ20%£»

µÚ¶þ¿éÊÇÈÃtaskͨ¹ýshuffle¹ý³ÌÀ­È¡ÁËÉÏÒ»¸östageµÄtaskµÄÊä³öºó£¬½øÐоۺϵȲÙ×÷ʱʹÓã¬Ä¬ÈÏÒ²ÊÇÕ¼Executor×ÜÄÚ´æµÄ20%£»

µÚÈý¿éÊÇÈÃRDD³Ö¾Ã»¯Ê±Ê¹Óã¬Ä¬ÈÏÕ¼Executor×ÜÄÚ´æµÄ60%¡£

ÿ¸ötaskÒÔ¼°Ã¿¸öexecutorÕ¼ÓõÄÄÚ´æÐèÒª·ÖÎöһϡ£Ã¿¸ötask´¦ÀíÒ»¸öpartiitonµÄÊý¾Ý£¬·ÖƬ̫ÉÙ£¬»áÔì³ÉÄÚ´æ²»¹»¡£

ÆäËû×ÊÔ´ÅäÖãº

 
   
3113 ´Îä¯ÀÀ       28
Ïà¹ØÎÄÕÂ

»ùÓÚEAµÄÊý¾Ý¿â½¨Ä£
Êý¾ÝÁ÷½¨Ä££¨EAÖ¸ÄÏ£©
¡°Êý¾Ýºþ¡±£º¸ÅÄî¡¢ÌØÕ÷¡¢¼Ü¹¹Óë°¸Àý
ÔÚÏßÉ̳ÇÊý¾Ý¿âϵͳÉè¼Æ ˼·+Ч¹û
 
Ïà¹ØÎĵµ

GreenplumÊý¾Ý¿â»ù´¡Åàѵ
MySQL5.1ÐÔÄÜÓÅ»¯·½°¸
ijµçÉÌÊý¾ÝÖÐ̨¼Ü¹¹Êµ¼ù
MySQL¸ßÀ©Õ¹¼Ü¹¹Éè¼Æ
Ïà¹Ø¿Î³Ì

Êý¾ÝÖÎÀí¡¢Êý¾Ý¼Ü¹¹¼°Êý¾Ý±ê×¼
MongoDBʵս¿Î³Ì
²¢·¢¡¢´óÈÝÁ¿¡¢¸ßÐÔÄÜÊý¾Ý¿âÉè¼ÆÓëÓÅ»¯
PostgreSQLÊý¾Ý¿âʵսÅàѵ
×îл¼Æ»®
DeepSeek´óÄ£ÐÍÓ¦Óÿª·¢ 6-12[ÏÃÃÅ]
È˹¤ÖÇÄÜ.»úÆ÷ѧϰTensorFlow 6-22[Ö±²¥]
»ùÓÚ UML ºÍEA½øÐзÖÎöÉè¼Æ 6-30[±±¾©]
ǶÈëʽÈí¼þ¼Ü¹¹-¸ß¼¶Êµ¼ù 7-9[±±¾©]
Óû§ÌåÑé¡¢Ò×ÓÃÐÔ²âÊÔÓëÆÀ¹À 7-25[Î÷°²]
ͼÊý¾Ý¿âÓë֪ʶͼÆ× 8-23[±±¾©]
 
×îÐÂÎÄÕÂ
´óÊý¾Ýƽ̨ϵÄÊý¾ÝÖÎÀí
ÈçºÎÉè¼ÆÊµÊ±Êý¾Ýƽ̨£¨¼¼Êõƪ£©
´óÊý¾Ý×ʲú¹ÜÀí×ÜÌå¿ò¼Ü¸ÅÊö
Kafka¼Ü¹¹ºÍÔ­Àí
ELK¶àÖּܹ¹¼°ÓÅÁÓ
×îпγÌ
´óÊý¾Ýƽ̨´î½¨Óë¸ßÐÔÄܼÆËã
´óÊý¾Ýƽ̨¼Ü¹¹ÓëÓ¦ÓÃʵս
´óÊý¾ÝϵͳÔËά
´óÊý¾Ý·ÖÎöÓë¹ÜÀí
Python¼°Êý¾Ý·ÖÎö
³É¹¦°¸Àý
ijͨÐÅÉ豸ÆóÒµ PythonÊý¾Ý·ÖÎöÓëÍÚ¾ò
Ä³ÒøÐÐ È˹¤ÖÇÄÜ+Python+´óÊý¾Ý
±±¾© Python¼°Êý¾Ý·ÖÎö
ÉñÁúÆû³µ ´óÊý¾Ý¼¼Êõƽ̨-Hadoop
ÖйúµçÐÅ ´óÊý¾Ýʱ´úÓëÏÖ´úÆóÒµµÄÊý¾Ý»¯ÔËӪʵ¼ù