±à¼ÍƼö: |
±¾ÎÄÊ×ÏȽéÉÜ´«Í³ÏµÍ³µÄÎÊÌâ¡¢Êý¾ÝϵͳµÄ¸ÅÄî¡¢Lambda¼Ü¹¹µÄServing
Layer£¬Speed LayerµÈÏà¹ØÄÚÈÝ£¬Ï£Íû¶Ô´ó¼ÒÓÐËù°ïÖú¡£
±¾ÎÄÀ´×Ôcsdn£¬ÓÉ»ðÁú¹ûÈí¼þAnna±à¼¡¢ÍƼö¡£ |
|
Nathan MarzµÄ´ó×÷Big Data: Principles
and best practices of scalable real-time data systems½éÉÜÁËLabmda
ArchitectureµÄ¸ÅÄÓÃÓÚÔÚ´óÊý¾Ý¼Ü¹¹ÖУ¬ÈçºÎÈÃreal-timeÓëbatch job¸üºÃµØ½áºÏÆðÀ´£¬ÒÔ´ï³É¶Ô´óÊý¾ÝµÄʵʱ´¦Àí¡£
´«Í³ÏµÍ³µÄÎÊÌâ
ÔÚ´«Í³Êý¾Ý¿âµÄÉè¼ÆÖУ¬ÎÞ·¨ºÜºÃµØÖ§³ÖϵͳµÄ¿ÉÉìËõÐÔ¡£µ±Óû§·ÃÎÊÁ¿Ôö¼Óʱ£¬Êý¾Ý¿âÎÞ·¨Âú×ãÈÕÒæÔö³¤µÄÓû§ÇëÇó¸ºÔØ£¬´Ó¶øµ¼ÖÂÊý¾Ý¿â·þÎñÆ÷ÎÞ·¨¼°Ê±ÏìÓ¦Óû§ÇëÇ󣬳öÏÖ³¬Ê±´íÎó¡£
½â¾öµÄ°ì·¨ÊÇÔÚWeb·þÎñÆ÷ÓëÊý¾Ý¿âÖ®¼äÔö¼ÓÒ»¸öÒì²½´¦ÀíµÄ¶ÓÁС£ÈçÏÂͼËùʾ£º

ÒýÈë¶ÓÁÐ
µ±Web ServerÊÕµ½Ò³ÃæÇëÇóʱ£¬»á½«ÏûÏ¢Ìí¼Óµ½¶ÓÁÐÖС£ÔÚDB¶Ë£¬´´½¨Ò»¸öWorker¶¨ÆÚ´Ó¶ÓÁÐÖÐÈ¡³öÏûÏ¢½øÐд¦Àí£¬ÀýÈçÿ´Î¶ÁÈ¡100ÌõÏûÏ¢¡£ÕâÏ൱ÓÚÔÚÁ½ÕßÖ®¼ä½¨Á¢ÁËÒ»¸ö»º³å¡£
µ«ÊÇ£¬ÕâÒ»·½°¸²¢Ã»Óдӱ¾ÖÊÉϽâ¾öÊý¾Ý¿âoverloadµÄÎÊÌ⣬ÇÒµ±workerÎÞ·¨¸úÉÏwriterµÄÇëÇóʱ£¬¾ÍÐèÒªÔö¼Ó¶à¸öworker²¢·¢Ö´ÐУ¬Êý¾Ý¿âÓÖ½«ÔٴγÉΪÏìÓ¦ÇëÇóµÄÆ¿¾±¡£Ò»¸ö½â¾ö°ì·¨ÊǶÔÊý¾Ý¿â½øÐзÖÇø£¨horizontal
partitioning»òÕßsharding£©¡£·ÖÇøµÄ·½Ê½Í¨³£ÒÔHashÖµ×÷Ϊkey¡£ÕâÑù¾ÍÐèÒªÓ¦ÓóÌÐò¶ËÖªµÀÈçºÎȥѰÕÒÿ¸ökeyËùÔڵķÖÇø¡£
ÎÊÌâÈÔÈ»»áËæ×ÅÓû§ÇëÇóµÄÔö¼Ó½Óõà¶øÀ´¡£µ±Ö®Ç°µÄ·ÖÇøÎÞ·¨Âú×ã¸ºÔØÊ±£¬¾ÍÐèÒªÔö¼Ó¸ü¶à·ÖÇø£¬Õâʱ¾ÍÐèÒª¶ÔÊý¾Ý¿â½øÐÐreshard¡£reshardingµÄ¹¤×÷·Ç³£ºÄʱ¶øÍ´¿à£¬ÒòΪÐèҪе÷ºÜ¶à¹¤×÷£¬ÀýÈçÊý¾ÝµÄÇ¨ÒÆ¡¢¸üпͻ§¶Ë·ÃÎʵķÖÇøµØÖ·£¬¸üÐÂÓ¦ÓóÌÐò´úÂë¡£Èç¹ûϵͳ±¾Éí»¹ÌṩÁËÔÚÏß·ÃÎÊ·þÎñ£¬¶ÔÔËάµÄÒªÇó¾Í¸ü¸ß¡£ÉÔÓв»É÷£¬¾Í¿ÉÄܵ¼ÖÂÊý¾Ýдµ½´íÎóµÄ·ÖÇø£¬Òò´Ë±ØÐëÒª±àд½Å±¾À´×Ô¶¯Íê³É£¬ÇÒÐèÒª³ä·ÖµÄ²âÊÔ¡£
¼´Ê¹·ÖÇøÄܹ»½â¾öÊý¾Ý¿â¸ºÔØÎÊÌ⣬ȴ»¹´æÔÚÈÝ´íÐÔ£¨Fault-Tolerance£©µÄÎÊÌâ¡£½â¾ö°ì·¨£º
¸Ä±äqueue/workerµÄʵÏÖ¡£µ±ÏûÏ¢·¢Ë͸ø²»¿ÉÓõķÖÇøÊ±£¬½«ÏûÏ¢·Åµ½¡°pending¡±¶ÓÁУ¬È»ºóÿ¸ôÒ»¶Îʱ¼ä¶Ôpending¶ÓÁÐÖеÄÏûÏ¢½øÐд¦Àí¡£
ʹÓÃÊý¾Ý¿âµÄreplication¹¦ÄÜ£¬ÎªÃ¿¸ö·ÖÇøÔö¼Óslave¡£
ÎÊÌⲢûÓеõ½ÍêÃÀµØ½â¾ö¡£¼ÙÉèϵͳ³öÏÖÎÊÌ⣬ÀýÈçÔÚÓ¦ÓÃϵͳ´úÂë¶Ë²»Ð¡ÐÄÒýÈëÁËÒ»¸öbug£¬Ê¹µÃ¶ÔÒ³ÃæµÄÇëÇóÖØ¸´Ìá½»ÁËÒ»´Î£¬Õâ¾Íµ¼ÖÂÁËÖØ¸´µÄÇëÇóÊý¾Ý¡£Ôã¸âµÄÊÇ£¬Ö±µ½24Сʱ֮ºó²Å·¢ÏÖÁ˸ÃÎÊÌ⣬´Ëʱ¶ÔÊý¾ÝµÄÆÆ»µÒѾÔì³ÉÁË¡£¼´Ê¹Ã¿ÖܵÄÊý¾Ý±¸·ÝÒ²ÎÞ·¨½â¾ö´ËÎÊÌ⣬ÒòΪËü²»ÖªµÀµ½µ×ÊÇÄÄЩÊý¾ÝÊܵ½ÁËÆÆ»µ£¨corrupiton£©¡£ÓÉÓÚÈËΪ´íÎó×ÜÊDz»¿É±ÜÃâµÄ£¬ÎÒÃÇÔڼܹ¹Ê±Ó¦¸ÃÈçºÎ¹æ±Ü´ËÎÊÌ⣿
ÏÖÔÚ£¬¼Ü¹¹±äµÃÔ½À´Ô½¸´ÔÓ£¬Ôö¼ÓÁ˶ÓÁС¢·ÖÇø¡¢¸´ÖÆ¡¢ÖØ·ÖÇø½Å±¾£¨resharding scripts£©¡£Ó¦ÓóÌÐò»¹ÐèÒªÁ˽âÊý¾Ý¿âµÄschema£¬²¢ÄÜ·ÃÎʵ½ÕýÈ·µÄ·ÖÇø¡£ÎÊÌâÔÚÓÚ£ºÊý¾Ý¿â¶ÔÓÚ·ÖÇøÊDz»Á˽âµÄ£¬ÎÞ·¨°ïÖúÄãÓ¦¶Ô·ÖÇø¡¢¸´ÖÆÓë·Ö²¼Ê½²éѯ¡£×îÔã¸âµÄÎÊÌâÊÇϵͳ²¢Ã»ÓÐΪÈËΪ´íÎó½øÐй¤³ÌÉè¼Æ£¬½ö¿¿±¸·ÝÊDz»ÄÜÖα¾µÄ¡£¹é¸ù½áµ×£¬ÏµÍ³»¹ÐèÒªÏÞÖÆÒòΪÈËΪ´íÎóµ¼ÖÂµÄÆÆ»µ¡£
Êý¾ÝϵͳµÄ¸ÅÄî
´óÊý¾Ý´¦Àí¼¼ÊõÐèÒª½â¾öÕâÖÖ¿ÉÉìËõÐÔÓ븴ÔÓÐÔ¡£Ê×ÏÈÒªÈÏʶµ½ÕâÖÖ·Ö²¼Ê½µÄ±¾ÖÊ£¬ÒªºÜºÃµØ´¦Àí·ÖÇøÓë¸´ÖÆ£¬²»»áµ¼Ö´íÎó·ÖÇøÒýÆð²éѯʧ°Ü£¬¶øÊÇÒª½«ÕâЩÂß¼ÄÚ»¯µ½Êý¾Ý¿âÖС£µ±ÐèÒªÀ©Õ¹ÏµÍ³Ê±£¬¿ÉÒԷdz£·½±ãµØÔö¼Ó½Úµã£¬ÏµÍ³Ò²Äܹ»Õë¶ÔÐÂ½Úµã½øÐÐrebalance¡£
Æä´ÎÊÇÒªÈÃÊý¾Ý³ÉΪ²»¿É±äµÄ¡£ÔʼÊý¾ÝÓÀÔ¶¶¼²»Äܱ»Ð޸ģ¬ÕâÑù¼´Ê¹·¸ÁË´íÎó£¬Ð´ÁË´íÎóÊý¾Ý£¬ÔÀ´ºÃµÄÊý¾Ý²¢²»»áÊܵ½ÆÆ»µ¡£
ºÎν¡°Êý¾Ýϵͳ¡±£¿Nathan MarzÈÏΪ£º
Èç¹ûÊý¾Ýϵͳͨ¹ý²éÕÒ¹ýÈ¥µÄÊý¾ÝÈ¥»Ø´ðÎÊÌ⣬Ôòͨ³£ÐèÒª·ÃÎÊÕû¸öÊý¾Ý¼¯¡£
Òò´Ë¿ÉÒÔ¸ødata systemµÄ×îͨÓõ͍Ò壺
Query = function(all
data) |
½ÓÏÂÀ´£¬±¾Êé×÷Õß½éÉÜÁËBig Data SystemËùÐè¾ß±¸µÄÊôÐÔ£º
½¡×³ÐÔºÍÈÝ´íÐÔ£¨RobustnessºÍFault Tolerance£©
µÍÑӳٵĶÁÓë¸üУ¨Low Latency reads and updates£©
¿ÉÉìËõÐÔ£¨Scalability£©
ͨÓÃÐÔ£¨Generalization£©
¿ÉÀ©Õ¹ÐÔ£¨Extensibility£©
ÄÚÖòéѯ£¨Ad hoc queries£©
ά»¤×îС£¨Minimal maintenance£©
¿Éµ÷ÊÔÐÔ£¨Debuggability£©
Lambda¼Ü¹¹
Lambda¼Ü¹¹µÄÖ÷Ҫ˼Ïë¾ÍÊǽ«´óÊý¾Ýϵͳ¹¹½¨Îª¶à¸ö²ã´Î£¬ÈçÏÂͼËùʾ£º

lambda layer
ÀíÏë״̬Ï£¬ÈκÎÊý¾Ý·ÃÎʶ¼¿ÉÒÔ´Ó±í´ïʽQuery = function(all data)¿ªÊ¼£¬µ«ÊÇ£¬ÈôÊý¾Ý´ïµ½Ï൱´óµÄÒ»¸ö¼¶±ð£¨ÀýÈçPB£©£¬ÇÒ»¹ÐèÒªÖ§³Öʵʱ²éѯʱ£¬¾ÍÐèÒªºÄ·Ñ·Ç³£ÅÓ´óµÄ×ÊÔ´¡£
Ò»¸ö½â¾ö·½Ê½ÊÇÔ¤ÔËËã²éѯº¯Êý£¨precomputed query
funciton£©¡£ÊéÖн«ÕâÖÖÔ¤ÔËËã²éѯº¯Êý³ÆÖ®ÎªBatch View£¬ÕâÑùµ±ÐèÒªÖ´Ðвéѯʱ£¬¿ÉÒÔ´ÓBatch
ViewÖжÁÈ¡½á¹û¡£ÕâÑùÒ»¸öÔ¤ÏÈÔËËãºÃµÄViewÊÇ¿ÉÒÔ½¨Á¢Ë÷ÒýµÄ£¬Òò¶ø¿ÉÒÔÖ§³ÖËæ»ú¶ÁÈ¡¡£ÓÚÊÇϵͳ¾Í±ä³É£º
batch view =
function(all data)
query = function(batch view) |
Batch Layer
ÔÚLambda¼Ü¹¹ÖУ¬ÊµÏÖbatch view = function(all data)µÄ²¿·Ö±»³ÆÖ®Îªbatch
layer¡£Ëü³Ðµ£ÁËÁ½¸öÖ°Ôð£º
´æ´¢Master Dataset£¬ÕâÊÇÒ»¸ö²»±äµÄ³ÖÐøÔö³¤µÄÊý¾Ý¼¯
Õë¶ÔÕâ¸öMaster Dataset½øÐÐÔ¤ÔËËã
ÏÔÈ»£¬Batch LayerÖ´ÐеÄÊÇÅúÁ¿´¦Àí£¬ÀýÈçHadoop»òÕßSparkÖ§³ÖµÄMap-Reduce·½Ê½¡£
ËüµÄÖ´Ðз½Ê½¿ÉÒÔÓÃÒ»¶Îα´úÂëÀ´±íʾ£º
function runBatchLayer():
while (true):
recomputeBatchViews() |
ÀýÈçÕâÑùÒ»¶Î´úÂ룺
Api.execute(Api.hfsSeqfile ("/tmp/pageview-counts"),
new Subquery ("?url", "?count")
.predicate (Api.hfsSeqfile("/data/pageviews"),
"?url", "?user", "?timestamp")
.predicate(new Count(), "?count"); |
´úÂë²¢ÐеضÔhdfsÎļþ¼ÐϵÄpage views½øÐÐͳ¼Æ£¨count£©£¬ºÏ²¢½á¹û£¬²¢½«×îÖÕ½á¹û±£´æÔÚpageview-countsÎļþ¼ÐÏ¡£
ÀûÓÃBatch Layer½øÐÐÔ¤ÔËËãµÄ×÷ÓÃʵ¼ÊÉϾÍÊǽ«´óÊý¾Ý±äС£¬´Ó¶øÓÐЧµØÀûÓÃ×ÊÔ´£¬¸ÄÉÆÊµÊ±²éѯµÄÐÔÄÜ¡£µ«ÕâÀïÓÐÒ»¸öǰÌᣬ¾ÍÊÇÎÒÃÇÐèÒªÔ¤ÏÈÖªµÀ²éѯÐèÒªµÄÊý¾Ý£¬Èç´Ë²ÅÄÜÔÚBatch
LayerÖа²ÅÅÖ´Ðмƻ®£¬¶¨ÆÚ¶ÔÊý¾Ý½øÐÐÅúÁ¿´¦Àí¡£´ËÍ⣬»¹ÒªÇóÕâЩԤÔËËãµÄͳ¼ÆÊý¾ÝÊÇÖ§³ÖºÏ²¢£¨merge£©µÄ¡£
Serving Layer
Batch Layerͨ¹ý¶Ômaster datasetÖ´Ðвéѯ»ñµÃÁËbatch view£¬¶øServing
Layer¾ÍÒª¸ºÔð¶Ôbatch view½øÐвÙ×÷£¬´Ó¶øÎª×îÖÕµÄʵʱ²éѯÌṩ֧³Å¡£Òò´ËServing
LayerµÄÖ°Ôð°üº¬£º
¶Ôbatch viewµÄËæ»ú·ÃÎÊ
¸üÐÂbatch view
Serving LayerÓ¦¸ÃÊÇÒ»¸öרÓõķֲ¼Ê½Êý¾Ý¿â£¬ÀýÈçElephant DB£¬ÒÔÖ§³Ö¶Ôbatch
viewµÄ¼ÓÔØ¡¢Ëæ»ú¶ÁÈ¡ÒÔ¼°¸üС£×¢Ò⣬Ëü²¢²»Ö§³Ö¶Ôbatch viewµÄËæ»úд£¬ÒòÎªËæ»úд»áΪÊý¾Ý¿âÒýÀ´Ðí¶à¸´ÔÓÐÔ¡£¼òµ¥µÄÌØÐÔ²ÅÄÜʹϵͳ±äµÃ¸ü½¡×³¡¢¿ÉÔ¤²â¡¢Ò×ÅäÖã¬Ò²Ò×ÓÚÔËά¡£
Speed Layer
Ö»Òªbatch layerÍê³É¶Ôbatch viewµÄÔ¤¼ÆË㣬serving layer¾Í»á¶ÔÆä½øÐиüС£ÕâÒâζ×ÅÔÚÔËÐÐÔ¤¼ÆËãʱ½øÈëµÄÊý¾Ý²»»áÂíÉϳÊÏÖµ½batch
viewÖС£Õâ¶ÔÓÚÒªÇóÍêȫʵʱµÄÊý¾Ýϵͳ¶øÑÔÊDz»ÄܽÓÊܵġ£Òª½â¾öÕâ¸öÎÊÌ⣬¾ÍҪͨ¹ýspeed layer¡£´Ó¶ÔÊý¾ÝµÄ´¦ÀíÀ´¿´£¬speed
layerÓëbatch layer·Ç³£ÏàËÆ£¬ËüÃÇÖ®¼ä×î´óµÄÇø±ðÊÇǰÕßÖ»´¦Àí×î½üµÄÊý¾Ý£¬ºóÕßÔòÒª´¦ÀíËùÓеÄÊý¾Ý¡£ÁíÒ»¸öÇø±ðÊÇΪÁËÂú×ã×îСµÄÑÓ³Ù£¬speed
layer²¢²»»áÔÚͬһʱ¼ä¶ÁÈ¡ËùÓеÄÐÂÊý¾Ý£¬Ïà·´£¬Ëü»áÔÚ½ÓÊÕµ½ÐÂÊý¾Ýʱ£¬¸üÐÂrealtime view£¬¶ø²»»áÏñbatch
layerÄÇÑùÖØÐÂÔËËãÕû¸öview¡£speed layerÊÇÒ»ÖÖÔöÁ¿µÄ¼ÆË㣬¶ø·ÇÖØÐÂÔËË㣨recomputation£©¡£
Òò¶ø£¬Speed LayerµÄ×÷ÓðüÀ¨£º
¶Ô¸üе½serving layer´øÀ´µÄ¸ßÑÓ³ÙµÄÒ»ÖÖ²¹³ä
¿ìËÙ¡¢ÔöÁ¿µÄËã·¨
×îÖÕBatch Layer»á¸²¸Çspeed layer
Speed LayerµÄµÈʽ±í´ïÈçÏÂËùʾ£º
realtime view
= function(realtime view, new data) |
×¢Ò⣬realtime viewÊÇ»ùÓÚÐÂÊý¾ÝºÍÒÑÓеÄrealtime view¡£
×ܽáÏÂÀ´£¬Lambda¼Ü¹¹¾ÍÊÇÈçϵÄÈý¸öµÈʽ£º
batch view =
function(all data)
realtime view = function(realtime view, new data)
query = function(batch view . realtime view) |
Õû¸öLambda¼Ü¹¹ÈçÏÂͼËùʾ£º

lambda architecture
»ùÓÚLambda¼Ü¹¹£¬Ò»µ©Êý¾Ýͨ¹ýbatch layer½øÈëµ½serving layer£¬ÔÚrealtime
viewÖеÄÏàÓ¦½á¹û¾Í²»ÔÙÐèÒªÁË¡£
|