Äú¿ÉÒÔ¾èÖú£¬Ö§³ÖÎÒÃǵĹ«ÒæÊÂÒµ¡£

1Ôª 10Ôª 50Ôª





ÈÏÖ¤Â룺  ÑéÖ¤Âë,¿´²»Çå³þ?Çëµã»÷Ë¢ÐÂÑéÖ¤Âë ±ØÌî



  ÇóÖª ÎÄÕ ÎÄ¿â Lib ÊÓÆµ iPerson ¿Î³Ì ÈÏÖ¤ ×Éѯ ¹¤¾ß ½²×ù Model Center   Code  
»áÔ±   
   
 
     
   
 ¶©ÔÄ
  ¾èÖú
´óÊý¾Ý¿ò¼Üѧϰ£º´Ó Hadoop µ½ Spark
 
  2544  次浏览      27
 2018-6-14 
 
±à¼­ÍƼö:
±¾ÎÄÀ´×ÔÓÚtencent.com£¬½éÉÜÁËHadoopµÄ»ù´¡ÐÅÏ¢£¬Spark »ùÓÚÄÚ´æµÄ¼ÆËã¿ò¼ÜµÈ֪ʶ¡£

Hadoop

1. HadoopÊÇʲô

HadoopÈí¼þ¿âÊÇÒ»¸öÀûÓüòµ¥µÄ±à³ÌÄ£ÐÍÔÚ´óÁ¿¼ÆËã»ú¼¯ÈºÉ϶ԴóÐÍÊý¾Ý¼¯½øÐзֲ¼Ê½´¦ÀíµÄ¿ò¼Ü¡£

ÌØµã£º²¿Êð³É±¾µÍ¡¢À©Õ¹·½±ã¡¢±à³ÌÄ£Ðͼòµ¥¡£

Hadoop ʵÏÖÁËÔÚÐÐÒµ±ê×¼µÄ·þÎñÆ÷ÉϽøÐпɿ¿¡¢¿ÉËõ·ÅµÄ·Ö²¼Ê½¼ÆË㣬ÈÃÄãÄܹ»ÒԽϵ͵ÄÔ¤Ëã¸ú×ÙÊý PB ÒÔÉϵÄÊý¾Ý£¬¶ø²»±ØÐèÒª³¬¼¶¼ÆËã»úºÍÆäËû°º¹óµÄרÃÅÓ²¼þ¡£

Hadoop »¹Äܹ»´Óµ¥Ì¨·þÎñÆ÷À©Õ¹µ½Êýǧ̨¼ÆËã»ú£¬¼ì²âºÍ´¦ÀíÓ¦ÓóÌÐò²ãÉϵĹÊÕÏ£¬´Ó¶øÌá¸ß¿É¿¿ÐÔ¡£

2. HadoopµÄ×é³É²¿·Ö £¨Hadoop 2.0£©

1¡¢Hadoop Common: The common utilities that support the other Hadoop modules.

2¡¢Hadoop Distributed File System (HDFS:tm:): A distributed file system that provides high-throughput access to application data.

3¡¢Hadoop YARN: A framework for job scheduling and cluster resource management.

4¡¢Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

ÎÒÃÇÆ½³£½Ó´¥±È½Ï¶àµÄÒ²ÊÇ HDFS¡¢YARN¡¢MapReduce£»

¾ßÌåµÄ³¡¾°£¬HDFS£¬±ÈÈçͨ¹ý¿Í»§¶Ë·ÃÎʼ¯Èº£¬ YARN£¬MapReduce£¬ÎÒÃÇ¿´Ìá½»µÄÈÎÎñµÄÖ´ÐÐÇé¿ö¡£

3. Hadoop¼Ü¹¹

HadoopµÄ·¢Õ¹ Hadoop1.0 vs Hadoop 2.0

ÔÚ Hadoop 1.0 ʱ´ú£¬Hadoop µÄÁ½´óºËÐÄ×é¼þ HDFS NameNode ºÍ JobTracker¶¼´æÔÚ×ŵ¥µãÎÊÌ⣬ÕâÆäÖÐÒÔNameNodeµÄµ¥µãÎÊÌâÓÈΪÑÏÖØ¡£ÒòΪNameNode±£´æÁËÕû¸öHDFSµÄÔªÊý¾ÝÐÅÏ¢£¬Ò»µ©NameNode¹Òµô£¬Õû¸öHDFS¾ÍÎÞ·¨·ÃÎÊ£¬Í¬Ê±HadoopÉú̬ϵͳÖÐÒÀÀµÓÚHDFSµÄ¸÷¸ö×é¼þ£¬°üÀ¨MapReduce¡¢Hive¡¢PigÒÔ¼°HBaseµÈÒ²¶¼ÎÞ·¨Õý³£¹¤×÷£¬²¢ÇÒÖØÐÂÆô¶¯NameNodeºÍ½øÐÐÊý¾Ý»Ö¸´µÄ¹ý³ÌÒ²»á±È½ÏºÄʱ¡£ÕâЩÎÊÌâÔÚ¸øHadoopµÄʹÓÃÕß´øÀ´À§ÈŵÄͬʱ£¬Ò²¼«´óµØÏÞÖÆÁËHadoopµÄʹÓó¡¾°£¬Ê¹µÃHadoopÔںܳ¤µÄʱ¼äÄÚ½öÄÜÓÃ×÷ÀëÏß´æ´¢ºÍÀëÏß¼ÆË㣬ÎÞ·¨Ó¦Óõ½¶Ô¿ÉÓÃÐÔºÍÊý¾ÝÒ»ÖÂÐÔÒªÇóºÜ¸ßµÄÔÚÏßÓ¦Óó¡¾°ÖС£ Hadoop1.0HDFSÓÉÒ»¸öNameNodeºÍ¶à¸öDataNode ×é³É£¬ MapReduceµÄÖ´Ðйý³ÌÓÉÒ»¸öJobTrackerºÍ¶à¸öTaskTracker ×é³É¡£

Hadoop2.0Õë¶ÔHadoop1.0Öеĵ¥NameNodeÖÆÔ¼HDFSµÄÀ©Õ¹ÐÔÎÊÌ⣬Ìá³öÁËHDFSFederation£¨ÁªÃË£©£¬ËüÈöà¸öNameNode·Ö¹Ü²»Í¬µÄĿ¼½ø¶øÊµÏÖ·ÃÎʸôÀëºÍºáÏòÀ©Õ¹£¬Í¬Ê±Ëü³¹µ×½â¾öÁËNameNodeµ¥µã¹ÊÕÏÎÊÌâ¡£Ëü½«JobTrackerÖеÄ×ÊÔ´¹ÜÀíºÍ×÷Òµ¿ØÖƹ¦ÄÜ·Ö¿ª£¬·Ö±ðÓÉ×é¼þResourceManagerºÍApplicationMasterʵÏÖ£¬ÆäÖУ¬ResourceManager¸ºÔðËùÓÐÓ¦ÓóÌÐòµÄ×ÊÔ´·ÖÅ䣬¶øApplicationMaster½ö¸ºÔð¹ÜÀíÒ»¸öÓ¦ÓóÌÐò£¬½ø¶øµ®ÉúÁËȫеÄͨÓÃ×ÊÔ´¹ÜÀí¿ò¼ÜYARN¡£»ùÓÚYARN£¬Óû§¿ÉÒÔÔËÐи÷ÖÖÀàÐ͵ÄÓ¦ÓóÌÐò£¨²»ÔÙÏñ1.0ÄÇÑù½ö¾ÖÏÞÓÚMapReduceÒ»ÀàÓ¦Óã©£¬´ÓÀëÏß¼ÆËãµÄMapReduceµ½ÔÚÏß¼ÆË㣨Á÷ʽ´¦Àí£©µÄStormµÈYARN²»½öÏÞÓÚMapReduceÒ»ÖÖ¿ò¼ÜʹÓã¬Ò²¿ÉÒÔ¹©ÆäËû¿ò¼ÜʹÓ㬱ÈÈçTez¡¢Spark¡¢Storm¡£

ͼ HDFS1.0

ͼ HDFS2.0

Hadoop 1.0

Hadoop 2.0

Õâ¸öͼÀïÃæÓÐHadoopµÄ¹¹³É£¬ÒÔ¼°HadoopÏà¹ØÏîÄ¿µÄһЩ¹¹³É£¬±ÈÈçHive¡¢Pig,Spark¡£ËüÃǶ¼ÊÇÒÀÀµÓÚMapReduceÖ®Éϵ쬱ÈÈçHive Sql »áת»»³É MapReduce³ÌÐòÈ¥Ö´ÐС£

4. MapReduceÔ­Àí¼°¹ý³Ì

´«Í³µÄ·Ö²¼Ê½³ÌÐòÉè¼Æ£¨ÈçMPI£©·Ç³£¸´ÔÓ£¬Óû§ÐèÒª¹Ø×¢µÄϸ½Ú·Ç³£¶à£¬±ÈÈçÊý¾Ý·ÖƬ¡¢Êý¾Ý´«Êä¡¢½Úµã¼äͨÐŵȣ¬Òò¶øÉè¼Æ·Ö²¼Ê½³ÌÐòµÄÃż÷·Ç³£¸ß¡£

HadoopµÄÒ»¸öÖØÒªÉè¼ÆÄ¿±ê±ãÊǼò»¯·Ö²¼Ê½³ÌÐòÉè¼Æ£¬½«ËùÓв¢ÐгÌÐò¾ùÐèÒª¹Ø×¢µÄÉè¼ÆÏ¸½Ú³éÏó³É¹«¹²Ä£¿é²¢½»ÓÉϵͳʵÏÖ£¬¶øÓû§Ö»ÐèרעÓÚ×Ô¼ºµÄÓ¦ÓóÌÐòÂß¼­ÊµÏÖ£¬ÕâÑù¼ò»¯ÁË·Ö²¼Ê½³ÌÐòÉè¼ÆÇÒÌá¸ßÁË¿ª·¢Ð§ÂÊ¡£

Map Task ÏȽ«¶ÔÓ¦µÄ split µü´ú½âÎö³ÉÒ»¸ö¸ö key/value ¶Ô£¬ÒÀ´Îµ÷ÓÃÓû§×Ô¶¨ÒåµÄmap()º¯Êý½øÐд¦Àí£¬×îÖÕ½«ÁÙʱ½á¹û´æ·Åµ½±¾µØ´ÅÅÌÉÏ£¬ÆäÖÐÁÙʱÊý¾Ý±»·Ö³ÉÈô¸É¸öpartition£¬Ã¿¸ö partition ½«±»Ò»¸ö Reduce Task ´¦Àí¡£

Reduce Task Ö´Ðйý³ÌÈçͼ2-8Ëùʾ¡£¸Ã¹ý³Ì·ÖΪÈý¸ö½×¶Î¢Ù´ÓÔ¶³Ì½ÚµãÉ϶ÁÈ¡MapTaskÖмä½á¹û£¨³ÆÎª¡°Shuffle½×¶Î¡±£©£»¢Ú°´ÕÕ key ¶Ôkey/value¶Ô½øÐÐÅÅÐò£¨³ÆÎª¡°Sort½×¶Î¡±£©£»¢ÛÒÀ´Î¶ÁÈ¡ <key, valuelist>£¬µ÷ÓÃÓû§×Ô¶¨ÒåµÄ reduce() º¯Êý´¦Àí£¬²¢½«×îÖÕ½á¹û´æµ½HDFSÉÏ£¨³ÆÎª¡°Reduce ½×¶Î¡±£©¡£

Ò»°ãµÄ³¡¾°ÊÇÐèÒª¶à¸öMapReduce½øÐеü´ú¼ÆË㣨ÈçHiveSQL£©£¬Map Reduce¹ý³Ì¶¼»áÓÐд´ÅÅ̵IJÙ×÷£¬¶øÇÒÁ½¸öMapReduceÖ®¼ä»¹ÐèÒª·ÃÎÊHDFS¡£

ÈÎÎñÌá½»

Hadoop 1.0

Yarn

5. Hive HQL Ô­Àí

´ó¼ÒÒòΪÏ໥ѡÔñ×ßµ½ÁËÒ»Æð£¬È»ºó³É¾ÍÁ˱˴ˡ£µ«²»ÊÇÌìÉú¾ÍÊÇΪÁ˱˴ˡ£

HiveQLͨ¹ýCLI/webUI»òÕßthrift¡¢odbc»òjdbc½Ó¿ÚµÄÍⲿ½Ó¿ÚÌá½»£¬¾­¹ýcomplier±àÒëÆ÷£¬ÔËÓÃMetastoreÖеÄÔÆÊý¾Ý½øÐÐÀàÐͼì²âºÍÓï·¨·ÖÎö£¬Éú³ÉÒ»¸öÂß¼­·½°¸(logicalplan),È»ºóͨ¹ý¼òµ¥µÄÓÅ»¯´¦Àí£¬²úÉúÒ»¸öÒÔÓÐÏòÎÞ»·Í¼DAGÊý¾Ý½á¹¹ÐÎʽչÏÖµÄmap-reduceÈÎÎñ

Õû¸ö±àÒë¹ý³Ì·ÖΪÁù¸ö½×¶Î£º

1¡¢Antlr¶¨ÒåSQLµÄÓï·¨¹æÔò£¬Íê³ÉSQL´Ê·¨£¬Óï·¨½âÎö£¬½«SQLת»¯Îª³éÏóÓï·¨Ê÷AST Tree£»

2¡¢±éÀúAST Tree£¬³éÏó³ö²éѯµÄ»ù±¾×é³Éµ¥ÔªQueryBlock£»

3¡¢±éÀúQueryBlock£¬·­ÒëΪִÐвÙ×÷Ê÷OperatorTree£»

4¡¢Âß¼­²ãÓÅ»¯Æ÷½øÐÐOperatorTree±ä»»£¬ºÏ²¢²»±ØÒªµÄReduceSinkOperator£¬¼õÉÙshuffleÊý¾ÝÁ¿£»

5¡¢±éÀúOperatorTree£¬·­ÒëΪMapReduceÈÎÎñ£»

6¡¢ÎïÀí²ãÓÅ»¯Æ÷½øÐÐMapReduceÈÎÎñµÄ±ä»»£¬Éú³É×îÖÕµÄÖ´Ðмƻ®¡£

TDW Hive ת»»Îª MapReduce¾ÙÀý£º

TDW Hive Sql ת»¯Îª MapReduce£¬¿ÉÒÔÔÚIDEÀïÏÈ¿´ÏÂSQLµÄÖ´Ðмƻ®£¬Ã¿¸öStage¶¼ÊÇÓÉÒ»¸öMapReduce×é³É£¬µ±È»£¬Ò»¸öStageÒ²¿ÉÄÜûÓÐReduce¡£ Á½¸öStageÖ®¼ä£¬ÉÏÒ»¸öreduceµÄÊý¾Ý»áдµ½HDFSÉÏ¡£

6¡¢DAG¼ÆËã¿ò¼Ü Tez

¶ÔÓÚÐèÒª¶à¸öMapReduce×÷Òµµü´ú¼ÆËãµÄ³¡¾°£¬ÒòΪÿ¸öMapReduce¶¼Òª¶ÁдHDFS»áÔì³É´ÅÅ̺ÍÍøÂçIOµÄÀË·Ñ£¬¶øTez×÷Ϊһ¸öDAG¿ò¼Ü£¬¿ÉÒÔ½«¶à¸öÓÐÒÀÀµµÄMapReduce×÷ҵת»¯ÎªÒ»¸ö×÷Òµ£¬´Ó¶øÌá¸ßÐÔÄÜ¡£

Spark »ùÓÚÄÚ´æµÄ¼ÆËã¿ò¼Ü

1¡¢ºËÐĸÅÄî RDD

RDD µ¯ÐÔ·Ö²¼Ê½Êý¾Ý¼¯£¨RDD£¬Resilient Distributed Datasets£©£¬ÊÇÒ»¸öÈÝ´íµÄ¡¢²¢ÐеÄÊý¾Ý½á¹¹£¬¿ÉÒÔÈÃÓû§ÏÔʽµØ½«Êý¾Ý´æ´¢µ½´ÅÅ̺ÍÄÚ´æÖУ¬²¢ÄÜ¿ØÖÆÊý¾ÝµÄ·ÖÇø¡£RDD»¹ÌṩÁËÒ»×é·á¸»µÄ²Ù×÷À´²Ù×÷ÕâЩÊý¾Ý¡£Spark¶ÔÓÚÊý¾ÝµÄ´¦Àí£¬¶¼ÊÇÎ§ÈÆ×ÅRDD½øÐеġ£

RDDÖ»ÄÜͨ¹ýÔÚÎȶ¨µÄ´æ´¢Æ÷»òÆäËûRDDµÄÊý¾ÝÉϵÄÈ·¶¨ÐÔ²Ù×÷À´´´½¨¡£

val hdfsURL= "hdfs://** /**/** /**/ds= 20170101 /*gz"
val hdfsRdd = sparkSession .sparkContext .textFile (hdfsURL)

2¡¢RDDµÄ²Ù×÷£ºRDDת»»ºÍ¶¯×÷

Transformation ²Ù×÷ÊÇÑÓ³Ù¼ÆËãµÄ£¬Ò²¾ÍÊÇ˵´ÓÒ»¸öRDD ת»»Éú³ÉÁíÒ»¸ö RDD µÄת»»²Ù×÷²»ÊÇÂíÉÏÖ´ÐУ¬ÐèÒªµÈµ½ÓÐ Action ²Ù×÷µÄʱºò²Å»áÕæÕý´¥·¢ÔËËã¡£

Action Ðж¯Ëã×Ó£ºÕâÀàËã×ӻᴥ·¢ SparkContext Ìá½» Job ×÷Òµ¡£

3¡¢Ö´Ðйý³Ì

¿íÕ­ÒÀÀµ

Õ­ÒÀÀµÔÊÐíÔÚÒ»¸ö¼¯Èº½ÚµãÉÏÒÔÁ÷Ë®Ïߵķ½Ê½£¨pipeline£©¼ÆËãËùÓи¸·ÖÇø¡£ÀýÈ磬Öð¸öÔªËØµØÖ´ÐÐmap¡¢È»ºófilter²Ù×÷£»¶ø¿íÒÀÀµÔòÐèÒªÊ×ÏȼÆËãºÃËùÓи¸·ÖÇøÊý¾Ý£¬È»ºóÔÚ½ÚµãÖ®¼ä½øÐÐShuffle£¬ÕâÓëMapReduceÀàËÆ¡£µÚ¶þ£¬Õ­ÒÀÀµÄܹ»¸üÓÐЧµØ½øÐÐʧЧ½ÚµãµÄ»Ö¸´£¬¼´Ö»ÐèÖØÐ¼ÆË㶪ʧRDD·ÖÇøµÄ¸¸·ÖÇø£¬¶øÇÒ²»Í¬½ÚµãÖ®¼ä¿ÉÒÔ²¢ÐмÆË㣻¶ø¶ÔÓÚÒ»¸ö¿íÒÀÀµ¹ØÏµµÄLineageͼ£¬µ¥¸ö½ÚµãʧЧ¿ÉÄܵ¼ÖÂÕâ¸öRDDµÄËùÓÐ׿ÏȶªÊ§²¿·Ö·ÖÇø£¬Òò¶øÐèÒªÕûÌåÖØÐ¼ÆËã¡£

4¡¢ ÓëMapReduce¶Ô±È£¬ÌáÉýЧÂʵĵط½

MapReduceÊÇÒ»¸öMapºÍÒ»¸öReduce×é³ÉÒ»¸östage£¬µ±È»Ò²ÓÐûÓÐreduceµÄstage£¬£¨Èç¼òµ¥µÄ²»Éæ¼°µ½reduceµÄ²éѯ£©

SparkÒ²ÀàËÆ£¬Ã¿Ò»¸ö¿íÒÀÀµ£¨ÐèÒªshuffle£¬ÀàËÆReduce£©µÄµØ·½¾ÍÊÇÁ½¸öStageµÄ·Ö½çÏß¡£

¿ÉÒÔ¿´µ½ SparkµÄstage˼Ïë¸ú TezµÄºÜÏñ£¬²»ÏñMapReduceÄÇÑù±ØÐë³É¶ÔµÄMapReduceÒ»Æð³öÏÖ£¬¿ÉÒÔÔÚmap½×¶Î×öºÜ¶àÊÂÇ飬¼õÉÙ²»±ØÒªµÄÍøÂçIOºÍдHDFSµÄʱ¼ä¡£Í¬Ê±Ò»¸öStageÄÚ£¬Êý¾ÝµÄ´¦ÀíÒ²ÊÇ»ùÓÚÄÚ´æµÄ£¬¼õÉÙÁ˱¾µØ´ÅÅ̵ÄIO¡£

5¡¢ DataSet ½á¹¹»¯µÄRDD

ÔÚSparkÖУ¬DataFrameÊÇÒ»ÖÖÒÔRDDΪ»ù´¡µÄ·Ö²¼Ê½Êý¾Ý¼¯£¬ÀàËÆÓÚ´«Í³Êý¾Ý¿âÖеĶþά±í¸ñ¡£DataFrameÓëRDDµÄÖ÷񻂿±ðÔÚÓÚ£¬Ç°Õß´øÓÐschemaÔªÐÅÏ¢£¬¼´DataFrameËù±íʾµÄ¶þά±íÊý¾Ý¼¯µÄÿһÁж¼´øÓÐÃû³ÆºÍÀàÐÍ¡£ÕâʹµÃSpark SQLµÃÒÔ¶´²ì¸ü¶àµÄ½á¹¹ÐÅÏ¢£¬´Ó¶ø¶Ô²ØÓÚDataFrame±³ºóµÄÊý¾ÝÔ´ÒÔ¼°×÷ÓÃÓÚDataFrameÖ®Éϵı任½øÐÐÁËÕë¶ÔÐÔµÄÓÅ»¯£¬×îÖÕ´ïµ½´ó·ùÌáÉýÔËÐÐʱЧÂʵÄÄ¿±ê¡£·´¹ÛRDD£¬ÓÉÓÚÎÞ´ÓµÃÖªËù´æÊý¾ÝÔªËØµÄ¾ßÌåÄÚ²¿½á¹¹£¬SparkCoreÖ»ÄÜÔÚstage²ãÃæ½øÐмòµ¥¡¢Í¨ÓõÄÁ÷Ë®ÏßÓÅ»¯¡£

6¡¢Spark ʹÓÃ

SparkSQLµÄǰÉíÊÇShark£¬¶øSharkµÄǰÉíÊÇHadoopÖеÄhive¡£

ÊÜÏÞÓÚÂç×Ó£¬Ä¿Ç°ºÃÏñÖ»ÄÜÓÃScala¿ª·¢¡£

Python SqlµÄÈÎÎñ£¬Èç¹ûSQLÖ§³ÖSpark SQLµÄÓï·¨£¬»áʹÓÃSparkÒýÇæÖ´ÐÐÈÎÎñ¡£

   
2544 ´Îä¯ÀÀ       27
Ïà¹ØÎÄÕÂ

»ùÓÚEAµÄÊý¾Ý¿â½¨Ä£
Êý¾ÝÁ÷½¨Ä££¨EAÖ¸ÄÏ£©
¡°Êý¾Ýºþ¡±£º¸ÅÄî¡¢ÌØÕ÷¡¢¼Ü¹¹Óë°¸Àý
ÔÚÏßÉ̳ÇÊý¾Ý¿âϵͳÉè¼Æ ˼·+Ч¹û
 
Ïà¹ØÎĵµ

GreenplumÊý¾Ý¿â»ù´¡Åàѵ
MySQL5.1ÐÔÄÜÓÅ»¯·½°¸
ijµçÉÌÊý¾ÝÖÐ̨¼Ü¹¹Êµ¼ù
MySQL¸ßÀ©Õ¹¼Ü¹¹Éè¼Æ
Ïà¹Ø¿Î³Ì

Êý¾ÝÖÎÀí¡¢Êý¾Ý¼Ü¹¹¼°Êý¾Ý±ê×¼
MongoDBʵս¿Î³Ì
²¢·¢¡¢´óÈÝÁ¿¡¢¸ßÐÔÄÜÊý¾Ý¿âÉè¼ÆÓëÓÅ»¯
PostgreSQLÊý¾Ý¿âʵսÅàѵ