|
Catalyst
CatalystÊÇÓëSpark½âñîµÄÒ»¸ö¶ÀÁ¢¿â£¬ÊÇÒ»¸öimpl-freeµÄÖ´Ðмƻ®µÄÉú³ÉºÍÓÅ»¯¿ò¼Ü¡£
ĿǰÓëSpark Core»¹ÊÇñîºÏµÄ£¬¶Ô´ËuserÓʼþ×éÀïÓÐÈ˶ԴËÌá³öÒÉÎÊ£¬¼ûmail¡£
ÒÔÏÂÊÇCatalyst½ÏÔçʱºòµÄ¼Ü¹¹Í¼£¬Õ¹Ê¾µÄÊÇ´úÂë½á¹¹ºÍ´¦ÀíÁ÷³Ì¡£

Catalyst¶¨Î»
ÆäËûϵͳÈç¹ûÏë»ùÓÚSpark×öһЩÀàsql¡¢±ê×¼sqlÉõÖÁÆäËû²éѯÓïÑԵIJéѯ£¬ÐèÒª»ùÓÚCatalystÌṩµÄ½âÎöÆ÷¡¢Ö´Ðмƻ®Ê÷½á¹¹¡¢Âß¼Ö´Ðмƻ®µÄ´¦Àí¹æÔòÌåϵµÈÀàÌåϵÀ´ÊµÏÖÖ´Ðмƻ®µÄ½âÎö¡¢Éú³É¡¢ÓÅ»¯¡¢Ó³É乤×÷¡£
¶ÔÓ¦ÉÏͼÖУ¬Ö÷ÒªÊÇ×ó²àµÄTreeNodelib¼°ÖмäÈý´Îת»¯¹ý³ÌÖÐÉæ¼°µ½µÄÀà½á¹¹¶¼ÊÇCatalystÌṩµÄ¡£ÖÁÓÚÓÒ²àÎïÀíÖ´Ðмƻ®Ó³ÉäÉú³É¹ý³Ì£¬ÎïÀíÖ´Ðмƻ®»ùÓڳɱ¾µÄÓÅ»¯Ä£ÐÍ£¬¾ßÌåÎïÀíËã×ÓµÄÖ´Ðж¼ÓÉϵͳ×Ô¼ºÊµÏÖ¡£
CatalystÏÖ×´
ÔÚ½âÎöÆ÷·½ÃæÌṩµÄÊÇÒ»¸ö¼òµ¥µÄscalaдµÄsql parser£¬Ö§³ÖÓïÒåÓÐÏÞ£¬¶øÇÒÓ¦¸ÃÊDZê×¼sqlµÄ¡£
ÔÚ¹æÔò·½Ã棬ÌṩµÄÓÅ»¯¹æÔòÊDZȽϻù´¡µÄ(ºÍPig/Hive±ÈûÓÐÄÇô·á¸»)£¬²»¹ýһЩÓÅ»¯¹æÔòÆäʵÊÇ񻃾¼°µ½¾ßÌåÎïÀíËã×ӵģ¬ËùÒÔ²¿·Ö¹æÔòÐèÒªÔÚϵͳ·½ÄÇ×Ô¼ºÖƶ¨ºÍʵÏÖ(Èçspark-sqlÀïµÄSparkStrategy)¡£
CatalystÒ²ÓÐ×Ô¼ºµÄÒ»Ì×Êý¾ÝÀàÐÍ¡£
ÏÂÃæ½éÉÜCatalystÀXÌ×ÖØÒªµÄÀà½á¹¹¡£
TreeNodeÌåϵ
TreeNodeÊÇCatalystÖ´Ðмƻ®±íʾµÄÊý¾Ý½á¹¹£¬ÊÇÒ»¸öÊ÷½á¹¹£¬¾ß±¸Ò»Ð©scala collectionµÄ²Ù×÷ÄÜÁ¦ºÍÊ÷±éÀúÄÜÁ¦¡£Õâ¿ÃÊ÷Ò»Ö±ÔÚÄÚ´æÀïά»¤£¬²»»ádumpµ½´ÅÅÌÒÔijÖÖ¸ñʽµÄÎļþ´æÔÚ£¬ÇÒÎÞÂÛÔÚÓ³ÉäÂß¼Ö´Ðмƻ®½×¶Î»¹ÊÇÓÅ»¯Âß¼Ö´Ðмƻ®½×¶Î£¬Ê÷µÄÐÞ¸ÄÊÇÒÔÌæ»»ÒÑÓнڵãµÄ·½Ê½½øÐеġ£
TreeNode£¬ÄÚ²¿´øÒ»¸öchildren: Seq[BaseType]±íʾº¢×ӽڵ㣬¾ß±¸foreach¡¢map¡¢collectµÈÕë¶Ô½Úµã²Ù×÷µÄ·½·¨£¬ÒÔ¼°transformDown(ĬÈÏ£¬Ç°Ðò±éÀú)¡¢transformUpÕâÑùµÄ±éÀúÊ÷ÉϽڵ㣬¶ÔÆ¥Åä½Úµãʵʩ±ä»¯µÄ·½·¨¡£
ÌṩUnaryNode,BinaryNode, LeafNodeÈýÖÖtrait£¬¼´·ÇÒ¶×Ó½ÚµãÔÊÐíÓÐÒ»¸ö»òÁ½¸ö×ӽڵ㡣
TreeNodeÌṩµÄÊÇ·¶ÐÍ¡£
TreeNodeÓÐÁ½¸ö×ÓÀà¼Ì³ÐÌåϵ£¬QueryPlanºÍExpression¡£QueryPlanÏÂÃæÊÇÂß¼ºÍÎïÀíÖ´Ðмƻ®Á½¸öÌåϵ£¬Ç°ÕßÔÚCatalystÀïÓÐÏêϸʵÏÖ£¬ºóÕßÐèÒªÔÚϵͳ×Ô¼ºÊµÏÖ¡£ExpressionÊDZí´ïʽÌåϵ£¬ºóÃæÕ½ڶ¼»áÕ¹¿ª½éÉÜ¡£

TreeµÄtransformationʵÏÖ£º
´«ÈëPartialFunction[TreeType,TreeType]£¬Èç¹ûÓë²Ù×÷·ûÆ¥Å䣬Ôò½Úµã»á±»½á¹ûÌæ»»µô£¬·ñÔò½Úµã²»»á±ä¶¯¡£Õû¸ö¹ý³ÌÊǶÔchildrenµÝ¹éÖ´Ðеġ£
Ö´Ðмƻ®±íʾģÐÍ
Âß¼Ö´Ðмƻ®
QueryPlan¼Ì³Ð×ÔTreeNode£¬ÄÚ²¿´øÒ»¸öoutput: Seq[Attribute],¾ß±¸transformExpressionDown¡¢transformExpressionUp·½·¨¡£
ÔÚCatalystÖУ¬QueryPlanµÄÖ÷Òª×ÓÀàÌåϵÊÇLogicalPlan£¬¼´Âß¼Ö´Ðмƻ®±íʾ¡£ÆäÎïÀíÖ´Ðмƻ®±íʾÓÉʹÓ÷½ÊµÏÖ(spark-sqlÏîÄ¿ÖÐ)¡£
LogicalPlan¼Ì³Ð×ÔQueryPlan£¬ÄÚ²¿´øÒ»¸öreference:Set[Attribute]£¬Ö÷Òª·½·¨Îªresolve(name:String): Option[NamedeExpression]£¬ÓÃÓÚ·ÖÎöÉú³É¶ÔÓ¦µÄNamedExpression¡£
LogicalPlanÓÐÐí¶à¾ßÌå×ÓÀ࣬Ҳ·ÖΪUnaryNode, BinaryNode, LeafNodeÈýÀ࣬¾ßÌåÔÚorg.apache.spark.sql.catalyst.plans.logical·¾¶Ï¡£

Âß¼Ö´Ðмƻ®ÊµÏÖ
LeafNodeÖ÷Òª×ÓÀàÊÇCommandÌåϵ£º

¸÷commandµÄÓïÒå¿ÉÒÔ´Ó×ÓÀàÃû×Ö¿´³ö£¬´ú±íµÄÊÇϵͳ¿ÉÒÔÖ´ÐеÄnon-queryÃüÁÈçDDL¡£
UnaryNodeµÄ×ÓÀࣺ

BinaryNodeµÄ×ÓÀࣺ

ÎïÀíÖ´Ðмƻ®
ÁíÒ»·½Ã棬ÎïÀíÖ´Ðмƻ®½ÚµãÔÚ¾ßÌåϵͳÀïʵÏÖ£¬±ÈÈçspark-sql¹¤³ÌÀïµÄSparkPlan¼Ì³ÐÌåϵ¡£

ÎïÀíÖ´Ðмƻ®ÊµÏÖ
ÿ¸ö×ÓÀ඼ҪʵÏÖexecute()·½·¨£¬´óÖÂÓÐÒÔÏÂʵÏÖ×ÓÀà(²»È«)¡£

Ìáµ½ÎïÀíÖ´Ðмƻ®£¬»¹ÒªÌáÒ»ÏÂCatalystÌṩµÄ·ÖÇø±íʾģÐÍ¡£
Ö´Ðмƻ®Ó³Éä
Catalyst»¹ÌṩÁËÒ»¸öQueryPlanner[Physical <: TreeNode[PhysicalPlan]]³éÏóÀ࣬ÐèÒª×ÓÀàÖÆ¶¨Ò»Åústrategies: Seq[Strategy]£¬Æäapply·½·¨Ò²ÊÇÀàËÆ¸ù¾ÝÖÆ¶¨µÄ¾ßÌå²ßÂÔÀ´°ÑÂß¼Ö´Ðмƻ®Ëã×ÓÓ³Éä³ÉÎïÀíÖ´Ðмƻ®Ëã×Ó¡£ÓÉÓÚÎïÀíÖ´Ðмƻ®µÄ½ÚµãÊÇÔÚ¾ßÌåϵͳÀïʵÏֵģ¬ËùÒÔQueryPlanner¼°ÀïÃæµÄstrategiesÒ²ÐèÒªÔÚ¾ßÌåϵͳÀïʵÏÖ¡£

ÔÚspark-sqlÏîÄ¿ÖУ¬SparkStrategies¼Ì³ÐÁËQueryPlanner[SparkPlan]£¬ÄÚ²¿Öƶ¨ÁËLeftSemiJoin, HashJoin,PartialAggregation, BroadcastNestedLoopJoin, CartesianProductµÈ¼¸ÖÖ²ßÂÔ£¬Ã¿ÖÖ²ßÂÔ½ÓÊܵͼÊÇÒ»¸öLogicalPlan£¬Éú³ÉµÄÊÇSeq[SparkPlan]£¬Ã¿¸öSparkPlanÀí½âΪ¾ßÌåRDDµÄËã×Ó²Ù×÷¡£
±ÈÈçÔÚBasicOperatorsÕâ¸öStrategyÀÒÔmatch-caseÆ¥ÅäµÄ·½Ê½´¦ÀíÁ˺ܶà»ù±¾Ëã×Ó£¨¿ÉÒÔÒ»¶ÔÒ»Ö±½ÓÓ³Éä³ÉRDDËã×Ó£©£¬ÈçÏ£º

ExpressionÌåϵ
Expression£¬¼´±í´ïʽ£¬Ö¸²»ÐèÒªÖ´ÐÐÒýÇæ¼ÆË㣬¶ø¿ÉÒÔÖ±½Ó¼ÆËã»ò´¦ÀíµÄ½Úµã£¬°üÀ¨Cast²Ù×÷£¬Projection²Ù×÷£¬ËÄÔòÔËË㣬Âß¼²Ù×÷·ûÔËËãµÈ¡£
¾ßÌå¿ÉÒԲο¼org.apache.spark.sql.expressionspackageϵÄÀà¡£
RulesÌåϵ
·²ÊÇÐèÒª´¦ÀíÖ´Ðмƻ®Ê÷(Analyze¹ý³Ì£¬Optimize¹ý³Ì£¬SparkStrategy¹ý³Ì)£¬ÊµÊ©¹æÔòÆ¥ÅäºÍ½Úµã´¦ÀíµÄ£¬¶¼ÐèÒª¼Ì³ÐRuleExecutor[TreeType]³éÏóÀà¡£
RuleExecutorÄÚ²¿ÌṩÁËÒ»¸öSeq[Batch]£¬ÀïÃæ¶¨ÒåµÄÊǸÃRuleExecutorµÄ´¦Àí²½Ö衣ÿ¸öBatch´ú±í×ÅÒ»Ì×¹æÔò£¬Å䱸һ¸ö²ßÂÔ£¬¸Ã²ßÂÔ˵Ã÷Á˵ü´ú´ÎÊý(Ò»´Î»¹ÊǶà´Î)¡£

Rule[TreeType <: TreeNode[_]]ÊÇÒ»¸ö³éÏóÀ࣬×ÓÀàÐèÒª¸´Ð´apply(plan: TreeType)·½·¨À´Öƶ¨´¦ÀíÂß¼¡£
RuleExecutorµÄapply(plan: TreeType): TreeType·½·¨»á°´ÕÕbatches˳ÐòºÍbatchÄÚµÄRules˳Ðò£¬¶Ô´«ÈëµÄplanÀïµÄ½Úµãµü´ú´¦Àí£¬´¦ÀíÂ߼ΪÓɾßÌåRule×ÓÀàʵÏÖ¡£
HiveÏà¹Ø
HiveÖ§³Ö·½Ê½
Spark SQL¶ÔhiveµÄÖ§³ÖÊǵ¥¶ÀµÄspark-hiveÏîÄ¿£¬¶ÔHiveµÄÖ§³Ö°üÀ¨HQL²éѯ¡¢hive metaStoreÐÅÏ¢¡¢hive SerDes¡¢hive UDFs/UDAFs/ UDTFs£¬ÀàËÆShark¡£
Ö»ÓÐÔÚHiveContextÏÂͨ¹ýhive api»ñµÃµÄÊý¾Ý¼¯£¬²Å¿ÉÒÔʹÓÃhql½øÐвéѯ£¬ÆähqlµÄ½âÎöÒÀÀµµÄÊÇorg.apache.hadoop.hive.ql.parse.ParseDriverÀàµÄparse·½·¨£¬Éú³ÉHive AST¡£
ʵ¼ÊÉÏsqlºÍhql£¬²¢²»ÊÇÒ»ÆðÖ§³ÖµÄ¡£¿ÉÒÔÀí½âΪhqlÊǶÀÁ¢Ö§³ÖµÄ£¬Äܱ»hql²éѯµÄÊý¾Ý¼¯±ØÐë¶ÁÈ¡×Ôhive api¡£ÏÂͼÖеÄparquet¡¢jsonµÈÆäËûÎļþÖ§³ÖÖ»·¢ÉúÔÚsql»·¾³ÏÂ(SQLContext)¡£

Hive on Spark
Hive¹Ù·½Ìá³öÁËHive onSparkµÄJIRA¡£Shark½áÊøÖ®ºó£¬²ð·ÖΪÁ½¸ö·½Ïò£º

´ÓÕâÀï¿´£¬¶ÔHiveµÄ¼æÈÝÖ§³Ö½«×ªÒƵ½Hive on SparkÉÏ£¬Ö®Ç°SharkµÄ¾Ñ齫ÔÚHiveÉçÇøµÄÕâ¸öÖ§³ÖÉÏÌåÏÖ¡£ÎÒÀí½â£¬Ä¿Ç°SparkSQLÀïµÄÄÇÖÖHiveÖ§³Ö·½Ê½£¬Ö»ÊÇΪÁËÔÚSpark»·¾³Ï¼¯³É²Ù×ÝHiveÊý¾Ý£¬ËüµÄhqlÖ´ÐÐÊǵ÷ÓÃHive¿Í»§¶ËDriver£¬ÅÜÔÚhadoop MRÉϵ쬱¾Éí²»ÊÇHive on SparkµÄʵÏÖ£¬Ö»ÊÇΪÁËʹÓÃRDD¼ä½Ó²Ù×÷HiveÊý¾Ý¼¯¡£
ËùÒÔÈç¹ûÏëÒª°ÑÏÖÓÐHiveÈÎÎñÇ¨ÒÆµ½SparkÉÏ£¬Ó¦¸ÃʹÓÃShark»òÕߵȴýHive on Spark¡£
Spark SQLÀïµÄHiveÖ§³Ö²»ÊÇhive on sparkµÄʵÏÖ£¬¶ø¸üÏñÒ»¸ö¶ÁдHiveÊý¾ÝµÄ¿Í»§¶Ë¡£ÇÒÆähqlÖ§³ÖÖ»°üº¬hiveÊý¾Ý£¬Óësql»·¾³ÊÇ»¥Ïà¶ÀÁ¢µÄ¡£
ÒÔÉÏÁ½½ÚÊÇSpark SQL Hive¡¢Shark¡¢Hive on SparkµÄÇø±ðºÍÀí½â¡£
SQL Core
Spark SQLµÄºËÐÄÊǰÑÒÑÓеÄRDD£¬´øÉÏSchemaÐÅÏ¢£¬È»ºó×¢²á³ÉÀàËÆsqlÀïµÄ¡±Table¡±£¬¶ÔÆä½øÐÐsql²éѯ¡£ÕâÀïÃæÖ÷Òª·ÖÁ½²¿·Ö£¬Ò»ÊÇÉú³ÉSchemaRD£¬¶þÊÇÖ´Ðвéѯ¡£
Éú³ÉSchemaRDD
Èç¹ûÊÇspark-hiveÏîÄ¿£¬ÄÇô¶ÁÈ¡metadataÐÅÏ¢×÷ΪSchema¡¢¶ÁÈ¡hdfsÉÏÊý¾ÝµÄ¹ý³Ì½»¸øHiveÍê³É£¬È»ºó¸ù¾ÝÕâÁ©²¿·ÖÉú³ÉSchemaRDD£¬ÔÚHiveContextϽøÐÐhql()²éѯ¡£
¶ÔÓÚSpark SQLÀ´Ëµ£¬
Êý¾Ý·½Ã棬RDD¿ÉÒÔÀ´×ÔÈκÎÒÑÓеÄRDD£¬Ò²¿ÉÒÔÀ´×ÔÖ§³ÖµÄµÚÈý·½¸ñʽ£¬Èçjson file¡¢parquet file¡£
SQLContextÏ»á°Ñ´øcase classµÄRDDÒþʽת»¯ÎªSchemaRDD

ExsitingRddµ¥ÀýÀï»á·´Éä³öcase classµÄattributes£¬²¢°ÑRDDµÄÊý¾Ýת»¯³ÉCatalystµÄGenericRow£¬×îºó·µ»ØRDD[Row]£¬¼´Ò»¸öSchemaRDD¡£ÕâÀïµÄ¾ßÌåת»¯Âß¼¿ÉÒԲο¼ExsitingRddµÄproductToRowRddºÍconvertToCatalyst·½·¨¡£
Ö®ºó¿ÉÒÔ½øÐÐSchemaRDDÌṩµÄ×¢²átable²Ù×÷¡¢Õë¶ÔSchema¸´Ð´µÄ²¿·ÖRDDת»¯²Ù×÷¡¢DSL²Ù×÷¡¢saveAs²Ù×÷µÈµÈ¡£
RowºÍGenericRowÊÇCatalystÀïµÄÐбíʾģÐÍ
RowÓÃSeq[Any]À´±íʾvalues£¬GenericRowÊÇRowµÄ×ÓÀ࣬ÓÃÊý×é±íʾvalues¡£RowÖ§³ÖÊý¾ÝÀàÐͰüÀ¨Int, Long, Double, Float, Boolean, Short, Byte, String¡£Ö§³Ö°´ÐòÊý(ordinal)¶Áȡijһ¸öÁеÄÖµ¡£¶ÁȡǰÐèÒª×öisNullAt(i: Int)µÄÅжϡ£
¸÷×Ô¶¼ÓÐMutableÀ࣬ÌṩsetXXX(i: int, value: Any)ÐÞ¸ÄijÐòÊýÉϵÄÖµ¡£
²ã´Î½á¹¹

ÏÂͼ´óÖ¶ԱÈÁËPig£¬Spark SQL£¬SharkÔÚʵÏÖ²ã´ÎÉϵÄÇø±ð£¬½ö×ö²Î¿¼¡£


²éѯÁ÷³Ì
SQLContextÀï¶ÔsqlµÄÒ»¸ö½âÎöºÍÖ´ÐÐÁ÷³Ì£º
1.µÚÒ»²½parseSql(sql: String)£¬simple sql parser×ö´Ê·¨Óï·¨½âÎö£¬Éú³ÉLogicalPlan¡£
2.µÚ¶þ²½analyzer(logicalPlan)£¬°Ñ×öÍê´Ê·¨Óï·¨½âÎöµÄÖ´Ðмƻ®½øÐгõ²½·ÖÎöºÍÓ³É䣬
ĿǰSQLContextÄÚµÄAnalyzerÓÉCatalystÌṩ£¬¶¨ÒåÈçÏ£º
new Analyzer(catalog, EmptyFunctionRegistry, caseSensitive =true)
catalogΪSimpleCatalog£¬catalogÊÇÓÃÀ´×¢²átableºÍ²éѯrelationµÄ¡£
¶øÕâÀïµÄFunctionRegistry²»Ö§³ÖlookupFunction·½·¨£¬ËùÒÔ¸Ãanalyzer²»Ö§³ÖFunction×¢²á£¬¼´UDF¡£
AnalyzerÄÚ¶¨ÒåÁ˼¸Åú¹æÔò£º

3.´ÓµÚ¶þ²½µÃµ½µÄÊdzõ²½µÄlogicalPlan£¬½ÓÏÂÀ´µÚÈý²½ÊÇoptimizer(plan)¡£
OptimizerÀïÃæÒ²ÊǶ¨ÒåÁ˼¸Åú¹æÔò£¬»á°´Ðò¶ÔÖ´Ðмƻ®½øÐÐÓÅ»¯²Ù×÷¡£

4.ÓÅ»¯ºóµÄÖ´Ðмƻ®£¬»¹Òª¶ª¸øSparkPlanner´¦Àí£¬ÀïÃæ¶¨ÒåÁËһЩ²ßÂÔ£¬Ä¿µÄÊǸù¾ÝÂß¼Ö´Ðмƻ®Ê÷Éú³É×îºó¿ÉÒÔÖ´ÐеÄÎïÀíÖ´Ðмƻ®Ê÷£¬¼´µÃµ½SparkPlan¡£

5.ÔÚ×îÖÕÕæÕýÖ´ÐÐÎïÀíÖ´Ðмƻ®Ç°£¬×îºó»¹Òª½øÐÐÁ½´Î¹æÔò£¬SQLContextÀﶨÒåÕâ¸ö¹ý³Ì½ÐprepareForExecution£¬Õâ¸ö²½ÖèÊǶîÍâÔö¼ÓµÄ£¬Ö±½Ónew RuleExecutor[SparkPlan]½øÐеġ£

6.×îºóµ÷ÓÃSparkPlanµÄexecute()Ö´ÐмÆËã¡£Õâ¸öexecute()ÔÚÿÖÖSparkPlanµÄʵÏÖÀﶨÒ壬һ°ã¶¼»áµÝ¹éµ÷ÓÃchildrenµÄexecute()·½·¨£¬ËùÒԻᴥ·¢Õû¿ÃTreeµÄ¼ÆËã¡£
ÆäËûÌØÐÔ
ÄÚ´æÁд洢
SQLContextÏÂcache/uncache tableµÄʱºò»áµ÷ÓÃÁд洢ģ¿é¡£
¸ÃÄ£¿é½è¼ø×ÔShark£¬Ä¿µÄÊǵ±°Ñ±íÊý¾ÝcacheÔÚÄÚ´æµÄʱºò×öÐÐתÁвÙ×÷£¬ÒÔ±ãѹËõ¡£
ʵÏÖÀà
InMemoryColumnarTableScanÀàÊÇSparkPlan LeafNodeµÄʵÏÖ£¬¼´ÊÇÒ»¸öÎïÀíÖ´Ðмƻ®¡£´«ÈëÒ»¸öSparkPlan(È·ÈÏÁ˵ÄÎïÀíÖ´ÐмÆ)ºÍÒ»¸öÊôÐÔÐòÁУ¬ÄÚ²¿°üº¬Ò»¸öÐÐתÁС¢´¥·¢¼ÆËã²¢cacheµÄ¹ý³Ì(ÇÒÊÇlazyµÄ)¡£
ColumnBuilderÕë¶Ô²»Í¬µÄÊý¾ÝÀàÐÍ(boolean, byte, double, float, int, long, short, string)Óɲ»Í¬µÄ×ÓÀà°ÑÊý¾Ýдµ½ByteBufferÀ¼´°ü×°RowµÄÿ¸öfield£¬Éú³ÉColumns¡£ÓëÆä¶ÔÓ¦µÄColumnAccessorÊÇ·ÃÎÊcolumn£¬½«Æäת»ØRow¡£
CompressibleColumnBuilderºÍCompressibleColumnAccessorÊÇ´øÑ¹ËõµÄÐÐÁÐת»»builder£¬ÆäByteBufferÄÚ²¿´æ´¢½á¹¹ÈçÏÂ

CompressionScheme×ÓÀàÊDz»Í¬µÄѹËõʵÏÖ

¶¼ÊÇscalaʵÏֵģ¬Î´½èÖúµÚÈý·½¿â¡£²»Í¬µÄʵÏÖ£¬Ö¸¶¨ÁËÖ§³ÖµÄcolumn dataÀàÐÍ¡£ÔÚbuild()µÄʱºò£¬»á±È½ÏÿÖÖѹËõ£¬Ñ¡ÔñѹËõÂÊ×îСµÄ£¨ÈôÈÔ´óÓÚ0.8¾Í²»Ñ¹ËõÁË£©¡£
ÕâÀïµÄ¹ÀËãÂß¼£¬À´×Ô×ÓÀàʵÏÖµÄgatherCompressibilityStats·½·¨¡£
CacheÂß¼
cache֮ǰ£¬ÐèÒªÏȰѱ¾´ÎcacheµÄtableµÄÎïÀíÖ´Ðмƻ®Éú³É³öÀ´¡£
ÔÚcacheÕâ¸ö¹ý³ÌÀInMemoryColumnarTableScan²¢Ã»Óд¥·¢Ö´ÐУ¬µ«ÊÇÉú³ÉÁËÒÔInMemoryColumnarTableScanΪÎïÀíÖ´Ðмƻ®µÄSparkLogicalPlan£¬²¢´æ³ÉtableµÄplan¡£
ÆäʵÔÚcacheµÄʱºò£¬Ê×ÏÈÈ¥catalogÀïѰÕÒÕâ¸ötableµÄÐÅÏ¢ºÍtableµÄÖ´Ðмƻ®£¬È»ºó»á½øÐÐÖ´ÐУ¨Ö´Ðе½ÎïÀíÖ´Ðмƻ®Éú³É£©£¬È»ºó°ÑÕâ¸ötableÔÙ·Å»ØcatalogÀïά»¤ÆðÀ´£¬Õâ¸öʱºòµÄÖ´Ðмƻ®ÒѾÊÇ×îÖÕÒªÖ´ÐеÄÎïÀíÖ´Ðмƻ®ÁË¡£µ«ÊÇ´ËʱColumnerÄ£¿éÏà¹ØµÄת»»µÈ²Ù×÷¶¼ÊÇûÓд¥·¢µÄ¡£
ÕæÕýµÄ´¥·¢»¹ÊÇÔÚexecute()µÄʱºò£¬Í¬ÆäËûSparkPlanµÄexecute()·½·¨´¥·¢³¡¾°ÊÇÒ»ÑùµÄ¡£
UncacheÂß¼
UncacheTableµÄʱºò£¬³ýÁËɾ³ýcatalogÀïµÄtableÐÅÏ¢Ö®Í⣬»¹µ÷ÓÃÁËInMemoryColumnarTableScanµÄcacheColumnBuffers·½·¨£¬µÃµ½RDD¼¯ºÏ£¬²¢½øÐÐÁËunpersist()²Ù×÷¡£cacheColumnBuffersÖ÷Òª×öÁ˰ÑRDDÿ¸öpartitionÀïµÄROWµÄÿ¸öField´æµ½ÁËColumnBuilderÄÚ¡£
UDF£¨Ôݲ»Ö§³Ö£©
ÈçÇ°Ãæ¶ÔSQLContextÀïAnalyzerµÄ·ÖÎö£¬ÆäFunctionRegistryûÓÐʵÏÖlookupFunction¡£
ÔÚspark-hiveÏîÄ¿ÀHiveContextÀïÊÇʵÏÖÁËFunctionRegistryÕâ¸ötraitµÄ£¬ÆäʵÏÖΪHiveFunctionRegistry£¬ÊµÏÖÂß¼¼ûorg.apache.spark.sql.hive.hiveUdfs
JSONÖ§³Ö
SQLContextÏ£¬Ôö¼ÓÁËjsonFileµÄ¶ÁÈ¡·½·¨£¬¶øÇÒĿǰ¿´£¬´úÂëÀïʵÏÖµÄÊÇhadoop textfileµÄ¶ÁÈ¡£¬Ò²¾ÍÊÇÕâ·ÝjsonÎļþÓ¦¸ÃÊÇÔÚHDFSÉϵġ£¾ßÌåÕâ·ÝjsonÎļþµÄÔØÈ룬InputFormatÊÇTextInputFormat£¬key classÊÇLongWritable£¬value classÊÇText£¬×îºóµÃµ½µÄÊÇvalue²¿·ÖµÄÄǶÎStringÄÚÈÝ£¬¼´RDD[String]¡£
¶ÁÈ¡jsonÎļþÖ®ºó£¬×ª»»³ÉSchemaRDD¡£JsonRDD.inferSchema(RDD[String])ÀïÓÐÏêϸµÄ½âÎöjsonºÍÓ³Éä³öschemaµÄ¹ý³Ì£¬×îºóµÃµ½¸ÃjsonµÄLogicalPlan¡£
JsonµÄ½âÎöʹÓõÄÊÇFasterXML/jackson-databind¿â£¬GitHubµØÖ·£¬wiki
°ÑÊý¾ÝÓ³Éä³ÉMap[String, Any]
JsonµÄÖ§³Ö·á¸»ÁËSpark SQLÊý¾Ý½ÓÈ볡¾°¡£
JDBCÖ§³Ö
Jdbc support branchis under going
SQL92
Spark SQLĿǰµÄSQLÓï·¨Ö§³ÖÇé¿ö¼ûSqlParserÀࡣĿ±êÊÇÖ§³ÖSQL92£¿£¿
1.»ù±¾Ó¦ÓÃÉÏ£¬sql server ºÍoracle¶¼×ñÑsql 92Óï·¨±ê×¼¡£
2.ʵ¼ÊÓ¦ÓÃÖдó¼Ò¶¼»á³¬³öÒÔÉϱê×¼£¬Ê¹Óø÷¼ÒÊý¾Ý¿â³§É̶¼ÌṩµÄ·á¸»µÄ×Ô¶¨Òå±ê×¼º¯Êý¿âºÍÓï·¨¡£
3.΢Èísql serverµÄsql À©Õ¹½ÐT-SQL(Transcate SQL).
4.Oracle µÄsql À©Õ¹½ÐPL-SQL.
×ܽá
ÒÔÉÏÕûÀíÁ˶ÔSpark SQL¸÷¸öÄ£¿éµÄʵÏÖÇé¿ö£¬´úÂë½á¹¹£¬Ö´ÐÐÁ÷³ÌÒÔ¼°×Ô¼º¶ÔSpark SQLµÄÀí½â¡£ |