Äú¿ÉÒÔ¾èÖú£¬Ö§³ÖÎÒÃǵĹ«ÒæÊÂÒµ¡£

1Ôª 10Ôª 50Ôª





ÈÏÖ¤Â룺  ÑéÖ¤Âë,¿´²»Çå³þ?Çëµã»÷Ë¢ÐÂÑéÖ¤Âë ±ØÌî



  ÇóÖª ÎÄÕ ÎÄ¿â Lib ÊÓÆµ iPerson ¿Î³Ì ÈÏÖ¤ ×Éѯ ¹¤¾ß ½²×ù Modeler   Code  
»áÔ±   
 
   
 
 
     
   
 ¶©ÔÄ
  ¾èÖú
Spark ¼¼ÊõÔÚ»ùÒòÐòÁзÖÎöÖеÄÓ¦ÓÃ
 
À´Ô´£ºÍøÂç ·¢²¼ÓÚ£º 2016-12-12
  3215  次浏览      28
 

ÒýÑÔ

ÉúÃü¿ÆÑ§·½ÐËδ°¬£¬ ´ÓʳƷ¹¤ÒµÖеÄϸ¾úÅàÑø¼ø¶¨µ½°©Ö¢¿ìËÙÕï¶Ï£¬»ùÓÚ DNA ·ÖÎöµÄÓ¦Óò»¶Ï³öÏÖ£¬µ«Í¬Ê±»ùÒò·ÖÎöÓ¦ÓÃÒ²ÃæÁÙןܴóÌôÕ½£»Ðí¶àм¼Êõ¡¢Ð·½·¨±»Ó¦Óõ½»ùÒòÐòÁзÖÎöÓ¦ÓÃÖУ¬°üÀ¨ Spark¡¢FPGA ÒÔ¼° GPU Э´¦ÀíÆ÷¼ÓËٵȣ¬ÕâЩ¼¼ÊõµÄÓ¦Óò»½öÄܹ»Ê¹´ó²¿·ÖÉúÃü¿ÆÑ§ÁìÓòµÄÓ¦Ó㬰üÀ¨¿ªÔ´ºÍ ISV Èí¼þ£¬ÔÚ²»ÐèÒª¸´Ô MPI ±à³ÌÇé¿öÏÂʵÏÖ²¢Ðл¯´¦Àí£¬Í¬Ê± Spark ÄÚ´æÄÚ¼ÆËã¼¼ÊõÒ²Äܹ»Ìá¸ß·ÖÎöЧÂÊ£¬¼ÓËÙ¹¤×÷Á÷³Ì£¬ Ëõ¶Ì·ÖÎöʱ¼ä£¬´Ó¶øÓиü¶àеķ¢ÏÖ¡£±¾ÎĽ«½éÉÜÈçºÎÀûÓà Spark ¼¼ÊõÔËÐг£ÓõĻùÒòÐòÁзÖÎöÓ¦Ó㬰üÀ¨ÔÚ Spark ²»Í¬Ä£Ê½ÏµÄÔËÐз½·¨£¬ ÔËÐйý³ÌÒÔ¼°ÔËÐнá¹û·ÖÎö£¬²¢±È½ÏÔÚ²»Í¬ÔËËãÆ½Ì¨ÒÔ¼°²»Í¬ÔËÐвÎÊýÇé¿öϵÄÐÔÄܺͼÓËٱȡ£

1. »ùÒòÐòÁзÖÎö¹¤×÷Á÷

»ùÒòÐòÁзÖÎö¹¤×÷Á÷ÒÔ GATK µÄ×î¼Ñʵ¼ùΪ±ê×¼¡£ËüÒÔ×î³õµÄ FASTQ ÎļþΪÊäÈ룬´Ó BWA-mem ²âÐòµ½ GATK µÄ HaploTyperCaller£¬Íê³É¶ÔÕû¸öÑù°åµÄ²âÐò·ÖÎö¡£

ͼ 1¡¢GATK ×î¼Ñʵ¼ù

ÔÚ²âÐò¹¤×÷Á÷µÄµÚÒ»½×¶Î£¬BWA-mem ¶ÔÊäÈëÎļþ FASTQ Ö´Ðбȶԣ¬Éú³ÉÐòÁбȶԺÍÓ³ÉäÎļþ SAM£¬È»ºóͨ¹ý SortSam Éú³ÉÒ»¸ö¾­¹ýÅÅÐòµÄ BAM Îļþ£¬Êµ¼ÊÉÏ£¬BAM ÎļþÊÇ SAM ÎļþµÄ¶þ½øÖÆÐÎʽ£¬´ËºóµÄ´¦Àí¾ù»ùÓÚ BAM ¶þ½øÖÆÎļþ¡£

BAM Îļþ´«Ë͸ø Picard ¹¤¾ß MarkDuplicates, È¥³ýÖØ¸´µÄƬ¶Î£¬²¢Éú³ÉÒ»¸öºÏ²¢µÄ¡¢È¥³ýÖØ¸´Æ¬¶ÎµÄ BAM Îļþ¡£ÒÔÏµļ¸²½£¬RealignerTargetCreator¡¢IndelRealigner¡¢BaseRecalibrator¡¢PrintReads ºÍ HaplotypeCaller ¶¼ÊÇ GATK µÄÒ»²¿·Ö£¬ÊǶԸßÍÌÍÂÐòÁÐÊý¾Ý½øÐзÖÎöµÄÈí¼þ°ü¡£

ͼ 2¡¢»ùÒò²âÐò¹¤×÷Á÷·Ö½â

ÐòÁзÖÎöµÄÖ÷Òª¹¤×÷ÊÇÊý¾Ýǰ´¦Àí£¬¾­¹ý´¦ÀíµÄÊý¾Ý¿ÉÒÔΪºóÐøµÄ·ÖÎö¹¤×÷Ëùµ÷Óá£Ç°´¦Àí½×¶Î£¬±È¶ÔºÍÅÅÐòÊǼÆËãÃܼ¯ÇұȽϺķÑʱ¼äµÄ¹ý³Ì£¬¾¡¹Üͨ¹ý¶à´¦ÀíÆ÷»ò¶àÏ̵߳ķ½Ê½¿ÉÒÔÌá¸ßЧÂÊ£¬µ«ÊÇÔÚʵ¼Ê¹¤×÷ÖÐÓÉÓÚ¼ÆËã·½·¨µÄ¸´Ôӳ̶ÈÒÔ¼°ÐèÒª·ÖÎöµÄÊý¾ÝÁ¿Ñ¸ËÙÔö¼Ó£¬µ±Ç°Ò»¸ö·ÖÎö¹ý³ÌÈÔÈ»¿ÉÄÜ»¨·Ñ³¬¹ý 1 Ììʱ¼ä£¬·ÑÓÃ´Ó 200 ÃÀÔªµ½ 600 ÃÀÔª²»µÈ¡£Spark ¼¼Êõ¿ÉÒÔ½«´®ÐеķÖÎö²¢Ðл¯£¬½«Êý¾Ý·Ö¶ÎÓÅ»¯²¢½øÐж¯Ì¬¸ºÔؾùºâÒÔÌá¸ßЧÂÊ¡£

GATK4 ¾ÍÊÇ Broad ÍÆ³öµÄ»ùÓÚ Spark ¼¼ÊõµÄ»ùÒòÐòÁзÖÎöÈí¼þ°ü¡£GATK4 Êý¾Ýǰ´¦ÀíµÄÁ÷³ÌÊÇ£º

ͼ 3¡¢»ùÓÚ Spark ¼¼ÊõÊý¾Ýǰ´¦Àí

ºÏ²¢ÊäÈëÎļþºÍ²Î¿¼Îļþ

·Ö³ÉÊý¾Ý¿é

Êý¾Ý¿éµÄÊýÁ¿È¡¾öÓÚ¼¯Èº´óСºÍ¿ÉÓÃ×ÊÔ´

¹¤×÷Á÷·ÖΪ¶à¼¶£¬ÔÚÊý¾Ý´¦Àí֮ǰֻ»®·ÖÒ»´Î

·Ö¼¶ÀàËÆÓÚ Mapreduce

TaskManager ·ÖÅäÈÎÎñ¸ø executor

BlockManager ÀûÓà spark.broadcast.blockSize ÉèÖÿéµÄÿһ¸öƬµÄ´óС£»Êýֵ̫´ó£¬Ôڹ㲥¹ý³ÌÖлá¼õС²¢·¢£¨Ê¹ÔËÐбäÂý£©£¬µ«ÊÇ£¬Êýֵ̫Сʱ£¬BlockManager µÄÐÔÄÜ»áÊÜÓ°Ï죬ȱʡÊÇ 4M

Ê£ÓàÄÚ´æ¿Õ¼ä²»¶Ï¼õÉÙ£¬µ±Ê£ÓàÄÚ´æÌ«Ð¡Ê±£¬ÈÎÎñ»áÖжϲ¢±»Ìß³ö¡£Spark »á³¢ÊÔÖØÆôÈÎÎñ£¬µ±³¬¹ýÉ趨µÄÖØÆô´ÎÊýÈÔÎÞ·¨³É¹¦Ê±£¬×÷Òµ¾Í·ÇÕý³£½áÊø¡£

GATK4 ÈÔ´¦ÔÚ²»¶Ï¿ª·¢¡¢²»¶ÏÍêÉÆµÄ¹ý³ÌÖУ¬ÆäËùÌṩµÄ¹¤¾ßºÍ¹¤×÷Á÷Ò²ÔÚ²»¶ÏÔö¼Ó£¬Ä¿Ç°×îа汾µÄ GATK4 ÌṩµÄ¹¤×÷Á÷°üÀ¨£º

BQSRPipelineSpark

ÔÚ Spark ÉÏÖ´ÐÐ BQSR µÄÁ½¸ö²½Öè(BaseRecalibrator ºÍ ApplyBQSR)

BwaAndMarkDuplicatesPipelineSpark

ÒÔÃû³ÆÅÅÐòµÄÎļþΪÊäÈëÔËÐÐ BWA ºÍ MarkDuplicates.

ReadsPipelineSpark

ÒÔ BWA µÄУÕýƬ¶ÎΪÊäÈëÔËÐÐ MarkDuplicates ºÍ BQSR£¬ÆäÊä³öÓÃÓÚºóÐøµÄ·ÖÎö

ͼ 4 ÊÇ ReadsPipelineSpark ¹¤×÷Á÷ʾÒâͼ¡£

ͼ 4¡¢ ReadsPipelineSpark ¹¤×÷Á÷

2. ÐòÁзÖÎöÓÅ»¯·½·¨

ÐòÁзÖÎöÖеIJ»Í¬Ó¦ÓöÔϵͳ×ÊÔ´ÓÐͬ²½µÄÐèÇó¡£´Óͼ 5 ¿ÉÒÔ¿´µ½£¬ÓеÄÓ¦ÓÃÕ¼Óà CPU ×ÊÔ´±È½Ï¸ß£¬Èç BWA ²»½öÕ¼ÓôóÁ¿´¦ÀíÆ÷×ÊÔ´£¬ÇÒÔËÐÐʱ¼ä³¤£¬¶øÓеÄÓ¦ÓÃÔòÐèÒª´óÁ¿Äڴ棬´¦Àíʱ¼äͬÑù±È½Ï³¤£¬Èç HaploTyperCaller¡£

ͼ 5¡¢²»Í¬Ó¦ÓöÔϵͳ×ÊÔ´µÄÐèÇó

Ò»°ãµØ£¬ÓÐËÄÖÖ·½·¨¶Ô»ùÒò´¦Àí¹ý³Ì½øÐÐÓÅ»¯ºÍ¼ÓËÙ£º

-nt ÔÚÒýÇæ engine ¼¶±ð½øÐв¢ÐУ¬²¢Ðд¦Àí»ùÒòÐòÁеIJ»Í¬²¿·Ö

-nct ÔÚ walker ¼¶±ð½øÐв¢ÐУ¬¼ÓËÙ´¦Àí»ùÒòÐòÁÐÿ¸öµ¥¶ÀÇøÓò

MapReduce ͬʱÉú³ÉÐí¶àʵÀý£¬Ã¿¸öʵÀý´¦Àí»ùÒòÐòÁеIJ»Í¬µÄ£¨ÈÎÒâµÄ£©²¿·Ö

ÀûÓÿÆÑ§¿âÓÅ»¯

ÔÚ GATK ¹¤×÷Á÷ÖпÉÒÔͨ¹ýÉèÖÃ-nt ºÍ-nct ²ÎÊý£¬Ìá¸ß×÷ÒµÔËÐÐЧÂÊ¡£

GATK4 ÊÇ GATK »ùÓÚ Spark ¿ª·¢µÄ°æ±¾£¬ËüÓкܶà¿ÉÒÔÔÚ Spark »·¾³ÖÐÔËÐеŤ¾ßºÍ¹¤×÷Á÷£¬Ëü²ÉÓ÷ּ¶µÄ·½Ê½ÔËÐÐ×÷Òµ£¬Æä¹¤×÷¹¤³ÌÀàËÆÓÚ Mapreduce ¡£ÓÐ 3 ÖÖÔËÐÐģʽ£º

None-spark standalone ģʽ

Spark standalone ģʽ

Spark cluster ģʽ

ÔÚÊäÈëÊý¾ÝºÍ²Î¿¼ÎļþÉèÖÃÕýÈ·µÄÇé¿öÏ£¬´ó²¿·Ö GATK4 ¹¤¾ß¶¼¿ÉÒÔÔÚ Spark ¼¯ÈºÄ£Ê½Ï³ɹ¦ÔËÐС£ ÓÐЩӦÓÃÔÚ¼¯ÈºÄ£Ê½ÏµÄÔËÐнá¹û¿ÉÒԵõ½ÏÔÖøµÄÌáÉý£¬Èç CountReadsSpark£¬ÓеÄÓ¦Óã¬ÌرðÊǵ±¹¤×÷Á÷ÔòÐèÒª¸ü¶àµÄϵͳ×ÊԴʱ£¬ÔÚ spark standalone ģʽÏÂÎÞ·¨ÔËÐУ¬»á±¨¸æ"Not enough space to cache RDD in memory"´íÎ󣬶øÔÚ Spark Cluster ģʽÏÂÔòÄÜ˳ÀûÔËÐУ¬ Èç CollectInsertSizeMetricsSpark ¡£

ͼ 6¡¢CollectInsertSizeMetricsSpark ½á¹û

¶ÔÐòÁзÖÎöÓ¦ÓüÓËÙµÄÁíÒ»ÖÖ·½·¨ÊÇÌṩ»ùÓÚ POWER8 ´¦ÀíÆ÷µÄÓÅ»¯¿ÆÑ§¿â¡£ÒÔ HaplotypeCaller ·ÖÎöΪÀý£¬ËüÔÚ·ÖÎö¹ý³ÌÖÐÕ¼ÓôóÁ¿µÄÄڴ棬ÔËÐÐʱ¼ä×¡£²»Í¬³§¼ÒÒ²ÔÚ¿ª·¢»ùÓÚ×Ô¼ºÈí¼þÕ»µÄ¼ÓËٿ⣬Èç Intel GKL »ùÒòÄں˿⡣

IBM Ìṩһ¸öÔÚ POWER8 ϵͳÉÏÓÅ»¯µÄ PairHMM Ëã·¨£¬Ëü³ä·Ö·¢»ÓÁË POWER8 ϵͳÉÏеÄÈí¼þ¡¢Ó²¼þÌØÐÔ£¬Ä¿Ç°£¬¸ÃÓÅ»¯¿ÆÑ§¿â¿ÉÒÔÔËÐÐÔÚ POWER8 Ubuntu14 ºÍ RHEL7 ²Ù×÷ϵͳÉÏ¡£

×îа汾µÄ¿ÆÑ§¿âÀûÓÃÓë POWER8 ÉÏ Java ÏàͬµÄ¸¡µã¾«¶È¶Ô HaplotyperCaller ½øÐмÓËÙ£¬Í¬Ê±Ëü³ä·ÖÀûÓò¢·¢¶àÏß³Ì SMT ÒÔ¼°ÏòÁ¿Ö¸Á£¬¶Ô HaplotypeCaller µÄ¼ÓËÙÐÔÄܳ¬¹ýÒÔǰµÄ°æ±¾£¬ÌرðÊÇÔÚµ¥Ïß³Ìģʽ£¨¼´ ¨Cnct Ñ¡Ïîδָ¶¨£©¡£ÔÚµ¥Ïß³ÌģʽÏ£¬ÀûÓà PairHMM ¼ÓËÙ£¬HaplotypeCaller ÏûºÄµÄʱ¼äÖ»ÓÐÒ»°ë£¬¼ÓËٱȴﵽ 1.88 ±¶¡£

ÔÚ P8 ϵͳÉϵ÷Óà PairHMM£º

¼ÓËÙ»ùÒò²âÐòÓ¦ÓÃµÄÆäËû·½·¨»¹°üÀ¨ FPGA¡¢GPGPU ¼ÆËãÒÔ¼°È«Ó²¼þ¼ÓËÙµÄ Edico Dragon Solution µÈ£¬²»ÔÚ±¾ÎĵµÌÖÂÛ·¶Î§¡£

3. Spark ¼¼Êõ½éÉÜ

Spark ÊÇÒ»ÖÖÓë Hadoop ÏàËÆµÄ¼¯Èº¼ÆËã»·¾³£¬ µ« Spark ÆôÓÃÁËÄÚ´æ·Ö²¼Êý¾Ý¼¯£¬ ÔÚijЩ¹¤×÷¸ºÔØ·½Ãæ±íÏֵøü¼ÓÓÅÔ½£¬³ýÁËÄܹ»Ìṩ½»»¥Ê½²éѯÍ⣬Spark »¹¿ÉÒÔÓÅ»¯µü´ú¹¤×÷¸ºÔØ¡£

Spark µÄÖ÷ÒªÌØµã°üÀ¨£º

ËÙ¶È¿ì

Spark ¾ßÓÐÏȽøµÄ DAG Ö´ÐÐÒýÇæ£¬Ö§³Öµü´úÊý¾ÝÁ÷ºÍÄÚ´æÄÚ¼ÆË㣬ӦÓóÌÐòÖ´ÐÐËÙ¶ÈÊÇ Hadoop ÔÚÄÚ´æÄÚ MapReduce µÄ 100 ±¶£¬»òÔÚ´ÅÅÌÉ쵀 10 ±¶£»

Ò×ÓÚʹÓÃ

Óà Java¡¢Scala¡¢Python ºÍ R ¿ìËÙ±àдӦÓ㬠Spark Ìṩ³¬¹ý 80 Öָ߼¶²Ù×÷£¬¹¹½¨²¢ÐÐÓ¦Ó÷dz£·½±ã£¬¶øÇÒ¿ÉÒÔÓë Scala ¡¢Python ºÍ R ½øÐн»»¥£»

ͨÓÃ

Spark Ö§³ÅһϵÁк¯Êý¿âºÍÈí¼þÕ»£¬°üÀ¨ SQL¡¢DataFrames¡¢»úÆ÷ѧϰ MLlib¡¢GraphX ºÍ Spark Streaming£¬¿ÉÒÔ½«ÕâЩ¿âÎÞ·ìµØ¼¯³Éµ½Í¬Ò»¸öÓ¦Ó㬽« SQL¡¢Á÷ÒÔ¼°¸´ÔÓ·ÖÎö½áºÏÔÚÒ»Æð¡£

ÔËÐÐÔÚÈκεط½

Spark ¿ÉÒÔÔËÐÐÔÚ Hadoop¡¢Mesos¡¢standalone ģʽ»òÔÆ¡£Ëü¿ÉÒÔ·ÃÎʶàÖÖ¶àÑùµÄÊý¾Ý£¬°üÀ¨ HDFS¡¢Cassandra¡¢HBase ºÍ S3¡£¿ÉÒÔÒÔ²»Í¬Ä£Ê½ÔËÐÐ Spark£¬°üÀ¨±¾µØÄ£Ê½¡¢Standalone ģʽ¡¢Mesoes ģʽºÍ yarn ģʽ¡£

IBM Spectrum Conductor with Spark Äܹ»¼ò»¯¿ªÔ´´óÊý¾Ý·ÖÎöƽ̨ Apache Spark µÄ²¿Ê𣬽«Æä·ÖÎöËÙ¶ÈÌáÉý½ü 60%¡£×÷ΪһÖÖ¿ªÔ´´óÊý¾Ý·ÖÎö¿ò¼Ü£¬Apache Spark ÌṩÁîÈËÐÅ·þµÄÐÔÄÜÓÅÊÆ¡£ ʵʩ Spark ¼«¾ßÌôÕ½ÐÔ£¬°üÀ¨Í¶×ÊеÄרҵÄÜÁ¦¡¢¹¤¾ßºÍ¹¤×÷Á÷µÈ¡£ÉèÖÃÁÙʱ Spark ¼¯Èº¿ÉÄܵ¼ÖÂÎÞ·¨¸ßЧÀûÓÃ×ÊÔ´£¬²¢´øÀ´¹ÜÀíºÍ°²È«ÌôÕ½¡£IBM Spectrum Conductor with Spark ¿É°ïÖú½â¾öÕâЩÎÊÌâ¡£Ëü½« Spark ·¢ÐÐÓë×ÊÔ´¡¢»ù´¡¼Ü¹¹ºÍÊý¾ÝÉúÃüÖÜÆÚ¹ÜÀí¼¯³É£¬ÒÔ¾«¼òµÄ·½Ê½´´½¨ÆóÒµ¼¶¶à×â»§ Spark »·¾³¡£ÎªÁ˰ïÖú¹ÜÀí¿ìËÙ±äǨµÄ Spark ÉúÃüÖÜÆÚ£¬IBM Spectrum Conductor with Spark Ö§³Öͬ ʱÔËÐÐ Spark µÄ¶àÖÖʵÀýºÍ°æ±¾¡£

±¾ÎĵµËù×öµÄ²âÊÔ£¬ÊÇ»ùÓÚ IBM Conductor with Spark ¼Ü¹¹£¬Ëü°üº¬ 3 ̨ Firestone ·þÎñÆ÷£¬1 ̨ Driver ½Úµã£¬2 ̨ Worker ½Úµã£¬¼´ 1+2 ½á¹¹¡£Èçͼ 6 Ëùʾ£º

Conductor with spark ¼¯Èº¼Ü¹¹

ÔËÐл·¾³ÊÇ Conductor with spark, Spark °æ±¾ÊÇ 1.6.1, ²ÉÓà Spark ȱʡµÄ DAGScheduler µ÷¶ÈÈí¼þ£¬ÔËÐÐ gatk-launch µÄÑ¡ÏîÊÇ --sparkRunner SPARK --sparkMaster spark://c712f6n10:7077¡£

4. ÀûÓà Spark ¼¼Êõ½øÐлùÒòÐòÁзÖÎö´¦Àí

ÒÔ ReadsPipelineSpark ¹¤×÷Á÷ΪÀý£¬ËüÊÇ GATK4 Ô¤¶¨ÒåµÄÒ»¸ö¹¤×÷Á÷£¬Óà BAM ÎļþΪÊäÈ룬ÔËÐÐ MarkDuplicate ºÍ BQSR £¬ÆäÊä³öÎļþ½«ÓÃÓÚÏÂÒ»½×¶Î·ÖÎö¡£

ÖØ¸´ÊÇÖ¸Ò»×é»ùÒòƬ¶ÎÓÐÏàͬµÄ¡¢Î´ÐÞÊÎµÄÆðʼºÍ½áÊø£¬MarkDuplicate ¾ÍÊÇÒªÌôÑ¡³ö"×î¼ÑµÄ"¸´ÖÆ£¬´Ó¶ø¼õ»º´íÎóЧӦ¡£

BQSR ¶ÔÒ»¸öÒѾ­ÅÅÐò¹ýµÄ BAM ÎļþµÄºÏ³É²âÐòÊý¾Ý»ù´¡ÖÊÁ¿ÊýÖµ½øÐÐÖØÐµ÷Õû£¬ÖØÐµ÷Õûºó£¬ÔÚ BAM Êä³öÖÐÿ¸öƬ¶ÎÔÚ QUAL ÓòÖиü¾«È·£¬Æä±¨¸æµÄÖÊÁ¿ÊýÖµ¸ü½Ó½üÓڲο¼»ùÒòµÄʵ¼Ê¿ÉÄÜÐÔ¡£

ÔËÐÐÃüÁ

Çåµ¥ 1.

./gatk/gatk-launch \
ReadsPipelineSpark \ # Pipeline name
-I $bam \ # Input file
-R $ref \ # Reference file
-O $bamout \ # Output file
¨CbamPartitionSize 134217728 \
# maximum number of bytes to read from a file into each partition of reads.
¨CknownSites $dbsnp \ # knownSites (see notes)
¨CshardedOutput true \ # Write output to multiple pieces
¨Cduplicates_scoring_strategy # MarkDuplicatesScoringStrategy
SUM_OF_BASE_QUALITIES \
¨CsparkRunner SPARK \ # Run mode
¨CsparkMaster spark://c712f6n10:7077 # Spark cluster
--conf spark.driver.memory=5g --conf spark.executor.memory=16g

 

ÊäÈëÎļþ£º

-I CEUTrio.HiSeq.WEx.b37.NA12892.bam
-R human_g1k_v37.2bit
-knownSites dbsnp_138.b37.excluding_sites_after_129.vcf

 

 

ÀûÓò»Í¬µÄ×ÊÔ´¹ÜÀíÆ÷£¬»¹¿ÉÒÔ¶¨Òå¨Cnum-executors, --executor-mem, --executor-cores£¬´Ó¶ø¸ù¾Ý¼ÆËã×ÊÔ´µÄ´óСºÏÀí·ÖÅäºÍµ÷¶È×ÊÔ´¡£

5. ÐÔÄܺͼÓËٱȷÖÎö

±¾ÎĵµÒÔ CountReadsSpark ΪÀý£¬¶Ô±È·ÖÎöÔÚ²»Í¬Ä£Ê½ÏµÄÔËÐнá¹û¡£

µ¥»ú¡¢·Ç Spark ģʽÏÂÔËÐÐ CountReads:

Çåµ¥ 2.

#./gatk-launch CountReads -I /home/dlspark/SRR034975.Sort_all.bam

Running:
/home/dlspark/gatk/build/install/gatk/bin/gatk CountReads -I /home/dlspark/SRR034975.Sort_all.bam
[May 31, 2016 9:52:01 PM EDT] org.broadinstitute.hellbender.tools.CountReads --input /home/dlspark/SRR034975.Sort_all.bam --disable_all_read_filters false --interval_set_rule UNION --interval_padding 0 --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --addOutputSAMProgramRecord true --help false --version false --verbosity INFO --QUIET false

Output: 34929382
Elapsed time: 12.14 minutes

 

 

ÔÚ Spark Standalone ģʽÏÂÔËÐÐ CountReadsSpark

Çåµ¥ 3.

# ./gatk-launch CountReadsSpark -I /home/dlspark/SRR034975.Sort_all.bam
Running:
/home/dlspark/gatk/build/install/gatk/bin/gatk CountReadsSpark -I /home/dlspark/SRR034975.Sort_all.bam
[June 1, 2016 1:11:34 AM EDT] org.broadinstitute.hellbender.tools.spark.pipelines.CountReadsSpark --input /home/dlspark/SRR034975.Sort_all.bam --readValidationStringency SILENT --interval_set_rule UNION --interval_padding 0 --bamPartitionSize 0 --disableSequenceDictionaryValidation false --shardedOutput false --numReducers 0 --sparkMaster local[*] --help false --version false --verbosity INFO --QUIET false

Output: 34929382
Elapsed time: 9.50 minutes

 

 

ÔÚ Spark Cluster ģʽÏÂÔËÐÐ CountReadsSpark

Çåµ¥ 4.

# ./gatk-launch CountReadsSpark -I /gpfs1/yrx/SRR034975.Sort_all.bam 
-O /gpfs1/yrx/gatk4-test.output -
-sparkRunner SPARK --sparkMaster spark://c712f6n10:7077
Running:
spark-submit --master spark://c712f6n10:7077
--conf spark.kryoserializer.buffer.max=512m
--conf spark.driver.maxResultSize=0
--conf spark.driver.userClassPathFirst=true
--conf spark.io.compression.codec=lzf
--conf spark.yarn.executor.memoryOverhead=600
--conf spark.yarn.dist.files
=/home/dlspark/gatk/build/libIntelDeflater.so
--conf spark.driver.extraJavaOptions
=-Dsamjdk.intel_deflater_so_path=libIntelDeflater.so
-Dsamjdk.compression_level
=1 -DGATK_STACKTRACE_ON_USER_EXCEPTION=
true --conf spark.executor.extraJavaOptions=
-Dsamjdk.intel_deflater_so_path
=libIntelDeflater.so -Dsamjdk.compression_level
=1 -DGATK_STACKTRACE_ON_USER_EXCEPTION
=true

Output: 34929382
Elapsed time: 0.60 minutes

ÊÇÈýÖÖ²»Í¬ÔËËãģʽ½á¹ûµÄ±È½Ï£¬´ÓÔËÐÐʱ¼ä¿´£¬ÔÚ Spark Cluster ģʽÏ£¬CountReadsSpark µÄÔËÐÐЧÂÊ±È Spark Standalone ģʽÌá¸ßÁË 15 ±¶¡£

ͼ 7¡¢3 ÖÖÔËËãģʽ½á¹û±È½Ï

×ܽá

ͨ¹ýÒÔÉϵķÖÎö£¬»ùÓÚ Spark ¼¼ÊõµÄÉúÃü¿ÆÑ§½â¾ö·½°¸£¬Äܹ»Ê¹Ô­À´´®ÐеÄÓ¦ÓÃÔÚ²»Ð޸ĴúÂë»òÐÞ¸ÄÉÙÁ¿´úÂëµÄÇé¿öϾͿÉÒÔʵÏÖ²¢Ðл¯¡¢ÄÚ´æÄÚ¼ÆË㣬»òʹÓü¸Ìõ Java ´úÂë¼´¿ÉÔÚ¼¯ÈºÉÏ´¦Àí´óÁ¿µÄÊý¾Ý£¬²»ÐèÒª¸´Ô MPI »ò OpenMP ±à³Ì£¬Ê¹¿ÆÑ§¼Ò½«¾«Á¦¸ü¶à¼¯ÖÐÓÚз½·¨µÄÑо¿ºÍеķ¢ÏÖ¡£Ëæ×Å»ùÓÚ Spark µÄÉúÃü¿ÆÑ§½â¾ö·½°¸µÄ²»¶Ï³ÉÊìºÍÍêÉÆ£¬Ô½À´Ô½¶àµÄ¹¤¾ßºÍ¹¤×÷Á÷¿ÉÒÔÔËÐÐÔÚ Spark ¼ÆËãÆ½Ì¨£¬Ëü¾ßÓÐÈÝ´í¹¦ÄÜ£¬À©Õ¹ÐÔÒ²²»¶ÏÌá¸ß£¬²¢Ðл¯µÄÉúÃü¿ÆÑ§Ó¦Óý«»áʹÐòÁзÖÎöʱ¼ä´ÓĿǰµÄÊ®¼¸¸öСʱËõ¶Ìµ½ 1 ¸öСʱ֮ÄÚ£¬Í¬Ê±ÀûÓ÷ֲ¼Ê½×÷Òµµ÷¶ÈºÍµÍ³É±¾¼¯Èº£¬Ò²»á¼«´óµØ½µµÍ·ÖÎö³É±¾¡£

 

   
3215 ´Îä¯ÀÀ       28
Ïà¹ØÎÄÕÂ

»ùÓÚEAµÄÊý¾Ý¿â½¨Ä£
Êý¾ÝÁ÷½¨Ä££¨EAÖ¸ÄÏ£©
¡°Êý¾Ýºþ¡±£º¸ÅÄî¡¢ÌØÕ÷¡¢¼Ü¹¹Óë°¸Àý
ÔÚÏßÉ̳ÇÊý¾Ý¿âϵͳÉè¼Æ ˼·+Ч¹û
 
Ïà¹ØÎĵµ

GreenplumÊý¾Ý¿â»ù´¡Åàѵ
MySQL5.1ÐÔÄÜÓÅ»¯·½°¸
ijµçÉÌÊý¾ÝÖÐ̨¼Ü¹¹Êµ¼ù
MySQL¸ßÀ©Õ¹¼Ü¹¹Éè¼Æ
Ïà¹Ø¿Î³Ì

Êý¾ÝÖÎÀí¡¢Êý¾Ý¼Ü¹¹¼°Êý¾Ý±ê×¼
MongoDBʵս¿Î³Ì
²¢·¢¡¢´óÈÝÁ¿¡¢¸ßÐÔÄÜÊý¾Ý¿âÉè¼ÆÓëÓÅ»¯
PostgreSQLÊý¾Ý¿âʵսÅàѵ
×îл¼Æ»®
DeepSeekÔÚÈí¼þ²âÊÔÓ¦ÓÃʵ¼ù 4-12[ÔÚÏß]
DeepSeek´óÄ£ÐÍÓ¦Óÿª·¢Êµ¼ù 4-19[ÔÚÏß]
UAF¼Ü¹¹ÌåϵÓëʵ¼ù 4-11[±±¾©]
AIÖÇÄÜ»¯Èí¼þ²âÊÔ·½·¨Óëʵ¼ù 5-23[ÉϺ£]
»ùÓÚ UML ºÍEA½øÐзÖÎöÉè¼Æ 4-26[±±¾©]
ÒµÎñ¼Ü¹¹Éè¼ÆÓ뽨ģ 4-18[±±¾©]

APPÍÆ¹ãÖ®ÇÉÓù¤¾ß½øÐÐÊý¾Ý·ÖÎö
Hadoop Hive»ù´¡sqlÓï·¨
Ó¦Óö༶»º´æÄ£Ê½Ö§³Åº£Á¿¶Á·þÎñ
HBase ³¬Ïêϸ½éÉÜ
HBase¼¼ÊõÏêϸ½éÉÜ
Spark¶¯Ì¬×ÊÔ´·ÖÅä

HadoopÓëSpark´óÊý¾Ý¼Ü¹¹
HadoopÔ­ÀíÓë¸ß¼¶Êµ¼ù
HadoopÔ­Àí¡¢Ó¦ÓÃÓëÓÅ»¯
´óÊý¾ÝÌåϵ¿ò¼ÜÓëÓ¦ÓÃ
´óÊý¾ÝµÄ¼¼ÊõÓëʵ¼ù
Spark´óÊý¾Ý´¦Àí¼¼Êõ

GE Çø¿éÁ´¼¼ÊõÓëʵÏÖÅàѵ
º½Ìì¿Æ¹¤Ä³×Ó¹«Ë¾ Nodejs¸ß¼¶Ó¦Óÿª·¢
ÖÐÊ¢Òæ»ª ׿Խ¹ÜÀíÕß±ØÐë¾ß±¸µÄÎåÏîÄÜÁ¦
ijÐÅÏ¢¼¼Êõ¹«Ë¾ PythonÅàѵ
ij²©²ÊITϵͳ³§ÉÌ Ò×ÓÃÐÔ²âÊÔÓëÆÀ¹À
ÖйúÓÊ´¢ÒøÐÐ ²âÊÔ³ÉÊì¶ÈÄ£Ðͼ¯³É(TMMI)
ÖÐÎïÔº ²úÆ·¾­ÀíÓë²úÆ·¹ÜÀí