Äú¿ÉÒÔ¾èÖú£¬Ö§³ÖÎÒÃǵĹ«ÒæÊÂÒµ¡£

1Ôª 10Ôª 50Ôª





ÈÏÖ¤Â룺  ÑéÖ¤Âë,¿´²»Çå³þ?Çëµã»÷Ë¢ÐÂÑéÖ¤Âë ±ØÌî



  ÇóÖª ÎÄÕ ÎÄ¿â Lib ÊÓÆµ iPerson ¿Î³Ì ÈÏÖ¤ ×Éѯ ¹¤¾ß ½²×ù Modeler   Code  
»áÔ±   
 
   
 
 
     
   
 ¶©ÔÄ
  ¾èÖú
Hadoop vs SparkÐÔÄܶԱÈ
 
×÷Õß magenfengµÄ²©¿Í £¬»ðÁú¹ûÈí¼þ    ·¢²¼ÓÚ 2014-08-15
  3979  次浏览      27
 

1. Kmeans

Êý¾Ý£º×Ô¼º²úÉúµÄÈýάÊý¾Ý£¬·Ö±ðÎ§ÈÆÕý·½ÐεÄ8¸ö¶¥µã

{0, 0, 0}, {0, 10, 0}, {0, 0, 10}, {0, 10, 10},

{10, 0, 0}, {10, 0, 10}, {10, 10, 0}, {10, 10, 10}

³ÌÐòÂß¼­£º

¶ÁÈ¡HDFSÉϵÄblockµ½Äڴ棬ÿ¸öblockת»¯ÎªRDD£¬ÀïÃæ°üº¬vector¡£

È»ºó¶ÔRDD½øÐÐmap²Ù×÷£¬³éȡÿ¸övector£¨point£©¶ÔÓ¦µÄÀàºÅ£¬Êä³ö£¨K,V£©Îª£¨class£¬£¨Point£¬1£©£©£¬×é³ÉеÄRDD¡£

È»ºóÔÙreduce֮ǰ£¬¶Ôÿ¸öеÄRDD½øÐÐcombine£¬ÔÚRDDÄÚ²¿Ëã³öÿ¸öclassµÄÖÐÐĺ͡£Ê¹µÃÿ¸öRDDµÄÊä³öÖ»ÓÐ×î¶àK¸öKV¶Ô¡£

×îºó½øÐÐreduceµÃµ½ÐµÄRDD£¨ÄÚÈݵÄKeyÊÇclass£¬ValueÊÇÖÐÐĺͣ¬ÔÙ¾­¹ýmapºóµÃµ½×îºóµÄÖÐÐÄ¡£

ÏÈÉÏ´«µ½HDFSÉÏ£¬È»ºóÔÚMasterÉÏÔËÐÐ

root@master:/opt/spark# ./run spark.examples.SparkKMeans 
master@master:5050 hdfs://master:9000/user/LijieXu/Kmeans/Square-10GB.txt 8 2.0

µü´úÖ´ÐÐKmeansËã·¨¡£

Ò»¹²160¸ötask¡££¨160 * 64MB = 10GB£©

ÀûÓÃÁË32¸öCPU cores£¬18.9GBµÄÄÚ´æ¡£

ÿ¸ö»úÆ÷µÄÄÚ´æÏûºÄΪ4.5GB £¨¹²40GB£©£¨±¾ÉípointsÊý¾Ý10GB*2£¬MapºóÖмäÊý¾Ý(K, V) => (int, (vector, 1)) £¨´ó¸Å10GB£©

×îºó½á¹û£º

0.505246194 s

Final centers: Map(5 -> (13.997101228817169, 9.208875044622895, -2.494072457488311), 8 -> (-2.33522333047955, 9.128892414676326, 1.7923150585737604), 7 -> (8.658031587043952, 2.162306996983008, 17.670646829079146), 3 -> (11.530154433698268, 0.17834347219956842, 9.224352885937776), 4 -> (12.722903153986868, 8.812883284216143, 0.6564509961064319), 1 -> (6.458644369071984, 11.345681702383024, 7.041924994173552), 6 -> (12.887793408866614, -1.5189406469928937, 9.526393664105957), 2 -> (2.3345459304412164, 2.0173098597285533, 1.4772489989976143))


50MB/s 10GB => 3.5min
10MB/s 10GB => 15min

ÔÚ20GBµÄÊý¾ÝÉϲâÊÔ

ÔËÐвâÊÔÃüÁ

root@master:/opt/spark# ./run spark.examples.SparkKMeans master@master:5050
 hdfs://master:9000/user/LijieXu/Kmeans/Square-20GB.txt 8 2.0 | tee mylogs/sqaure-20GB-kmeans.log

µÃµ½¾ÛÀà½á¹û£º

Final centers: Map(5 -> (-0.47785701742763115, -1.5901830956323306, 
-0.18453046159033773), 
8 -> (1.1073911553593858, 9.051671594514225, -0.44722211311446924), 
7 -> (1.4960397239284795, 10.173412443492643, -1.7932911100570954), 
3 -> (-1.4771114031182642, 9.046878176063172, -2.4747981387714444), 
4 -> (-0.2796747780312184, 0.06910629855122015, 10.268115903887612),
 1 -> (10.467618592186486, -1.168580362309453, -1.0462842137817263), 
6 -> (0.7569895433952736, 0.8615441990490469, 9.552726007309518), 
2 -> (10.807948500515304, -0.5368803187391366, 0.04258123037074164))

»ù±¾¾ÍÊÇ8¸öÖÐÐĵã

ÄÚ´æÏûºÄ£º£¨Ã¿¸ö½Úµã´óÔ¼5.8GB£©£¬¹²50GB×óÓÒ¡£

ÄÚ´æ·ÖÎö£º

20GBԭʼÊý¾Ý£¬20GBµÄMapÊä³ö

12/06/05 11:11:08 INFO spark.CacheTracker: Looking for RDD partition 2:302

12/06/05 11:11:08 INFO spark.CacheTracker: Found partition in cache!

ÔÚ20GBµÄÊý¾ÝÉϲâÊÔ£¨µü´ú¸ü¶àµÄ´ÎÊý£©

root@master:/opt/spark# ./run spark.examples.SparkKMeans master@master:5050 hdfs://master:900

0/user/LijieXu/Kmeans/Square-20GB.txt 8 0.8

TaskÊýÄ¿£º320

ʱ¼ä£º

µü´úÂÖÊý¶ÔÄÚ´æÈÝÁ¿µÄÓ°Ï죺

»ù±¾Ã»ÓÐʲôӰÏ죬Ö÷ÒªÄÚ´æÏûºÄ£º20GBµÄÊäÈëÊý¾ÝRDD£¬20GBµÄÖмäÊý¾Ý¡£

Final centers: Map(5 -> (-4.728089224526789E-5, 3.17334874733142E-5, -2.0605806380414582E-4), 
8 -> (1.1841686358289191E-4, 10.000062966002101, 9.999933240005394), 7 -> (9.999976672588097, 
10.000199556926772, -2.0695123602840933E-4), 
3 -> (-1.3506815993198176E-4, 9.999948270638338, 2.328148782609023E-5),
 4 -> (3.2493629851483764E-4, -7.892413981250518E-5, 10.00002515017671), 1 -> (10.00004313126956, 7.431996896171192E-6, 
7.590402882208648E-5), 6 -> (9.999982611661382, 10.000144597573051, 10.000037734639696), 
2 -> (9.999958673426654, -1.1917651103354863E-4, 9.99990217533504))

½á¹û¿ÉÊÓ»¯

2. HdfsTest

²âÊÔÂß¼­£º

package spark.examples

import spark._

object HdfsTest {

def main(args: Array[String]) {

val sc = new SparkContext(args(0), "HdfsTest")

val file = sc.textFile(args(1))

val mapped = file.map(s => s.length).cache()

for (iter <- 1 to 10) {

val start = System.currentTimeMillis()

for (x <- mapped) { x + 2 }

// println("Processing: " + x)

val end = System.currentTimeMillis()

println("Iteration " + iter + " took " + (end-start) + " ms")

}

}

}

Ê×ÏÈÈ¥HDFSÉ϶Áȡһ¸öÎı¾Îļþ±£´æÔÚfile

ÔٴμÆËãfileÖÐÿÐеÄ×Ö·ûÊý£¬±£´æÔÚÄÚ´æRDDµÄmappedÖÐ

È»ºó¶ÁÈ¡mappedÖеÄÿһ¸ö×Ö·ûÊý£¬½«Æä¼Ó2£¬¼ÆËã¶ÁÈ¡+Ïà¼ÓµÄºÄʱ

Ö»ÓÐmap£¬Ã»ÓÐreduce¡£

²âÊÔ10GBµÄWiki

ʵ¼Ê²âÊÔµÄÊÇRDDµÄ¶ÁÈ¡ÐÔÄÜ¡£

root@master:/opt/spark# ./run spark.examples.HdfsTest master@master:5050 
hdfs://master:9000:/user/LijieXu/Wikipedia/txt/enwiki-20110405.txt

²âÊÔ½á¹û£º

Iteration 1 took 12900 ms = 12s

Iteration 2 took 388 ms

Iteration 3 took 472 ms

Iteration 4 took 490 ms

Iteration 5 took 459 ms

Iteration 6 took 492 ms

Iteration 7 took 480 ms

Iteration 8 took 501 ms

Iteration 9 took 479 ms

Iteration 10 took 432 ms

ÿ¸önodeµÄÄÚ´æÏûºÄΪ2.7GB £¨¹²9.4GB * 3£©

ʵ¼Ê²âÊÔµÄÊÇRDDµÄ¶ÁÈ¡ÐÔÄÜ¡£

root@master:/opt/spark# ./run spark.examples.HdfsTest master@master:5050
 hdfs://master:9000/user/LijieXu/Wikipedia/txt/enwiki-20110405.txt

²âÊÔ90GBµÄRandomTextÊý¾Ý

root@master:/opt/spark# ./run spark.examples.HdfsTest master@master:5050
 hdfs://master:9000/user/LijieXu/RandomText90GB/RandomText90GB

ºÄʱ£º

ÄÚ´æ×ÜÏûºÄ30GB×óÓÒ¡£

µ¥¸ö½ÚµãµÄ×ÊÔ´ÏûºÄ£º

3. ²âÊÔWordCount

д³ÌÐò£º

import spark.SparkContext

import SparkContext._

object WordCount {

def main(args: Array[String]) {

if (args.length < 2) {

System.err.println("Usage: wordcount <master> <jar>")

System.exit(1)

}

val sp = new SparkContext(args(0), "wordcount", "/opt/spark", List(args(1)))

val file = sp.textFile("hdfs://master:9000/user/LijieXu/Wikipedia/txt/enwiki-20110405.txt");

val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://master:9000/user/Output/WikiResult3")

}

}

´ò°ü³ÉmySpark.jar£¬ÉÏ´«µ½MasterµÄ/opt/spark/newProgram¡£

ÔËÐгÌÐò£º

root@master:/opt/spark# ./run -cp newProgram/mySpark.jar WordCount master@master:5050 newProgram/mySpark.jar

Mesos×Ô¶¯½«jar¿½±´µ½Ö´Ðнڵ㣬ȻºóÖ´ÐС£

ÄÚ´æÏûºÄ£º£¨10GBÊäÈëfile + 10GBµÄflatMap + 15GBµÄMapÖмä½á¹û£¨word£¬1£©£©

»¹Óв¿·ÖÄÚ´æ²»ÖªµÀ·ÖÅäµ½ÄÄÀïÁË¡£

ºÄʱ£º50 sec£¨Î´¾­¹ýÅÅÐò£©

Hadoop WordCountºÄʱ£º120 secµ½140 sec

½á¹ûδÅÅÐò

µ¥¸ö½Úµã£º

Hadoop²âÊÔ

Kmeans

ÔËÐÐMahoutÀïµÄKmeans

root@master:/opt/mahout-distribution-0.6# bin/mahout org.apache.mahout.clustering.
syntheticcontrol.kmeans.Job -Dmapred.reduce.tasks=
36 -i /user/LijieXu/Kmeans/Square-20GB.txt -o output -t1 3 -t2 1.5 -cd 0.8 -k 8 -x 6

ÔÚÔËÐУ¨320¸ömap£¬1¸öreduce£©

Canopy Driver running buildClusters over input: output/data

ʱij¸öslaveµÄ×ÊÔ´ÏûºÄÇé¿ö

Completed Jobs

ÔËÐжà´Î10GB¡¢20GBÉϵÄKmeans£¬×ÊÔ´ÏûºÄ

Hadoop WordCount²âÊÔ

Spark½»»¥Ê½ÔËÐÐ

½øÈëMasterµÄ/opt/spark

ÔËÐÐ

MASTER=master@master:5050 ./spark-shell

´ò¿ªMesos°æ±¾µÄspark

ÔÚmaster:8080¿ÉÒÔ¿´µ½framework

Active Frameworks

scala> val file = sc.textFile("hdfs://master:9000/user/LijieXu/Wikipedia/txt/enwiki-20110405.txt")

scala> file.first

scala> val words = file.map(_.split(' ')).filter(_.size < 100) //µÃµ½RDD[Array[String]]

scala> words.cache

scala> words.filter(_.contains("Beijing")).count

12/06/06 22:12:33 INFO SparkContext: Job finished in 10.862765819 s

res1: Long = 855

scala> words.filter(_.contains("Beijing")).count

12/06/06 22:12:52 INFO SparkContext: Job finished in 0.71051464 s

res2: Long = 855

scala> words.filter(_.contains("Shanghai")).count

12/06/06 22:13:23 INFO SparkContext: Job finished in 0.667734427 s

res3: Long = 614

scala> words.filter(_.contains("Guangzhou")).count

12/06/06 22:13:42 INFO SparkContext: Job finished in 0.800617719 s

res4: Long = 134

ÓÉÓÚGCµÄÎÊÌ⣬²»ÄÜcacheºÜ´óµÄÊý¾Ý¼¯¡£

   
3979 ´Îä¯ÀÀ       27
Ïà¹ØÎÄÕÂ

»ùÓÚEAµÄÊý¾Ý¿â½¨Ä£
Êý¾ÝÁ÷½¨Ä££¨EAÖ¸ÄÏ£©
¡°Êý¾Ýºþ¡±£º¸ÅÄî¡¢ÌØÕ÷¡¢¼Ü¹¹Óë°¸Àý
ÔÚÏßÉ̳ÇÊý¾Ý¿âϵͳÉè¼Æ ˼·+Ч¹û
 
Ïà¹ØÎĵµ

GreenplumÊý¾Ý¿â»ù´¡Åàѵ
MySQL5.1ÐÔÄÜÓÅ»¯·½°¸
ijµçÉÌÊý¾ÝÖÐ̨¼Ü¹¹Êµ¼ù
MySQL¸ßÀ©Õ¹¼Ü¹¹Éè¼Æ
Ïà¹Ø¿Î³Ì

Êý¾ÝÖÎÀí¡¢Êý¾Ý¼Ü¹¹¼°Êý¾Ý±ê×¼
MongoDBʵս¿Î³Ì
²¢·¢¡¢´óÈÝÁ¿¡¢¸ßÐÔÄÜÊý¾Ý¿âÉè¼ÆÓëÓÅ»¯
PostgreSQLÊý¾Ý¿âʵսÅàѵ
×îл¼Æ»®
DeepSeek´óÄ£ÐÍÓ¦Óÿª·¢ 6-12[ÏÃÃÅ]
È˹¤ÖÇÄÜ.»úÆ÷ѧϰTensorFlow 6-22[Ö±²¥]
»ùÓÚ UML ºÍEA½øÐзÖÎöÉè¼Æ 6-30[±±¾©]
ǶÈëʽÈí¼þ¼Ü¹¹-¸ß¼¶Êµ¼ù 7-9[±±¾©]
Óû§ÌåÑé¡¢Ò×ÓÃÐÔ²âÊÔÓëÆÀ¹À 7-25[Î÷°²]
ͼÊý¾Ý¿âÓë֪ʶͼÆ× 8-23[±±¾©]

MySQLË÷Òý±³ºóµÄÊý¾Ý½á¹¹
MySQLÐÔÄܵ÷ÓÅÓë¼Ü¹¹Éè¼Æ
SQL ServerÊý¾Ý¿â±¸·ÝÓë»Ö¸´
ÈÃÊý¾Ý¿â·ÉÆðÀ´ 10´óDB2ÓÅ»¯
oracleµÄÁÙʱ±í¿Õ¼äдÂú´ÅÅÌ
Êý¾Ý¿âµÄ¿çƽ̨Éè¼Æ


²¢·¢¡¢´óÈÝÁ¿¡¢¸ßÐÔÄÜÊý¾Ý¿â
¸ß¼¶Êý¾Ý¿â¼Ü¹¹Éè¼ÆÊ¦
HadoopÔ­ÀíÓëʵ¼ù
Oracle Êý¾Ý²Ö¿â
Êý¾Ý²Ö¿âºÍÊý¾ÝÍÚ¾ò
OracleÊý¾Ý¿â¿ª·¢Óë¹ÜÀí


GE Çø¿éÁ´¼¼ÊõÓëʵÏÖÅàѵ
º½Ìì¿Æ¹¤Ä³×Ó¹«Ë¾ Nodejs¸ß¼¶Ó¦Óÿª·¢
ÖÐÊ¢Òæ»ª ׿Խ¹ÜÀíÕß±ØÐë¾ß±¸µÄÎåÏîÄÜÁ¦
ijÐÅÏ¢¼¼Êõ¹«Ë¾ PythonÅàѵ
ij²©²ÊITϵͳ³§ÉÌ Ò×ÓÃÐÔ²âÊÔÓëÆÀ¹À
ÖйúÓÊ´¢ÒøÐÐ ²âÊÔ³ÉÊì¶ÈÄ£Ðͼ¯³É(TMMI)
ÖÐÎïÔº ²úÆ·¾­ÀíÓë²úÆ·¹ÜÀí