hadoop分析

分享到

hadoop分析

作者：kntao ，发布于2012-8-20，来源：博客

hadoop分析之一HDFS元数据解析
hadoop分析之二元数据备份方案的机制
hadoop分析之三org.apache.hadoop.hdfs.server.namenode各个类的功能与角色
hadoop分析之四：关于hadoop namenode的双机热备份方案

hadoop分析之一HDFS元数据解析

1、元数据（Metadata）：维护HDFS文件系统中文件和目录的信息，分为内存元数据和元数据文件两种。NameNode维护整个元数据。

HDFS实现时，没有采用定期导出元数据的方法，而是采用元数据镜像文件（FSImage）+日子文件（edits）的备份机制。

2、Block：文件内容而言。

寻路径流程：

路径信息 bocks[] triplets[]

Client ------------》INode---------------------》BlockInfo --------------------------》DataNode。

INode：文件的基本元素：文件和目录

BlockInfo：文件内容对象

DatanodeDescriptor:具体存储对象。

3 、 FSImage和edits的checkPoint。FSImage有2个状态，分别是FsImage和FsImage.ckpt,后者表示正在checkpoint的过程中，上传后将会修改为FSImage文件，同理edits也有两个状态，edits和edits.new。

4、NameNode format情景分析：

遍历元数据存储目录，提示用户是否格式化？(NameNode.java里format函数）

01.private static boolean format( Configuration conf ,  
02.                                boolean isConfirmationNeeded )  
03.      throws IOException {  
04.    Collection<URI > dirsToFormat = FSNamesystem. getNamespaceDirs(conf );  
05.    Collection<URI > editDirsToFormat =  
06.                 FSNamesystem .getNamespaceEditsDirs (conf );  
07.    for( Iterator< URI> it = dirsToFormat.iterator (); it. hasNext() ;) {  
08.      File curDir = new File (it .next (). getPath()) ;  
09.      if (! curDir. exists())  
10.        continue;  
11.      if (isConfirmationNeeded ) {  
12.        System .err .print ("Re-format filesystem in " + curDir + " ? (Y or N) ");  
13.        if (! (System .in .read () == 'Y')) {  
14.          System .err .println ("Format aborted in " + curDir );  
15.          return true ;  
16.        }  
17.        while(System .in .read () != '\n') ; // discard the enter-key  
18.      }  
19.    }  
20.  
21.    FSNamesystem nsys = new FSNamesystem (new FSImage(dirsToFormat ,  
22.                                         editDirsToFormat ), conf) ;  
23.    nsys.dir.fsImage .format ();  
24.    return false;  
25.  }

创建元数据内存镜像，包括类FSNamesystem实例化对象，类FSDirectory实例化对象，类FSImage对象，类Edits对象。创建FsNameSystem对象主要完成：BlockManager，FSDirectory对象以及初始化成员变量。FSImage对象主要完成对layoutVersion、namespaceID，CTime赋值为0，实例化FSEditLog。在类FSDirectory，创建了HDFS根目录节点rootDir。

01.FSNamesystem( FSImage fsImage, Configuration conf ) throws IOException {  
02.    this. blockManager = new BlockManager (this, conf) ;  
03.    setConfigurationParameters (conf );  
04.    this. dir = new FSDirectory(fsImage , this, conf );  
05.    dtSecretManager = createDelegationTokenSecretManager (conf );  
06.  }  
07.  
08.  FSImage( Collection< URI> fsDirs , Collection< URI> fsEditsDirs )  
09.      throws IOException {  
10.    this() ;  
11.    setStorageDirectories( fsDirs, fsEditsDirs );  
12.  }  
13.  
14. void setStorageDirectories(Collection <URI > fsNameDirs,  
15.                             Collection< URI> fsEditsDirs ) throws IOException {  
16.    this. storageDirs = new ArrayList <StorageDirectory >() ;  
17.    this. removedStorageDirs = new ArrayList <StorageDirectory >() ;  
18.     
19.   // Add all name dirs with appropriate NameNodeDirType  
20.    for (URI dirName : fsNameDirs ) {  
21.      checkSchemeConsistency (dirName );  
22.      boolean isAlsoEdits = false;  
23.      for (URI editsDirName : fsEditsDirs) {  
24.        if (editsDirName .compareTo (dirName ) == 0) {  
25.          isAlsoEdits = true;  
26.          fsEditsDirs .remove (editsDirName );  
27.          break;  
28.        }  
29.      }  
30.      NameNodeDirType dirType = (isAlsoEdits ) ?  
31.                          NameNodeDirType .IMAGE_AND_EDITS :  
32.                          NameNodeDirType .IMAGE ;  
33.      // Add to the list of storage directories, only if the  
34.      // URI is of type file://  
35.      if(dirName .getScheme (). compareTo( JournalType.FILE .name (). toLowerCase())  
36.          == 0){  
37.        this.addStorageDir (new StorageDirectory(new File(dirName. getPath()) ,  
38.            dirType ));  
39.      }  
40.    }  
41.     
42.    // Add edits dirs if they are different from name dirs  
43.    for (URI dirName : fsEditsDirs ) {  
44.      checkSchemeConsistency (dirName );  
45.      // Add to the list of storage directories, only if the  
46.      // URI is of type file://  
47.      if(dirName .getScheme (). compareTo( JournalType.FILE .name (). toLowerCase())  
48.          == 0)  
49.        this.addStorageDir (new StorageDirectory(new File(dirName. getPath()) ,  
50.                    NameNodeDirType .EDITS ));  
51.    }  
52.  }

对内存镜像数据中的数据结构进行初始化：主要有FSImage的format函数完成，layoutVersion：软件所处的版本。namespaceID：在Format时候产生，当data node注册到Name Node后，会获得该NameNode的NameSpaceID，并作为后续与NameNode通讯的身份标识。对于未知身份的Data Node，NameNode拒绝通信。CTime：表示FSimage产生的时间。checkpointTime：表示NameSpace第一次checkpoint的时间。

01.public void format () throws IOException {  
02.   this. layoutVersion = FSConstants .LAYOUT_VERSION ;  
03.   this. namespaceID = newNamespaceID ();  
04.   this. cTime = 0L ;  
05.   this. checkpointTime = FSNamesystem .now ();  
06.   for (Iterator <StorageDirectory > it =  
07.                          dirIterator (); it. hasNext() ;) {  
08.     StorageDirectory sd = it .next ();  
09.     format (sd );  
10.   }  
11. }

对内存镜像写入元数据备份目录。FSImage的format方法会遍历所有的目录进行备份。如果是FSImage的文件目录，则调用saveFSImage保存FSImage，如果是Edits，则调用editLog.createEditLogFile,最后调用sd.write方法创建fstime和VERSION文件。VERSION文件通常最后写入。

01.void format(StorageDirectory sd ) throws IOException {  
02.    sd.clearDirectory (); // create currrent dir  
03.    sd.lock ();  
04.    try {  
05.      saveCurrent (sd );  
06.    } finally {  
07.      sd .unlock ();  
08.    }  
09.    LOG.info ("Storage directory " + sd. getRoot()  
10.             + " has been successfully formatted.");  
11.  }

最后分析一下元数据应用的场景：

1、格式化时。

2、Hadoop启动时。

3、元数据更新操作时。

4、如果NameNode与Secondary NameNode、Backup Node或checkpoint Node配合使用时，会进行checkPoint操作。

hadoop分析之二元数据备份方案的机制

1、NameNode启动加载元数据情景分析

NameNode函数里调用FSNamesystemm读取dfs.namenode.name.dir和dfs.namenode.edits.dir构建FSDirectory。
FSImage类recoverTransitionRead和saveNameSpace分别实现了元数据的检查、加载、内存合并和元数据的持久化存储。
saveNameSpace将元数据写入到磁盘，具体操作步骤：首先将current目录重命名为lastcheckpoint.tmp;然后在创建新的current目录，并保存文件；最后将lastcheckpoint.tmp重命名为privios.checkpoint.
checkPoint的过程：Secondary NameNode会通知nameNode产生一个edit log文件edits.new，之后所有的日志操作写入到edits.new文件中。接下来Secondary NameNode会从namenode下载fsimage和edits文件，进行合并产生新的fsimage.ckpt;然后Secondary会将fsimage.ckpt文件上传到namenode。最后namenode会重命名fsimage.ckpt为fsimage，edtis.new为edits；

2、元数据更新及日志写入情景分析

以mkdir为例:

logSync代码分析：

代码:

01.public void logSync () throws IOException {  
02.ArrayList<EditLogOutputStream > errorStreams = null ;  
03.long syncStart = 0;  
04.  
05.// Fetch the transactionId of this thread.  
06.long mytxid = myTransactionId .get (). txid;  
07.EditLogOutputStream streams[] = null;  
08.boolean sync = false;  
09.try {  
10.synchronized (this) {  
11.assert editStreams. size() > 0 : "no editlog streams" ;  
12.printStatistics (false);  
13.// if somebody is already syncing, then wait  
14.while (mytxid > synctxid && isSyncRunning) {  
15.try {  
16.wait (1000 );  
17.} catch (InterruptedException ie ) {  
18.}  
19.}  
20.//  
21.// If this transaction was already flushed, then nothing to do  
22.//  
23.if (mytxid <= synctxid ) {  
24.numTransactionsBatchedInSync ++;  
25.if (metrics != null) // Metrics is non-null only when used inside name node  
26.metrics .transactionsBatchedInSync .inc ();  
27.return;  
28.}  
29.// now, this thread will do the sync  
30.syncStart = txid ;  
31.isSyncRunning = true;  
32.sync = true;  
33.// swap buffers  
34.for( EditLogOutputStream eStream : editStreams ) {  
35.eStream .setReadyToFlush ();  
36.}  
37.streams =  
38.editStreams .toArray (new EditLogOutputStream[editStreams. size()]) ;  
39.}  
40.// do the sync  
41.long start = FSNamesystem.now();  
42.for (int idx = 0; idx < streams. length; idx++ ) {  
43.EditLogOutputStream eStream = streams [idx ];  
44.try {  
45.eStream .flush ();  
46.} catch (IOException ie ) {  
47.FSNamesystem .LOG .error ("Unable to sync edit log." , ie );  
48.//  
49.// remember the streams that encountered an error.  
50.//  
51.if (errorStreams == null) {  
52.errorStreams = new ArrayList <EditLogOutputStream >( 1) ;  
53.}  
54.errorStreams .add (eStream );  
55.}  
56.}  
57.long elapsed = FSNamesystem.now() - start ;  
58.processIOError (errorStreams , true);  
59.if (metrics != null) // Metrics non-null only when used inside name node  
60.metrics .syncs .inc (elapsed );  
61.} finally {  
62.synchronized (this) {  
63.synctxid = syncStart ;  
64.if (sync ) {  
65.isSyncRunning = false;  
66.}  
67.this.notifyAll ();  
68.}  
69.}  
70.}

3、Backup Node 的checkpoint的过程分析：

01./** 
02.* Create a new checkpoint 
03.*/  
04.void doCheckpoint() throws IOException {  
05.long startTime = FSNamesystem.now ();  
06.NamenodeCommand cmd =  
07.getNamenode().startCheckpoint( backupNode. getRegistration());  
08.CheckpointCommand cpCmd = null;  
09.switch( cmd. getAction()) {  
10.case NamenodeProtocol .ACT_SHUTDOWN :  
11.shutdown() ;  
12.throw new IOException ("Name-node " + backupNode .nnRpcAddress  
13.+ " requested shutdown.");  
14.case NamenodeProtocol .ACT_CHECKPOINT :  
15.cpCmd = (CheckpointCommand )cmd ;  
16.break;  
17.default:  
18.throw new IOException ("Unsupported NamenodeCommand: "+cmd.getAction()) ;  
19.}  
20.  
21.CheckpointSignature sig = cpCmd. getSignature();  
22.assert FSConstants.LAYOUT_VERSION == sig .getLayoutVersion () :  
23."Signature should have current layout version. Expected: "  
24.+ FSConstants.LAYOUT_VERSION + " actual " + sig. getLayoutVersion();  
25.assert !backupNode .isRole (NamenodeRole .CHECKPOINT ) ||  
26.cpCmd. isImageObsolete() : "checkpoint node should always download image.";  
27.backupNode. setCheckpointState(CheckpointStates .UPLOAD_START );  
28.if( cpCmd. isImageObsolete()) {  
29.// First reset storage on disk and memory state  
30.backupNode. resetNamespace();  
31.downloadCheckpoint(sig);  
32.}  
33.  
34.BackupStorage bnImage = getFSImage() ;  
35.bnImage. loadCheckpoint(sig);  
36.sig.validateStorageInfo( bnImage) ;  
37.bnImage. saveCheckpoint();  
38.  
39.if( cpCmd. needToReturnImage())  
40.uploadCheckpoint(sig);  
41.  
42.getNamenode() .endCheckpoint (backupNode .getRegistration (), sig );  
43.  
44.bnImage. convergeJournalSpool();  
45.backupNode. setRegistration(); // keep registration up to date  
46.if( backupNode. isRole( NamenodeRole.CHECKPOINT ))  
47.getFSImage() .getEditLog (). close() ;  
48.LOG. info( "Checkpoint completed in "  
49.+ (FSNamesystem .now() - startTime )/ 1000 + " seconds."  
50.+ " New Image Size: " + bnImage .getFsImageName (). length()) ;  
51.}  
52.}

4、元数据可靠性机制。

配置多个备份路径。NameNode在更新日志或进行Checkpoint的过程，会将元数据放在多个目录下。
对于没一个需要保存的元数据文件，都创建一个输出流，对访问过程中出现的异常输出流进行处理，将其移除。并再合适的时机再次检查移除的数据量是否恢复正常。有效的保证了备份输出流的异常问题。
采用了多种机制来保证元数据的可靠性。例如在checkpoint的过程中，分为几个阶段，通过不同的文件名来标识当前所处的状态。为存储失败后进行恢复提供了可能。

5、元数据的一致性机制。

首先从NameNode启动时，对每个备份目录是否格式化、目录元数据文件名是否正确等进行检查，确保元数据文件间的状态一致性，然后选取最新的加载到内存，这样可以确保HDFS当前状态和最后一次关闭时的状态一致性。
其次，通过异常输出流的处理，可以确保正常输出流数据的一致性。
运用同步机制，确保了输出流一致性问题。

hadoop分析之三org.apache.hadoop.hdfs.server.namenode各个类的功能与角色

以hadoop0.21为例。

NameNode.java: 主要维护文件系统的名字空间和文件的元数据，以下是代码中的说明。

01./********************************************************** 
02. * NameNode serves as both directory namespace manager and 
03. * "inode table" for the Hadoop DFS.  There is a single NameNode 
04. * running in any DFS deployment.  (Well, except when there 
05. * is a second backup/failover NameNode.) 
06. * 
07. * The NameNode controls two critical tables: 
08. *   1)  filename ->blocksequence (namespace) 
09. *   2)  block ->machinelist ("inodes") 
10. * 
11. * The first table is stored on disk and is very precious. 
12. * The second table is rebuilt every time the NameNode comes 
13. * up. 
14. * 
15. * 'NameNode' refers to both this class as well as the 'NameNode server'. 
16. * The 'FSNamesystem' class actually performs most of the filesystem 
17. * management.  The majority of the 'NameNode' class itself is concerned 
18. * with exposing the IPC interface and the http server to the outside world, 
19. * plus some configuration management. 
20. * 
21. * NameNode implements the ClientProtocol interface, which allows 
22. * clients to ask for DFS services.  ClientProtocol is not 
23. * designed for direct use by authors of DFS client code.  End -users 
24. * should instead use the org.apache.nutch.hadoop.fs.FileSystem class. 
25. * 
26. * NameNode also implements the DatanodeProtocol interface, used by 
27. * DataNode programs that actually store DFS data blocks.  These 
28. * methods are invoked repeatedly and automatically by all the 
29. * DataNodes in a DFS deployment. 
30. * 
31. * NameNode also implements the NamenodeProtocol interface, used by 
32. * secondary namenodes or rebalancing processes to get partial namenode's 
33. * state, for example partial blocksMap etc. 
34. **********************************************************/

FSNamesystem.java: 主要维护几个表的信息：维护了文件名与block列表的映射关系；有效的block的集合；block与节点列表的映射关系；节点与block列表的映射关系；更新的heatbeat节点的LRU cache

01./*************************************************** 
02. * FSNamesystem does the actual bookkeeping work for the 
03. * DataNode. 
04. * 
05. * It tracks several important tables. 
06. * 
07. * 1)  valid fsname --> blocklist  (kept on disk, logged) 
08. * 2)  Set of all valid blocks (inverted #1) 
09. * 3)  block --> machinelist (kept in memory, rebuilt dynamically from reports) 
10. * 4)  machine --> blocklist (inverted #2) 
11. * 5)  LRU cache of updated -heartbeat machines 
12. ***************************************************/

INode.java：HDFS将文件和文件目录抽象成INode。

01./** 
02. * We keep an in-memory representation of the file/block hierarchy. 
03. * This is a base INode class containing common fields for file and 
04. * directory inodes. 
05. */

FSImage.java：需要将INode信息持久化到磁盘上FSImage上。

01./** 
02. * FSImage handles checkpointing and logging of the namespace edits. 
03. * 
04. */

FSEditLog.java:写Edits文件

01./** 
02. * FSEditLog maintains a log of the namespace modifications. 
03. * 
04. */

BlockInfo.java:INode主要是所文件和目录信息的，而对于文件的内容来说，这是用block描述的。我们假设一个文件的长度大小为Size,那么从文件的0偏移开始，按照固定大小，顺序对文件划分并编号，划分好的每一块为一个block

01./** 
02. * Internal class for block metadata. 
03. */

DatanodeDescriptor.java:代表的具体的存储对象。

01./************************************************** 
02. * DatanodeDescriptor tracks stats on a given DataNode, 
03. * such as available storage capacity, last update time, etc., 
04. * and maintains a set of blocks stored on the datanode. 
05. * 
06. * This data structure is a data structure that is internal 
07. * to the namenode. It is *not* sent over- the- wire to the Client 
08. * or the Datnodes. Neither is it stored persistently in the 
09. * fsImage. 
10. 
11. **************************************************/

FSDirectory.java: 代表了HDFS中的所有目录和结构属性

01./************************************************* 
02. * FSDirectory stores the filesystem directory state. 
03. * It handles writing/loading values to disk, and logging 
04. * changes as we go. 
05. * 
06. * It keeps the filename->blockset mapping always- current 
07. * and logged to disk. 
08. * 
09. *************************************************/

EditLogOutputStream.java:所有的日志记录都是通过EditLogOutputStream输出，在具体实例化的时候，这一组EditLogOutputStream包含多个EditLogFIleOutputStream和一个EditLogBackupOutputStream

01./** 
02. * A generic abstract class to support journaling of edits logs into 
03. * a persistent storage. 
04. */

EditLogFileOutputStream.java:将日志记录写到edits或edits.new中。

01./** 
02. * An implementation of the abstract class {@link EditLogOutputStream}, which 
03. * stores edits in a local file. 
04. */

EditLogBackupOutputStream.java：将日志通过网络发送到backupnode上。

01./** 
02. * An implementation of the abstract class {@link EditLogOutputStream}, 
03. * which streams edits to a backup node. 
04. * 
05. * @see org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol#journal 
06. * (org.apache.hadoop.hdfs.server.protocol.NamenodeRegistration, 
07. *  int, int, byte[]) 
08. */

BackupNode.java：name Node的backup：升级阶段：Secondary Name Node -》Checkpoint Node(定期保存元数据，定期checkpoint) -》Backup Node(在内存中保持一份和Name Node完全一致的镜像，当元数据发生变化时，其元数据进行更新，可以利用自身的镜像来checkpoint，无需从nameNode下载)-》Standby Node（可以进行热备）

01./** 
02. * BackupNode. 
03. * <p> 
04. * Backup node can play two roles. 
05. * <ol> 
06. * <li>{@link NamenodeRole#CHECKPOINT} node periodically creates checkpoints, 
07. * that is downloads image and edits from the active node, merges them, and 
08. * uploads the new image back to the active. </li> 
09. * <li>{@link NamenodeRole#BACKUP} node keeps its namespace in sync with the 
10. * active node, and periodically creates checkpoints by simply saving the 
11. * namespace image to local disk(s).</li> 
12. * </ol> 
13. */

BackupStorage.java：在Backup Node备份目录下创建jspool，并创建edits.new,将输出流指向edits.new

01./** 
02. * Load checkpoint from local files only if the memory state is empty.
 
03. * Set new checkpoint time received from the name -node. 
 
04. * Move lastcheckpoint.tmp  to previous.checkpoint . 
05. * @throws IOException 
06. */

TransferFsImage.java:负责从name Node去文件。

01./** 
02. * This class provides fetching a specified file from the NameNode. 
03. */

GetImageServlet.java：是httpServlet的子类，处理doGet请求。

01./** 
02. * This class is used in Namesystem's jetty to retrieve a file. 
03. * Typically used by the Secondary NameNode to retrieve image and 
04. * edit file for periodic checkpointing. 
05. */

hadoop分析之四：关于hadoop namenode的双机热备份方案

关于hadoopnamenode的双机热备份方案

1、前言

目前hadoop-0.20.2没有提供name node的备份，只是提供了一个secondary node，尽管它在一定程度上能够保证对name node的备份，但当name node所在的机器出现故障时，secondary node不能提供实时的进行切换，并且可能出现数据丢失的可能性。

我们采用drbd + heartbeat方案实现name node的HA。

采用drbd实现共享存储，采用heartbeat实现心跳监控，所有服务器都配有双网卡，其中一个网卡专门用于建立心跳网络连接。

2、基本配置

2.1、硬件环境

采用VMWare的虚拟机作为测试机，一共三台，其中两台分别提供2个网卡（其中一个用作网络通讯，一个为heartbeat的心跳），和一个空白的大小相同的分区（供drbd使用）。软件环境：RedHat Linux AS 5，hadoop-0.20.2, 大体情况如下图：

2.1、网络配置

2.2.1、修改server1和server3的hosts（相同）文件

vi /etc/hosts
10.10.140.140  server1
10.10.140.117  server2
10.10.140.84   server3
10.10.140.200  servervip
10.0.0.201     server1
10.0.0.203      server3

2.2.2、server1和server3的网络配置如下：

server1的网络配置：

[root@server1 ~]#cat /etc/sysconfig/network-scripts/ifcfg-eth0
# Advanced MicroDevices [AMD] 79c970 [PCnet32 LANCE]
DEVICE=eth0
BOOTPROTO=none
HWADDR=00:0C:29:18:65:F5
ONBOOT=yes
IPADDR=10.10.140.140
NETMASK=255.255.254.0
GATEWAY=10.10.140.1
TYPE=Ethernet
 
[root@server1 ~]#cat /etc/sysconfig/network-scripts/ifcfg-eth1
# Please read/usr/share/doc/initscripts-*/sysconfig.txt
# for thedocumentation of these parameters.
GATEWAY=10.0.0.1
TYPE=Ethernet
DEVICE=eth1
HWADDR=00:0c:29:18:65:ff
BOOTPROTO=none
NETMASK=255.255.255.0
IPADDR=10.0.0.201
ONBOOT=yes
USERCTL=no
IPV6INIT=no
PEERDNS=yes

Server3的网络配置：

[root@server3 ~]#cat /etc/sysconfig/network-scripts/ifcfg-eth0
# Advanced MicroDevices [AMD] 79c970 [PCnet32 LANCE]
DEVICE=eth0
BOOTPROTO=none
HWADDR=00:0C:29:D9:6A:53
ONBOOT=yes
IPADDR=10.10.140.84
NETMASK=255.255.254.0
GATEWAY=10.10.140.1
TYPE=Ethernet
 
[root@server3 ~]#cat /etc/sysconfig/network-scripts/ifcfg-eth1
# Please read/usr/share/doc/initscripts-*/sysconfig.txt
# for thedocumentation of these parameters.
GATEWAY=10.0.0.1
TYPE=Ethernet
DEVICE=eth1
HWADDR=00:0c:29:d9:6a:5d
BOOTPROTO=none
NETMASK=255.255.255.0
IPADDR=10.0.0.203
ONBOOT=yes
USERCTL=no
IPV6INIT=no
PEERDNS=yes

2.2.3、修改主机名

[root@server1 ~]#cat /etc/sysconfig/network
NETWORKING=yes
NETWORKING_IPV6=yes
HOSTNAME=server1
 
[root@server3 ~]#cat /etc/sysconfig/network
NETWORKING=yes
NETWORKING_IPV6=yes
HOSTNAME=server3

2.2.4、关闭防火墙

[root@server1 ~]#chkconfig iptables off
[root@server3 ~]# chkconfig iptables off

3、 DRBD安装与配置

3.1、DRBD的原理

DRBD（DistributedReplicated Block Device）是基于Linux系统下的块复制分发设备。它可以实时的同步远端主机和本地主机之间的数据，类似与Raid1的功能，我们可以将它看作为网络 Raid1。在服务器上部署使用DRBD，可以用它代替共享磁盘阵列的功能，因为数据同时存在于本地和远端的服务器上，当本地服务器出现故障时，可以使用远端服务器上的数据继续工作，如果要实现无间断的服务，可以通过drbd结合另一个开源工具heartbeat，实现服务的无缝接管。DRBD的工作原理如下图：

3.2、安装

下载安装包：wget http://oss.linbit.com/drbd/8.3/drbd-8.3.0.tar.gz，执行以下命令：

tar xvzf drbd-8.3.0.tar.gz
cd drbd-8.3.0
cd drbd 
make clean all 
cd .. 
make tools 
make install
make install-tools

验证安装是否正确：

# insmod drbd/drbd.ko 或者 # modprobe drbd
# lsmod | grep drbd
drbd                 220056  2

显示则安装正确。主要在server1上和和server3上都要安装

3.3、配置

3.3.1、DRBD使用的硬盘分区

server1和server3分区的大小，格式必须相同。并且必须都为空白分区，可以在装系统前预留分区，如果已经安装好的系统，建议使用gparted工具进行分区。

使用方法可以参考：http://hi.baidu.com/migicq/blog/item/5e13f1c5c675ccb68226ac38.html

server1：ip地址为10.10.140.140，drbd的分区为：/dev/sda4

server3：ip地址为10.10.140.84，drbd的分区为：/dev/sda4

3.3.2、主要的配置文件

DRBD运行时，会读取一个配置文件/etc/drbd.conf。这个文件里描述了DRBD设备与硬盘分区的映射关系，和DRBD的一些配置参数。

[root@server1 ~]#vi /etc/drbd.conf 
#是否参加DRBD使用者统计.默认是yes
global {
    usage-count yes;
}
# 设置主备节点同步时的网络速率最大值,单位是字节
common {
  syncer { rate 10M; }
# 一个DRBD设备(即:/dev/drbdX),叫做一个"资源".里面包含一个DRBD设备的主备#节点的相关信息。
resource r0 {
  # 使用协议C.表示收到远程主机的写入确认后,则认为写入完成. 
  protocol C;
  net {
                        # 设置主备机之间通信使用的信息算法.
        cram-hmac-alg sha1;
        shared-secret"FooFunFactory";
        allow-two-primaries;
  }
  syncer {
    rate 10M;
  }
  # 每个主机的说明以"on"开头,后面是主机名.在后面的{}中为这个主机的配置  on server1 {
    device    /dev/drbd0; 
#使用的磁盘分区是/dev/sda4
    disk     /dev/sda4;
# 设置DRBD的监听端口,用于与另一台主机通信
    address   10.10.140.140:7788; 
    flexible-meta-disk  internal;
  }
  on server3 {
    device   /dev/drbd0;
    disk    /dev/sda4;
    address  10.10.140.84:7788;
    meta-disk internal;
  }
}

3.3.3、将drbd.conf文件复制到备机上/etc目录下

[root@server1 ~]#scp /etc/drbd.conf root@server3:/etc/

3.4、DRBD启动

准备启动之前，需要分别在2个主机上的空白分区上创建相应的元数据保存的数据块：

常见之前现将两块空白分区彻底清除数据

分别在两个主机上执行

#dd if=/dev/zero of=/dev/sdbX bs=1M count=128

否则下一步会出现

.........
 
Device size would be truncated,which
 would corrupt data and result in
 'access beyond end of device' errors.
 You need to either
  * use external meta data (recommended)
  * shrink that filesystem first
  * zero out the device (destroy thefilesystem)
 Operation refused.
 ..........

分别在server1和server3上面执行

3.4.1、#drbdadmcreate-md r0 创建元数据

确保成功后，接下来就可以启动drbd进程了(在server01和server02同时启用)：

3.4.2 在server1和server3上分别执行

[root@server01~]# /etc/init.d/drbd start 或servicedrbd start

StartingDRBD resources: [ d(r0) s(r0) n(r0) ].

3.4.3 设置主节点

在server1执行以下命令(第一次)，设置server1为主节点,以后可以用 drbdadmprimary db

#drbdsetup /dev/drbd0 primary –o

3.4.4 查看连接

在第一次启动会同步磁盘的数据。

3.4.5 对空白磁盘进行格式化并mount到文件系统中

此操作只在primary节点上执行。

[root@server1 ~]# mkfs.ext2/dev/drbd0 
mke2fs 1.39 (29-May-2006)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
655360 inodes, 1309232 blocks
65461 blocks (5.00%) reserved forthe super user 
First data block=0
Maximum filesystemblocks=1342177280
40 block groups
32768 blocks per group, 32768fragments per group
16384 inodes per group
Superblock backups stored onblocks: 
        32768, 98304, 163840, 229376, 294912,819200, 884736
Writing inode tables: done                           
Creating journal (32768 blocks):done
Writing superblocks and filesystemaccounting information: done
This filesystem will beautomatically checked every 35 mounts or
180 days, whichever comesfirst.  Use tune2fs -c or -i to override.
[root@server1 ~]# mount /dev/drbd0 /home/share

3.4.6 设置drbd开机时自动启动

chkconfig--level 35 drbd on

3.5、DRBD测试

3.5.1 主备机手动切换

先卸载主机上drbd设备

[root@server1 ~]# umount /dev/drbd0

将server1降为从节点

[root@server1 ~]# drbdadm secondary r0

查询server1的状态

把server3升级为主节点

[root@server3 ~]# drbdadm primary r0

在server3上挂在到drbd设备上

[root@server3 ~]# mount /dev/drbd0 /home/share

查看server3的状态

4、 Heartbeat的安装与配置

4.1 Heartbeat的安装

在server1和server3利用yum安装heartbeat

[root@server1~]# yum install heartbeat

4.2 Heartbeat的配置

配置/etc/ha.d/ha.cf

1、使用下面的命令查找Heartbeat RPM包安装后释放的ha.cf样本配置文件：

rpm -qd heartbeat | grepha.cf

2、使用下面的命令将样本配置文件复制到适当的位置：

cp/usr/share/doc/packages/heartbeat/ha.cf /etc/ha.d/

3、编辑/etc/ha.d/ha.cf文件，取消注释符号或增加以下内容：

udpport 694

#采用ucast方式，使用网卡eth1在主服务器和备用服务器之间发送心跳消息。指定对端ip，即在server1上指定10.0.0.203，在server3上指定10.0.0.201

ucast eth1 10.0.0.203

4、同时，取消keepalive，deadtime和initdead这三行的注释符号：

keepalive 2

deadtime 30

initdead 120

initdead行指出heartbeat守护进程首次启动后应该等待120秒后再启动主服务器上的资源，keepalive行指出心跳消息之间应该间隔多少秒，deadtime行指出备用服务器在由于主服务器出故障而没有收到心跳消息时，应该等待多长时间，Heartbeat可能会发送警告消息指出你设置了不正确的值（例如：你可能设置deadtime的值非常接近keepalive的值以确保一个安全配置）。

5、将下面两行添加到/etc/ha.d/ha.cf文件的末尾：

node server1

node server3

这里填写主、备用服务器的名字（uname -n命令返回的值）

5、去掉以下注释可以查看heartbeat的运行日志，对错误分析有很大帮助

debugfile /var/log/ha-debug

logfile /var/log/ha-log

配置 /etc/ha.d/authkeys

1、使用下面的命令定位样本authkeys文件，并将其复制到适当的位置： rpm -qd heartbeat | grep authkeys

cp/usr/share/doc/packages/heartbeat/authkeys /etc/ha.d

2、编辑/etc/ha.d/authkeys文件，取消下面两行内容前的注释符号：

auth1

1 crc

3、确保authkeys文件只能由root读取：

chmod 600/etc/ha.d/authkeys

4.3 在备用服务器上安装Heartbeat

把配置文件拷贝到备用服务器上

[root@server1 ~]# scp -r/etc/ha.d root@server3:/etc/ha.d

4.4 启动Heartbeat

1 在主服务器和备用服务器上把heartbeat配置为开机自动启动

chkconfig --level 35 heartbeat on

2 手工启停方法

/etc/init.d/heartbeat start

或者

service heartbeat start

/etc/init.d/heartbeat stop

或者

service heartbeat stop

5、 Hadoop主要配置文件的配置

提示：在启动heartbeat前，应该先formatnamenode在drbd分区中产生元数据。

masters

01.servervip

slaves

server2

core-site.xml

01.<property>  
02. <name>hadoop.tmp.dir</name>  
03. <value>/home/share/hadoopdata/</value>  
04. <description>A base for other temporary directories.</description>  
05.</property>  
06.<property>  
07. <name>fs.default.name</name>  
08. <value>hdfs://servervip:9000</value>  
09. <description>The name of the default file system.  A URI whose  
10.  schemeand authority determine the FileSystem implementation.  The  
11.  uri'sscheme determines the config property (fs.SCHEME.impl) naming  
12.  theFileSystem implementation class.  Theuri's authority is used to  
13. determine the host, port, etc. for a filesystem.</description>  
14.</property  
15.<property>  
16. <name>fs.checkpoint.dir</name>  
17. <value>${hadoop.tmp.dir}/dfs/namesecondary</value>  
18. <description>Determines where on the local filesystem the DFSsecondary  
19.      namenode should store the temporary images to merge.  
20.      Ifthis is a comma-delimited list of directories then the image is  
21.     replicated in all of the directories for redundancy.  
22. </description>  
23.</property  
24.<property>  
25. <name>fs.checkpoint.edits.dir</name>  
26. <value>${fs.checkpoint.dir}</value>  
27. <description>Determines where on the local filesystem the DFSsecondary  
28.      namenode should store the temporary edits to merge.  
29.      Ifthis is a comma-delimited list of directoires then teh edits is  
30.     replicated in all of the directoires for redundancy.  
31.     Default value is same as fs.checkpoint.dir  
32. </description>  
33.</property>

hdfs-site.xml

01.<property>  
02.  <name>dfs.name.dir</name>  
03.  <value>${hadoop.tmp.dir}/dfs/name</value>  
04.  <description>Determines where on the local filesystem the DFS  
05.   namenode should store the name table(fsimage). If this is a  
06.   comma-delimitedlist of directories then the name table is  
07.  replicated in all of the directories, for  
08.  redundancy.</description>  
09.</property>  
10.     <property>  
11.  <name>dfs.name.edits.dir</name>  
12.  <value>${dfs.name.dir}</value>  
13.  <description>Determines where on the local filesystem the DFS  
14.   namenode should store the transaction (edits) file. If this is  
15.   acomma-delimited list of directories then the transaction file  
16.   isreplicated in all of the directories, for redundancy.  
17.  Default value is same as dfs.name.dir</description>  
18.</property>

mapred-site.xml

01.<property>  
02. <name>mapred.job.tracker</name>  
03. <value>servervip:9001</value>  
04. <description>The host and port that the MapReduce job tracker runs  
05.  at.  If "local", then jobs are run in-processas a single map  
06.  andreduce task.  
07. </description>  
08.</property>

6、通过haresource配置自动切换

如果不使用heartbeat的情况下，DRBD只能手工切换主从关系，现在修改heartbeat的配置文件，使DRBD可以通过heartbeat自动切换。

6.1 创建资源脚本

1、新建脚本hadoop-hdfs，用于启停hdfs文件系统，同理也可以建脚本hadoop-all,hadoop-jobtracker等资源文件，以hdfs为例内容如下：

[root@server1 conf]# cat/etc/ha.d/resource.d/hadoop-hdfs

01.cd /etc/ha.d/resource.d  
02.vi hadoop-hdfs  
03.#!/bin/sh  
04.case "$1" in  
05.start)  
06.# Start commands go here  
07.cd /home/hadoop-0.20.2/bin  
08.msg=`su - root -c "sh/home/hadoop-0.20.2/bin/start-dfs.sh"`  
09.logger $msg  
10.;;  
11.stop)  
12.# Stop commands go here  
13.cd /home/hadoop-0.20.2/bin  
14.msg=`su - root -c "sh/home/hadoop-0.20.2/bin/stop-dfs.sh"`  
15.logger $msg  
16.;;  
17.status)  
18.# Status commands go here  
19.;;

2、修改权限

[root@server1 conf]# chmod755 /etc/ha.d/resource.d/hadoop-hdfs

3、把脚本拷贝到备份机并同样修改权限

[root@server1 conf]# scp/etc/ha.d/resource.d/hadoop-hdfs server3: /etc/ha.d/resource.d/

6.2 配置haresources

[root@server1 conf]# cat /etc/ha.d/haresources

server1 IPaddr::10.10.140.200 drbddisk::r0 Filesystem::/dev/drbd0::/home/share::ext2hadoop-hdfs

注释：

Server1 主服务器名

10.10.140.200 对外服务IP别名

drbddisk::r0 资源drbddisk，参数为r0

Filesystem::/dev/drbd0::/home/share::ext2资源Filesystem，mount设备/dev/drbd0到/home/share目录，类型为ext2

Hadoop-hdfs文件系统资源

7、 DRBD、heartbeat、hadoop联调

7.1创建文件和目录

1、在server1（主节点）上drbd和heartbeat运行着。由于heartbeat启动后，虚拟地址10.10.140.200被分配到主节点上。用命令查看：

用命令cat /proc/drbd查看server1和server3是否通信正常，可以看到server1和server3分别为主从节点。

查看drbd分区是否挂载

2、查看hadoop dfs是否启动，打开：http://10.10.140.200:50070/dfshealth.jsp

3、向hadoop上传文件

创建一个目录并上传一个测试文件，

[root@server1hadoop-0.20.2]# bin/hadoop dfs -mkdir testdir

[root@server1 hadoop-0.20.2]# bin/hadoop dfs-copyFromLocal /home/share/temp2 testdir

查看文件：

7.2 主备机切换

1、在server1上停止heartbeat

[root@server1 /]# service heartbeat stop

Stopping High-Availabilityservices:

[ OK ]

2、可以查看虚拟IP已经切换到server3上了

3、验证server3上查看hadoop文件系统

7.3 主备机再次切换

1、在server1上启动heartbeat

[root@server1 /]# service heartbeatstart

Starting High-Availability services:

2012/07/25_15:03:31 INFO: Resource is stopped

[ OK ]

2、查看虚拟IP已经切换到server1上。

3、验证server1上查看hadoop文件系统

8、其他问题

8.1 split brain问题处理

split brain实际上是指在某种情况下，造成drbd的两个节点断开了连接，都以primary的身份来运行。当drbd某primary节点连接对方节点准备发送信息的时候如果发现对方也是primary状态，那么会会立刻自行断开连接，并认定当前已经发生split brain了，这时候他会在系统日志中记录以下信息：“Split-Brain detected,droppingconnection!”当发生split brain之后，如果查看连接状态，其中至少会有一个是StandAlone状态，另外一个可能也是StandAlone（如果是同时发现split brain状态），也有可能是WFConnection的状态。

1 节点重新启动时，在dmesg中出现错误提示：

drbd0: Split-Brain detected, dropping connection!

drbd0: self055F46EA3829909E:899EC0EBD8690AFD:FEA4014923297FC8:3435CD2BACCECFCB

drbd0: peer 7E18F3FEEA113778:899EC0EBD8690AFC:FEA4014923297FC8:3435CD2BACCECFCB

drbd0: helper command: /sbin/drbdadm split-brain minor-0

drbd0: meta connection shut down by peer.

2在203查看cat/proc/drbd，203运行为StandAlone状态

version: 8.3.0 (api:88/proto:86-89)

GIT-hash: 9ba8b93e24d842f0dd3fb1f9b90e8348ddb95829build by root@ost3, 2008-12-30 17:16:32

0: cs:StandAlone ro:Secondary/Unknownds:UpToDate/DUnknown r---

ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0ua:0 ap:0 ep:1 wo:b oos:664

3在202查看cat /proc/drbd，202运行为StandAlone状态

version: 8.3.0 (api:88/proto:86-89)

GIT-hash:9ba8b93e24d842f0dd3fb1f9b90e8348ddb95829 build by root@ost2, 2008-12-3017:23:44

0: cs:StandAlone ro:Primary/Unknownds:UpToDate/DUnknown r---

ns:0 nr:0 dw:4 dr:21 al:1 bm:0 lo:0pe:0 ua:0 ap:0 ep:1 wo:b oos:68

4 原因分析

由于节点重启导致数据不一致，而配置文件中没有配置自动修复错误的内容，因而导致握手失败，数据无法同步。

split brain有两种解决办法：手动处理和自动处理。

手动处理

1 在203上停止heartbeat

Heartbeat会锁定资源，只有停止后才能释放

/etc/init.d/heartbeat stop

2 在作为secondary的节点上放弃该资源的数据

在ost3上

/sbin/drbdadm -- --discard-my-dataconnect r0

3在作为primary的节点重新连接secondary

在ost2上

/sbin/drbdadm disconnect r0

/sbin/drbdadm connect r0

把ost2设置为主节点

/sbin/drbdadm primary r0

4在203上重新启动heartbeat

/etc/init.d/heartbeat start

5 查看202状态 cat /proc/drbd，显示为Connected，已经恢复了正常。

version: 8.3.0 (api:88/proto:86-89)

GIT-hash: 9ba8b93e24d842f0dd3fb1f9b90e8348ddb95829 build byroot@ost2, 2008-12-30 17:23:44

0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---

ns:768 nr:0 dw:800 dr:905 al:11 bm:10 lo:0 pe:0 ua:0 ap:0 ep:1wo:b oos:0

6查看203状态 cat/proc/drbd，显示为Connected，已经恢复了正常。

version: 8.3.0 (api:88/proto:86-89)

GIT-hash:9ba8b93e24d842f0dd3fb1f9b90e8348ddb95829 build by root@ost3, 2008-12-3017:16:32

0: cs:Connected ro:Secondary/Primaryds:UpToDate/UpToDate C r---

ns:0 nr:768 dw:768 dr:0 al:0 bm:10 lo:0pe:0 ua:0 ap:0 ep:1 wo:b oos:0

自动处理

通过/etc/drbd.conf配置中设置自动处理策略，在发生数据不一致时自动处理。自动处理策略定义如下：

1 after-sb-0pri.

当两个节点的状态都是secondary时，可以通过after-sb-0pri策略自动恢复。

1）disconnect

默认策略，没有自动恢复，简单的断开连接。

2）discard-younger-primary

在split brain发生前从主节点自动同步。

3）discard-older-primary

在split brain发生时从变成primary的节点同步数据。

4）discard-least-changes

在split brain发生时从块最多的节点同步数据。

5）discard-node-NODENAME

自动同步到名字节点

2 after-sb-1pri

当两个节点的状态只有一个是primary时，可以通过after-sb-1pri策略自动恢复。

1）disconnect

默认策略，没有自动恢复，简单的断开连接。

2）consensus

丢弃secondary或者简单的断开连接。

3）discard-secondary

丢弃secondary数据。

4）call-pri-lost-after-sb

按照after-sb-0pri的策略执行。

3 after-sb-2pri

当两个节点的状态都是primary时，可以通过after-sb-2pri策略自动恢复。

1）disconnect

默认策略，没有自动恢复，简单的断开连接。

2）violently-as0p

按照after-sb-0pri的策略执行。

3）call-pri-lost-after-sb

按照after-sb-0pri的策略执行，并丢弃其他节点。

4 配置自动恢复

编辑/etc/drbd.conf，找到resource r0部分，配置策略如下，所有节点完全一致。

#after-sb-0pri disconnect;

after-sb-0pri discard-younger-primary;

#after-sb-1pri disconnect;

after-sb-1pri discard-secondary;

#after-sb-2pri disconnect;

after-sb-2pri call-pri-lost-after-sb;

参考资料：Hadoop_HDFS系统双机热备方案.pdf

DRBD安装配置(主从模式)--详细步骤图文并茂.doc

企业架构、TOGAF与ArchiMate概览

架构师之路-如何做好业务建模？

大型网站电商网站架构案例和技术架构的示例

完整的Archimate视点指南（包括示例）