Data sharing with a GFS storage cluster
BOoRFGOnZ
|
1#
BOoRFGOnZ 发表于 2008-01-29 19:24
Data sharing with a GFS storage cluster
Data sharing with a GFS storage clusterby Mark Hlawatschek and Marc Grimme
ATIX, Munich, GermanyIntroductionLinux server clustering has become an important technique to providescalable performance and high availability for IT services. Theseservices quite often require that data be shared between servers. Inaddition, even small companies often have many computers, includingdesktops and servers, that must share data. Hence, data sharing is arequirement for both small and large companies. Some services have static data that can easily be split betweenservers. Using duplication, each server in the cluster hosts a copy ofall the data. However, other services utilize dynamic data that israpidly changing, and it is much more difficult to duplicate data. Forexample, databases and file services (based on protocols like SQL, NFSor CIFS) would have to distribute the new information synchronously toall other servers after each write. This would lead to very longresponse times and an extremely high network load. Anotherdisadvantage is the higher costs for maintaining duplicate copies andfor the associated increase in system management complexity. Whatthese applications really need is access to a single data store thatcan be read from and written to by all servers simultaneously. Theuse of a file server (network attached storage server) supporting theNFS and CIFS protocols is the traditional approach for this kind ofshared data. Linux, of course, offers these popular data sharingprotocols, and this solution is suitable for some applications, but asingle file server often becomes a performance bottleneck and singlepoint of failure (SPoF) in the complete system. To solve the difficulty of using traditional approaches for scalableand simplified data sharing, every server in the cluster should havedirect access to the storage device, and each server should be able toread and write to the data store simultaneously. The Red Hat GlobalFile System (GFS) represents the heart of such a solution. It combinesLinux servers and a storage network (SAN) to create a data sharingcluster through a single shared file system base. GFS internalsThe Global File System was created as a 64-bit cluster file system. Itenables several servers connected to a storage area network (SAN) toaccess a common, shared file store at the same time with standardUNIX/POSIX file system semantics. The development of GFS began in 1995 at the University ofMinnesota. At that time, large-scale computing clusters in use at theUniversity were generating huge data sets, which had to be writteneffectively to a central storage pool. To solve this problem, MatthewO'Keefe, a professor at the University of Minnesota at that time,started to develop GFS with a group of students. Since then, Matthewhas become Red Hat's Director of Storage Strategy. These efforts overtime resulted in the Red Hat GFS cluster file system, which isreleased under the GPL. At the moment GFS is only available forLinux. GFS is a journaling file system. Each cluster node isallocated its own journal. Changes to the file system metadata arewritten in a journal and then on the file system like other journalingfile systems. In case of a node failure, file system consistency canbe recovered by replaying the metadata operations. Optionally, bothdata and metadata can be journaled. GFS saves its file system descriptors in inodes that areallocated dynamically (referred to as dynamic nodes ordinodes). They are placed in a whole file system block (4096 bytes isthe standard file system block size in Linux kernels). In a clusterfile system, multiple servers access the file system at the same time;hence, the pooling of multiple dinodes in one block would lead to morecompetitive block accesses and false contention. For space efficiencyand reduced disk accesses, file data is saved (stuffed) thedinode itself if the file is small enough to fit completely inside thedinode. In this case, only one block access is necessary to accesssmaller files. If the files are bigger, GFS uses a "flat file"structure. All pointers in a dinode have the same depth. There areonly direct, indirect, or double indirect pointers. The tree heightgrows as much as necessary to store the file data as shown in Figure1. "Extendible hashing" (ExHash) is used to save the indexstructure for directories. For every filename a multi-bit hash issaved as an index in the hash table. The corresponding pointer in thetable points at a "leaf node." Every leaf node can be referenced bymultiple pointers. If a hash table leaf node becomes too small to savethe directory entries, the size of the whole hash table is doubled. Ifone leaf node is too small, it splits up into two leaf nodes of thesame size. If there are only a few directory entries, the directoryinformation is saved within the dinode block, just like filedata. This data structure lets each directory search be performed in anumber of disk accesses proportional to the depth of the extendiblehashing tree structure, which is very flat. For very large directorieswith thousands or millions of files, only a small number of diskaccesses are required to find the directory entry. The latest version GFS 6.0 offers new features including file accesscontrol lists (ACLs), quota support, direct I/O (to acceleratedatabase performance), and dynamic on-line enlargement of the filesystem. Figure 1: GFS metadata structure StructureFigure 2 shows the structure of a typical GFS storagecluster. The GFS file system is mapped onto a pool volume, which isconstructed from one or more independent storage units. The serversare connected with a storage network (SAN) over one or more data pathsto the pool volume. The individual cluster servers are also connectedvia one or more data paths to the network. Thus every server candirectly access the storage arrays onto which the pool volume ismapped, greatly increasing I/O system performance and providingscalability far beyond what can be achieved with a single NAS server. Figure 2: GFS storage cluster The servers in the GFS storage cluster use Linux as the operatingsystem. A simple cluster volume manager, the GFS pool layervirtualizes the storage units (/dev/sda) and aggregates them into asingle logical pool volume (dev/pool/foo). Multiple devices can becombined by striping or by concatenation. Changes in the poolconfiguration are visible in all cluster servers. The pool volumemanager allows pool volumes to be resized online and provides I/Omulti-pathing, so that single failures in the SAN path can betolerated. However, the pool volume manager does not provide volumemirroring or snapshots. These capabilities will be provided in thefuture via CLVM, the Cluster Logical Volume Manager, a LVM2-basedcluster volume manager which allows multiple servers to share accessto a storage volume on a SAN. The lock server coordinates the multiple servers which accessthe same physical file system blocks in a GFS storage cluster. Itassures the file system's data consistency. From thebeginning, GFS has been provided with a modular locking layer. Inearly GFS versions lock information was exchanged over the SCSIprotocol (DLOCK, DMEP). Since GFS version 5, a redundant, IP-baseduser space locking service (RLM), which runs on all nodes ,has beenused. Red Hat is working to integrate its distributed lock manager(DLM) into GFS 6.1, which will be released in the summer of 2005 Eachserver in the GFS cluster must heartbeat the lock server on a regularbasis. If the server is unable to heartbeat, it will be selected bythe lock manager to be removed from the cluster, an operation calledfencing. GFS supports several fencing mechanisms including differentnetwork power switches and HP's ILO interface. ScalabilityA classic IT system consists of services and applications that run onindividual servers and that are generally limited to running on aparticular server. If the hardware to which a particular applicationis limited is no longer sufficient, the application generally cannotexploit the additional memory, processing power, or storage capacitycontained in the rest of the cluster. In contrast, applications thatcan run in parallel on a storage cluster are much easier to scale. Incase of a capability shortage, new components (server, storage) can beeasily integrated in the system until the required capacity can beachieved. The common use of a storage pool not only removes the needfor the laborious duplication of data to multiple servers but alsooffers elegant scaling possibilities. With growing storagerequirements, the common storage pool can be expanded and will beimmediately available for all servers. AvailabilityThe availability of the complete system is an important aspect forproviding IT services. To achieve Class 3 availability (99% to 99.9%),it is necessary to eliminate every single point of failure (SPOF). ForClass 4 availability (99.9% to 99.99% uptime), it is necessary to havea high availability cluster, mirrored data, and a second data centerfor disaster recovery. The services must have the possibility to runon multiple servers at different locations. The breakdown of oneserver or the whole data center must not avert the accessibility ofthe services for more than a short time. A GFS cluster can beconnected to the central storage system via the SAN through redundantI/O paths to overcome the breakdown of individual infrastructurecomponents like switches, host bus adapters, and cables. I/Omulti-pathing can be implemented either by the fibre channel driverfor the host bus adapter or by the GFS pool. Unfortunately, the GFSstorage cluster is not yet able to mirror file blocks redundantly tomultiple storage devices from the host servers but can of course takeadvantage of the hardware redundancy available on good RAID storagearrays. Host-based mirroring in a GFS cluster arrives later in 2005with the Cluster Logical Volume Manager (CLVM). The lock server, which is essential for GFS, is available in twoversions. There is a simple version (Single Lock Manager, or SLM),which is a SPOF for the complete system, and the redundant version(Redundant Lock Manager, or RLM). It is possible to define multiplelock servers with the RLM, which can transparently overtake the roleof an active lock server in case of a failure. In addition, Red HatCluster Suite can be used to provide application fail-over in GFSclusters. LAN-free backupA data backup is normally done from backup client machines (which areusually production application servers) either over the local areanetwork (LAN) to a dedicated backup server (via products like LegatoNetworker or Veritas Netbackup), or LAN-free from the applicationserver directly to the backup device. Because every connected serverusing a cluster file system has access to all data and file systems,it is possible to convert a server to a backup server. The backupserver is able to accomplish a backup during ongoing operationswithout affecting the application server. It is also very useful togenerate snapshots or clones of GFS volumes using the hardwaresnapshot capabilities of many storage products. These snapshot volumescan be mounted and backed up by a GFS backup server. To enable thiscapability, GFS includes a file system quiesce capability to ensure aconsistent data state. To quiesce means that all accesses to the filesystem are halted after a file system sync operation which insuresthat all metadata and data is written to the storage unit in aconsistent state before the snapshot is taken. Diskless shared root clusteringAs all servers in a GFS storage cluster access their data through ashared storage area network, additional servers can be added to easilyscale the server capacity. Hence, each server can be viewed as justanother resource in the pool of available servers. The system data andoperating system images are on the shared storage, and thereforeserver and storage can be seen as effectively independent of eachother. The result is a diskless shared root cluster where no serverneeds a local hard disk; each server can instead boot directly fromthe SAN. Both application data and the operating system images areshared, which means the root (/) partition for all cluster nodes isthe same. As a consequence the management is simplified. Changes haveto be done only one time and they are immediately valid for allservers. Constructing shared root disk clusters with GFS is quitehardware and kernel-version-specific, and this feature should only bedeployed with the help of Red Hat professional services or Red Hatpartners like ATIX GmbH. Case study: IP Tech AGThe deployment of GFS at IP Tech, one of the biggest Internet hostingand service providers in Switzerland, demonstrates how effectively RedHat cluster technologies are already used in enterprises. Since thebeginning of 2002, a Red Hat GFS cluster with 14 nodes has been inproduction at IP Tech. This cluster supports database (MySQL), email(Qmail) and web serving (Apache) applications in specialconfigurations. Over 1500 MySQL databases, 10,000 mail domains, and28,000 web domains are hosted for mostly Swiss companies at IPTech. Current daily operations at IP Tech support about 5-7 millionweb accesses, 3.0-3.5 million pop3 (email server) connections, 1.0-1.5million SMTP connections (email relay), and 3.5-4.0 million MySQLconnections every day over this GFS-based infrastructure. Over 100,000individual email users are supported. In addition to the classic web, database, and email services, IP Techrecently introduced the hosting of virtual machines and the dedicatedallocation of servers in a GFS storage cluster. Customers can nowdynamically allocate GFS servers on-the-fly and run multiple virtualservers on a single GFS node. This is an excellent way to improvesystem utilization and squeeze the most out of the GFS data sharingcluster infrastructure. Recently, IP Tech migrated its services to a centralizedblade-based infrastructure with two terabytes of redundant shared rootstorage and about 22 diskless blade servers. All applications exceptfor the virtual machines run on GFS. This configuration minimizeshardware repair times by simply replacing blades in case of failureand pointing them to boot off the shared root boot image. Server andstorage scalability can be achieved during ongoingoperation. Additionally each night file system data is replicated on asecond storage system via a LAN-free backup using GFS and theSAN. Figure 3 illustrates the IP Tech infrastructure. Figure 3: IP Tech infrastructure IP Tech initially used NFS for the data sharing requirements of theirIP hosting environment, but they had significant problems maintainingit because it was unstable under heavy load. File services and mountswould come and go without warning, stopping operations at criticaltimes, and quite often during high load periods where down time wouldhave the maximum negative impact. Two years ago, IP Tech migrated toRed Hat GFS and, in contrast to their NFS-based storageinfrastructure, the cluster running Red Hat GFS "ran byitself." By using a Red Hat Enterprise Linux cluster with Red Hat GFS, IP Techcould achieve both high availability and performance. If any of theservers crashed or if an application (e.g., http or qmail ) hung, thisserver could be re-booted quickly and then brough back into thecluster infrastructure without disrupting the services being providedby the other servers. In addition, IP Tech also uses the shared GFSroot disk feature which simplifies the process of updating softwareand performing static application service load-balancing in thecluster. For example, when IP Tech has a spam attack they can quicklychange some webservers to mailservers and keep up the mail servicewithout problems and detect and counter the spam attack, while allcluster services continue to run. They can also scale theinfrastructure within minutes using new servers and storagearrays. Finally, IP Tech performs regular backups on a point-in-timesnapshot of the GFS volume using a separate server in thecluster. This approach allows a GFS file system to be backed up withalmost no impact to other system operations. In summary, the key benefits IP Tech found in using Red Hat GFS were:
About the authorMarc Grimme, academically qualified computer scientist, and MarkHlawatschek, academically qualified engineer, are two of the threefounders of the ATIX GmbH. Their main focus lies within thedevelopment and the implementation of tailor-made enterprise storagesolutions based on SAN/NAS. Additionally, they are responsible for therealisation of highly scalable IT infrastructures for enterpriseapplications on Linux base. Copyright © 2008 Red Hat, Inc. All rights reserved. |