High Availability Hadoop

Johannes Kirschnick, Steve LoughranJune 2010Making Hadoop highly availableUsing an alternative File system – HP IBRIX

Something about meI work at HP Labs, Bristol, UKDegree in computer science, TU MunichAutomated Infrastructure LabAutomated, secure, dynamic instantiation and management of cloud computing infrastructure and servicesPersonal interestCloud ServicesAutomated service deployment Storage Service

What do I want to talk aboutMotivate High Availability, introduce the contextOverview about HadoopHighlight the Hadoop modes of failure operationIntroduce HP IBRIXPerformance ResultsSummary

Context of this talkHigh availabilityContinued availability in times of failuresHadoop ServiceData operated onFault tolerant operationWhat happens if a node diesReduce time to restart

Hadoop in a nutshellExample: Wordcount across a number of documentsInputOutputSortJobReduceMapand,1and,1Copysd Dearest creature in creation, Study English pronunciation. I will teach you in my verse Sounds like corpse, corps, horse, and worse. I will keep you, Suzy, busy, Make your head with heat grow dizzy. Tear in eye, your dress will tear. So shall I! Oh hear my prayer.and,1I HAVE, alas! Philosophy, Medicine, Jurisprudence too, And to my cost Theology, With ardent labour, studied through. And here I stand, with all my lore, Poor fool, no wiser than before. Magister, doctor styled, indeed, Already these ten years I lead, Up, down, across, and to and fro, My pupils by the nose,--and learn, That we in truth can nothing know! That in my heart like fire doth burn. 'Tis true I've more cunning than all your dull tribe, Magister and doctor, priest, parson, and scribe; Scruple or doubt comes not to enthrall me, Neither can devil nor hell now appal me-- Hence also my heart must all pleasure forego! I may not pretend, aught rightly to know, I may not pretend, through teaching, to find A means to improve or convert mankind. Then I have neither goods nor treasure, No worldly honour, rank, or pleasure; No dog in such fashion would longer live! Therefore myself to magic I give, In hope, through spirit-voice and might, Secrets now veiled to bring to light, That I no more, with aching brow, Need speak of what I nothing know; That I the force may recognise That binds creation's inmost energies; Her vital powers, her embryo seeds survey, And fling the trade in empty words away. O full-orb'd moon, did but thy rays Their last upon mine anguish gaze! Beside this desk, at dead of night, Oft have I watched to hail thy light: Then, pensive friend! o'er book and scroll, With soothing power, thy radiance stole! In thy dear light, ah, might I climb, Freely, some mountain height sublime, Round mountain caves with spirits ride, In thy mild haze o'er meadows glide, And, purged from knowledge-fumes, renew My spirit, in thy healing dew! Woe's me! still prison'd in the gloom Of this abhorr'd and musty room! Where heaven's dear light itself doth pass, But dimly through the painted glass! Hemmed in by book-heaps, piled around, Worm-eaten, hid 'neath dust and mould, Which to the high vault's topmost bound, A smoke-stained paper doth enfold; With boxes round thee piled, and glass, And many a useless instrument, With old ancestral lumber blent-- This is thy world! a world! alas! And dost thou ask why heaves thy heart, With tighten'd pressure in thy breast? Why the dull ache will not depart, By which thy life-pulse is oppress'd? Instead of nature's living sphere, Created for mankind of old, Brute skeletons surround thee here, And dead men's bones in smoke and mould.Ham. To be, or not to be, that is the Question: Whether 'tis Nobler in the minde to suffer The Slings and Arrowes of outragious Fortune, Or to take Armes against a Sea of troubles, And by opposing end them: to dye, to sleepe No more; and by a sleepe, to say we end The Heart-ake, and the thousand Naturallshockes That Flesh is heyre too? 'Tis a consummation Deuoutly to be wish'd. To dye to sleepe, To sleepe, perchance to Dreame; I, there's the rub, For in that sleepe of death, what dreames may come, When we haueshuffel'd off this mortallcoile, Must giuevspawse. There's the respect That makes Calamity of so long life: For who would beare the Whips and Scornes of time, The Oppressors wrong, the poore mans Contumely, The pangs of dispriz'dLoue, the Lawes delay, The insolence of Office, and the Spurnes That patient merit of the vnworthy takes, When he himselfe might his Quietus make With a bare Bodkin? Who would these Fardlesbeare To grunt and sweat vnder a weary life, But that the dread of something after death, The vndiscoueredCountrey, from whose Borne No Trauellerreturnes, Puzels the will, And makes vs rather beare those illes we haue, Then flye to others that we know not of. Thus Conscience does make Cowards of vs all, And thus the Natiue hew of Resolution Is sickliedo're, with the pale cast of Thought, And enterprizes of great pith and moment, With this regard their Currants turne away, And loose the name of Action. Soft you now, The faire Ophelia? Nimph, in thy Orizons Be all my sinnesremembred<Word>,1where1what,1"(Lo)cra" 1"1490 1 "1498," 1 "35" 1 "40," 1 "A 2 "AS-IS". 2 "A_ " 1 "Absoluti " 1 "Alack " 1and,1ReduceTaskMapTaskthe,1the,1the,1the,1reduce(key, values ...) { count = 0 for each value v in values: count+=v emit(key,count)}map(name, document) { for each word w in document:emitIntermediate(w,1)}

Hadoop componentsMap Reduce LayerProvides the map and reduce programming frameworkCan break up Jobs into tasksKeeps track of execution statusFile system LayerPluggable file systemSupport for location aware file systemsAccess through an API LayerDefault is HDFS (Hadoop Distributed File system)HDFSProvides fault high availability by replicating individual filesConsists of a central metadata server – NameNodeAnd a number of Data nodes, which store copies of files (or parts of them)

Hadoop operation (with HDFS)TaskTrackerMapReduceLayerJobTrackerTaskTrackerData NodeData NodeNameNodeFile systemLayer...DiskDiskSlave NodeMasterSlave Node

Hadoop operation (with HDFS)SchedulerJobTaskTrackerMapReduceLayerJobTrackerTaskTrackerData NodeData NodeNameNodeFile systemLayer...DiskDiskSlave NodeMasterSlave Node

Hadoop operation (with HDFS)SchedulerJobTaskTrackerLocationInformationMapReduceLayerJobTrackerTaskTrackerData NodeData NodeNameNodeFile systemLayer...DiskDiskSlave NodeMasterSlave Node

SchedulerTaskTaskTaskHadoop operation (with HDFS)JobTaskTrackerLocationInformationMapReduceLayerJobTrackerTaskTrackerData NodeData NodeNameNodeFile systemLayer...DiskDiskSlave NodeMasterSlave Node

Failure scenarios and responsesFailure in Map Reduce componentsTaskTrackerSends heartbeat to JobTrackerIf unresponsive for x seconds, JobTracker marks TaskTracker as dead and stop assigning work to itScheduler reschedules tasks running on that TaskTrackerJobTrackerNo build in heartbeat mechanismCheckpoints to filesystemCan be restarted and resumes operationIndividual TasksTaskTracker monitors progressCan restart failed TasksComplex failure handlingE.g. skip parts of input data which produces failure

Failure scenarios and responses (2)Failure of Data storagePluggable file system implementation needs to detect and remedy error scenariosHDFSFailure of Data NodeKeeps track of replication count for files (parts of files)Can re-replicate missing piecesTries to place copies of individual physically apart from each otherSame rack vs. different racksFailure of NameNodeOperations are written to logs, makes restart possibleDuring restart the filesystem is in read only modeA secondary NameNode can periodically read these logs, to speed up time to become availableBUTIf secondary namenode takes over, restart of the whole cluster is needed, since assigned hostnames have changed.

Availability takeawayMap reduce LayerCheckpoints to the persisting file system to resume workTaskTrackerCan be restartedJobTrackerCan be restartedHDFSSingle point of failure is the NameNodeRestarts can take a long time, depending on amount of data stored and number of operations in the log itselfIn the regions of hours

A different file systemHP IBRIXSoftware solution which runs on top of storage configurationsFault tolerant, high availability file systemSegmented File systemDisks (Luns) are treated as SegmentsSegments are managed by Segment serversAggregated into global file system(s)File systems provide single namespaceEach file system supports up to 16 Petabyte

iBrix in a nutshellClientClientClientClientNFS, CIFS or native clientPerformanceFusion Managerincrease...SegmentServerSegmentServerCapacityDiskDiskDiskDisk......No single metadata server / segmented file system

How Does it look likeFusion Manager Web ConsoleBased on command line interfaceGlobal management view of the installationHere segments correspond to disks attached to servers

How does it look like (2)A client simply mounts the file system via:NFSCIFS / SambaNative ClientEach segment server is automatically a clientMount points and exports need to be created firstson the fusion managerClients access file system via “normal” file system calls

Fault tolerantSupports failoverDifferent hardware topologies configurationsCouplet configurationBest suited for hybrid of performance and capacity...serverserverserverserverserverserverRAIDRAIDRAIDSingle Namespace

Location aware Hadoop on IBRIX

TaskTaskTaskHadoop internals – with ibrixSchedulerJobTaskTrackerLocationInformationMapReduceLayerJobTrackerTaskTrackerSegmentServerSegmentServerIBRIXClientFile systemLayer...DiskDiskSlave NodeMasterSlave Node

Performance test1 GB of randomly generated data, spread across 10 input filesRandomWriterUse HadoopSort to sort the records, measure time spend sortingIncludes mapping, sorting and reducing timeVary the number of slave nodesFile access testActual computation on each TaskTracker is lowGoverning factors for execution time areTime to read and write filesTime to distribute data to the reducers

Performance Resultsexecution Time (sec)

Performance resultsComparable performance to native HDFS systemFor smaller workload even increased performance – due to no replicationCan take advantage of location informationIs dependent on distribution and type of input dataAcross the segment serversPrefers many smaller files, since they can be distributed better

featurehdfsibrixFurther Feature comparisonSingle Point of FailureNeeds RAIDCan expose location informationIndividual file replicationRespond to node failureHomogenous file systemSplit files across nodesYes, namenodeNo, replicatesYesYesRe-Replicationmark as deadYesYes - files are split into chunks which are distributed individuallyNoYesYesNo, only complete filesystemsFailovermark as dead, can fallbackNo, can define TiersOnly if a segment is full

SummaryHadoop provides a number of failure handling methodsDependent on persistent file systemIBRIX as alternative file systemNot specifically build for HadoopLight weight file system plug-in for HadoopLocation aware design enables computation close to the dataComparable performance while gaining on fault toleranceFault tolerance persistence – no single point of failureReduced storage requirementStorage not exclusive to HadoopFuture workMaking the JobTracker failure independentMoving into a virtual environmentShort lived Hadoop Cluster

Ibrix detailsIBRIX uses iNodes as backend storeExtends them by a file-based globally unique identifierEach Segment server is responsible for a fixed number of iNodes)Determined by blocksize within that segment and overall sizeExample4 GB segment size, 4kb block size  1,048,576 iNodes (1M)Simplified calculation exampleWhere is iNode 1,800,000 divide by 1M ≈ 1.71  lives on segment server 1iNodes do not store the data but have a reference to the actual dataBackend storage for iBrix is ext3 filesystem

More detailsBased on distributed iNodesSegmentDisk1stiNodeNthiNodelocal file system

securityFile system respects POSIX like interfaceFiles belong to user/group and have read/write/execute flagsNative ClientNeeds to be bound to a Fusion ManagerExport control can be enforced Mounting only possible from the Fusion manager consoleCIFS / SambaRequires Active Directory to translate windows ids to Linux idExport only sub path of the file system (e.g. /filesystem/sambadirectory)NFSCreate exports on Segment serverLimit clients by IP MaskExport only sub path of the file system (e.g. /filesystem/nfsdirectory)Normal NFS properties (read/write/root squash)

featuresMultiple logical file systemsSelect different segments as base for themTask Manager / PolicyRebalancing between different segment serversTiering of dataSome segments could be better/worse than othersMove data to from them based on policy/ruleReplicate complete logical file systems - Replicate to remote clusterFailoverBuddy system of two (or more) segment servers (active/active standby)Native clients will automatically failoverGrowingSegment servers register with Fusion ManagerNew segments (discs) need to be programmatically discoveredCan be added to logical file systemsIs location awareBy nature of designFor each file, the segment server(s) where it is stored can be determined

Features (2)De-duplicationCachingOn segment server owning a particular fileDistributed MetadataNo single point of failureSupports snap shooting of whole file systemsCreates a new virtual file systemPolicy for storing new filesDistribute them randomly across segment servers assign them to the “local” segment serverSeparate data networkAllows to configure the network interface to use for storage communication

High Availability Hadoop

Recommended

More Related Content

Similar to High Availability Hadoop (18)

More from Steve Loughran (20)

Recently uploaded (20)

High Availability Hadoop