Deploying Hadoop with Serengeti…

I read Richard McDougall’s blog post about Project Serengeti. Richard describes how you can deploy a Hadoop cluster in literally 10 minutes using Serengeti. I am not a Hadoop expert so I am probably the best qualified to test the 10 minute claim.

First of all, download the OVA. I would also suggest downloading the user guide, I needed it for the password / username to login to the Serengeti VM. (Which is: serengeti / password) So what do you need to deploy a Hadoop cluster in your vSphere environment? This is what I did:

  • Import the OVA
    • I decided to Provide a static IP instead of DHCP, I don’t have DHCP in my datacenter
  • Upgrade VMware Tools
    • Just right click the VM and upgrade the tools automatically, works as a charm. note that this is not a requirement!
  • Login to the console
    • ssh serengeti@
    • username/password: serengeti/password
  • Go to the Serengeti CLI by typing
    • serengeti
  • I don’t run DHCP so in order for the Hadoop nodes to get an IP-address I will need to tell Hadoop which IPs to use. I had to remove the default network first and create a new one
    • network list
    • network delete –name defaultNetwork
    • network add –name defaultNetwork –portGroup “VM Network” –ip –dns –gateway –mask
  • Create the Hadoop cluster by running the following commands
    • cluster create –name myHadoop
  • Now you will see a whole set of new virtual machines being created and your Hadoop cluster is created

How long did that take me? Indeed ~10 minutes… I tested if the cluster worked as follows, first SSH in to the Hadoop client node and then do the following:

  • cd /usr/local/share/pig-0.9.2/test/e2e/pig/lib/
  • hadoop jar hadoop-examples.jar teragen 1000000000 tera-data

Now you should see the worker nodes in vSphere ramping up to 100% CPU and memory utilization. It works for me… So why did I deploy it? Really to see if Richard was right and it would only take me 10 minutes? That wasn’t the reason of course. I wanted to see how it deployed and how it leveraged vSphere components. Note that this is not a GA release yet. It is version 0.5 and is probably still under heavy development. It is exciting to see though which direction we are heading in to and I am looking forward to the integration points with vSphere and vCloud Director (eventually).

One of the integration points that had my interest was the HA part that is mentioned in Richard’s post. Hortonworks was responsible for that and apparently they plug in to the vSphere HA VM and Application Monitoring. I haven’t been able to test that yet, but when I do I will update you on this. If you are at the point of testing this yourself, please note (and this is not well documented) you will have to enable “VM and Application Monitoring” explicitly as this is not enabled by default on vSphere HA cluster.

  • Right click your cluster
  • Click “Edit Settings”
  • Click the “VM Monitoring” tab and make sure “VM and Application Monitoring” is turned on.
  • Also check the settings for the individual VMs they will need to be set to “include” where applicable

That is it for now… again I am by no means a Hadoop expert and am not going to try to pretend, just exploring and broadening my horizon.

Be Sociable, Share!


    1. says

      Nice, but I’d like to see if a Hadoop chap/chapess can do this in 10 mins.. after all, you’re a bit of a whiz with VMware and most of these actions are VMware related… need to find a hadoop person! :)

    2. Duncan says

      I would hope so… It is one action -> import ovf. But then again, I wouldn’t expect a Hadoop person to manage a vSphere environment :-)

    3. says

      Mmmm good point – so who would do the serengeti install then? The VMware guy or the Hadoop guy?

      How about making this available via vCD?

    4. Duncan says

      According to the Serengeti google group vCD would be next. I would expect the vSphere admin to deploy the vApp and the provide the IP details to the Hadoop’er

    5. Ed says

      Hi Duncan,

      Great stuff. I am a bit stumped by an issue trying to deploy a cluster with the Big Data Extension. I see that Serengeti provisions the correct IP’s from the defined range for the cloned VM’s, but the VM’s never actually get that IP assigned, thus failing the cluster creation. The udev rules file (70-persistent-network…) is removed from template, and ifcfg-eth… do not contain HWADDR or UUID info, and the Serengeti snapshot is removed from the template each time after changes. The eth.. interfaces *do* get created after VM poweron – just don’t get the IP address…

      Any ideas, pointers, where to look to get this going? I am running out of ideas and have exceeded *well* beyond my 10 minutes by now! 😉

      Many thanks,