I read Richard McDougall’s blog post about Project Serengeti. Richard describes how you can deploy a Hadoop cluster in literally 10 minutes using Serengeti. I am not a Hadoop expert so I am probably the best qualified to test the 10 minute claim.
First of all, download the OVA. I would also suggest downloading the user guide, I needed it for the password / username to login to the Serengeti VM. (Which is: serengeti / password) So what do you need to deploy a Hadoop cluster in your vSphere environment? This is what I did:
- Import the OVA
- I decided to Provide a static IP instead of DHCP, I don’t have DHCP in my datacenter
- Upgrade VMware Tools
- Just right click the VM and upgrade the tools automatically, works as a charm. note that this is not a requirement!
- Login to the console
- ssh email@example.com
- username/password: serengeti/password
- Go to the Serengeti CLI by typing
- I don’t run DHCP so in order for the Hadoop nodes to get an IP-address I will need to tell Hadoop which IPs to use. I had to remove the default network first and create a new one
- network list
- network delete –name defaultNetwork
- network add –name defaultNetwork –portGroup “VM Network” –ip 10.27.51.165-200 –dns 10.27.51.122 –gateway 10.27.51.254 –mask 255.255.255.0
- Create the Hadoop cluster by running the following commands
- cluster create –name myHadoop
- Now you will see a whole set of new virtual machines being created and your Hadoop cluster is created
How long did that take me? Indeed ~10 minutes… I tested if the cluster worked as follows, first SSH in to the Hadoop client node and then do the following:
- cd /usr/local/share/pig-0.9.2/test/e2e/pig/lib/
- hadoop jar hadoop-examples.jar teragen 1000000000 tera-data
Now you should see the worker nodes in vSphere ramping up to 100% CPU and memory utilization. It works for me… So why did I deploy it? Really to see if Richard was right and it would only take me 10 minutes? That wasn’t the reason of course. I wanted to see how it deployed and how it leveraged vSphere components. Note that this is not a GA release yet. It is version 0.5 and is probably still under heavy development. It is exciting to see though which direction we are heading in to and I am looking forward to the integration points with vSphere and vCloud Director (eventually).
One of the integration points that had my interest was the HA part that is mentioned in Richard’s post. Hortonworks was responsible for that and apparently they plug in to the vSphere HA VM and Application Monitoring. I haven’t been able to test that yet, but when I do I will update you on this. If you are at the point of testing this yourself, please note (and this is not well documented) you will have to enable “VM and Application Monitoring” explicitly as this is not enabled by default on vSphere HA cluster.
- Right click your cluster
- Click “Edit Settings”
- Click the “VM Monitoring” tab and make sure “VM and Application Monitoring” is turned on.
- Also check the settings for the individual VMs they will need to be set to “include” where applicable
That is it for now… again I am by no means a Hadoop expert and am not going to try to pretend, just exploring and broadening my horizon.