big data Archives - Yellow Bricks

I read Richard McDougall’s blog post about Project Serengeti. Richard describes how you can deploy a Hadoop cluster in literally 10 minutes using Serengeti. I am not a Hadoop expert so I am probably the best qualified to test the 10 minute claim.

First of all, download the OVA. I would also suggest downloading the user guide, I needed it for the password / username to login to the Serengeti VM. (Which is: serengeti / password) So what do you need to deploy a Hadoop cluster in your vSphere environment? This is what I did:

Import the OVA
- I decided to Provide a static IP instead of DHCP, I don’t have DHCP in my datacenter
Upgrade VMware Tools
- Just right click the VM and upgrade the tools automatically, works as a charm. note that this is not a requirement!
Login to the console
- ssh serengeti@10.27.51.21
- username/password: serengeti/password
Go to the Serengeti CLI by typing
- serengeti
I don’t run DHCP so in order for the Hadoop nodes to get an IP-address I will need to tell Hadoop which IPs to use. I had to remove the default network first and create a new one
- network list
- network delete –name defaultNetwork
- network add –name defaultNetwork –portGroup “VM Network” –ip 10.27.51.165-200 –dns 10.27.51.122 –gateway 10.27.51.254 –mask 255.255.255.0
Create the Hadoop cluster by running the following commands
- cluster create –name myHadoop
Now you will see a whole set of new virtual machines being created and your Hadoop cluster is created

How long did that take me? Indeed ~10 minutes… I tested if the cluster worked as follows, first SSH in to the Hadoop client node and then do the following:

cd /usr/local/share/pig-0.9.2/test/e2e/pig/lib/
hadoop jar hadoop-examples.jar teragen 1000000000 tera-data

Now you should see the worker nodes in vSphere ramping up to 100% CPU and memory utilization. It works for me… So why did I deploy it? Really to see if Richard was right and it would only take me 10 minutes? That wasn’t the reason of course. I wanted to see how it deployed and how it leveraged vSphere components. Note that this is not a GA release yet. It is version 0.5 and is probably still under heavy development. It is exciting to see though which direction we are heading in to and I am looking forward to the integration points with vSphere and vCloud Director (eventually).

One of the integration points that had my interest was the HA part that is mentioned in Richard’s post. Hortonworks was responsible for that and apparently they plug in to the vSphere HA VM and Application Monitoring. I haven’t been able to test that yet, but when I do I will update you on this. If you are at the point of testing this yourself, please note (and this is not well documented) you will have to enable “VM and Application Monitoring” explicitly as this is not enabled by default on vSphere HA cluster.

Right click your cluster
Click “Edit Settings”
Click the “VM Monitoring” tab and make sure “VM and Application Monitoring” is turned on.
Also check the settings for the individual VMs they will need to be set to “include” where applicable

That is it for now… again I am by no means a Hadoop expert and am not going to try to pretend, just exploring and broadening my horizon.

<disclaimer: I am a technical advisor for CloudPhysics>

Today at the New England VMUG CloudPhysics has their first official “public appearance”. Yes some of you have heard the name a couple of times before and some of you might even know who the brains are behind this new start-up… for those who don’t let me give a brief introduction.

CloudPhysics was recently founded by John Blumenthal and Irfan Ahmad. Some of you might recognize their names as they used to work at VMware, John was a Product Manager for storage and Irfan was the person who was responsible for awesome features like Storage DRS and Storage IO Control. Together with several other brilliant people, including no one less than Carl “TPS / DRS” Waldspurger acting as an advisor and consultant, they founded a new company.

So what is CloudPhysics about? CloudPhysics is about big data, about centralized data, about analytics, about modeling data. CloudPhysics is essentially about helping you! How? Well let me try to explain that without revealing too much.

We’ve all monitored and managed environments, some of you are responsible for 3 hosts and some might be responsible for 80 hosts in different sites and in different companies. We all face several challenges and in many cases these are similar… How do you find common themes? How do you validate best practices are applied on all levels in your environment? How do you validate if your practices are actually used by others, and do you benefit from them? How do you know if you sized correctly? How do I solve specific problems? Would I benefit from a different storage platform or SSD? All of these are questions or problems you probably face daily and that is where CloudPhysics aims to come in to play.

CloudPhysics will enable you to find common best practices and problems in your environment. CloudPhysics will provide you guidance, this could be custom but also generic through for instance a link to a VMware KB article. They will enable you to compare and explore performance results. Find patterns in your environment… See trends and provide you with meaningful statistics about your environment. Sounds amazing right and probably something you wouldn’t mind testing today… The CloudPhysics product will come as a virtual appliance. The data gathered will go up to the cloud and all of the analysis will happen outside of your environment, of course with various degrees of anonymity.

CloudPhysics is constructing an analytics platform for vSphere for the application of collective intelligence to individual, local vSphere environments and users. At the same time the platform is intended to service the needs of consulting companies, customers and the blogging community by providing APIs to enable unique exploration and discovery within the dynamic, changing dataset CloudPhysics continuously generates. Access to this dataset enables them to transform qualitative discussions into quantitative views of vSphere design and operation. CloudPhysics is not seeking to build a community; rather, it exists to empower the engineer and architect in all of us, particularly the commentators and critics essential to the industry.

For those who can’t wait, sign up at www.cloudphysics.com now for announcements and news on the beta. I am excited about CloudPhysics and I hope you all are as well.