Tuesday, September 30, 2014

A Big Data Solution Running On Top of Rackspace Private Cloud

http://www.cloudypoint.com/openstack/big-data-solution-running-top-rackspace-private-cloud/
A Big Data Solution Running On Top of Rackspace Private Cloud
Why Rackspace Private Cloud?
The first half of the battle is figuring out if you have big data. The second half is figuring out how you can make use of it all.

Once you determine that you actually have big data to process, why would you want to use Rackspace Private Cloud to process all of it? You could just as easily use a public cloud to run your big data solution.

In the last webinar, I built a narrative around using Rackspace Private Cloud to support your software development lifecycle.

Let's again build a similar narrative for running a big data solution on Rackspace Private Cloud. Similar to the last webinar, there are some key factors as to why you may want to use a private cloud instead of a public cloud.

The Narrative
Your company recently acquired another company who has a large volume and variety of raw data that is constantly being captured. The company you acquired has no idea how to take advantage of this raw data. It is quickly determined that this raw data is something that could have a huge return on investment if the proper analysis are performed on it to create actionable information. If even a small sample of this raw data is seen by competing companies it could put the acquisition at risk.

So, based on that, you have decided to take this raw data and process it on top of a Rackspace Private Cloud environment. What are some of the factors influencing your decision to run your environment within a private cloud instead of a public cloud?

Privacy and Security

First, privacy and security.

Some data is crucial to businesses, other data is not. There is a good chance that if you truly have big data to process into valuable information, that raw data and the way you process it are crucial to your business. When using a private cloud to run your big data workloads, you run a much lower risk of your raw data being compromised when compared to running it on a public cloud. Rackspace Private Cloud provides you the privacy and security needed to ensure your data stays within your organization.

Performance

Second, performance.

One of the three “V’s” of big data you heard earlier was volume. When you have a high volume of raw data to process, some of that data you may want to process as fast as possible because the quicker you process it the quicker you can make business decisions that could positively impact your business. With a Rackspace Private Cloud, you run your workloads on OpenStack Instances that run on your own bare metal servers. There are no other tenants besides you in the environment. Because of this, you do not have to worry about noisy neighbor's affecting the performance of your workloads like you would in a public cloud.

More Control

Third, more control.

The raw data you have to process may require different types of server configurations or OpenStack Instance types. Running a Rackspace Private Cloud gives you the ability to customize such things as spindle hard drives vs SSD hard drives in your compute nodes, 1 Gb or 10 Gb networking, or the ability to exactly define your OpenStack Flavors so you can create OpenStack Instances with the perfect amount of vCPU, RAM, and Storage.

Cost

Fourth, cost.

Big data workloads can run for a very short time, for a very long time, or anything in between. If your workloads are running in a public cloud, your cost continually increases until your workload finishes. If you workloads are running in a private cloud, your cost is flat regardless if your workload is running. With a Rackspace Private Cloud, you only pay for the gear you currently have. You can create and run as many OpenStack Instances as you want on it without the cost changing. You only add cost when you add physical servers to your environment.

Big Data Tools
So, with those four factors, and there are always more, you have decided to run your big data solution on a Rackspace Private Cloud.

But, what sort of tools would you be running on your Rackspace Private Cloud to process your big data?

There are too many big data tools to list and talk about, but I will briefly discuss some of the more popular ones.

Analysis Tools

As for big data analysis tools, there is of course Hadoop. Whenever big data is discussed, Hadoop is not far from being mentioned. But, what is Hadoop?

From hadoop.apache.org:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Hadoop was designed with cloud environments and infrastructure in mind.

Another useful tool is MapReduce, which is often also talked about when discussing Hadoop.

From Wikipedia:

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
Hadoop gives you the ability to run MapReduce jobs across a cluster so you can process your raw data quicker.

Database Backends

As for some database backends, you have Cassandra and mongoDB.

Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. It is one of the NoSQL databases. Hadoop has support for using Cassandra as a data storage mechanism.

MongoDB, another NoSQL database, is a document-oriented database that uses JSON-like documents with dynamic schemas. It too has a connector that allows you to use it as a data storage mechanism in Hadoop.

No comments:

Post a Comment