全部版块 我的主页
论坛 计量经济学与统计论坛 五区 计量经济学与统计软件 winbugs及其他软件专版
701 0
2016-08-07

By Dan Kellett, Director of Data Science, Capital One.

Disclaimer: This is my attempt to explain some of the ‘Big Data’ concepts using basic analogies. There are inevitably nuances my analogy misses.

What is HDFS?

When people talk about ‘Hadoop’ they are usually referring to either the efficient storing or processing of large amounts of data. MapReduce is a framework for efficient processing using a parallel, distributed algorithm (see my previous blog here). The standard approach to reliable, scalable data storage in Hadoop is through the use of HDFS (Hadoop Distributed File System).

Imagine you wanted to find out how much food each type of animal eats in a day. How would you do it?

The gigantic warehouse

One approach would be to buy or rent out a huge warehouse and store some of every type of animal in the world. Then you could study each type one at a time to get the information you needed – presumably starting with aardvarks and finishing with zebras.



The downside of this approach (other than the smell & the risk the lions would eat the antelopes) is that this would take a looooong time and would be very expensive (renting out a huge building for many years would add up).

This is similar to the approach we take with our data at the moment. Huge amounts of information about our customers are available and we are restricted to analyzing it through a fairly narrow pipeline.

Mini-centers

An alternative approach would be to split up the animals and send them to lots of smaller centers where each would could be assessed. Maybe the penguins go to Penzance, the lemurs to Leeds and the oxen to Oxford.



This would mean each center could study their animals and send the information back to a head office to be summarized. This would be much faster and a lot cheaper.

This is an example of parallel processing whereby you split up your data, analyze each part separately and bring it back together at the end.

Disaster!

The key drawback is that you are highly susceptible to failure in one of these mini-centers. What happens if there’s a break-in at the center in Lincoln and all the chickens escape? What if the systems go down in Dundee and all the information on sparrows is lost?



The solution to this is to still use mini-centers but to send animals to multiple centers. For example, you may send some rabbits to Birmingham, some to Edinburgh and some to Cardiff. You are then protected from individual failures and can still carry out the survey quickly.

This is from a very high level what HDFS (Hadoop Distributed File System) does. Data are split up in a way that the overall task is not impacted if an individual node fails. Each node carries out its designated task and then passes the results back to a central node to be aggregated.

When would I use HDFS?

As with any technique related to data science HDFS is one of many approaches you could take to solve a business problem using large amounts of data. The key is being able pick and choose when to take HDFS off the shelf. At a high level: HDFS may help you if your problem is huge but not hard. If you can parallelize your problem, then HDFS coupled with MapReduce should help.


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群