Talk:Apache Hadoop

Learn more about this page

This is the talk page for discussing improvements to the Apache Hadoop article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Archives: 1

Java Low‑importance

	This article is within the scope of WikiProject Java, a collaborative effort to improve the coverage of Java on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.JavaWikipedia:WikiProject JavaTemplate:WikiProject JavaJava articles
Low	This article has been rated as Low-importance on the project's importance scale.

Computing Low‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
Low	This article has been rated as Low-importance on the project's importance scale.

Cleanup edit

Latest comment: 10 years ago1 comment1 person in discussion

I started taking a crack at cleaning this page up. Started with the description. Please let me know if this looks okay. If so, I'll add more citations and proceed to fixing the main body. Vinod (talk) 16:12, 28 October 2013 (UTC)Reply

I still have no idea WHAT this thing does! edit

Latest comment: 9 years ago7 comments7 people in discussion

And I'm in IT! What is the (at least intended) reading audience for this article? If it's other Hadoop experts then it needs a total rewrite. — Preceding unsigned comment added by 98.23.29.8 (talk) 19:11, 25 January 2013 (UTC)Reply

Haha my thoughts exactly! This article is a huge pile of buzzwords, none of which is understandable for the general public. A translation from Marketing to English would be much appreciated. 213.112.197.68 (talk) 10:27, 25 August 2013 (UTC)Reply

This is an important part of IT infrastructure that underpins many emerging technologies. Sadly, the whole article is so badly written that many readers are struggling to understand Hadoop from this explanation. Vote +1rewrite Andmark (talk) 11:56, 1 September 2013 (UTC).Reply

Uh... Yes... But... I can clarify that I've lots of little bits about Hadoop over the last few years, but my proximate cause for reading the article was some corporate presentations about so-called big data. Overall the article has failed to provide me with the kind of insight that I was hoping for, but I admit that part of my framing was that I expected to see more on the position of Amazon vis a vis Hadoop, insofar as the company that employs me is mentioned several places and I had somehow reached the conclusion that Amazon was the big barrier to overcome here... Should I conclude that Amazon is much less relevant to Hadoop than I thought, or that the article has a PoV against Amazon? I sort of agree with the comment about too many buzzwords, too, but mostly I'm just kind of disappointed in the lack of enlightment after spending the time to read the entire thing fairly carefully. I actually feel I do have a small idea of "WHAT this thing [Hadoop] does", but it wasn't helped or refined by the article Shanen (talk) 07:45, 9 December 2013 (UTC)Reply

Visit their website; and I will too. -- Charles Edwin Shipp (talk) 15:42, 30 May 2014 (UTC)Reply

I have a degree in Computing Science and I still don't know what Hadoop is after reading this article. There needs to be an explanation of what it does, and what it is for, before any talk about its components. If describing a car I would start with it being a vehicle that is driven by person and that cars typically transport between 1 and 7 people. I would not begin by describing it as a framework consisting of an engine, gearbox and body. The article needs to be written as an encyclopedia article. FreeFlow99 (talk) 13:15, 6 June 2014 (UTC)Reply

Another request for a plain explanation of what Hadoop is. The fog arrived for me when "data set" was used where I expected "data". Seriously, this topic is important enough to deserve the attention of a subject matter expert to explain what this magical blend of project, framework, distributed file system, distributed data base, distributed operating system, etc. is. I echo the above comments. patsw (talk) 14:55, 19 August 2014 (UTC)Reply

What Hadoop is Not edit

Latest comment: 9 years ago3 comments2 people in discussion

It is not a cluster. Distributed computing and cluster computing are two very different things.

A 100 computer Hadoop System with 100 cpu's can never have 100 cpu's working on the data on one system. Each system with one cpu can have precisely one cpu working on the data on that node. You own 100 cpu's but get the benefit of only the cpu's that happen to be local to the data being worked on. A Slurm - MPI - Ganglia etc,,, true cluster allows all the cpu's you own to work with whatever data you have to whatever extent they can access the data and share the computation. In Hadoop data needing more processor attention must be duplicates as many times as you need processors to work on it. In practice 10 or more copies of the same data may be needed. If the data is already massive then this duplication can be costly and prohibitive. It is possible that data aware clustering has obsoleted hadoop and similar poor mans cluster technologies. Rocks clusters, beowulf style clusters with data aware slurm implementation can our perform at a lower cost and with less duplication of data.

In any case published Hadoop data from government users reveal that the cost of electricity is often so high no savings are realized over traditional data warehousing. Scottprovost (talk) 18:52, 1 September 2013 (UTC)Reply

With the new versions of Hadoop applications out it may be possible to describe the ecosystem more clearly. The Berkeley Data Analytic Stack and a multitude of addons slash replacements have changed the landscape drastically that the term Hadoop has come to refer to everything and nothing. An article about this word that pretends to be about a computer software application is a falsity and should be removed. A disambiguation page with over 1,000 links to applications and systems once known as Hadoop ecosystem would be more appropriate. As fast as the word Hadoop's trending to popularity, it has now fallen with the Word Hadoop being synonymous massive software debacle or administrative failure. Wikipedia now needs a page for the word Hadooped. Scottprovost (talk) 22:22, 10 April 2015 (UTC)Reply

The article already mentions the ecosystem in the intro, so the issue has not been ignored. But perhaps now a new separate article should be created just about the Hadoop Ecosystem? Michaelmalak (talk) 23:59, 10 April 2015 (UTC)Reply

What Hadoop Is edit

Latest comment: 5 years ago2 comments2 people in discussion

Since over the years Hadoop has become many thinks and applications. Most of which can be run without HDFS or even any core Hadoop components. It would be a good addition to this article to provide a list and links to the 40 plus components that have become known as part of or in them selves "Hadoop". Sometimes referred to as the Alphabet soup of "Hadoop Ecosystem?" 2. Apache Pig 3. Apache Hive 4. Apache HCatalog 5. Apache HBase 6. Apache ZooKeeper 7. Apache Oozie 8. Apache Sqoop 9. Apache Flume 10. Apache Mahout ... Scottprovost (talk) 16:58, 15 March 2014 (UTC)Reply

Hadoop is central to some vendors' Big Data solutions (Dell/EMC as an example). Big data implementations provide a way to move and aggregate large amounts of data and without redundant bits. This is one example that is far from Apache; it is a use of Hadoop but almost completely out of context of the original implementation. I do not work for EMC but I've had experience with their solutions. For reference (not to be included in the article) : https://www.dellemc.com/en-us/storage/unstructured-data-analytics/solutions-use-case.htm?CID=314887&VEN1=sP0BFNI8u%2C268143709895%2C901qz26673%2Cc%2C%2C&VEN2=b&LID=5957906&DGC=ST&ACD=1230921248720564&VEN3=823148740449067458 — Preceding unsigned comment added by 144.160.98.94 (talk) 16:09, 8 June 2018 (UTC)Reply

Yahoo edit

Latest comment: 13 years ago2 comments2 people in discussion

The article says "On February 19, 2008, Yahoo! launched what it claimed was the world's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query."

I thought that Bing was powering Yahoo search??? Kitplane01 (talk) 18:54, 26 August 2010 (UTC)Reply

Y! are switching/have switched to Bing for index and search; I don't know what they use those same clusters for now, but as of august they were running a 4000 machine cluster, as mentioned on the Hadoop general mailing list [1]. That cluster is the largest #of machines in a single Hadoop cluster, though it is believed that Facebook have a bigger filestore in a cluster with less machines. (Newer servers have more higher-capacity disks in them. I have a photo of Arun and Owen from Y! running Terasort on one of Y!s clusters at Apachecon 2009; this includes a screen shot of the laptop as they set the then record for the petasort benchmark; this might make a good addition to the article.

Yahoo! runs more than 38,000 nodes across its various Hadoop clusters, the largest of which are 4,000 nodes. Even after the Bing switch-over, the clusters are used for analytics, machine-learning, ad targeting, content customization, etc. Yahoo! is still by far the largest user of Hadoop. —Preceding unsigned comment added by 99.23.190.196 (talk) 07:30, 28 September 2010 (UTC)Reply

Untitled edit

Latest comment: 14 years ago3 comments3 people in discussion

This feels a little too much like promotional literature to me.

I don't think that's the case, but it is just fairly minimal right now. What we need is some information on the underlying architecture, some discussion of its strengths (scales) and weaknesses (Name node is a single point of failure, base performance not great, can be tricky to nurture if you don't know how to manage a cluster). Are you volunteering to add these? SteveLoughran (talk) 21:46, 23 June 2008 (UTC)Reply

I agree that it looks more like a marketing brochure than a real wikipedia entry. 14:00, 30 October 2009 (UTC) — Preceding unsigned comment added by 193.109.175.80 (talk)

Added an architecture section, including coverage of limitations and specifics of the filesytems. Better?

I think this is a good overview of Hadoop ... concise ... relates the project and product well to the Who What Where and Why you'd be looking for in an Encyclopedia entry. The only thing I'd add is comparative discussion of other ways similar problems are solved to anchor context (FreddyMack (talk) 14:02, 14 April 2009 (UTC))Reply

I would like to know what is involved in implementing it. What sort of limitations are imposed on developer making data processing code for this system? What sort of techniques can be used to make code more efficient for such a setup? Chillum 03:41, 21 May 2009 (UTC)Reply

Google patents Hadoop? edit

Latest comment: 10 years ago2 comments2 people in discussion

Excerpt from http://www.theregister.co.uk/2010/02/22/google_mapreduce_patent/

In mid-January, Google won a patent for MapReduce, the distributed data crunching platform that underpins its globe-spanning online infrastructure. And that means there's at least a question mark hanging over Hadoop, the much-hyped open source platform that helps drive Yahoo!, Facebook, Microsoft's Bing, and an ever-expanding array of other web services and back-end business applications.

66.192.121.51 (talk) 17:35, 23 February 2010 (UTC)Reply

Oh yeah? So they want to forbid that anyone else can slice an SQL query over several server within a cluster? Doesn't make any sense to me... --178.197.236.109 (talk) 12:25, 12 January 2014 (UTC)Reply

Podcast with Hadoop edit

Latest comment: 14 years ago1 comment1 person in discussion

A recent Software Engineering Radio podcast was about Hadoop:

Episode 157: Hadoop with Philip Zeyliger. Released 2010-03-08. Direct download URL for MP3. Length: 51 minutes 04 seconds.

It could be included in the article, e.g. in External Links. E.g. as in arcticle "Aspect-oriented programming".

--Mortense (talk) 12:00, 9 March 2010 (UTC)Reply

Belatedly done. Ross Fraser (talk)

Hadoop Podcast Focused On All Things Hadoop edit

Latest comment: 13 years ago1 comment1 person in discussion

http://allthingshadoop.com/podcast

perhaps can get put into this main hadoop page as a resource for use —Preceding unsigned comment added by Omniomega (talk • contribs) 04:42, 5 September 2010 (UTC)Reply

What is the problem that Apache HAdoop is trying to solve edit

Latest comment: 10 years ago2 comments2 people in discussion

I read the article, but was unable to separate out the problem that the system seeks to solve from the implementation details. As far as I can tell, it seems to be useful wherever there is a large quantity of file-based data which can be processed independently from other data, but is expensive to transfer. This seems to read as if the problem is to create an index (hashmap?) that can direct you an appropriate node to compute on.. Is this right? Can someone splice in a section after the lede to aid understanding this? 129.67.86.189 (talk) 11:46, 19 April 2011 (UTC)Reply

Actually, it's solving very different things. E.g. MapReduce is about accessing different clusters containing different data (where a cluster consists of several servers containing the exact same data). So it's basically distributing the SQL query and afterwards asking each server for the result of a different subset, and finally merging the data to create one data set. However, this can be easily done and probably any large scale DB developer already does it. Finally I think that Hadoop is great for distributed file server, but only, since distributed DB queries can easily be done without hadoop. Anyway, it's basically a Java query implementation, the question is, do we need it or shouldn't we just implement our own map reducing systems? --178.197.236.109 (talk) 12:31, 12 January 2014 (UTC)Reply

Hadoop inspired by Google's GFS and MapReduce edit

Latest comment: 12 years ago2 comments2 people in discussion

The introduction erroneously says that Hadoop inspired Google's MapReduce and GFS. It is the other way around. Sanjay Ghemawat et al. published the GFS paper in 2003 [2], and Jeffrey Dean and Sanjay Ghemawat published the MapReduce paper in 2004 [3]. Hadoop developers have clearly stated that they used these works as inspiration to solve their scalability problems [4] [5]. 96.250.77.130 (talk) 13:44, 1 June 2011 (UTC)Reply

Well spotted! Someone edited the page page last week and flipped the credits. Reverted and added another warning to the IP address. SteveLoughran (talk) 20:38, 1 June 2011 (UTC)Reply

Current Hadoop Versions are wrong edit

Latest comment: 12 years ago1 comment1 person in discussion

The current Hadoop versions rendered in the infobox are wrong. The 1.0.0 is the current beta version for the 1.0X branch and 0.20.203.X is the current stable version from the 0.20 branch [6]. — Preceding unsigned comment added by Aalexand85 (talk • contribs) 15:07, 2 February 2012 (UTC)Reply

HDFS Not Mountable? edit

Latest comment: 12 years ago1 comment1 person in discussion

The section on HDFS contains the following paragraph, "Another limitation of HDFS is that it cannot be directly mounted by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient. A Filesystem in Userspace (FUSE) virtual file system has been developed to address this problem, at least for Linux and some other Unix systems."

That's a pretty big contradiction, with a FUSE based filesystem for HDFS, it can be mounted by an existing operating system. Also, what's the deal with the phrase "existing operating system", is that opposed to an operating system that doesn't even exist. Onlynone (talk) 17:22, 20 April 2012 (UTC)Reply

A version of Microsoft Windows which can mount FUSE-based filesystems would be an example of an operating system that doesn't even exist?
Of course, Linux, and other "UNIX-like" operating systems have been able to use FUSE for quite some time. Because of the involved metadata, direct mounting and accessing it like it was a directory of photos could have detrimental effects much like digging into your favorite relational database with a text/hex editor...

Copyvio? edit

Latest comment: 11 years ago1 comment1 person in discussion

This website has some of the same content as the article: [7]

Do we think it's someone copying Wikipedia, or could it be a copyvio? Andrew³²⁷ 07:57, 2 April 2013 (UTC)Reply

Data nodes can talk to them selves?

"Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. " This is wrong!

Jargon and techno-babble edit

Latest comment: 9 years ago2 comments2 people in discussion

The intro claims that Hadoop provides "reliability and data motion to applications". Data motion? Is this like interpretive dance? A bit ballet? The term "data motion" is undefined elsewhere in WP (thank heaven). It is also the name of a company and product line (that has noting to do with Hadoop). As well, a previous entry in this talk page draws attention to text in the article where process nodes "talk to each other". Do they do this via Twitter? Or do they use couriers on cyber bikes like in the movie Tron? The intro also refers to "computation-independent computers". Nice to see computers finally moving away from being dependent on computation...

This whole article needs a re-write to avoid sloppy writing, breezy jargon, and dubious techno-babble. Ross Fraser (talk) 22:12, 15 July 2013 (UTC)Reply

A glossary and advisory statement at the beginning of the article would go a long way toward demystifying it. 2601:2:8D00:1E3:E986:AB21:7172:AA44 (talk) 19:40, 22 July 2014 (UTC) John BealeReply

Stratosphere extends Hadoop edit

Latest comment: 10 years ago1 comment1 person in discussion

http://stratosphere.eu/

There are no mention of Stratosphere — Preceding unsigned comment added by 179.234.179.107 (talk) 09:28, 26 March 2014 (UTC)Reply

Unbelievably bad bad bad article edit

Latest comment: 8 years ago5 comments5 people in discussion

The beginning of this article reads as follows:

"Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. "

This is fine, but the crucial word "processing" is one of the vaguest verbs in the English language and requires immediate elaboration.

Unfortunately, the word is nowhere elaborated. As a result, readers are left with the impression that Hadoop does nothing whatsoever.

Instead, all we get is innumerable paragraphs about its underlying architecture.

This is totally unacceptable. Hadoop is above all defined by what it does, not by how it is built. So: if the architecture paragraph are left in the article, they belong only after a good description of what Hadoop does.

As currently constituted, this article is exactly as if an article about Facebook mentioned in its first sentence that it was "social software" and then, with no elaboration on that description, proceeded to discuss for many, many paragraphs the software architecture of Facebook. That is how utterly ridiculous this article is.

I strongly urge that this article either be fixed immediately to explain what Hadoop does, or that it be removed, lest it give other unknowledgeable editors the wrong idea about what an encyclopedia article should be.Daqu (talk) 16:12, 30 October 2014 (UTC)Reply

OK, I rewrote the introduction. I apologize, though, for the WP:SELFPUBLISH -- I couldn't find any other good source. Michaelmalak (talk) 17:17, 30 October 2014 (UTC)Reply

The criticism that this article contains too many buzzwords is simply not true. It does contain a lot of technical terms (apparently mistaken for buzzwords) that make this a very useful article. I have read it not knowing what Hadoop was before now and I have a clear understanding now of what it is and where it fits into Big Data. — Preceding unsigned comment added by 94.193.190.1 (talk) 14:27, 13 May 2015 (UTC)Reply

Probably written originally by an Apache Foundation documentation writer. Worst documentation anywhere, ever. — Preceding unsigned comment added by 208.81.212.222 (talk) 21:13, 24 June 2015 (UTC)Reply

ASF documentations are all open source: contributions to improve it are welcome. One aspect of the Apache Docs is they generally assume some foundation of knowledge of/interest in the area. This article can't make so many assumptions on the audience. Even so, it's hard to do it without assuming some level of knowledge. SteveLoughran (talk) 17:00, 4 September 2015 (UTC)Reply