Ok, you have decided to setup a Hadoop cluster for your business.
Next step now, planning the cluster… But Hadoop is a complex stack and you might have many questions:
- HDFS deals with replication and Map Reduce create files… How can I plan my storage needs?
- How to plan my CPU needs?
- How to plan my memory needs? Should I consider different needs on some nodes of the cluster?
- I heard that Map Reduce moves its job code where the data to process is located… What does it involve in terms of network bandwidth?
- At which point and how far should I consider what the final users will actually process on the cluster during my planning?
That is what we are trying to make clearer in this article by providing explanations and formulas in order to help you to best estimate your needs.