Thursday, June 25, 2015

How should I start Big Data?

Many companies are apprehending the benefits of Big Data and are starting to use their data more effectively. With the benefits of Big Data, companies now have an opportunity to control the power of real-time information and use analytics to interact with their consumers in real time.
We know the advantages of Big Data: understanding your customer, improving customer loyalty and gaining competitive advantage.  I have recently started to learn Big Data. At the beginning I've asked this most popular question many times-“How should I start Big Data?”
Actually, this is a great question since there are numerous resources to learn about Big Data and it is so difficult to select one to start. Therefore I decided to write this post to share a summary of those I found.
So how do we start our journey to Big Data? Here are the five tips I recommend to help you get started.

1-Learning the tools and technologies: start with any tool that you can access to like Python, SAS, SPSS, SQL, R (which is available as open source) and try to learn it at a deep and practical level. Then you will have some knowledge and then you can search and study relevant topics as you now know a little to grow your knowledge. Remember that people with high level knowledge about one special tool are more preferable than who know a little bit about everything! So, it is strongly recommended to master one tool and a few techniques of the tool to have a better chance of getting the opportunities and accomplishing them. For example, you can try with Introduction to Data Science from University of Washington on Coursera website. Just remember to plan in right direction what tools and technologies you want to learn.

2-Learning the tricks: indeed a supplementary step to master a tools is learning the tricks of that tool from another experienced in your company or learn from professional courses. Notice that self-study courses and tutorials mostly will not provide you the key secrets and tricks which are very crucial for solving real life problems.

3-Look for an opportunity in your company to apply analytics in your organization. Mainly it is difficult to identify where to start. If you know the sources of data and where data is being collected (like some data repository) according to a certain business process then you have a good chance to use it in your first Big Data scenario. Start by generating simple insights from the data which is not presently captured in the business reports and create simple metrics which will add tremendous value to the businesses to show to the top management in your company interested in what you are doing. Remember that most organizations do not even do the most obvious understanding from a data analysis perspective.

4-Create a case study of your work and show your analytics to your superiors. If they don’t support you, devise a job search to extramural companies related to your new skills.

5-Read more and more: it is strongly recommended to join blogs and forums on Big Data, follow carefully companies in related domain and participate in the latest discussions and events in Big Data such as LinkedIn. This help you being aware of how Big Data is being applied in different business applications and functions and increase your knowledge.


  1. Thanks for your post.

    Do you have any idea to how much extent the volume of the data, we could call it Big Data?

    1. Once I heard that it must be big enough not to fit in you hard disk and memory.
      Big Data for your smartphone, is not necessarily big data for a computing server.

    2. Thanks for your reply, I heard that 1 TB is big, but this is a loose definition because it depends how one encode the data. Suppose the data is compressed and fits the hard disk, but could not be held on it after decompressing! It looks "Big Data" is a new hype while has not defined yet formally!

  2. This comment has been removed by the author.

  3. I think the significance of "big data" has nothing to do with storage capabilities.Even if storage is being considered as one of multiple other challenges of big data (such as capture, search,sharing, transfer and analysis complexity, just to name few) the definition of "big" data is not function of hardware .So actually what can be qualified as "big" for hardware configuration A (disk, memory , ..) , it is still considered as big data for hardware configuration B. Now the question is still relevant : is there a metric used or any threshold from which one can qualify a data set of being "big" ?I think that,size wise, it is commonly known that 1TB is big but , as far as i know, there is no official size attribute of big data!

    1. The common definition for big data (that I know of) is: not being able to store the data on the machine. Big data for my cell phone is not big data for my desktop, and big data for my desktop is not big data for my server. Big data makes the experts handicap to analyse them using the conventional ways.

    2. In my opinion, big data has nothing to do with the size of data!
      I can either change the encoding of a dataset (compressing, low-rank factorization for sparse data) that does fit in your device or I can change the data structure of the existing one in your device so that it cannot fit it any more! I think we need a better concrete definition in terms of mathematics and computer science so that it becomes hardware-independent. Anyway, it is still vague and it is more discussed as a new term for more attraction rather than being addressed by researchers in Machine Learning, Data Mining or Statistics. But in hardware/software engineering, Map-reduce, Cloud computing, Hadoop, Spark, ... are designed and implemented for this task.