Introduction to Data Science

This year I'm embracing the big data topic and I have thought that spending some of my summer time to attend Introduction to Data Science class from coursera would be a good idea in order to improve my knowledge about this subject. Although I went through some pain, I did it!

the course

Commerce and research are being transformed by data-driven discovery and prediction. Skills required for data analytics at massive levels – scalable data management on and off the cloud, parallel algorithms, statistical modeling, and proficiency with a complex ecosystem of tools and platforms – span a variety of disciplines and are not easy to obtain through conventional curricula. Tour the basic techniques of data science, including both SQL and NoSQL solutions for massive data management (e.g., MapReduce and contemporaries), algorithms for data mining (e.g., clustering and association rule mining), and basic statistical modeling (e.g., linear and non-linear regression).

the course description

the cons

I want to start by speaking about the pain. The video lectures are sometimes too abstract and general, other time takes mathematical or statistic background without introducing for granted the concepts used. In general, the video lectures seem to be unorganized.

Just to dissuade youm if you aren't really motivated, the course is long (eight weeks) and the workload is high and demanding: ten/twelve hours per week. I know, the class covers a lot of topics (for example map reduce, no sql, data visualization, machine learning, graph analytics) and high workload is generally required in order to grasp the sense of data science.

Ok, if you reach here, the cons haven't stopped you.

the pros

The contributes from the homeworks is priceless: they get the feel of lot of aspects of data science, learn about some new technologies and use the technologies you know in a different way (for example, you will multiply a matrix using sql queries).

The homework assignements are very challenging: you are able to solve practical problems you have considered too complex before you took this class. For example you have to grab some tweets and calculate for each one a sentiment value that tells you if and how much a tweet about a certain topic is positive or negative!

Another awesome example is machine learning: the class give you to the opportunity to participate in the Kaggle competition. Kaggle is a great platform where you can compete with other data scientists in a real world contest on a given topic prediction.

I chose a bike sharing competition in which I created and submitted a model to forecast the bike rental demand in the Capital Bikeshare program in Washington, D.C. using information like temperature, season, whether the user is registered to program or not, what day is considered a holiday or a working, weather, wind speed and humidity. The scores of the submissions are evaluated using an error index calculated from a test dataset given by the platform.

This is a great example of a challenge: in seven days I studied the basics of the machine learning theory, I chose the right technology to use (phyton? excel? R?), I studied a new language (R) and I analyzed and implemented the prediction model: what a beautiful result in such a short time!!
Just a little note about R: it's the best example of how the costs of learning a new language, reward you for all your effort if the language totally matches your target problem. So don't be lazy and always choose the right tool for the right job.

I found machine learning a really powerful tool. If you want to enhance your skills in this area (I'll doing it!), you can follow the course Machine Learning.

at the end

In this course I had the opportunity to learn how important it is to be pragmatic in data science in order to choose the best tool matching the context. This gave me the chance to learn new technologies such as:

  • NumPy a python library to work with multi-dimensional arrays
  • pandas a python library to make it easy to work with table-like structures
  • scikit-learn a python library on machine learning
  • R a special purpose language for statistics, machine learning and data visualization

The class forum is worthy of a special mention. It's a great way to meet new smart people and exchange information about the course, especially the homework.

If you are curious, you can access to the video lectures without joining the course (and without any login).

This blog will speak about bigdata very often starting from now: so stay tuned!

comments powered by Disqus