Coursera - Introduction to Data Science

This was my first experience with a MOOC - and it was incredible. This online-course stuff is definitely going to be one of the best things to come out of the Internet. Here is my report on one of the courses: Coursera - Introduction to Data Science by Bill Howe of U of Washington. This is a 8-week course, and the blurb says it needs around 8-10 hours per week attention, and basic programming background. In reality, it is worth allotting 10-20 hours per week, and having more than just basic programming skills to allow spare time to go through the textbook and papers on the recommended reading lists. The video lectures themselves take around 2 hours per week.

The textbook was Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman and worth reading it once at least - even if lack of time necessitates skipping over the difficult parts. And some of the papers assigned were also quite readable - and always fun to see the progression of ideas, especially the continuing march of Map Reduce techniques towards ACID Database concepts and vice-versa.

I highly recommend these online courses to everyone interested in keeping in touch with latest technology or getting an introduction to a new area of interest. It gets things done in a short time, and the key is that motivation is provided by the homework assignments as well as the discussion forums associated with the course. And amazingly, this is all free.

The three best things about this course and probably MOOCs in general:

  1. Course syllabus was very good. This course covers Databases, Relational Algebra, Map Reduce, Analytics (Statistics, Modeling), Visualization. And introduces a bunch of tools and databases and algorithms, with optional assignments that involve using Pig Latin based Hadoop to mine a 500GB dataset using Amazon Elastic Map Reduce frameworks. Woo-hoo!
  2. Assignments covered SQL programming, some python, a optional Amazon Elastic Map Reduce task (costs US$5-20 for each student, though Amazon may offer grants for students and covers full cost), R programming, etc. Other optional assignments included real-life programming projects, and Kaggle competitions with peer-assessments.
  3. Discussion forums were quite active. This is one reason why these online courses have a start and end-date - it allows students to have good relevant discussions. And given the many problems getting tools like python (tokenizing text) and Amazon AWS (painful to setup correctly), having a discussion forum was great.

The problems were not major, just minor issues:

  • It is the first time the course has been offered, so there were many hiccups on the start, and a good percentage of time is spent on accidental complexities of programming languages and tools. So the assignments had to be tweaked to pass the auto-grader scripts, even when that seemed like not the right thing to do.
  • The coursera.org online course framework works fine on a computer, but not-so-good on a tablet. Would have been nice if the videos played easily, and well, on a tablet for example. It was kind of clumsy to make this work. Coursera also made the assignments difficult to discover - navigation to locate assignments vs quizzes was clumsy, and it did not distinguish between homework that would be graded, vs homework that was not to be graded (survey-only questions).
  • While Amazon Web Services is quite mature and has mostly good documentation, it still fails in mysterious ways, and does not provide good error messages. Lot of time (and money!) is wasted when this happens. The course discussion forum was invaluable to help resolve these issues. Very satisfying to see a 500GB dataset mined in under 40 minutes on 19 Amazon m1.large machines (in their words, using 78 instance hours).

Introduction to Data Science Course specific problems:

  • First week was really bad. The python assignments were too open-ended, and the auto-graders were quite opaque so a lot of time was wasted trying to match the auto-grader. Too much time wasted on assignments trying to be good at scrubbing twitter data, but this ended up causing submission failures and required deleting code and doing simpler tokenization to pass the tests.
    Would have been much better if sample code to tokenize the twitter stream was provided so issues of case and punctuation characters would not be the focus of the homework. Took over 16 hours to complete the first week of lectures, papers, and exercises.
    And so no time left to experiment with the assignments or other data science related projects.
  • Second week was much better. The SQL tests were concrete, and did not take long to complete. More importantly, there was enough time left over (alloting 8-10 hours per week) to experiment and try multiple ways of doing things. And time to read up on the things mentioned in the course - such as cosine similarity.
  • Third week was even better - the assignments clarified the way lines are to be broken into tokens (thus, focusing on and allowing more experimentation on the subject matter of this course).
  • The visualization assignment was another bad experience. Requires Windows, and not that useful - the lectures were sufficient for this part, it was just a waste of time using Tableau for homework which most people would never use again. An assignment in R would have been far better.

Overall, by the end of the course, it was a very positive experience. Would recommended this to anyone looking for an introduction to data science - and trying out Amazon Map Reduce platform, participating in a Kaggle competition, and catching up on current happenings in this field.