Machine Learning

Machine Learning Project

Search Engine Project

Yelp has this popular program among AI enthusiast which is called Yelp Dataset Challenge. In this challenge, they open up the data that they have for data scientists all over the world to be used in an innovative way. The challenge that I did was part of machine learning course project from Aalto University. The objective of this challenge that I did was to predict user vote based on their reviews. We did that by utilizing occurrences of some words on the reviews. The occurrences of each words became the features for our machine learning model. We do both classification and regression problem on the similar data set. On classification problem, we try to predict the usefulness feedback from the user based on their reviews. Whereas on the regression problem, we predict the star rating that were given by the user based on their reviews. For the classification problem, we use decision tree learning model and logistic regression with some feature selection. While for the regression problem, we utilize the linear regression using gradient descent technique. We implemented everything in python. We code from scratch some of the model that we implement.


This was my first attempt on implementing real-world machine learning problem. Although last year I did something which is a little bit related with machine learning on my projects about information retrieval, but I did not dig deep into the machine learning approach for the search engine. It was very challenging for me to actually learn all the underlying math behind every major machine learning models out there because it has been a while since I touched upon the hard math subjects like calculus, probability theory, statistics etc. I had to relearn everything that I have learned years ago on my bachelor study to able to get a good understanding of the whole system.
It was a school project on Search Engine and Information Retrieval course  at KTH, Stockholm. The task was to implement several techniques used to build an efficient search engine system. We had to build a standalone search engine for DavisWiki dataset. It was coded using Java. I implemented some techniques like vector space model, PageRank algorithms, probabilistic monte carlo approach to Pagerank. We were also tasked to do evaluation of the search engine that we built.