Restaurant Rating System

Evaluating Restaurants by Extracting and Categorizing Information from Reviews

Our system achieved an average tagging accuracy of 97.08% in classifying reviews from three restaurants into seven self-defined categories. The pipeline comprises three core components:

Data Collection and Labeling: Scrape review data from Yelp into a .csv file, parse it into a .txt file, and label the reviews.
Feature Engineering and Model Training: Develop effective feature sets and train the model.
Testing and Rating Calculation: Test the model on additional reviews and compute ratings for each review.

To classify reviews, we defined seven categories with both positive and negative tags for annotation and classification:

Overall: LOVE/HATE
Taste: YUMMY/GROSS
Service: FRIENDLY/INHOSPITABLE
Price: CHEAP/EXPENSIVE
Sanitation: CLEAN/DIRTY
Location: CONVENIENT/INCONVENIENT
Other: OTHER

Feature Engineering

During feature extraction, we identified five key features to optimize classification performance. These include:

Sequence Part-of-Speech (POS) Features
POS Combinations (e.g., NN+VB+JJ)
Token Word Appearance
Special Lists: Synonyms relevant to each category from Thesaurus.com
Special Sentence Structures: Common patterns in reviews

Model Training

We trained the system using the Maximum-Entropy Markov Model (MEMM) implemented in the OpenNLP library.

Results and Insights

Our system produced category ratings—Overall, Taste, Service, Price, Sanitation, and Location—that closely matched Yelp’s aggregate ratings. This demonstrates the system’s potential to help business owners gain more detailed insights from Yelp reviews.

Future Enhancements

To further improve accuracy and reliability, we plan to pursue three major enhancements:

Incorporate Sentiment Analysis: Identify words most likely to convey positive or negative sentiment.
Enhance the Dictionary: Expand category-specific synonym lists and account for different POS variations.
Increase Training Data: Label more reviews to improve model performance through larger datasets.

With these improvements, we believe our system will provide even more accurate and actionable insights for businesses and users alike. We trained the system using the Maximum-Entropy Markov Model (MEMM) implemented in the OpenNLP library.

Life is really simple, but we insist on making it complicated. - Confucius