A comprehensive machine learning system that predicts mobile game success metrics using scraped data from app stores. The platform combines web scraping, classification algorithms, learning-to-rank techniques, and a web interface to provide game analytics and performance predictions.
- Overview
- Features
- System Architecture
- Installation
- Usage
- Database Schema
- API Reference
- Machine Learning Pipeline
- Web Interface
- Development
- Screenshots
- Contributing
- License
This system provides two main services:
- Game Performance Prediction: Predict download ranges and rankings for games based on their attributes
- Optimal Attribute Recommendations: Suggest optimal game attributes to achieve desired performance criteria
The platform processes data from over 40,000 mobile games to provide accurate predictions using advanced machine learning techniques.
- Web Scraping Pipeline: Automated data collection from 42matters.com using Selenium and BeautifulSoup
- Machine Learning Prediction: GradientBoostingClassifier for download tier classification
- Learning-to-Rank System: RankLib integration for genre-based game ranking
- Trend Analysis Engine: Statistical analysis for optimal attribute recommendations
- Interactive Web Interface: AngularJS frontend with real-time predictions
- Comprehensive Database: MySQL storage with structured game metadata
- Cross-Validation Testing: Model accuracy optimization across multiple algorithms
┌─────────────────────┐ ┌──────────────────────┐ ┌─────────────────────┐
│ Data Collection │ │ Processing Layer │ │ Prediction API │
│ │ │ │ │ │
│ ┌─────────────────┐ │ │ ┌──────────────────┐ │ │ ┌─────────────────┐ │
│ │ scrape.py │ │ │ │ rough.py │ │ │ │ server.py │ │
│ │ (Selenium) │ │────┤ │ (Score Calc) │ │────┤ │ (Prediction) │ │
│ └─────────────────┘ │ │ └──────────────────┘ │ │ └─────────────────┘ │
│ │ │ │ │ │
│ ┌─────────────────┐ │ │ ┌──────────────────┐ │ │ ┌─────────────────┐ │
│ │ crawler.php │ │ │ │ rank.py │ │ │ │ server2.py │ │
│ │ (Detail Scrape) │ │ │ │ (Genre Ranking) │ │ │ │ (Analysis) │ │
│ └─────────────────┘ │ │ └──────────────────┘ │ │ └─────────────────┘ │
└─────────────────────┘ └──────────────────────┘ └─────────────────────┘
│
┌──────────────────────┐
│ MySQL Database │
│ │
│ • url │
│ • games │
│ • games_formatted │
│ • scores │
│ • ranked_games │
└──────────────────────┘
- Python 2.7
- MySQL Server
- Java Runtime Environment (JRE) 8+
- Firefox WebDriver
- PHP with simple_html_dom library
# Install MySQL
# macOS
brew install mysql
# Ubuntu
sudo apt-get install mysql-server
# Install Java
# macOS
brew install openjdk@8
# Ubuntu
sudo apt-get install openjdk-8-jre
# Install PHP
# macOS
brew install php
# Ubuntu
sudo apt-get install php php-mysql
# Install required Python packages
pip install scikit-learn==0.18.2
pip install pandas==0.20.3
pip install selenium==3.141.0
pip install beautifulsoup4==4.6.0
pip install MySQL-python==1.2.4
- Create MySQL database:
CREATE DATABASE dataset;
USE dataset;
-- Create required tables
CREATE TABLE url (
id INT PRIMARY KEY,
game_name VARCHAR(255),
game_url TEXT,
price FLOAT
);
CREATE TABLE games (
id INT PRIMARY KEY,
title VARCHAR(255),
genre VARCHAR(100),
rating FLOAT,
rating_count INT,
date VARCHAR(50),
size FLOAT,
downloads VARCHAR(100),
price FLOAT
);
CREATE TABLE games_formatted (
id INT PRIMARY KEY,
title VARCHAR(255),
genre VARCHAR(100),
rating FLOAT,
rating_count INT,
date VARCHAR(50),
size FLOAT,
downloads VARCHAR(50),
price FLOAT
);
CREATE TABLE scores (
id INT PRIMARY KEY,
name VARCHAR(255),
genre INT,
size FLOAT,
price FLOAT,
rating FLOAT,
review_count INT,
downloads INT,
score FLOAT
);
CREATE TABLE ranked_games (
id INT PRIMARY KEY,
rank INT,
name VARCHAR(255),
genre INT,
size FLOAT,
price FLOAT,
rating FLOAT,
review_count INT,
downloads INT,
score FLOAT
);
- Update database credentials in configuration files:
# Update in all Python files:
db = MySQLdb.connect(
user='root',
passwd='your_password', # Change from 'suna'
db='dataset',
host='localhost'
)
# Download Firefox GeckoDriver
# macOS
brew install geckodriver
# Ubuntu
wget https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-linux64.tar.gz
tar -xzf geckodriver-v0.26.0-linux64.tar.gz
sudo mv geckodriver /usr/local/bin/
- Clone the repository:
git clone <repository-url>
cd Machine-Learning-Project
- Update file paths in configuration:
- Update CSV path in
Server/server.py
andServer/Max_Accuracy.py
- Update RankLib jar path in
Server/server.py
- Verify all absolute paths match your system
- Update CSV path in
- Start the Prediction Engine (Port 8000):
cd Server
python server.py
- Start the Analysis Engine (Port 8001):
cd Server
python server2.py
- Serve the Web Interface:
# Using Python's built-in server
cd "Executable UI"
python -m SimpleHTTPServer 8080
# Visit: http://localhost:8080/bark.html
- Scrape Basic Game Data:
cd Scraper
python scrape.py
- Crawl Detailed Metadata:
cd Scraper
php crawler.php
- Process and Score Games:
cd Ranking
python rough.py
- Generate Rankings by Genre:
cd Ranking
python rank.py
- Compare Algorithm Performance:
cd Server
python Max_Accuracy.py
- Run Cross-Validation:
cd Server
python cross_validation.py
- Train Ranking Model:
cd Ranking
java -jar RankLib-2.1-patched.jar -train dummy.txt -save mymodel.txt
Table | Purpose | Key Fields |
---|---|---|
url |
Scraped URLs | id, game_name, game_url, price |
games |
Raw game metadata | id, title, genre, rating, rating_count, size, downloads |
games_formatted |
Processed game data | id, title, genre, rating, rating_count, size, downloads |
scores |
Games with ML scores | id, name, genre, size, price, rating, review_count, downloads, score |
ranked_games |
Final ranked games | id, rank, name, genre, size, price, rating, review_count, downloads, score |
Raw URLs → Detailed Scraping → Formatting → Scoring → Ranking
↓ ↓ ↓ ↓ ↓
url → games → games_formatted → scores → ranked_games
Endpoint: POST http://localhost:8000/
Request Parameters:
name
: Game name (optional)genre
: Genre ID (0-19)size
: Game size (float)sizeType
: Size unit ('mb' or 'gb')price
: Game price (float)rating
: Average rating (float, 1-5)review_count
: Number of reviews (integer)
Response: {download_range};{estimated_rank}
Example:
curl -X POST http://localhost:8000/ \
-d "name=TestGame&genre=2&size=25&sizeType=mb&price=0.99&rating=4.2&review_count=1000"
Response: "1,000,000 - 10,000,000;1205"
Endpoint: POST http://localhost:8001/
Request Parameters:
name
: Game name (optional)genre
: Target genre ID (0-19, -1 for any)downloads
: Target download tier (0-4, -1 for any)rank
: Target rank threshold (-1 for any)
Response: {size};{rating};{review_count};{genre};{download_distribution}
Input Features:
genre
: Game category (0-19)size
: Game size in MBprice
: Game price in USDrating
: Average user rating (1-5)review_count
: Total number of reviews
Target Variable: downloads
(classified into 5 tiers)
Tier | Download Range | Description |
---|---|---|
0 | < 100,000 | Very Low |
1 | 100,000 - 1,000,000 | Low |
2 | 1,000,000 - 10,000,000 | Medium |
3 | 10,000,000 - 100,000,000 | High |
4 | 100,000,000 - 1,000,000,000 | Very High |
Action, Adventure, Arcade, Board, Brain Games, Card, Casino, Casual, Creativity, Educational, Music, Pretend Play, Puzzle, Racing, Role Playing, Simulation, Sports, Strategy and Tools, Trivia, Word
The system evaluates four algorithms:
- DecisionTreeClassifier: Basic tree-based classification
- RandomForestClassifier: Ensemble of decision trees
- ExtraTreesClassifier: Extremely randomized trees
- GradientBoostingClassifier: Sequential boosting (selected for production)
Production Model: GradientBoostingClassifier
n_estimators=100
: Number of boosting stageslearning_rate=1.0
: Learning rate shrinks contributionmax_depth=None
: Maximum tree depthrandom_state=0
: Reproducible results
Training Configuration:
- Training samples: 37,500 games
- Test samples: Remaining games from 40K+ dataset
- Data source:
/Ranking/data40Ksklearn.csv
The system uses RankLib for learning-to-rank:
- Feature Format:
target qid:genre 1:size 2:price 3:rating 4:reviews
- Training: Java-based RankLib model training
- Scoring: Generates relevance scores for ranking
Technology Stack:
- AngularJS 1.x
- Bootstrap 3.x
- jQuery 2.2.1
- WOW.js animations
- Custom CSS styling
- Input game attributes (genre, size, price, rating, reviews)
- Real-time prediction with progress animation
- Download range prediction and estimated rank
- Visual feedback with animated circles
- Specify desired performance criteria (downloads, genre, rank)
- Statistical trend analysis
- Optimal attribute recommendations
- Download distribution visualization
- Smooth scroll navigation
- WOW.js entrance animations
- Responsive Bootstrap layout
- Real-time AJAX predictions
- Interactive progress indicators
- Tooltip guidance
Machine-Learning-Project/
├── Server/ # ML prediction engines
│ ├── server.py # Main prediction API (port 8000)
│ ├── server2.py # Analysis API (port 8001)
│ ├── Max_Accuracy.py # Algorithm comparison
│ ├── cross_validation.py # Model validation
│ └── test.txt # Ranking test data
├── Scraper/ # Data collection
│ ├── scrape.py # Selenium scraper
│ └── crawler.php # Detail crawler
├── Ranking/ # ML ranking system
│ ├── rough.py # Score calculation
│ ├── rank.py # Genre-based ranking
│ ├── data40Ksklearn.csv # Training dataset (40K+ games)
│ ├── RankLib-2.1-patched.jar # Learning-to-rank library
│ ├── dummy.txt # Training data for ranking
│ ├── mymodel.txt # Trained ranking model
│ └── mysco.txt # Model scores
├── Executable UI/ # Web interface
│ ├── bark.html # Main frontend
│ └── files/ # Static assets
│ ├── css/ # Stylesheets
│ ├── js/ # JavaScript
│ └── fonts/ # Typography
├── ScreenShots/ # Application screenshots
└── MySQL-python-1.2.4b4/ # Database connector
- Data Collection: Run scrapers to gather new game data
- Data Processing: Process raw data into ML-ready format
- Model Training: Train and validate prediction models
- Ranking Generation: Create genre-based rankings
- API Testing: Verify prediction and analysis endpoints
- Frontend Integration: Test web interface functionality
Key Configuration Files:
- Database credentials: Update in all Python files
- File paths: Update absolute paths for your system
- API endpoints: Configured in
app.js
for frontend - Model parameters: Adjust in respective Python files
- New Algorithms: Extend
Max_Accuracy.py
with additional classifiers - Additional Features: Modify feature extraction in data processing scripts
- New Genres: Update genre mappings in both backend and frontend
- Enhanced UI: Extend AngularJS controllers and templates
The ScreenShots/
directory contains application screenshots:
landing page.png
: Main interfaceDatabase.png
: Database structureMART Begin.png
/MART Completed.png
: Model training processprediction server_ start.png
: Server startupType1_*.png
: Game prediction interfaceType2_*.png
: Attribute recommendation interface
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature
) - Update absolute paths in configuration files
- Test data pipeline and prediction accuracy
- Commit changes (
git commit -am 'Add new feature'
) - Push to branch (
git push origin feature/new-feature
) - Create Pull Request
- Maintain Python 2.7 compatibility
- Update database schema migrations
- Add comprehensive error handling
- Test ML model performance impact
- Document API changes
- Verify frontend compatibility
- Dataset Size: 40,000+ mobile games
- Training Speed: ~2-3 minutes on modern hardware
- Prediction Latency: <1 second per request
- Accuracy: Optimized through cross-validation testing
- Database: MySQL with indexed queries for fast lookups
- Concurrent Users: Limited by single-threaded Python HTTP servers
- Data Updates: Batch processing for new game data integration
- Model Retraining: Offline process with model replacement
- Database credentials hardcoded (development only)
- No authentication on API endpoints
- CORS enabled for cross-origin requests
- Input validation limited
-
Database Connection Errors:
- Verify MySQL server is running
- Check credentials in Python files
- Ensure database 'dataset' exists
-
Selenium WebDriver Issues:
- Update Firefox WebDriver version
- Check browser compatibility
- Verify WebDriver in system PATH
-
Java RankLib Errors:
- Confirm Java JRE 8+ installed
- Check RankLib jar file path
- Verify input data format
-
Python Import Errors:
- Install required packages with pip
- Check Python 2.7 compatibility
- Verify package versions
- Database Indexing: Add indexes on frequently queried columns
- Caching: Implement Redis for prediction caching
- Load Balancing: Deploy multiple API server instances
- Async Processing: Convert to async Python framework
- Real-time data streaming from app stores
- Advanced deep learning models (TensorFlow/PyTorch)
- Multi-platform support (iOS, web games)
- RESTful API with authentication
- Real-time recommendation updates
- A/B testing framework for model comparison
- Migration to Python 3.x
- Containerized deployment (Docker)
- Cloud database integration
- Automated CI/CD pipeline
- Comprehensive API documentation (Swagger)
- Performance monitoring and analytics
This project is available for educational and research purposes. Please ensure compliance with data scraping policies and terms of service for external platforms.
Contact: For questions about this machine learning platform, please refer to the project documentation or create an issue in the repository.
Last Updated: September 2025