Mobile Game Analytics and Prediction Platform

A comprehensive machine learning system that predicts mobile game success metrics using scraped data from app stores. The platform combines web scraping, classification algorithms, learning-to-rank techniques, and a web interface to provide game analytics and performance predictions.

Overview

This system provides two main services:

Game Performance Prediction: Predict download ranges and rankings for games based on their attributes
Optimal Attribute Recommendations: Suggest optimal game attributes to achieve desired performance criteria

The platform processes data from over 40,000 mobile games to provide accurate predictions using advanced machine learning techniques.

Features

Web Scraping Pipeline: Automated data collection from 42matters.com using Selenium and BeautifulSoup
Machine Learning Prediction: GradientBoostingClassifier for download tier classification
Learning-to-Rank System: RankLib integration for genre-based game ranking
Trend Analysis Engine: Statistical analysis for optimal attribute recommendations
Interactive Web Interface: AngularJS frontend with real-time predictions
Comprehensive Database: MySQL storage with structured game metadata
Cross-Validation Testing: Model accuracy optimization across multiple algorithms

System Architecture

┌─────────────────────┐    ┌──────────────────────┐    ┌─────────────────────┐
│   Data Collection   │    │   Processing Layer   │    │   Prediction API    │
│                     │    │                      │    │                     │
│ ┌─────────────────┐ │    │ ┌──────────────────┐ │    │ ┌─────────────────┐ │
│ │ scrape.py       │ │    │ │ rough.py         │ │    │ │ server.py       │ │
│ │ (Selenium)      │ │────┤ │ (Score Calc)     │ │────┤ │ (Prediction)    │ │
│ └─────────────────┘ │    │ └──────────────────┘ │    │ └─────────────────┘ │
│                     │    │                      │    │                     │
│ ┌─────────────────┐ │    │ ┌──────────────────┐ │    │ ┌─────────────────┐ │
│ │ crawler.php     │ │    │ │ rank.py          │ │    │ │ server2.py      │ │
│ │ (Detail Scrape) │ │    │ │ (Genre Ranking)  │ │    │ │ (Analysis)      │ │
│ └─────────────────┘ │    │ └──────────────────┘ │    │ └─────────────────┘ │
└─────────────────────┘    └──────────────────────┘    └─────────────────────┘
                                      │
                             ┌──────────────────────┐
                             │   MySQL Database     │
                             │                      │
                             │ • url               │
                             │ • games             │ 
                             │ • games_formatted   │
                             │ • scores            │
                             │ • ranked_games      │
                             └──────────────────────┘

Installation

Prerequisites

Python 2.7
MySQL Server
Java Runtime Environment (JRE) 8+
Firefox WebDriver
PHP with simple_html_dom library

System Dependencies

# Install MySQL
# macOS
brew install mysql

# Ubuntu
sudo apt-get install mysql-server

# Install Java
# macOS  
brew install openjdk@8

# Ubuntu
sudo apt-get install openjdk-8-jre

# Install PHP
# macOS
brew install php

# Ubuntu
sudo apt-get install php php-mysql

Python Dependencies

# Install required Python packages
pip install scikit-learn==0.18.2
pip install pandas==0.20.3
pip install selenium==3.141.0
pip install beautifulsoup4==4.6.0
pip install MySQL-python==1.2.4

Database Setup

Create MySQL database:

CREATE DATABASE dataset;
USE dataset;

-- Create required tables
CREATE TABLE url (
    id INT PRIMARY KEY,
    game_name VARCHAR(255),
    game_url TEXT,
    price FLOAT
);

CREATE TABLE games (
    id INT PRIMARY KEY,
    title VARCHAR(255),
    genre VARCHAR(100),
    rating FLOAT,
    rating_count INT,
    date VARCHAR(50),
    size FLOAT,
    downloads VARCHAR(100),
    price FLOAT
);

CREATE TABLE games_formatted (
    id INT PRIMARY KEY,
    title VARCHAR(255),
    genre VARCHAR(100),
    rating FLOAT,
    rating_count INT,
    date VARCHAR(50),
    size FLOAT,
    downloads VARCHAR(50),
    price FLOAT
);

CREATE TABLE scores (
    id INT PRIMARY KEY,
    name VARCHAR(255),
    genre INT,
    size FLOAT,
    price FLOAT,
    rating FLOAT,
    review_count INT,
    downloads INT,
    score FLOAT
);

CREATE TABLE ranked_games (
    id INT PRIMARY KEY,
    rank INT,
    name VARCHAR(255),
    genre INT,
    size FLOAT,
    price FLOAT,
    rating FLOAT,
    review_count INT,
    downloads INT,
    score FLOAT
);

Update database credentials in configuration files:

# Update in all Python files:
db = MySQLdb.connect(
    user='root',
    passwd='your_password',  # Change from 'suna'
    db='dataset',
    host='localhost'
)

Download WebDriver

# Download Firefox GeckoDriver
# macOS
brew install geckodriver

# Ubuntu
wget https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-linux64.tar.gz
tar -xzf geckodriver-v0.26.0-linux64.tar.gz
sudo mv geckodriver /usr/local/bin/

Project Setup

Clone the repository:

git clone <repository-url>
cd Machine-Learning-Project

Update file paths in configuration:
- Update CSV path in Server/server.py and Server/Max_Accuracy.py
- Update RankLib jar path in Server/server.py
- Verify all absolute paths match your system

Usage

Starting the Services

Start the Prediction Engine (Port 8000):

cd Server
python server.py

Start the Analysis Engine (Port 8001):

cd Server
python server2.py

Serve the Web Interface:

# Using Python's built-in server
cd "Executable UI"
python -m SimpleHTTPServer 8080
# Visit: http://localhost:8080/bark.html

Data Collection Workflow

Scrape Basic Game Data:

cd Scraper
python scrape.py

Crawl Detailed Metadata:

cd Scraper
php crawler.php

Process and Score Games:

cd Ranking
python rough.py

Generate Rankings by Genre:

cd Ranking
python rank.py

Machine Learning Operations

Compare Algorithm Performance:

cd Server
python Max_Accuracy.py

Run Cross-Validation:

cd Server
python cross_validation.py

Train Ranking Model:

cd Ranking
java -jar RankLib-2.1-patched.jar -train dummy.txt -save mymodel.txt

Database Schema

Primary Tables

Table	Purpose	Key Fields
`url`	Scraped URLs	id, game_name, game_url, price
`games`	Raw game metadata	id, title, genre, rating, rating_count, size, downloads
`games_formatted`	Processed game data	id, title, genre, rating, rating_count, size, downloads
`scores`	Games with ML scores	id, name, genre, size, price, rating, review_count, downloads, score
`ranked_games`	Final ranked games	id, rank, name, genre, size, price, rating, review_count, downloads, score

Data Flow

Raw URLs → Detailed Scraping → Formatting → Scoring → Ranking
   ↓              ↓               ↓          ↓         ↓
  url    →     games     →  games_formatted → scores → ranked_games

API Reference

Prediction API (Port 8000)

Endpoint: POST http://localhost:8000/

Request Parameters:

name: Game name (optional)
genre: Genre ID (0-19)
size: Game size (float)
sizeType: Size unit ('mb' or 'gb')
price: Game price (float)
rating: Average rating (float, 1-5)
review_count: Number of reviews (integer)

Response: {download_range};{estimated_rank}

Example:

curl -X POST http://localhost:8000/ \
  -d "name=TestGame&genre=2&size=25&sizeType=mb&price=0.99&rating=4.2&review_count=1000"

Response: "1,000,000 - 10,000,000;1205"

Analysis API (Port 8001)

Endpoint: POST http://localhost:8001/

Request Parameters:

name: Game name (optional)
genre: Target genre ID (0-19, -1 for any)
downloads: Target download tier (0-4, -1 for any)
rank: Target rank threshold (-1 for any)

Response: {size};{rating};{review_count};{genre};{download_distribution}

Machine Learning Pipeline

Feature Engineering

Input Features:

genre: Game category (0-19)
size: Game size in MB
price: Game price in USD
rating: Average user rating (1-5)
review_count: Total number of reviews

Target Variable: downloads (classified into 5 tiers)

Classification Tiers

Tier	Download Range	Description
0	< 100,000	Very Low
1	100,000 - 1,000,000	Low
2	1,000,000 - 10,000,000	Medium
3	10,000,000 - 100,000,000	High
4	100,000,000 - 1,000,000,000	Very High

Genre Categories (0-19)

Action, Adventure, Arcade, Board, Brain Games, Card, Casino, Casual, Creativity, Educational, Music, Pretend Play, Puzzle, Racing, Role Playing, Simulation, Sports, Strategy and Tools, Trivia, Word

Model Performance

The system evaluates four algorithms:

DecisionTreeClassifier: Basic tree-based classification
RandomForestClassifier: Ensemble of decision trees
ExtraTreesClassifier: Extremely randomized trees
GradientBoostingClassifier: Sequential boosting (selected for production)

Production Model: GradientBoostingClassifier

n_estimators=100: Number of boosting stages
learning_rate=1.0: Learning rate shrinks contribution
max_depth=None: Maximum tree depth
random_state=0: Reproducible results

Training Configuration:

Training samples: 37,500 games
Test samples: Remaining games from 40K+ dataset
Data source: /Ranking/data40Ksklearn.csv

Learning-to-Rank Integration

The system uses RankLib for learning-to-rank:

Feature Format: target qid:genre 1:size 2:price 3:rating 4:reviews
Training: Java-based RankLib model training
Scoring: Generates relevance scores for ranking

Web Interface

BARK Inc. Frontend

Technology Stack:

AngularJS 1.x
Bootstrap 3.x
jQuery 2.2.1
WOW.js animations
Custom CSS styling

Two Main Interfaces

1. Game Performance Prediction ("I have a Game in Mind")

Input game attributes (genre, size, price, rating, reviews)
Real-time prediction with progress animation
Download range prediction and estimated rank
Visual feedback with animated circles

2. Attribute Recommendations ("Need attrs for a new Game")

Specify desired performance criteria (downloads, genre, rank)
Statistical trend analysis
Optimal attribute recommendations
Download distribution visualization

User Experience Features

Smooth scroll navigation
WOW.js entrance animations
Responsive Bootstrap layout
Real-time AJAX predictions
Interactive progress indicators
Tooltip guidance

Development

Project Structure

Machine-Learning-Project/
├── Server/                     # ML prediction engines
│   ├── server.py              # Main prediction API (port 8000)
│   ├── server2.py             # Analysis API (port 8001)
│   ├── Max_Accuracy.py        # Algorithm comparison
│   ├── cross_validation.py    # Model validation
│   └── test.txt               # Ranking test data
├── Scraper/                   # Data collection
│   ├── scrape.py             # Selenium scraper
│   └── crawler.php           # Detail crawler
├── Ranking/                   # ML ranking system
│   ├── rough.py              # Score calculation
│   ├── rank.py               # Genre-based ranking
│   ├── data40Ksklearn.csv    # Training dataset (40K+ games)
│   ├── RankLib-2.1-patched.jar  # Learning-to-rank library
│   ├── dummy.txt             # Training data for ranking
│   ├── mymodel.txt           # Trained ranking model
│   └── mysco.txt             # Model scores
├── Executable UI/             # Web interface
│   ├── bark.html             # Main frontend
│   └── files/                # Static assets
│       ├── css/              # Stylesheets
│       ├── js/               # JavaScript
│       └── fonts/            # Typography
├── ScreenShots/              # Application screenshots
└── MySQL-python-1.2.4b4/    # Database connector

Development Workflow

Data Collection: Run scrapers to gather new game data
Data Processing: Process raw data into ML-ready format
Model Training: Train and validate prediction models
Ranking Generation: Create genre-based rankings
API Testing: Verify prediction and analysis endpoints
Frontend Integration: Test web interface functionality

Configuration Management

Key Configuration Files:

Database credentials: Update in all Python files
File paths: Update absolute paths for your system
API endpoints: Configured in app.js for frontend
Model parameters: Adjust in respective Python files

Adding New Features

New Algorithms: Extend Max_Accuracy.py with additional classifiers
Additional Features: Modify feature extraction in data processing scripts
New Genres: Update genre mappings in both backend and frontend
Enhanced UI: Extend AngularJS controllers and templates

Screenshots

The ScreenShots/ directory contains application screenshots:

landing page.png: Main interface
Database.png: Database structure
MART Begin.png / MART Completed.png: Model training process
prediction server_ start.png: Server startup
Type1_*.png: Game prediction interface
Type2_*.png: Attribute recommendation interface

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/new-feature)
Update absolute paths in configuration files
Test data pipeline and prediction accuracy
Commit changes (git commit -am 'Add new feature')
Push to branch (git push origin feature/new-feature)
Create Pull Request

Development Guidelines

Maintain Python 2.7 compatibility
Update database schema migrations
Add comprehensive error handling
Test ML model performance impact
Document API changes
Verify frontend compatibility

Technical Specifications

Performance Metrics

Dataset Size: 40,000+ mobile games
Training Speed: ~2-3 minutes on modern hardware
Prediction Latency: <1 second per request
Accuracy: Optimized through cross-validation testing

Scalability Considerations

Database: MySQL with indexed queries for fast lookups
Concurrent Users: Limited by single-threaded Python HTTP servers
Data Updates: Batch processing for new game data integration
Model Retraining: Offline process with model replacement

Security Notes

Database credentials hardcoded (development only)
No authentication on API endpoints
CORS enabled for cross-origin requests
Input validation limited

Troubleshooting

Common Issues

Database Connection Errors:
- Verify MySQL server is running
- Check credentials in Python files
- Ensure database 'dataset' exists
Selenium WebDriver Issues:
- Update Firefox WebDriver version
- Check browser compatibility
- Verify WebDriver in system PATH
Java RankLib Errors:
- Confirm Java JRE 8+ installed
- Check RankLib jar file path
- Verify input data format
Python Import Errors:
- Install required packages with pip
- Check Python 2.7 compatibility
- Verify package versions

Performance Optimization

Database Indexing: Add indexes on frequently queried columns
Caching: Implement Redis for prediction caching
Load Balancing: Deploy multiple API server instances
Async Processing: Convert to async Python framework

Future Enhancements

Planned Features

Real-time data streaming from app stores
Advanced deep learning models (TensorFlow/PyTorch)
Multi-platform support (iOS, web games)
RESTful API with authentication
Real-time recommendation updates
A/B testing framework for model comparison

Technical Improvements

Migration to Python 3.x
Containerized deployment (Docker)
Cloud database integration
Automated CI/CD pipeline
Comprehensive API documentation (Swagger)
Performance monitoring and analytics

License

This project is available for educational and research purposes. Please ensure compliance with data scraping policies and terms of service for external platforms.

Contact: For questions about this machine learning platform, please refer to the project documentation or create an issue in the repository.

Last Updated: September 2025

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Executable UI		Executable UI
MySQL-python-1.2.4b4		MySQL-python-1.2.4b4
Ranking		Ranking
Scraper		Scraper
ScreenShots		ScreenShots
Server		Server
README.md		README.md

arvindrk/Machine-Learning-Project

Folders and files

Latest commit

History

Repository files navigation