cse-analytics

⚠️ This repository is source-available, not open-source. The code is licensed PolyForm-Strict 1.0.0: no redistribution, no modification, no commercial use. See LICENSE for details.

Overview

cse-analytics is a Python toolkit for scraping, extracting, and analyzing quarterly financial report PDFs from the Colombo Stock Exchange (CSE). It automates the end-to-end workflow: scraping report links, downloading PDFs, extracting structured P&L data using LLMs, merging results, and providing a FastAPI backend for analytics and dashboarding.

Features

Automated Scraping: Uses Playwright and BeautifulSoup to navigate CSE company pages and extract quarterly report links.
LLM-Powered Extraction: Extracts structured financial data from PDFs using Google Gemini (GenAI) models.
Concurrent PDF Downloading: Downloads all company PDFs asynchronously with robust error handling.
Data Merging: Merges extracted JSONs into tidy CSV/Parquet tables for analysis.
REST API: FastAPI backend for orchestrating scraping, extraction, and data access.
Dashboard-ready: Output is suitable for visualization and further analytics.

Project Structure

app/
  controllers/         # Core business logic (PDF download, extraction, merging)
  routes/              # FastAPI route definitions
  utils/               # Logging, Google GenAI client, helpers
  models.py            # Pydantic models and schemas
  main.py              # FastAPI app entrypoint
scraping/
  scrape.py            # Playwright-based HTML scraper for CSE company pages
  exract_pdf_links.py  # Extracts quarterly report links from cleaned HTML using Gemini
output/                # Scraped HTML and intermediate files
output_files/          # Extracted financial data JSONs
LICENSE                # PolyForm-Strict 1.0.0
README.md              # Project documentation

Workflow

Scrape CSE Company Page
- Run scraping/scrape.py to navigate to a company's CSE profile, click through to the Quarterly Reports tab, and save cleaned HTML.
- Output: output/playwright_explore_rpe/quarterly_reports_tab_cleaned.html
Extract Quarterly Report Links
- Run scraping/exract_pdf_links.py to parse the cleaned HTML and extract CDN links to quarterly report PDFs using Gemini.
- Output: urls_{company}.json (e.g., urls_dpl.json)
Download PDFs
- Use the FastAPI endpoint /api/v1/download-pdfs/{company_name} to download all PDFs for a company asynchronously.
- PDFs are saved in a folder named after the company.
Extract Financial Data from PDFs
- Use the FastAPI endpoint /api/v1/extract-financial-data/ with a POST request containing the company name and directory path.
- Each PDF is processed via Gemini to extract structured P&L data, saved as JSON.
Merge Extracted Data
- Use the FastAPI endpoint /api/v1/merge-quarterlies to merge all JSONs in a directory into a single CSV and Parquet file.
- Output: rpe_quarterlies.csv, rpe_quarterlies.parquet

API Endpoints

POST /api/v1/download-pdfs/{company_name}
- Triggers background download of all PDFs for a company.
POST /api/v1/extract-financial-data/
- Extracts financial data from all PDFs in a directory for a given company.
POST /api/v1/merge-quarterlies
- Merges all extracted JSONs in a directory into CSV/Parquet.

See app/routes/ for implementation details.

Chatbot Agent (Agno UI)

This project includes an agentic chatbot built with Agno, which can be accessed via a local web UI.

1. Setup and Run the Chatbot Backend

The chatbot backend is defined in chatbot/agentic_chatbot.py. To run the chatbot backend:

uv run python -m chatbot.agentic_chatbot

Alternatively, you can use the Makefile command:

make run-chatbot

This will start the Agno agent server, typically on a port like 8000 (as configured in serve_playground_app). Ensure your PostgreSQL service (pgvector) is running, as the chatbot uses it for storage. You can start it using:

make pgvector

2. Setup and Run the Agno Agent UI

The Agno Agent UI provides a frontend to interact with the chatbot.

Installation (first time only):

Use npx to create and set up the Agent UI project. This will clone the necessary files and install dependencies.

npx create-agent-ui@latest

Follow the prompts. When asked if you want to install dependencies, you can choose 'y'.

❯ npx create-agent-ui@latest
Need to install the following packages:
[email protected]
Ok to proceed? (y) y

       
       
       
        @@@@@@@@@@@@@@@@                                                                            
        @##############@                                                                            
         @@@@@@@@@@@####@                                                                           
                    @####@              @@@@####@@@#@        #@@@@@@@@@#               @@@@@@
                     @####@           @@@######@@@@###@   @###@@@########@@        @@@########@@@
                      @####@        @@###@@@ @@@######@   @###@#@@@ @@@####@     @####@@@ @@@@###@
                       @####@      @####@        @####@   @####@       @####    @###@@        @###@ 
                        @####@     @###@          ####@   @###@        @####@  @####@          @###@
                        @#####@    ####@          @###@   @###@        @####@  @###@           @####
                         @#####@   @###@          ####@   @###@        @####@  @####@          @###@
                          @#####@  @####@        @####@   @###@        @####@   @###@         @####@
@@@@@@@@@@@@@@@            @####@   @####@@@@@@@##@###@   @###@        @####@    @###@@     @@###@@ 
@##############             @####@    @@#######@@ @###@   @####        @####@     @@@##########@@   
@@@@@@@@@@@@@@@              @@@@@@      @@@@@    ####@   @@@@@         @@@@         @@@@@@@@@      
                                   @@@@@@        @####@
                                    @@###@@@@@@@####@
                                       @@@@####@@@@




🚀 Creating a new Agent UI project: agent-ui

Cloning into 'agent-ui'...
remote: Enumerating objects: 114, done.
remote: Counting objects: 100% (114/114), done.
remote: Compressing objects: 100% (109/109), done.
remote: Total 114 (delta 1), reused 48 (delta 1), pack-reused 0 (from 0)
Receiving objects: 100% (114/114), 136.66 KiB | 361.00 KiB/s, done.
Resolving deltas: 100% (1/1), done.

✅ Project cloned successfully!

Do you want to install all dependencies now? (y/n): y

📦 Installing dependencies...

Run the Agent UI:

Once the agent-ui directory is created and dependencies are installed, navigate into it and start the development server:

cd agent-ui
npm run dev

This will typically start the Agent UI on http://localhost:3000.

   ▲ Next.js 15.2.3
   - Local:        http://localhost:3000
   - Network:      http://192.168.1.61:3000 # (Network IP may vary)

 ✓ Starting...
 Attention: Next.js now collects completely anonymous telemetry regarding usage.
 This information is used to shape Next.js' roadmap and prioritize features.
 You can learn more, including how to opt-out if you'd not like to participate in this anonymous program, by visiting the following URL:
 https://nextjs.org/telemetry

 ✓ Ready in 1850ms

3. Using the Chatbot

Open your browser and navigate to http://localhost:3000.
Connect to your local agent server (usually http://localhost:8000 or the address shown when you started the chatbot backend).
You can now interact with the "RAG Agent".

Scraping & Extraction Scripts

scraping/scrape.py: Uses Playwright to automate browser navigation and save cleaned HTML of the Quarterly Reports tab.
scraping/exract_pdf_links.py: Uses Gemini to extract quarterly report links from HTML and outputs a JSON file.

LLM Extraction

app/controllers/data_extractor_controller.py: Handles interaction with Google Gemini for extracting structured financial data from PDFs.
app/routes/extract_data.py: Orchestrates batch extraction for all PDFs in a directory.

Data Merging

app/controllers/merge_financials_controller.py: Merges all extracted JSONs into a tidy DataFrame, outputs CSV and Parquet.
app/routes/merge_financials.py: API endpoint for merging and previewing results.

Logging & Error Handling

Uses structlog for structured logging.
All major steps log progress and errors for traceability.
Failed downloads and extraction errors are logged and saved for review.

Requirements & Setup

Python 3.13+
Install dependencies using uv:
```
uv sync
```
Set up Google Gemini API credentials in your environment (see .env usage in scripts).
For scraping, install Playwright and its browser drivers:
```
uv pip install playwright
playwright install
```

Licensing

This repository is licensed under the PolyForm Strict License 1.0.0:

No redistribution
No modification
No commercial use
Source-available for personal, research, and noncommercial institutional use only

See LICENSE for full terms.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.devcontainer		.devcontainer
agent-ui		agent-ui
app		app
chatbot		chatbot
output_files		output_files
scraping		scraping
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
urls_dpl.json		urls_dpl.json
urls_rpe.json		urls_rpe.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

cse-analytics

Overview

Features

Project Structure

Workflow

API Endpoints

Chatbot Agent (Agno UI)

1. Setup and Run the Chatbot Backend

2. Setup and Run the Agno Agent UI

3. Using the Chatbot

Scraping & Extraction Scripts

LLM Extraction

Data Merging

Logging & Error Handling

Requirements & Setup

Licensing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

amod-ml/cse-analytics

Folders and files

Latest commit

History

Repository files navigation

cse-analytics

Overview

Features

Project Structure

Workflow

API Endpoints

Chatbot Agent (Agno UI)

1. Setup and Run the Chatbot Backend

2. Setup and Run the Agno Agent UI

3. Using the Chatbot

Scraping & Extraction Scripts

LLM Extraction

Data Merging

Logging & Error Handling

Requirements & Setup

Licensing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages