Skip to content

Conversation

xukunzh
Copy link

@xukunzh xukunzh commented Aug 29, 2025

This PR is a result of @xukunzh 's Google Summer of Code (GSoC) 2025 project that integrates capa with Frida for Android dynamic analysis.

This project provides a complete automation framework: automatically generating Frida monitoring scripts from API configuration JSON file, executing dynamic analysis on Android devices, outputting API call data in JSONL format, and then converting this behavioral data into features through FridaExtractor for malware capability detection.

A summary of this project: GSoC report.

xukunzh and others added 30 commits May 29, 2025 15:05
Add Frida log to capa analysis workflow
Accidentally merged unreviewed commit, reverting.

This reverts commit 8ed3cd1.
Implement basic Frida JSONL output and parser
 Integrate FridaExtractor into capa and add arguments
Auto-generate Frida hooks from APIs JSON file
add Java native & static method support and update model with Pydantic
Copy link

google-cla bot commented Aug 29, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @xukunzh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands capa's analytical capabilities by adding support for dynamic analysis of Android applications via Frida. It provides a robust system to ingest and interpret runtime behavioral data, allowing for the detection of capabilities that are only observable during execution. The changes streamline the process of setting up an analysis environment, capturing API interactions, and integrating this rich dynamic information into capa's existing feature extraction and rule matching engine.

Highlights

  • Frida Dynamic Analysis Integration: Introduced a comprehensive framework for integrating Frida dynamic analysis reports into capa. This enables capa to analyze behavioral data from Android applications, complementing static analysis.
  • New Frida Extractor and Data Models: Added a new FridaExtractor to parse JSONL reports generated by Frida, extracting features such as OS, architecture, package name, and detailed API calls with arguments. New Pydantic models (FridaReport, Call, Process, etc.) are defined to structure this dynamic data.
  • Automated Frida Analysis Workflow: Provided a suite of Python scripts (scripts/frida/) to automate the entire dynamic analysis process. This includes tools for Android emulator creation and setup, APK metadata extraction, dynamic Frida script generation using Jinja2 templates, and orchestration of Frida execution and result retrieval.
  • Core Capa Integration and Dependencies: Updated core capa components (capa/features/common.py, capa/helpers.py, capa/loader.py, capa/main.py) to recognize and process Frida reports as a new input format and backend. New Python dependencies (frida, jinja2) have been added to pyproject.toml to support these capabilities.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality by adding support for Frida-based dynamic analysis of Android applications. It includes a new Frida extractor in capa, along with a suite of scripts for generating Frida trace data. The changes are extensive and well-structured. I've identified a couple of areas for improvement related to performance and usability in the new scripts and models.

Comment on lines +70 to +83
with open(jsonl_path, "r") as f:
content = f.read()
for line in content.splitlines():
record = json.loads(line)

if "metadata" in record:
metadata = Metadata(**record["metadata"])
elif "api" in record:
if "java_api" in record["api"]:
call = Call(**record["api"]["java_api"])
api_calls.append(call)
elif "native_api" in record["api"]:
call = Call(**record["api"]["native_api"])
api_calls.append(call)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Reading the entire file into memory with f.read() can be inefficient for large JSONL files. It's better to iterate over the file line by line to reduce memory consumption. This change also adds encoding='utf-8' for robustness and handles empty or malformed JSON lines.

        with open(jsonl_path, "r", encoding="utf-8") as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue

                try:
                    record = json.loads(line)
                except json.JSONDecodeError:
                    continue

                if "metadata" in record:
                    metadata = Metadata(**record["metadata"])
                elif "api" in record:
                    if "java_api" in record["api"]:
                        call = Call(**record["api"]["java_api"])
                        api_calls.append(call)
                    elif "native_api" in record["api"]:
                        call = Call(**record["api"]["native_api"])
                        api_calls.append(call)

if not has_connected_device():
logger.info("Found no devices. Make sure emulator is running")
response = input("Auto-create an emulator? (y/n): ")
if response == "y":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check for user input y is case-sensitive. It's better to convert the input to lowercase to handle both 'y' and 'Y', which provides a better user experience.

Suggested change
if response == "y":
if response.lower() == "y":

@github-actions github-actions bot dismissed their stale review August 29, 2025 20:19

CHANGELOG updated or no update needed, thanks! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant