Skip to content

Fast Rust library and CLI tool for file type identification and detection using extensions, content analysis, and shebang parsing

License

Notifications You must be signed in to change notification settings

grok-rs/file-identify

Repository files navigation

file-identify

Crates.io Documentation CI License: MIT

File identification library for Rust.

Given a file (or some information about a file), return a set of standardized tags identifying what the file is.

This is a Rust port of the Python identify library.

Features

  • 🚀 Fast: Built in Rust with Perfect Hash Functions (PHF) for compile-time optimization
  • 📁 Comprehensive: Identifies 315+ file types and formats
  • 🔍 Smart detection: Uses file extensions, content analysis, and shebang parsing
  • 📦 Library + CLI: Use as a Rust library or command-line tool
  • Zero overhead: PHF provides O(1) lookups with no runtime hash computation
  • 🎯 Memory efficient: Static data structures with no lazy initialization
  • Well-tested: Extensive test suite ensuring reliability

Installation

Add this to your Cargo.toml:

[dependencies]
file-identify = "0.2.0"

Usage

Library Usage

use file_identify::{tags_from_path, tags_from_filename, tags_from_interpreter};

// Identify a file from its path
let tags = tags_from_path("/path/to/file.py").unwrap();
println!("{:?}", tags); // {"file", "text", "python", "non-executable"}

// Identify from filename only
let tags = tags_from_filename("script.sh");
println!("{:?}", tags); // {"text", "shell", "bash"}

// Identify from interpreter
let tags = tags_from_interpreter("python3");
println!("{:?}", tags); // {"python", "python3"}

Command Line Usage

# Install the CLI tool
cargo install file-identify

# Identify a file
file-identify setup.py
["file", "non-executable", "python", "text"]

# Use filename only (don't read file contents)
file-identify --filename-only setup.py
["python", "text"]

# Get help
file-identify --help

How it works

A call to tags_from_path does this:

  1. What is the type: file, symlink, directory? If it's not file, stop here.
  2. Is it executable? Add the appropriate tag.
  3. Do we recognize the file extension? If so, add the appropriate tags, stop here. These tags would include binary/text.
  4. Peek at the first 1KB of the file. Use these to determine whether it is binary or text, add the appropriate tag.
  5. If identified as text above, try to read and interpret the shebang, and add appropriate tags.

By design, this means we don't need to partially read files where we recognize the file extension.

Development

Setup

# Clone the repository
git clone [email protected]:grok-rs/file-identify.git
cd file-identify

# Build the project
cargo build

# Run tests
cargo test

# Run the CLI
cargo run -- path/to/file

Pre-commit hooks

This project uses pre-commit hooks to ensure code quality:

pip install pre-commit
pre-commit install

Testing

# Run all tests
cargo test

# Run with coverage (requires cargo-tarpaulin)
cargo install cargo-tarpaulin
cargo tarpaulin --out html

License

MIT

About

Fast Rust library and CLI tool for file type identification and detection using extensions, content analysis, and shebang parsing

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages