Skip to content

PhysicsNeMo-Curator is a Python-based library designed to streamline and accelerate the process of data curation for engineering datasets.

License

Notifications You must be signed in to change notification settings

NVIDIA/physicsnemo-curator

Repository files navigation

PhysicsNeMo-Curator

Project Status: Active. GitHub Code style: black

PhysicsNeMo Curator | Getting started | Documentation | Contributing Guidelines | Communication

What is PhysicsNeMo Curator?

PhysicsNeMo Curator is a sub-module of PhysicsNeMo framework, a pythonic library designed to streamline and accelerate the crucial process of data curation at scale for engineering and scientific datasets for training and inference. It accelerates data curation by leveraging GPUs.

This includes customizable interfaces and pipelines for extracting, transforming and loading data in supported formats and schema. Please refer to the DoMINO ETL example that illustrates the concept.

This package is intended to be used as part of the PhysicsNeMo framework.

Installation and Usage

The recommended way of using PhysicsNeMo-Curator is to leverage the PhysicsNeMo docker image. This can be pulled from the NVIDIA Container Registry.

Current limitations:

  • Currently only linux/amd64 platform is supported
  • Currently we don't provide a PyPi wheel, and support installing from source

PhysicsNeMo Container (Recommended)

The instructions to get started with PhysicsNeMo-Curator within the PhysicsNeMo docker container are shown below.

docker pull nvcr.io/nvidia/physicsnemo/physicsnemo:25.06

# Install from source
git clone [email protected]:NVIDIA/physicsnemo-curator.git && cd physicsnemo-curator

pip install --upgrade pip
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

Getting Started

New to PhysicsNeMo-Curator?

If you're new to the framework, start with our comprehensive Tutorial. It walks you through building a complete ETL pipeline from scratch. You'll learn how to:

  • Define data schemas
  • Implement schema validation, data sources, transformations, and sinks
  • Convert HDF5 data to ML-optimized Zarr format
  • Configure and run parallel processing pipelines

Working with Your CFD Data

Have CFD simulation data from a solver like Fluent? PhysicsNeMo-Curator can process your data through the following approaches:

Option 1: Convert to Supported Formats (Recommended)

Currently Supported Formats:

  • VTK formats: VTU (volume mesh data), VTP (surface mesh data)
  • STL: Geometry files

Next Steps:

  1. Organize your converted data according to one of the supported dataset formats
  2. Use the built-in DoMINO pipeline to convert your data to an AI model training ready format
  3. Train your DoMINO Model on your own data by following the example in PhysicsNeMo!

Option 2: Extend the Framework for Custom Formats

If your data is in a format not directly supported (VTU/VTP/STL), you can extend the framework. The Tutorial demonstrates creating a complete pipeline that reads in HDF5 data and converts it to Zarr data.

Getting Help

  • Domain-Specific Examples: Check if your use case matches our automotive aerodynamics pipeline. This provides an example ETL pipeline for training DoMINO models for automotive aerodynamics applications. For more questions about the formats, please refer to Data Processing Reference
  • Architecture Questions: See the Tutorial for framework concepts, and to understand how to extend the pipeline
  • Anything else: Please open a GitHub issue and we'll engage with you to answer the questions!

Contributing to PhysicsNeMo-Curator

PhysicsNeMo-Curator and PhysicsNeMo are open source collaborations and their success is rooted in community contribution to further the field of Physics-ML. Thank you for contributing to the project so others can build on top of your contribution.

For guidance on contributing to PhysicsNeMo-Curator, please refer to the contributing guidelines.

Cite PhysicsNeMo-Curator

If PhysicsNeMo-Curator helped your research and you would like to cite it, please refer to the guidelines.

Communication

  • Github Discussions: Discuss new data formats, transformations, Physics-ML research, etc.
  • GitHub Issues: Bug reports, feature requests, install issues, etc.
  • PhysicsNeMo Forum: The PhysicsNeMo Forum hosts an audience of new to moderate-level users and developers for general chat, online discussions, collaboration, etc.

Feedback

Want to suggest some improvements to PhysicsNeMo-Curator? Use our feedback form.

License

PhysicsNeMo-Curator is provided under the Apache License 2.0, please see LICENSE.txt for full license text.

About

PhysicsNeMo-Curator is a Python-based library designed to streamline and accelerate the process of data curation for engineering datasets.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published