PhysicsNeMo Curator | Getting started | Documentation | Contributing Guidelines | Communication
PhysicsNeMo Curator is a sub-module of PhysicsNeMo framework, a pythonic library designed to streamline and accelerate the crucial process of data curation at scale for engineering and scientific datasets for training and inference. It accelerates data curation by leveraging GPUs.
This includes customizable interfaces and pipelines for extracting, transforming and loading data in supported formats and schema. Please refer to the DoMINO ETL example that illustrates the concept.
This package is intended to be used as part of the PhysicsNeMo framework.
The recommended way of using PhysicsNeMo-Curator
is to leverage the PhysicsNeMo
docker image.
This can be pulled from the
NVIDIA Container Registry.
Current limitations:
- Currently only
linux/amd64
platform is supported - Currently we don't provide a PyPi wheel, and support installing from source
The instructions to get started with PhysicsNeMo-Curator
within the PhysicsNeMo
docker container are shown below.
docker pull nvcr.io/nvidia/physicsnemo/physicsnemo:25.06
# Install from source
git clone [email protected]:NVIDIA/physicsnemo-curator.git && cd physicsnemo-curator
pip install --upgrade pip
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
If you're new to the framework, start with our comprehensive Tutorial. It walks you through building a complete ETL pipeline from scratch. You'll learn how to:
- Define data schemas
- Implement schema validation, data sources, transformations, and sinks
- Convert HDF5 data to ML-optimized Zarr format
- Configure and run parallel processing pipelines
Have CFD simulation data from a solver like Fluent? PhysicsNeMo-Curator can process your data through the following approaches:
Currently Supported Formats:
- VTK formats: VTU (volume mesh data), VTP (surface mesh data)
- STL: Geometry files
Next Steps:
- Organize your converted data according to one of the supported dataset formats
- Use the built-in DoMINO pipeline to convert your data to an AI model training ready format
- Train your DoMINO Model on your own data by following the example in PhysicsNeMo!
If your data is in a format not directly supported (VTU/VTP/STL), you can extend the framework. The Tutorial demonstrates creating a complete pipeline that reads in HDF5 data and converts it to Zarr data.
- Domain-Specific Examples: Check if your use case matches our automotive aerodynamics pipeline. This provides an example ETL pipeline for training DoMINO models for automotive aerodynamics applications. For more questions about the formats, please refer to Data Processing Reference
- Architecture Questions: See the Tutorial for framework concepts, and to understand how to extend the pipeline
- Anything else: Please open a GitHub issue and we'll engage with you to answer the questions!
PhysicsNeMo-Curator and PhysicsNeMo are open source collaborations and their success is rooted in community contribution to further the field of Physics-ML. Thank you for contributing to the project so others can build on top of your contribution.
For guidance on contributing to PhysicsNeMo-Curator, please refer to the contributing guidelines.
If PhysicsNeMo-Curator helped your research and you would like to cite it, please refer to the guidelines.
- Github Discussions: Discuss new data formats, transformations, Physics-ML research, etc.
- GitHub Issues: Bug reports, feature requests, install issues, etc.
- PhysicsNeMo Forum: The PhysicsNeMo Forum hosts an audience of new to moderate-level users and developers for general chat, online discussions, collaboration, etc.
Want to suggest some improvements to PhysicsNeMo-Curator? Use our feedback form.
PhysicsNeMo-Curator is provided under the Apache License 2.0, please see LICENSE.txt for full license text.