Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
67da3a1
Add checklists for GH and HF expectations
egrace479 Feb 14, 2025
9ac2681
Update formatting for Mkdocs
egrace479 Feb 14, 2025
532bb03
update URLs to more precise references
egrace479 Feb 17, 2025
c04c618
Add references to other checklists
egrace479 Feb 17, 2025
8a4c138
update checklist descriptions at tops of pages
egrace479 Feb 17, 2025
86f99bb
Add link to repo issues to encourage dialog in case of questions/comm…
egrace479 Mar 19, 2025
6a0ba1e
Add description of what to expect from using checklist
egrace479 Apr 8, 2025
6039add
Rename to checklist
egrace479 Apr 8, 2025
9f4facf
Add help link at bottom of page
egrace479 Apr 8, 2025
868ba65
Add page explaining FAIR and providing context for checklists
egrace479 Apr 8, 2025
caf6e54
Reformulate Metadata Guide as FAIR Guide
egrace479 Apr 8, 2025
ef6701e
Rename Metadata-Guide file to Metadata-Checklist for consistency with…
egrace479 Apr 9, 2025
e211756
Remove course-specific aspect of note
egrace479 Apr 9, 2025
0eba37d
chore: linting
gwtaylor Apr 11, 2025
cec31d8
Update FAIR Guide navigation to clarify section title
gwtaylor Apr 11, 2025
71cdd3b
Clarify reproducibility context in FAIR Guide and update references
gwtaylor Apr 11, 2025
5382a0f
lint Metadata-Checklist.md
gwtaylor Apr 11, 2025
5ea502f
minor edits FAIR-Guide.md and Metadata-Checklist.md
gwtaylor Apr 11, 2025
0bb6d90
lint Code-Checklist.md
gwtaylor Apr 11, 2025
15e9bb7
fix: move nested checklist to 4 spaces indent for Python-Markdown com…
gwtaylor Apr 11, 2025
9485411
revert to 4 space indentation due to issues with Python-Markdown and …
gwtaylor Apr 11, 2025
4126ec3
minor edits to Code Repo Checklist
gwtaylor Apr 11, 2025
729648f
fix: lint Data-Checklist.md
gwtaylor Apr 11, 2025
8249e02
minor edits to Data-Checklist.md
gwtaylor Apr 11, 2025
7140de0
fix: lint Model-Checklist.md
gwtaylor Apr 11, 2025
4de9506
minor edits to Model-Checklist.md
gwtaylor Apr 11, 2025
f626ad6
fix: lint DOI-Generation.md
gwtaylor Apr 11, 2025
37b4ceb
fix: improve flow of intro to DOI guide
gwtaylor Apr 11, 2025
dfddaa0
minor edits to DOI-Generation.md
gwtaylor Apr 11, 2025
a9ce38c
Ensure examples are viable format for HF cards
egrace479 Apr 24, 2025
7abbf6d
Add clarification on what constitutes an issue tracker for HF
egrace479 Apr 24, 2025
eddffa9
Adjust sub-bullet formatting for clearer comment on the source
egrace479 Apr 25, 2025
bb676d7
Add more helpful references
egrace479 Jun 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .markdownlint.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"MD007": { "indent": 4 },
"no-hard-tabs": false,
"MD013": false
}
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Check out our guides to get your project off on the right foot!

- [The Hugging Face Repo Guide](wiki-guide/Hugging-Face-Repo-Guide.md): Analogous expected and suggested repository contents for Hugging Face repositories; there are notable differences from GitHub in both content and structure.

- [Metadata Guide](wiki-guide/Metadata-Guide.md): Guide to metadata collection and documentation. This closely follows our [HF Dataset Card Template](wiki-guide/HF_DatasetCard_Template_mkdocs.md) sections.
- [FAIR Guide](wiki-guide/FAIR-Guide.md): Guide to producing FAIR digital products, from metadata collection through product documentation and publication. This builds on the content in both the GitHub and Hugging Face Repository Guides, providing checklists to ensure [code](wiki-guide/Code-Checklist.md), [data](wiki-guide/Data-Checklist.md), and [model](wiki-guide/Model-Checklist.md) repositories are FAIR. The latter two closely follow our [HF Templates](wiki-guide/About-Templates.md).

### Project repo up, what's next?
Check out our workflow guides for how to interact with your new repo:
Expand Down
107 changes: 107 additions & 0 deletions docs/wiki-guide/Code-Checklist.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Code Checklist

This checklist provides an overview of essential and recommended elements to include in a GitHub repository to ensure that it conforms to FAIR principles and best practices for reproducibility. Along with the generation of a DOI (see [DOI Generation](DOI-Generation.md) and [Digital Products Release and Licensing Policy](Digital-products-release-licensing-policy.md)), following this checklist ensures compliance with the FAIR Principles for research software.[^1]
[^1]: Barker, M., Chue Hong, N. P., Katz, D. S., Lamprecht, A. L., Martinez-Ortiz, C., Psomopoulos, F., Harrow, J., Castro, L. J., Gruenpeter, M., Martinez, P. A., & Honeyman, T. (2022). Introducing the FAIR Principles for research software. _Scientific data_, 9(1), 622. [URL](https://doi.org/10.1038/s41597-022-01710-x).

!!! tip "Pro tip"

Use the eye icon at the top of this page to access the source and copy the markdown for the checklist below into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each element to your GitHub repository.

## Required Files

- [ ] **License**: Verify and include an appropriate license (e.g., `MIT`, `CC0-1.0`, etc.). See discussion in the [Repo Guide](GitHub-Repo-Guide.md/#license).
- [ ] **README File**: Following the [Repo Guide](GitHub-Repo-Guide.md/#readme), provide a detailed `README.md` with:
- [ ] Overview of the project.
- [ ] Installation instructions.
- [ ] Basic usage examples.
- [ ] Links to related/created dataset(s).
- [ ] Links to related/created model(s).
- [ ] Acknowledge source code dependencies and contributors.
- [ ] Reference related datasets used in training or evaluation.
- [ ] **Requirements File**: Provide a [file detailing software requirements](GitHub-Repo-Guide.md/#software-requirements-file), such as a `requirements.txt` or `pyproject.toml` for Python dependencies.
- [ ] **Gitignore File**: GitHub has premade `.gitignore` files ([here](https://github.com/github/gitignore)) tailored to particular languages (eg., [R](https://github.com/github/gitignore/blob/main/R.gitignore) or [Python](https://github.com/github/gitignore/blob/main/Python.gitignore)), operating systems, etc.
- [ ] **CITATION CFF**: This facilitates citation of your work, follow guidance provided in the [Repo Guide](GitHub-Repo-Guide.md/#citation).

### Data-Related

- [ ] Preprocessing code.
- [ ] Description of dataset(s), including description of training and testing sets (with links to relevant portions of dataset card, which will have more information).

### Model-Related

- [ ] Training code.
- [ ] Inference/evaluation code.
- [ ] Model weights (if not in Hugging Face model repository).
- [ ] Description of model(s)/benchmark(s).
- [ ] Explanation of training and testing (with links to relevant portions of model card, which will have more information).

!!! note
The [bioclip GitHub repository](https://github.com/Imageomics/bioclip) provides an example of incorporating data-and model-related code into a GitHub repository as published open-source code for both data and model development.

## General Information

- [ ] **Repository Structure**: Ensure the code repository follows a clear and logical directory structure. (See [Repo Guide](GitHub-Repo-Guide.md/#general-repository-structure).)
- [ ] **Code Comments**: Include meaningful inline comments and function descriptions for clarity.
- [ ] **Random Seed Control**: Save seed(s) for random number generator(s) to ensure reproducible results.

## Security Considerations

- [ ] **Sensitive Data Handling**: Ensure no hardcoded sensitive information (e.g., API keys, credentials) are included in your repository. These can be shared through a config file on OSC.

!!! note
The best practices described below will help you meet the above requirements. The more advanced development practices noted further down are included for educational purposes and are highly recommended—though these may go beyond what is expected for a given project, we advise collaborators to at least have a discussion about the topics covered in [Code Quality](#code-quality) and whether other practices discussed would be appropriate for their project.

---

## Best Practices

The [Repo Guide](GitHub-Repo-Guide.md/) provides general guidance on repository structure, [collaborative workflow](The-GitHub-Workflow.md/), and [how to make and review pull requests (PR)](The-GitHub-Pull-Request-Guide.md/). Below, we highlight some best practices in checklist form to help you meet the requirements described above for a FAIR and Reproducible project.

### Reproducibility

- **Version Control**: Use Git for version control and commit regularly.
- **Modularization**: Structure code into reusable and independent modules.
- **Code Execution**: Provide Notebooks to demonstrate how to reproduce results.

### Code Review & Maintenance

- **Code Reviews**: Regular peer reviews for quality assurance. Refer to the [GitHub PR Review Guide](The-GitHub-Pull-Request-Guide.md/#2-review-a-pull-request).
- **Issue Tracking**: Use GitHub issues for tracking bugs and feature requests.
- **Versioning**: Tag releases, changelogs can be auto-generated and informative when PRs are appropriately scoped.

### Installation and Dependencies

- [ ] **Environment Setup**: Include setup instructions (e.g., `conda` environment file, `Dockerfile`).
- [ ] **Dependency Management**: Use virtual environments and the frameworks that manage them (e.g., `venv`, `conda`, `uv` for Python) to isolate dependencies.

---

## More Advanced Development

### Documentation

- [ ] **API Documentation**: Generate API documentation (e.g., [`MkDocs`](https://www.mkdocs.org) for Python or wiki pages in the repo).
- [ ] **Docstrings**: Add comprehensive docstrings for all functions, classes, and modules. These can be incorporated to help generate documentation. Note that generative AI tools with access to your code, such as GitHub Copilot, can be quite accurate in generating these, especially if you are using type annotations.
- [ ] **Example Scripts**: Include example scripts for common use cases.
- [ ] **Configuration Files**: Use `yaml`, `json`, or `ini` for configuration settings.

### Code Quality

- [ ] **Consistent Style**: Follow coding style guidelines (e.g., `PEP 8` for Python).
- [ ] **Linting**: Ensure the code passes a linter (e.g., `Ruff` for Python).
- [ ] **Logging**: Use logging instead of print statements for better debugging (e.g., `logging` in Python).
- [ ] **Error Handling**: Implement robust exception handling to avoid crashes or bogus results from input outside of code expectations.

### Testing

- [ ] **Unit Tests**: Write unit tests to validate core functionality.
- [ ] **Integration Tests**: Ensure components work together correctly.
- [ ] **Test Coverage**: Check test coverage, e.g., using [Coverage](https://coverage.readthedocs.io/).
- [ ] **Continuous Integration (CI)**: Set up CI/CD pipelines (e.g., [GitHub Actions](https://docs.github.com/en/actions)) for automated testing.

### Code Distribution & Deployment

- [ ] **Packaging**: Provide installation instructions (e.g., `setup.py`, `hatch`, `poetry`, `uv` for Python).
- [ ] **Deployment Guide**: Document deployment procedures

!!! question "[Questions, Comments, or Concerns?](https://github.com/Imageomics/Imageomics-guide/issues)"
17 changes: 7 additions & 10 deletions docs/wiki-guide/DOI-Generation.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,28 @@
# DOI Generation

This guide discusses DOI generation for digital artifacts that may be associated with publications, such as datasets, models, and software.
You are likely familiar with DOIs from citing (journal/arXiv/conference) papers, for which they are generated by the publisher and regularly used in citations. However, they are also invaluable for proper citation of code, models, and data. One may think of this in the manner they are handled on arXiv, where there are options for "Cite as:" or "for this version" (with the "v#" at the end) option when citing a preprint.
You are likely familiar with DOIs from citing (journal/arXiv/conference) papers, for which they are generated by the publisher and regularly used in citations. However, they are also invaluable for proper citation of code, models, and data. Similar to how DOIs help track different versions of preprints on repositories like arXiv, they can provide persistent identification and versioning for your research artifacts beyond traditional publications.

## What is a DOI?

A DOI (Digital Object Identifier) is a _persistent_ (permanent) digital identifier for any object (data, model, code, etc.) that _uniquely_ distinguishes it from other objects and links to information—metadata—about the object. The International DOI Foundation (IDF) is responsible for developing and administering the DOI system. See their [What is a DOI](https://www.doi.org/the-identifier/what-is-a-doi/) article for more information.

A DOI (Digital Object Identifier) is a _persistent_ (permanent) digital identifier for any object (data, model, code, etc.) that _uniquely_ distinguishes it from other objects and links to information—metadata—about the object. The International DOI Foundation (IDF) is responsible for developing and administering the DOI system. See their [What is a DOI?](https://www.doi.org/the-identifier/what-is-a-doi/) article for more information.

## How do you generate a DOI?

When publishing code, data, or models, there are various options for DOI generation, and selecting one is generally dependent on where the object of interest is published. We will go over the two standard methods used by the Institute here, and we mention a third option for completeness. A comparison of these three options is provided in the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf).


### 1. Generate a DOI on Hugging Face

This is the simplest method for generating a DOI for a model or dataset since [Hugging Face partnered with DataCite to offer this option](https://huggingface.co/blog/introducing-doi).
This is the simplest method for generating a DOI for a model or dataset since [Hugging Face partnered with DataCite to offer this option](https://huggingface.co/blog/introducing-doi).

!!! warning "Warning"
Though it is a very simple process, it is not one to be taken lightly, as there is no removing data once this has been done--any changes require generation of a ***new*** DOI for the updated version: the old version will be maintained in perpetuity!
Though it is a very simple process, it is not one to be taken lightly, as there is no removing data once this has been done--any changes require generation of a _**new**_ DOI for the updated version: the old version will be maintained in perpetuity!

!!! warning "Warning"
As stated in the [Imageomics Digital Products Release and Licensing Policy](Digital-products-release-licensing-policy.md), DOIs are not to be generated for Imageomics Organization Repositories until approval has been granted by the Senior Data Scientist or Institute Leadership.

Hugging Face allows for the generation of a DOI through the settings tab on the Model or Dataset. For details on _how_ to generate a DOI with Hugging Face, please see the [Hugging Face DOI Documentation](https://huggingface.co/docs/hub/doi).


### 2. Generate a DOI with Zenodo

This is the most common method used for generating a DOI for a GitHub repository, because [Zenodo](https://zenodo.org/) has a [GitHub integration](https://zenodo.org/account/settings/github/), which is accessed through your Zenodo account settings (for more information, please see [GitHub's associated Docs](https://docs.github.com/articles/referencing-and-citing-content)). Zenodo can also be used to generate DOIs for data, as is relatively common in biology. However, for direct use of ML models and datasets, there are many more advantages to using Hugging Face; please see the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) for more information.[^1]
Expand All @@ -38,11 +35,11 @@ When your GitHub and Zenodo accounts are linked, there will be a list of availab
![Zenodo instructions and enabled repos](images/doi-generation/enabled_repos+intstructions.png){ loading=lazy, width="800" }

!!! info "The Sync now button"
There is a "Sync now" button at the top right of the instructions, with information on when the last sync occurred. Observe that a badge appears for the enabled repository that <b>_has_</b> a DOI, while the one without just shows up as enabled; this will also be true for repositories to which you have access but that you did not submit to Zenodo yourself.
There is a "Sync now" button at the top right of the instructions, with information on when the last sync occurred. Observe that a badge appears for the enabled repository that **_has_** a DOI, while the one without just shows up as enabled; this will also be true for repositories to which you have access but that you did not submit to Zenodo yourself.

#### Metadata Tracking

When automatically generating a DOI with Zenodo, it uses information provided in your `CITATION.cff` file to populate the metadata for the record. However, there is important information that is not supported through this integration despite its inclusion in the `CITATION.cff` format in some cases.
When automatically generating a DOI with Zenodo, it uses information provided in your `CITATION.cff` file to populate the metadata for the record. However, there is important information that is not supported through this integration despite its inclusion in the `CITATION.cff` format in some cases.

If your repository is likely to be updated repeatedly (i.e., generating new releases), then you may consider adding a `.zenodo.json` to preserve the remaining metadata on release sync with Zenodo for DOI. This metadata includes grant (funding) information, references (which may be included in your `CITATION.cff`), and a description of your repository/code.

Expand Down Expand Up @@ -70,8 +67,8 @@ Building on the alternate edit options, there is also the option to simply gener

When creating a new record on Zenodo, please ensure that other members of your project have access, as appropriate. In particular, there should be at least one member of Institute leadership or the Senior Data Scientist added to the record with management permissions. This ensures the ability to maintain the metadata and address matters related to the record (which may extend beyond your tenure with the Institute) in a timely manner.


### 3. Generate a DOI with Dryad

[Dryad](https://datadryad.org/stash/about) is another research data repository, similar to Zenodo, through which one can archive digital objects (such as, but not limited to, data) supporting scholarly publications, and obtain a DOI. It has a review process when depositing data and requires dedication to the public domain (CC0) of all digital objects uploaded. Imageomics through OSU is a member organization of Dryad, reducing or eliminating data deposit charge(s). To determine whether Dryad is a suitable archive for Institute data products supporting your publication, please consider the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) for more information, and consult with the Institute's Senior Data Scientist.[^1]

!!! question "[Questions, Comments, or Concerns?](https://github.com/Imageomics/Imageomics-guide/issues)"
Loading