fragSMILES4Reactions

fragSMILES4Reactions is a scientific project focused on the analysis and modeling of chemical reactions using fragSMILES representation compared with other notations like SMILES, SELFIES and fragment-based notations like SAFE and t-SMILES. This repository contains all the necessary code, data, and scripts to reproduce the experiments and results described in the associated research work.

The main repository for fragSMILES algorithm can be found here: chemicalGoF GitHub.

We also acknowledge and thank the authors of the following open-source projects, which are fundamental references for molecular representations:

SELFIES – https://github.com/aspuru-guzik-group/selfies
SAFE – https://github.com/datamol-io/safe
t-SMILES – https://github.com/juanniwu/t-SMILES

📁 Project Structure

rawdata_reactions/ – Raw reaction dataset already split into training, validation, and test sets.
data_reactions/ – Processed reaction data used as input for experiments.
experiments_reactions/ – Outputs from model training and prediction. Folder names follow the convention {key}={value}-{key}={value}-....
floats/ – Figures (in PDF format) and tables (in LaTeX format) generated during analysis.
notebooks/ – Jupyter notebooks for data exploration and post-processing of prediction results.
bestof_setup/ – CSV files reporting the best configuration found for each model.
scripts/ – Python scripts for preprocessing, training, prediction, and SMILES conversion tasks.
src/ – Main source code of the project.
extra/ – Contains an example reaction used for creating the introductory figure/chart.
shell/ – Includes run.sh, a script to launch experiments using the best configurations for each model.
requirements.txt – List of required Python dependencies for setting up the environment.
chemicalgof/, SAFE//datamol/, t-SMILES/ are external static repositories adopted for this project. The SAFE package has been modified to detect and report reasons for invalid sampled sequences. chemicalgof/ is the adopted version by this work to handle with fragSMILES notation.

🧪 Reproducibility

The output of the experiments is already included in experiments_reactions/, including model checkpoints (.ckpt files) adopted for analysis. You can train the model yourself, making sure not to resume from existing checkpoints.
The prediction phase (see the Scripts section) can be executed directly, if the trained models are stored in the appropriate experiment folder (see the [Project Structure](#project structure) section to figure out the scheme of folder names). However, the results of such predictions have already been analyzed and are available in this repository.

To reproduce our experiments:

Clone the repository:

 git clone https://github.com/molML/fragSMILES4reaction.git
 cd fragSMILES4Reactions

Set up the Python environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Experiments using the best configurations for each model can be run through a shell script.

NOTE: Set python environment path to be activated in file shell/run.sh line 3.
```
bash shell/run.sh
```
⚠️ These experiments were conducted using 4 GPUs in parallel. Running on fewer or lower-memory devices may result in out-of-memory errors.
Explore the Jupyter notebooks in Notebooks section to analyze datasets and prediction results.

Model parameters

Parameter	Description
`task`	Task to perform: either `forward` (i.e., synthesis) or `backward` (i.e., retrosynthesis).
`notation`	Molecular representation format used as input/output (i.e., smiles, selfies, safe or fragsmiles).
`model_dim`	Dimensionality of the model's hidden layers (e.g., transformer embedding size).
`num_heads`	Number of attention heads in multi-head attention mechanisms.
`num_layers`	Number of layers (e.g., encoder or decoder blocks) in the model architecture.
`batch_size`	Number of training samples processed simultaneously during one training step.
`lr`	Learning rate used by the optimizer to update model weights.
`dropout`	Dropout rate for regularization to prevent overfitting (only 0.3 value was adopted in this work)

These parameters are used as arguments in the Python scripts (see relative section) for training and prediction.

Scripts

We recommend running scripts from the root directory. Example:

python scripts/script_file.py --argument1 value1 --argument2 value2

convert_dataset.py Prepares dataset adopted for the experiments starting from rawdata. Please, explore arguments (python scripts/convert_dataset.py --help) to be provided when command is called. Most important arguments are "notation", "split", "ncpus" (for multiprocessing computation). When a dataset notation-based is obtained, a csv file is written to track sequence lengths.
train.py Trains a model using the selected configuration (see dedicated section). Model checkpoint (.ckpt file format) will be stored in the corresponding experiment folder. Vocabulary file (vocab.pt) will be stored in the respective notation folder.
predict.py Predict the test set with trained model by using the selected configuration (see dedicated section). The output includes encoded predicted sequences stored in the respective experiment folder, with filenames containing tokens substring.
convert_prediction_strict.py Convert encoded predicted sequences obtained by model specifying its parameters. Invalid decoded sequences include the erroneus chirality label assigned to atoms.
convert_prediction_strict_from_path.py Same as above, but only requires the path to the experiment folder.
convert_prediction.py and convert_prediction_from_path.py : Similar to the strict versions, but invalid sequences do not include erroneus chirality label assigned to atoms.
fragment_dataset.py Fragment SMILES of data to obtain relative Scaffold, cycles, and acyclic chains of them. Used only on the test set, as demonstrated in 05_struggle.ipynb

📓 Notebooks

The Jupyter notebooks in notebooks/ provide an interactive way to explore datasets and experiment outputs.

NOTE: IPython package is required to handle with notebooks.

data_analysis.ipynb Can be explored before experiments running. It includes sequence length and dataset size per split.
bestof_selection.ipynb Visualizes and compares loss curves for different hyperparameter settings to identify optimal configurations.
accuracy.ipynb Computes performance metrics for the best models and outputs tables ready for publication.
similarity.ipynb Analyzes similarity distributions between incorrect but valid predictions and their target molecules (forward task only).
struggle.ipynb Investigates failure cases in prediction, including reasons for invalid samples and substructure matching in erroneous predictions.
stereoselectivity.ipynb Additional analysis of different types of reactions involving stereocenters, including stereoselective reactions.

💬 Citation

If you find this work usefull, feel free to cite the following publication:

Fabrizio Mastrolorito, Fulvio Ciriaco, Orazio Nicolotti, Francesca Grisoni
Enhancing deep chemical reaction prediction with advanced chirality and fragment representation
Chem. Commun., 2025, The Royal Society of Chemistry.
https://doi.org/10.1039/D5CC02641E

@article{mastrolorito2025fragsmiles4reactions,
  author    = "Mastrolorito, Fabrizio and Ciriaco, Fulvio and Nicolotti, Orazio and Grisoni, Francesca",
  title     = "Enhancing deep chemical reaction prediction with advanced chirality and fragment representation",
  journal   = "Chem. Commun.",
  year      = "2025",
  pages     = "-",
  publisher = "The Royal Society of Chemistry",
  doi       = "10.1039/D5CC02641E",
  url       = "http://dx.doi.org/10.1039/D5CC02641E",
}

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

fragSMILES4Reactions

📁 Project Structure

🧪 Reproducibility

Model parameters

Scripts

📓 Notebooks

💬 Citation

📜 License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
bestof_setup		bestof_setup
chemicalgof		chemicalgof
data_reactions		data_reactions
datamol		datamol
experiments_reactions		experiments_reactions
extra		extra
floats		floats
notebooks		notebooks
rawdata_reactions		rawdata_reactions
safe		safe
scripts		scripts
shell		shell
src		src
tsmiles		tsmiles
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

molML/fragSMILES4reaction

Folders and files

Latest commit

History

Repository files navigation

fragSMILES4Reactions

📁 Project Structure

🧪 Reproducibility

Model parameters

Scripts

📓 Notebooks

💬 Citation

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages