fragSMILES4Reactions is a scientific project focused on the analysis and modeling of chemical reactions using fragSMILES representation compared with other notations like SMILES, SELFIES and fragment-based notations like SAFE and t-SMILES. This repository contains all the necessary code, data, and scripts to reproduce the experiments and results described in the associated research work.
The main repository for fragSMILES algorithm can be found here: chemicalGoF GitHub.
We also acknowledge and thank the authors of the following open-source projects, which are fundamental references for molecular representations:
- SELFIES – https://github.com/aspuru-guzik-group/selfies
- SAFE – https://github.com/datamol-io/safe
- t-SMILES – https://github.com/juanniwu/t-SMILES
rawdata_reactions/
– Raw reaction dataset already split into training, validation, and test sets.data_reactions/
– Processed reaction data used as input for experiments.experiments_reactions/
– Outputs from model training and prediction. Folder names follow the convention{key}={value}-{key}={value}-...
.floats/
– Figures (in PDF format) and tables (in LaTeX format) generated during analysis.notebooks/
– Jupyter notebooks for data exploration and post-processing of prediction results.bestof_setup/
– CSV files reporting the best configuration found for each model.scripts/
– Python scripts for preprocessing, training, prediction, and SMILES conversion tasks.src/
– Main source code of the project.extra/
– Contains an example reaction used for creating the introductory figure/chart.shell/
– Includesrun.sh
, a script to launch experiments using the best configurations for each model.requirements.txt
– List of required Python dependencies for setting up the environment.chemicalgof/
,SAFE/
/datamol/
,t-SMILES/
are external static repositories adopted for this project. TheSAFE
package has been modified to detect and report reasons for invalid sampled sequences.chemicalgof/
is the adopted version by this work to handle with fragSMILES notation.
The output of the experiments is already included in experiments_reactions/
, including model checkpoints (.ckpt
files) adopted for analysis.
You can train the model yourself, making sure not to resume from existing checkpoints.
The prediction phase (see the Scripts
section) can be executed directly, if the trained models are stored in the appropriate experiment
folder (see the [Project Structure
](#project structure) section to figure out the scheme of folder names).
However, the results of such predictions have already been analyzed and are available in this repository.
To reproduce our experiments:
-
Clone the repository:
git clone https://github.com/molML/fragSMILES4reaction.git cd fragSMILES4Reactions
-
Set up the Python environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate pip install -r requirements.txt
-
Experiments using the best configurations for each model can be run through a shell script.
NOTE: Set python environment path to be activated in file
shell/run.sh line 3
.bash shell/run.sh
⚠️ These experiments were conducted using 4 GPUs in parallel. Running on fewer or lower-memory devices may result in out-of-memory errors. -
Explore the Jupyter notebooks in Notebooks section to analyze datasets and prediction results.
Parameter | Description |
---|---|
task |
Task to perform: either forward (i.e., synthesis) or backward (i.e., retrosynthesis). |
notation |
Molecular representation format used as input/output (i.e., smiles, selfies, safe or fragsmiles). |
model_dim |
Dimensionality of the model's hidden layers (e.g., transformer embedding size). |
num_heads |
Number of attention heads in multi-head attention mechanisms. |
num_layers |
Number of layers (e.g., encoder or decoder blocks) in the model architecture. |
batch_size |
Number of training samples processed simultaneously during one training step. |
lr |
Learning rate used by the optimizer to update model weights. |
dropout |
Dropout rate for regularization to prevent overfitting (only 0.3 value was adopted in this work) |
These parameters are used as arguments in the Python scripts (see relative section) for training
and prediction
.
We recommend running scripts from the root directory. Example:
python scripts/script_file.py --argument1 value1 --argument2 value2
-
convert_dataset.py
Prepares dataset adopted for the experiments starting from rawdata. Please, explore arguments (python scripts/convert_dataset.py --help
) to be provided when command is called. Most important arguments are "notation", "split", "ncpus" (for multiprocessing computation). When a dataset notation-based is obtained, a csv file is written to track sequence lengths. -
train.py
Trains a model using the selected configuration (see dedicated section). Model checkpoint (.ckpt file format) will be stored in the corresponding experiment folder. Vocabulary file (vocab.pt) will be stored in the respective notation folder. -
predict.py
Predict the test set with trained model by using the selected configuration (see dedicated section). The output includes encoded predicted sequences stored in the respective experiment folder, with filenames containing tokens substring. -
convert_prediction_strict.py
Convert encoded predicted sequences obtained by model specifying its parameters. Invalid decoded sequences include the erroneus chirality label assigned to atoms. -
convert_prediction_strict_from_path.py
Same as above, but only requires the path to the experiment folder. -
convert_prediction.py
andconvert_prediction_from_path.py
: Similar to the strict versions, but invalid sequences do not include erroneus chirality label assigned to atoms. -
fragment_dataset.py
Fragment SMILES of data to obtain relative Scaffold, cycles, and acyclic chains of them. Used only on the test set, as demonstrated in05_struggle.ipynb
The Jupyter notebooks in notebooks/
provide an interactive way to explore datasets and experiment outputs.
NOTE: IPython package is required to handle with notebooks.
data_analysis.ipynb
Can be explored before experiments running. It includes sequence length and dataset size per split.bestof_selection.ipynb
Visualizes and compares loss curves for different hyperparameter settings to identify optimal configurations.accuracy.ipynb
Computes performance metrics for the best models and outputs tables ready for publication.similarity.ipynb
Analyzes similarity distributions between incorrect but valid predictions and their target molecules (forward task only).struggle.ipynb
Investigates failure cases in prediction, including reasons for invalid samples and substructure matching in erroneous predictions.stereoselectivity.ipynb
Additional analysis of different types of reactions involving stereocenters, including stereoselective reactions.
If you find this work usefull, feel free to cite the following publication:
Fabrizio Mastrolorito, Fulvio Ciriaco, Orazio Nicolotti, Francesca Grisoni
Enhancing deep chemical reaction prediction with advanced chirality and fragment representation
Chem. Commun., 2025, The Royal Society of Chemistry.
https://doi.org/10.1039/D5CC02641E
@article{mastrolorito2025fragsmiles4reactions,
author = "Mastrolorito, Fabrizio and Ciriaco, Fulvio and Nicolotti, Orazio and Grisoni, Francesca",
title = "Enhancing deep chemical reaction prediction with advanced chirality and fragment representation",
journal = "Chem. Commun.",
year = "2025",
pages = "-",
publisher = "The Royal Society of Chemistry",
doi = "10.1039/D5CC02641E",
url = "http://dx.doi.org/10.1039/D5CC02641E",
}
This project is licensed under the MIT License. See the LICENSE file for details.