
Inference docs | Training docs
A curated collection of ready-to-use training recipes for machine learning on Baseten. Whether you’re starting from scratch or fine-tuning an existing model, these recipes provide practical, copy-paste solutions for every stage of your ML pipeline.
- Training recipes - End-to-end examples for training models from scratch
- Fine-tuning workflows - Adapt pre-trained models to your specific use case
- Best practices - Optimized configurations and common patterns
From data preprocessing to checkpointed and trained models, these recipes cover the complete ML lifecycle on Baseten's platform.
Before getting started, ensure you have the following:
- A Baseten account. Sign up here if you don't have one.
- Add any access tokens, API keys (Example: Huggingface access token, Weights&Biases access token), passwords to securely access credentials from your models in secrets.
- This is required to access models on Huggingface that have gated access. More information on setting up Huggingface access tokens can be found here.
- Python 3.8 to 3.11 installed. Conda env recommended.
- Install Truss, Baseten's open-source model packaging tool to configure and containerize model code.
pip install --upgrade truss
git clone https://github.com/basetenlabs/ml-cookbook.git
Fine-tune GPT OSS 20B with LoRa and trl
If using a model with gated access, make sure you have access to the model on HuggingFace and your API token uploaded to your secrets. This example requires an HF access token and an optional Weights&Biases access token. To disable W&B, comment out any lines with wandb in examples/oss-gpt-20b-lora/training/config.py
and examples/oss-gpt-20b-lora/training/train.py
.
examples/oss-gpt-20b-lora/training/train.py
contains all training code.
examples/oss-gpt-20b-lora/training/config.py
will be the entry point to start training, where you can define your training configuration. This also includes the start commands to launch your training job. Make sure these commands also include any file permission changes to make shell scripts run. We do not change any file system permissions.
Make sure to update hf_access_token
in config.py
with the same name for this access token saved in your secrets. In this example, we will be writing trained checkpoints directly to Huggingface, the Hub IDs for models and datasets are configured in examples/oss-gpt-20b-lora/training/run.sh
. Update run.sh
with a repo you have access to write to.
cd examples/oss-gpt-20b-lora/training
truss train push config.py
Upon successful submission, the CLI will output helpful information about your job:
✨ Training job successfully created!
🪵 View logs for your job via `truss train logs --job-id e3m512w [--tail]`
🔍 View metrics for your job via `truss train metrics --job-id e3m512w`
Keep the Job ID handy, as you’ll use it for managing and monitoring your job.
Alternatively, you can view all your training jobs at (https://app.baseten.co/training/)[https://app.baseten.co/training/].
- As checkpoints are generated, you can access them on Huggingface at the same location defined in
run.sh
.
Fine-tune Llama 3.1 8b Instruct with LoRa and Unsloth
If using a model with gated access, make sure you have access to the model on HuggingFace and your API token uploaded to your secrets.
examples/llama-finetune-8b-lora/training/train.py
contains the training code.
examples/llama-finetune-8b-lora/training/config.py
will be the entry point to start training, where you can define your training configuration. This also includes the start commands to launch your training job. Make sure these commands also include any file permission changes to make shell scripts run. We do not change any file system permissions.
cd examples/llama-finetune-8b-lora/training
truss train push config.py
Upon successful submission, the CLI will output helpful information about your job:
✨ Training job successfully created!
🪵 View logs for your job via `truss train logs --job-id e3m512w [--tail]`
🔍 View metrics for your job via `truss train metrics --job-id e3m512w`
Alternatively, you can view all your training jobs at (https://app.baseten.co/training/)[https://app.baseten.co/training/].
In this example, since checkpointing is enabled in config.py
, checkpoints are stored in cloud storage and can be accessed with
truss train get_checkpoint_urls --job-id $JOB_ID
examples/mnist-single-gpu/training/train_mnist.py
contains the a Pytorch example of an MNIST classifier with CNNs.
examples/mnist-single-gpu/training/config.py
will be the entry point to start training, where you can define your training configuration. This also includes the start commands to launch your training job. Make sure these commands also include any file permission changes to make shell scripts run. We do not change any file system permissions.
cd examples/mnist-single-gpu/training
truss train push config.py
Upon successful submission, the CLI will output helpful information about your job:
✨ Training job successfully created!
🪵 View logs for your job via `truss train logs --job-id e3m512w [--tail]`
🔍 View metrics for your job via `truss train metrics --job-id e3m512w`
Keep the Job ID handy, as you’ll use it for managing and monitoring your job.
In this example, since checkpointing is enabled in config.py
, checkpoints are stored in cloud storage and can be accessed with
truss train get_checkpoint_urls --job-id $JOB_ID
Contributions are welcome! Please open issues or submit pull requests.