Releases: aws/aws-parallelcluster-cookbook
AWS ParallelCluster v2.4.1
We're excited to announce the release of AWS ParallelCluster Cookbook 2.4.1.
This is associated with AWS ParallelCluster v2.4.1.
Enhancements
- Install IntelMPI on Alinux, Centos 7 and Ubuntu 1604
- Upgrade EFA to version 1.4.1
- Run all node daemons and cookbook recipes in isolated Python virtualenvs. This allows our code to always
run with the required Python dependencies and solves all conflicts and runtime failures that were being
caused by user packages installed in the system Python
Changes
- Torque: upgrade to version 6.1.2
- Run all node daemons with Python 3.6
- Torque: changed following parameters in global configuration:
server node_check_rate = 120
- Specifies the minimum duration (in seconds) that a node can fail to send a status update before being marked down by the pbs_server daemon. Previously was 600. This reduces scaling reaction times in case of instance failure or unexpected termination (especially with spot)server node_ping_rate = 60
- Specifies the maximum interval (in seconds) between successive "pings" sent from the pbs_server daemon to the pbs_mom daemon to determine node/daemon health. Previously was 300. Setting it to half the node_check_rate.server timeout_for_job_delete = 30
- The specific timeout used when deleting jobs because the node they are executing on is being deleted. Previously was 120. This prevents job deletion to hang for more than 30 seconds when the node they are running on is being deleted.server timeout_for_job_requeue = 30
- The specific timeout used when requeuing jobs because the node they are executing on is being deleted. Previously was 120. This prevents node deletion to hang for more than 30 seconds when a job cannot be rescheduled.
Bug Fixes
- Restore correct value for
filehandle_limit
that was getting reset when settingmemory_limit
for EFA - Torque: fix configuration of server operators that was preventing compute nodes from disabling themselves
before termination
Support
Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192
AWS ParallelCluster v2.4.0
We're excited to announce the release of AWS ParallelCluster Cookbook 2.4.0.
This is associated with AWS ParallelCluster v2.4.0.
Enhancements
- Add support for EFA on Centos 7, Amazon Linux and Ubuntu 1604
- Add support for Ubuntu in China region
cn-northwest-1
Changes
- SGE: changed following parameters in global configuration
max_unheard 00:03:00
: allows a faster reaction in case of faulty nodesreschedule_unknown 00:00:30
: enables rescheduling of jobs running on failing nodesqmaster_params ENABLE_FORCED_QDEL_IF_UNKNOWN
: forces job deletion on unresponsive nodesqmaster_params ENABLE_RESCHEDULE_KILL
: forces rescheduling or killing of jobs running on failing nodes
- Slurm: decrease SlurmdTimeout to 120 seconds to speed up replacement of faulty nodes
- Always use full master FQDN when mounting NFS on compute nodes. This solves some issues occurring with some networking
setups and custom DNS configurations - Set soft and hard ulimit on open files to 10000 for all supported OSs
- Pin python
supervisor
version to 3.4.0 - Remove unused
compute_instance_type
from jobwatcher.cfg - Removed unused
max_queue_size
from sqswatcher.cfg - Remove double quoting of the post_install args
Bug Fixes
- Fix issue that was preventing Torque from being used on Centos 7
- Start node daemons at the end of instance initialization. The time spent for post-install script and node
initialization is not counted as part of node idletime anymore. - Fix issue which was causing an additional and invalid EBS mount point to be added in case of multiple EBS
- Install Slurm libpmpi/libpmpi2 that is distributed in a separate package since Slurm 17
Support
Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192
AWS ParallelCluster 2.3.1
We're excited to announce the release of AWS ParallelCluster Cookbook 2.3.1.
This is associated with AWS ParallelCluster v2.3.1.
Enhancements
- FSx Lustre - add support in Amazon Linux
Changes
- Slurm - upgrade to version 18.08.6.2
- Slurm - declare nodes in separate config file and use FUTURE for dummy nodes
- Slurm - set
ReturnToService=1
in scheduler config in order to recover instances that were initially marked as down due to a transient issue. - NVIDIA - update drivers to version 418.56
- CUDA - update toolkit to version 10.0
- Increase default EBS volume size from 15GB to 17GB
- Add
LocalHostname
toCOMPUTE_READY
events - Pin
future
,retrying
andsix
packages in Ubuntu 14.04 - Add
stackname
andmax_queue_size
to sqswatcher configuration
Support
Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192
AWS ParallelCluster 2.2.1
We're excited to announce the release of AWS ParallelCluster Cookbook 2.2.1.
This is associated with AWS ParallelCluster v2.2.1.
Features
- Support for FSx Lustre with Centos 7
- Check AWS EC2 account limits before starting cluster creation
- Allow users to force job deletion with
SGE
scheduler
Changes
- Set default value to
compute
forplacement_group
option pcluster ssh
: use private IP when the public one is not availablepcluster ssh
: now works also when stack is not completed as long as the master IP is available
Bugfixes
awsbsub
: fix file upload with absolute pathpcluster ssh
: fix issue that was preventing the command from working correctly when stack status isUPDATE_ROLLBACK_COMPLETE
- Fix block device conversion to correctly attach EBS nvme volumes
- Wait for Torque scheduler initialization before completing master node setup
pcluster version
: now works also when no ParallelCluster config is present- Improve
nodewatcher
daemon logic to detect if a SGE compute node has running jobs
Support
Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192
AWS ParallelCluster 2.1.1
We're excited to announce the release of AWS ParallelCluster Cookbook 2.1.1.
This is associated with AWS ParallelCluster v2.1.1.
Features
- Support for AWS Beijing Region (cn-north-1) and Ningxia Region (cn-northwest-1
Bugfixes
- No longer schedule jobs on compute nodes that are terminating
Support
Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192
AWS ParallelCluster v2.1.0
We're excited to announce the release of AWS ParallelCluster Cookbook 2.1.0.
This is associated with AWS ParallelCluster v2.1.0.
Features
- Support for Elastic File System (EFS)
- AWS Batch Multinode Parallel support
- Support for RAID 0 and 1 EBS Volumes
- Support for AWS Stockholm Region (eu-north-1)
Bugfixes
- No longer schedule jobs on compute nodes that are terminating
Support
Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192
AWS ParallelCluster v2.0.2
We're excited to announce the release of AWS ParallelCluster Cookbook 2.0.2.
This is associated with AWS ParallelCluster v2.0.2.
Features
- Support for new GovCloud region us-gov-east-1
Bugfixes
- Fix regression with
shared_dir
parameter in the cluster configuration section. - Fixed issue with
jq
that prevented customers from usingextra_json
- Fixed issue with
awscli
version on ubuntu1404
Support
Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192
AWS ParallelCluster v2.0.0
We're excited to announce the release of AWS ParallelCluster Cookbook 2.0.0!
This is associated with AWS ParallelCluster v2.0.0.
Features
- AWS Batch integration
- Multiple EBS Volumes
- Support for custom AMI's
Support
Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster note: we've moved cookbook issues to the main package, please create new issues there
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192
CfnCluster v1.6.0
This is a release of the cfncluster-cookbook v1.6.0, associated with CfnCluster v1.6.0.
Features:
- Refactor scaling up to take into account the number of pending/requested jobs/slots and instance slots.
- Refactor scaling down to scale down faster and take advantage of per-second billing.
- Add
scaledown_idletime
parameter as part of scale-down refactoring - Lock hosts before termination to ensure removal of dead compute nodes from host list
- Fix HTTP proxy support
CfnCluster v1.5.4
This is a release of the cfncluster-cookbook v1.5.4, associated with CfnCluster v1.5.4.
Features:
- Set SGE Accounting summary to be true, this reports a single accounting record
for a mpi job - Add option to disable ganglia
extra_json = { "cfncluster" : { "ganglia_enabled" : "no" } }