Skip to content
This repository was archived by the owner on Jun 24, 2025. It is now read-only.
Draft
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
29088bb
Create 2019-10-21-hosted-ce.md
LincolnBryant Oct 14, 2019
5bed6f1
Update 2019-10-21-hosted-ce.md
LincolnBryant Oct 14, 2019
f823cb4
Update 2019-10-21-hosted-ce.md
LincolnBryant Oct 14, 2019
45b14d9
Update 2019-10-21-hosted-ce.md
LincolnBryant Oct 14, 2019
cd02167
Update 2019-10-21-hosted-ce.md
LincolnBryant Oct 14, 2019
84177eb
Update 2019-10-21-hosted-ce.md
LincolnBryant Oct 14, 2019
6ca279b
Update 2019-10-21-hosted-ce.md
LincolnBryant Oct 14, 2019
009c345
Update 2019-10-21-hosted-ce.md
LincolnBryant Oct 14, 2019
2faa867
Update 2019-10-21-hosted-ce.md
LincolnBryant Jan 14, 2020
07c0a5d
deprecate users endpoint batch and maxjobs values merge values from s…
Mansalu Jan 21, 2020
c53933c
deprecate enabled value for squid
Mansalu Jan 21, 2020
5f447c6
add updated values to end config
Mansalu Jan 21, 2020
36d572e
stub sections for new config options
Mansalu Jan 21, 2020
38fd35f
remove errant comma
Mansalu Jan 21, 2020
bbe6ac9
fix job router entry for consistency
Mansalu Jan 21, 2020
a37412e
add networking sub section
Mansalu Jan 21, 2020
5d236d4
add job router entry configuration
Mansalu Jan 21, 2020
8534086
add narrative for the rest of the new configs
Mansalu Jan 21, 2020
f73ddbb
Add testing section
Mansalu Jan 21, 2020
05a8eeb
fixed cilogon link to point to actual website
amblinalong Jan 22, 2020
fcb3513
Fixed another cilogon link
amblinalong Jan 22, 2020
507f258
add instructions to conver the grid proxy certs
Mansalu Jan 22, 2020
caff853
Add conclusion
Mansalu Jan 22, 2020
05d91d6
grammatical fixes
Mansalu Jan 22, 2020
d4f7401
fix wording for site section
Mansalu Jan 22, 2020
3eb92b1
update testing instructions to mention cert password
Mansalu Jan 22, 2020
98eac54
elaborate on ITB host
Mansalu Jan 22, 2020
aba4b31
add link to opensciencegrid
Mansalu Jan 22, 2020
03d028d
fix link to osg
Mansalu Jan 22, 2020
9a599eb
fix typo
Mansalu Jan 22, 2020
f4af7f6
fix wording
Mansalu Jan 22, 2020
2b81ecb
fixed some spelling
amblinalong Jan 22, 2020
89a541e
more spelling
amblinalong Jan 22, 2020
3548ee4
add bosco-override-template
Mansalu Jan 23, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
379 changes: 379 additions & 0 deletions _posts/2019-10-21-hosted-ce.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,379 @@
---
title: "Deploying and testing an OSG Hosted CE via SLATE"
overview: Blog
published: true
permalink: blog/deploy-hosted-ce.html
attribution: The SLATE Team
layout: post
type: markdown
tag: draft
---

# Deploying an OSG HostedCE
An OSG Hosted Compute Element is an application that allows a site to contribute idle HPC or HTC compute resources to the [Open Science Grid](https://opensciencegrid.org). The CE is responsible for receiving jobs from the grid and routing them to your local cluster(s). These are preemptible jobs that will run when the cluster resources would have otherwise gone idle.

Using SLATE you can deploy the HostedCE locally at your site, and easily scale to as many instances as needed. It is reccomended to create a unique HostedCE instance for each individual cluster you intend to support.

## Prerequisites
You must have a functional batch system on a cluster that you would like to support OSG with.

The OSG HostedCE uses SSH to submit pilot jobs that will connect back to the
OSG central pool. In order for the HostedCE to work, you'll first need to
create a local service account for OSG to submit jobs. This should be done
according to whatever process you normally use to create accounts for users.

## Generating and storing the key
Once the account has been created, you'll want to create a new SSH key pair.
The private part of the key will be stored within SLATE, and the public part of
the key will be installed into `authorized_keys` file of the OSG user on your
cluster. To generate the key, you'll need to run the following on some machine
with OpenSSH installed:

ssh-keygen -f osg-keypair

Note that you will need to make this key passphraseless, as the HostedCE
software will consume this key. Once you've created the key, you'll want to
store the public part of it (osg-keypair.pub) into the `authorized_keys` file
on the OSG account for your cluster. For example, if your OSG service account
is called `osg`, you'll want to append the contents of `osg-keypair.pub` to
`/home/osg/.ssh/authorized_keys`.

The private part of the keypair will need to be stored on a SLATE cluster for use by the
CE. In this particular example, I'll be operating under the `slate-dev` group and using the `uutah-prod` cluster to host a CE pointed at a SLURM cluster at University of Utah. To do that:

slate secret create utah-lp-hostedce-privkey --from-file=bosco.key=osg-keypair --group slate-dev --cluster uutah-prod

Where `utah-lp-hostedce-privkey` will be the name of the secret, and `osg-keypair` is the
path to your private key (assumed to be the current working directory).

## Configuring the CE from SLATE
You'll want to download the application configuration template:

slate app get-conf --dev osg-hosted-ce > hosted-ce.yaml

There are several things that you'll need to edit here.

### Site Section
First, for the site section you'll want to put in appropriate values that will
ultimately map onto your OSG Topology entry. Let's take a look at this section
first:

Site:
Resource: SLATE_US_UUTAH_LONEPEAK
ResourceGroup: CHPC Group
Sponsor: osg:100
Contact: Mitchell Steinman
ContactEmail: [email protected]
City: Salt Lake City
Country: United States
Latitude: 40.7608
Longitude: -111.891

Sponsor is safe to leave as defaults if you plan to support OSG
users with your CE. From the [OSG Resource Registration
Documentation](https://opensciencegrid.org/docs/common/registration/), here's
the definition of Resource and ResourceGroup:

| Level | Definition |
| -------------- | ----------- |
| Resource Group | A logical grouping of resources at a site. Production and testing resources must be placed into separate Resource Groups. |
| Resource | A host belonging to a resource group that provides grid services, e.g. Compute Elements, storage endpoints, or perfSonar hosts. A resource may provide more than one service.|

These will ultimately get mapped into the [OSG
Topology](https://topology.opensciencegrid.org/). Following those, you'll need
to update the Contact and location information as appropriate. The contact
information will be used to reach you in case there are any problems with your
CE or site.

### Cluster Section
Next we'll go through the Cluster section, this section defines some of the hardware specifications on the remote side.

Memory should be the total per node memory available on the remote cluster. By default this is interpreted as megabytes but you can format this as `24G` if you wish. You can tailor the amount of memory to the lowest common denominator if you have a remote cluster with different kinds of nodes. The same principle applies for the CoresPerNode field.

MaxWallTime is the maxmimum allowed walltime for the job and is expressed in minutes by default.

You can read more on the [OSG Docs](https://opensciencegrid.org/docs/other/configuration-with-osg-configure/#subcluster-resource-entry)

VOs are virtual orginizations within the OSG and they correspond to different research groups. AllowedVOs allows us to specify which groups will run jobs through our CE.

Finally you'll need to refer to the private key we stored in SLATE previously.
I had called mine `utah-lp-hostedce-privkey`, so my configuration ends up
looking like this:

Cluster:
PrivateKeySecret: utah-lp-hostedce-privkey # maps to SLATE secret
Memory: 24576
CoresPerNode: 4
MaxWallTime: 4320
AllowedVOs: osg, cms, atlas, glow, hcc, fermilab, ligo, virgo, sdcc, sphenix, gluex, icecube, xenon



### Storage Section
For the Storage section, you'll want to define the path where the OSG Worker Node Client is installed as well as the location of temp/scratch space for your workers.

The OSG Worker Node Client is expected to be installed in the osguser's home directory in a subdirectory called `bosco-osg-wn-client`. You need to expand the absolute path of the user's home directory. In my case, this is under a shared file system. The HostedCE SLATE application will install this Worker Node Client directory for you.

For temp, I have a specific scratch directory I want the OSG jobs to use. `/tmp` is also a typical location for this.

Storage:
GridDir: /uufs/chpc.utah.edu/common/home/osguserl/bosco-osg-wn-client
WorkerNodeTemp: /scratch/local/.osgscratch



### Squid Section
If you have a Squid service running, you can ensure that your workers will
communicate with your local squid here. I will point my cluster to the local
squid I have deployed with SLATE:

Squid:
Location: sl-uu-es1.slateci.io:31726



(It's also possible to launch a Squid service through SLATE, see
[here](https://portal.slateci.io/applications/osg-frontier-squid))

### Networking Sections
The HostedCE application requires both forward and reverse DNS resolution for its publicly routable IP. Most SLATE clusters come pre-configured with a handful of "LoadBalancer" IP addresses that can be allocated automatically to different applications. You must set-up the DNS records for this address so it is a good idea to request a specific address from the pool. This is my network configuration:


Networking:
Hostname: "sl-uu-hce2.slateci.io"
RequestIP: 155.101.6.238

It simply consists of a RequestIP and corresponding hostname.



### HTCondorCeConfig Section
The HTCondorCeConfig file contains additional configuration for the CE itself. Most importantly it contains the required JOB_ROUTER_ENTRIES section. This is the configuration that allows the CE to route jobs to your remote cluster. It has the following format:

JOB_ROUTER_ENTRIES @=jre
[
GridResource = "batch <YOUR BATCH SYSTEM> <REMOTE USER>@<REMOTE ENDPOINT>";
Requirements = (Owner == "<REMOTE USER>");
]
@jre

My remote user is `osguserl` and jobs will be routed to the `lonepeak1.chpc.utah.edu` endpoint and the remote cluster is running SLURM. So my HTCondorCeConfig looks like this:

JOB_ROUTER_ENTRIES @=jre
[
GridResource = "batch slurm [email protected]";
Requirements = (Owner == "osguserl");
]
@jre


There is more extensive documentation on the [Job Router](https://opensciencegrid.org/docs/compute-element/job-router-recipes/)

### BoscoOverrides Section
The BoscoOverrides section provides a mechanism to override default configuration placed on the remote cluster for the OSG Worker Node Client. This can include things like the local path to the batch system's executables and additional job submission parameters for the batch system.

This will vary depending on your batch system. I did need to override the slurm_binpath and add some SBATCH parameters. All the overrides are expected to be placed in a git repository with a subdirectory format that matches `<RESOURCE NAME>/bosco-override` in my case `SLATE_US_UUTAH_LONEPEAK/bosco-override`.

It may take some trial and error to get the correct overrides in place. The general proccess for this is to deploy the CE, then check the logs on the application's HTTP Log exporter to see what must be changed. Finally re-dpeploy with the updated overrides.

There is a template bosco override repository that you can fork and tailor to your needs.

https://github.com/slateci/bosco-override-template

Once you've customized your fork, you can simply provide that repository as the `GitEndpoint` for this section.

You can use a private git repo and provide the key to the application as a SLATE secret. My configuration with a public git repo is:

BoscoOverrides:
Enabled: true
GitEndpoint: "https://github.com/slateci/utah-bosco.git"
RepoNeedsPrivKey: false
GitKeySecret: none

### HTTPLogger Section
This allows you to turn toggle HTTP logging side car. When it is enabled, it will allow you to view the CE logs from your browser.

HTTPLogger:
Enabled: true

You can get the endpoint for your logger by running `slate instance info <INSTANCE ID>`, and the randomly generated credentials will be written to the sidecar's logs.

`slate instance logs <INSTANCE ID>`

### VomsmapOverride Section
Each VO that is enabled must be mapped to a user on the remote cluster. It is standard to create a user for each VO you intend to support. It is possible to map each VO to the same remote user, which I have done here:

VomsmapOverride: |+
"/osg/Role=NULL/Capability=NULL" osguserl
"/GLOW/Role=htpc/Capability=NULL" osguserl
"/hcc/Role=NULL/Capability=NULL" osguserl
"/cms/*" osguserl
"/fermilab/*" osguserl
"/osg/ligo/Role=NULL/Capability=NULL" osguserl
"/virgo/ligo/Role=NULL/Capability=NULL" osguserl
"/sdcc/Role=NULL/Capability=NULL" osguserl
"/sphenix/Role=NULL/Capability=NULL" osguserl
"/atlas/*" osguserl
"/Gluex/Role=NULL/Capability=NULL" osguserl
"/dune/Role=pilot/Capability=NULL" osguserl
"/icecube/Role=pilot/Capability=NULL" osguserl
"/xenon.biggrid.nl/Role=NULL/Capability=NULL" osguserl


### GridmapOverride Section
The GridmapOverride will allow you to add your own personal grid proxy to the CE. This is for the purpose of testing basic job submission.

You can obtain one with your institutional credential at [cilogon.org](https://cilogon.org/)

GridmapOverride: |+
"/DC=org/DC=cilogon/C=US/O=University of Utah/CN=HENRY STEINMAN A14364946" osguserl

### Certficate Section
Each time the CE is deployed it requests a new certificate from Lets Encrypt, which has rate limits to prevent DOS attacks. This means that if you are redeploying a CE frequently for troubleshooting purposes, you may experience the rate limit.

It is possible to save the certificate (hostkey.pem and hostcert.pem) and store these as a SLATE secret for re-use. This circumvents the rate limit.

I will leave it disabled, because my configuration is stable.

Certificate:
Secret: null

### Developer Section
Simply disable this. It is in place for the purpose of OSG Internal Testbed hosts, and is not intended for use with production CEs.

Developer:
Enabled: false

## Finalizing the configuration

Now that we've gone through the sections line-by-line, let's look at our completed configuration:

```
Instance: "lonepeak"

Site:
Resource: SLATE_US_UUTAH_LONEPEAK
ResourceGroup: CHPC Group
Sponsor: osg:100
Contact: Mitchell Steinman
ContactEmail: [email protected]
City: Salt Lake City
Country: United States
Latitude: 40.7608
Longitude: -111.891

Cluster:
PrivateKeySecret: utah-lp-hostedce-privkey # maps to SLATE secret
Memory: 24000
CoresPerNode: 4
MaxWallTime: 4320
AllowedVOs: osg, cms, atlas, glow, hcc, fermilab, ligo, virgo, sdcc, sphenix, gluex, icecube, xenon

Storage:
GridDir: /uufs/chpc.utah.edu/common/home/osguserl/bosco-osg-wn-client
WorkerNodeTemp: /scratch/local/.osgscratch

Squid:
Location: sl-uu-es1.slateci.io:31726

Networking:
Hostname: "sl-uu-hce2.slateci.io"
RequestIP: 155.101.6.238

HTCondorCeConfig: |+
JOB_ROUTER_ENTRIES @=jre
[
GridResource = "batch slurm [email protected]";
Requirements = (Owner == "osguserl");
]
@jre

BoscoOverrides:
Enabled: true
GitEndpoint: "https://github.com/slateci/utah-bosco.git"
RepoNeedsPrivKey: false
GitKeySecret: none

HTTPLogger:
Enabled: true

VomsmapOverride: |+
"/osg/Role=NULL/Capability=NULL" osguserl
"/GLOW/Role=htpc/Capability=NULL" osguserl
"/hcc/Role=NULL/Capability=NULL" osguserl
"/cms/*" osguserl
"/fermilab/*" osguserl
"/osg/ligo/Role=NULL/Capability=NULL" osguserl
"/virgo/ligo/Role=NULL/Capability=NULL" osguserl
"/sdcc/Role=NULL/Capability=NULL" osguserl
"/sphenix/Role=NULL/Capability=NULL" osguserl
"/atlas/*" osguserl
"/Gluex/Role=NULL/Capability=NULL" osguserl
"/dune/Role=pilot/Capability=NULL" osguserl
"/icecube/Role=pilot/Capability=NULL" osguserl
"/xenon.biggrid.nl/Role=NULL/Capability=NULL" osguserl

GridmapOverride: |+
"/DC=org/DC=cilogon/C=US/O=University of Utah/CN=HENRY STEINMAN A14364946" osguserl

Certificate:
Secret: null

Developer:
Enabled: false

```

Before deploying the HostedCE, at this point you may want to test your private key and SSH endpoint before deploying the application to SLATE. You may also want to check that job submission is working as expected. For example, here I try to log in with the key (notice that it *does not* require a passphrase, this is a HostedCE requirement!), and I'm able to successfully run a job through my local batch system.

$ ssh -i osg-keypair [email protected]
Last login: Mon Oct 14 12:51:23 2019 from cdf38.uchicago.edu
[osguser@condor ~]$ condor_run /bin/hostname
htcondor-river-v2-586998c97f-m4vkp

This is often the first place to start looking when your HostedCE isn't working, so I would encourage you to test this appropriately at your site.

Once we're happy with the configuration and we have tested basic access, we can ask SLATE to install it:

slate app install --cluster uchicago-prod --group slate-dev osg-hosted-ce --dev --conf hostedce.yaml


## Testing the HostedCE

In order to test end-to-end job submission through the CE you will need a valid grid proxy and the condor tools. First you must obtain a certificate.

An easy way to do that is through [CILogon](https://cilogon.org/).

Create a cert and download it. You'll need to remember the password you set.

Next you will need to convert it to PKCS12 format for voms. These commands will prompt for your password.

`openssl pkcs12 -in usercred.p12 -nocerts -out hostkey.pem`

`openssl pkcs12 -in usercred.p12 -nocerts -nodes -out hostkey.pem`

Be sure that both files have the correct file permissions

`chmod 600 hostkey.pem && chmod 600 hostcert.pem`

You will need to install `htcondor-ce-client` on the machine you would like to submit from.

On EL7 enable the OSG yum repos

`yum install https://repo.opensciencegrid.org/osg/3.5/osg-3.5-el7-release-latest.rpm && yum update`

Then install the tools

`yum clean all; yum install htcondor-ce-client`

You should be able to use your cert to initialize your grid proxy

`voms-proxy-init -cert hostcert.pem -key hostkey.pem --debug`

Here I use the `--debug` flag because `voms-proxy-init` won't give us very helpful output, if it fails.

If that was successful you should be able to run a job trace against your CE, which will trace the end-to-end submission of a small test job.

`condor_ce_trace --debug sl-uu-hce2.slateci.io`

This command will output a great deal of helpful information about the job submission, it's status and the eventual result. If the jobs sits idle on the remote cluster for too long, the command may time out.