Skip to content

Commit 2be07a3

Browse files
committed
Fix CPU efficiency calculation (TotalCPU is now empty for job steps)
- In our slurm (25.05), it seems that TotalCPU is now empty anytime `srun` is used. This gave a lot of misleading data, but the necessary info to compute this is in TRESUsageInTot[cpu]. So now, we calculate: - TotalCPU: from TRESUsageInTot[cpu] - CPUEff: TRESUsageInTot[cpu] / (AllocTRES[cpu]*Elapsed) - Adjustments to the `eff` view to use this - There are some disadvantages, for example we can't get CPU efficiency from the allocation steps anymore. But this seems to be how it works now (sort of like how memory allocation is reported now).
1 parent b958218 commit 2be07a3

File tree

3 files changed

+63
-31
lines changed

3 files changed

+63
-31
lines changed

README.rst

Lines changed: 26 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -201,6 +201,9 @@ percentages, and *unixtime*.
201201
Columns which are the same in raw ``sacct`` output aren't documented
202202
specifically here (but note the default units above).
203203

204+
The syntax ``ColumnName[name]`` means the ``name=`` value from the
205+
given column name.
206+
204207
Below are some notable columns which do not exist in sacct (for the
205208
rest, check out the `sacct manual page <https://slurm.schedmd.com/sacct.html#lbAF>`_). It's good
206209
to verify that any of our custom columns make sense before trusting
@@ -252,13 +255,23 @@ them. For other columns, check ``man sacct``.
252255
stripped out and give invalid data. File an issue and this will
253256
be added.
254257

258+
* **CPU related**
259+
260+
* ``TotalCPU`` is the total CPU seconds used (time × number of
261+
CPUs). In our latest slurm, the raw column is zero for job steps,
262+
so this is now extracted from ``TRESUsageInTot[cpu]``
263+
264+
* ``CPUEff``: CPU efficiency (0.0-1.0). All the same caveats as above
265+
apply: test before trusting. This is calculated as
266+
``TRESUsageInTot[cpu]/(AllocTRES[cpu]*Elapsed)``.
267+
255268
* **Memory related**
256269

257-
* ``AllocMem``: The ``mem=`` value from ``AllocTRES`` field. You
258-
probably want to use this.
270+
* ``AllocMem``: The ``AllocTRES[mem]`` field. You probably want to
271+
use this.
259272

260-
* ``TotalMem``: The ``mem=`` value from ``TRESUsageInTot`` field.
261-
You probably want to use this.
273+
* ``TotalMem``: The ``TRESUsageInTot[mem]`` field. You probably
274+
want to use this.
262275

263276
* ``ReqMem``: The raw slurm value from the ReqMem column.
264277

@@ -268,25 +281,23 @@ them. For other columns, check ``man sacct``.
268281
* ``MemEff``: Computed ``TotalMem / AllocMem``.
269282

270283
* **GPU information.** These use values from the ``TRESUsageInAve``
271-
fields in modern Slurm
284+
fields in modern Slurm:
272285

273-
* ``ReqGPU``: Number of GPUs requested. Extracted from ``ReqTRES``.
286+
* ``ReqGPU``: Number of GPUs requested, from ``ReqTRES[gres/gpu]``.
274287

275-
* ``GpuMem``: ``gres/gpumem`` from ``TRESUsageInAve``
288+
* ``GpuMem``: From ``TRESUsageInAve[gres/gpumem]``
276289

277-
* ``GpuUtil``: ``gres/gpuutil`` (fraction 0.0-1.0).
290+
* ``GpuUtil``: From ``TRESUsageInAve[gres/gpuutil]`` (normalized to
291+
fraction 0.0-ngpus).
278292

279-
* ``NGpus``: Number of GPUs from ``gres/gpu`` in ``AllocTRES``.
293+
* ``NGpus``: Number of GPUs from ``AllocTRES[gres/gpu]``.
280294
Should be the same as ``ReqGPU``, but who knows.
281295

282296
* ``GpuUtilTot``, ``GpuMemTot``: like above but using the
283297
``TRESUsageInTot`` sacct field.
284298

285-
* ``GpuEff``: ``gres/gpuutil`` (from ``TRESUsageInTot``) / (100 *
286-
``gres/gpu`` (from ``AllocTRES``).
287-
288-
* ``CPUEff``: CPU efficiency (0.0-1.0). All the same caveats as above
289-
apply: test before trusting.
299+
* ``GpuEff``: From ``TRESUsageInTot[gres/gpuutil]``) / (100 *
300+
``AllocTRES[gres/gpu]``).
290301

291302
* And more, see the code for now.
292303

@@ -298,11 +309,9 @@ accounting database that are hardest to remember:
298309
* ``CPUTime``: Reserved CPU time (Elapsed * number of CPUs). CPUEff ≈
299310
TotalCPU/CPUTime = TotalCPU/(NCPUs x Elapsed)
300311

301-
* ``TotalCPU``: SystemCPU + TotalCPU, seconds of productive work.
302-
303312
The ``eff`` table adds the following:
304313

305-
* ``CPUEff``: Highest CPUEff for any job step
314+
* ``CPUEff``: sum(cpu usage over steps) / allocated (cpu usage over steps)
306315

307316
* ``MemEff``: Highest MemEff for any job step
308317

slurm2sql.py

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -507,19 +507,26 @@ def calc(row):
507507
return float_bytes(m_used.group(1)) / alloc
508508
return None
509509

510-
510+
RE_TRES_CPU = re.compile(rf'\bcpu=([^,]*)\b')
511511
class slurmCPUEff(linefunc):
512512
# This matches the seff tool currently:
513513
# https://github.com/SchedMD/slurm/blob/master/contribs/seff/seff
514+
# Update 2025-10-20: on our slurm TotalCPU is now empty, so we get the CPU value from TRESUsageInTot
514515
type = 'real'
515516
@staticmethod
516517
def calc(row):
517-
if not ('Elapsed' in row and 'TotalCPU' in row and 'NCPUS' in row):
518+
if not ('Elapsed' in row and 'TRESUsageInTot' in row):
518519
return
519520
walltime = slurmtime(row['Elapsed'])
520521
if not walltime: return None
522+
m_cpu_alloc = RE_TRES_CPU.search(row['AllocTRES'])
523+
if not m_cpu_alloc: return None
524+
m_cpu_used = RE_TRES_CPU.search(row['TRESUsageInTot'])
525+
if not m_cpu_used: return None
526+
cpu_alloc = int_metric(m_cpu_alloc.group(1))
527+
cpu_used = slurmtime(m_cpu_used.group(1))
521528
try:
522-
cpueff = slurmtime(row['TotalCPU']) / (walltime * int(row['NCPUS']))
529+
cpueff = cpu_used / (walltime * cpu_alloc)
523530
except ZeroDivisionError:
524531
return float('nan')
525532
return cpueff
@@ -621,7 +628,7 @@ def calc(row):
621628
'ReqCPUS': nullint, # Requested CPUs
622629
'AllocCPUS': nullint, # === NCPUS
623630
'CPUTime': slurmtime, # = Elapsed * NCPUS (= CPUTimeRaw) (not how much used)
624-
'TotalCPU': slurmtime, # = Elapsed * NCPUS * efficiency
631+
'TotalCPU': ExtractField("TotalCPU", "TRESUsageInTot", "cpu", slurmtime), # = Elapsed * NCPUS * efficiency
625632
'UserCPU': slurmtime, #
626633
'SystemCPU': slurmtime, #
627634
'_CPUEff': slurmCPUEff, # CPU efficiency, should be same as seff
@@ -933,9 +940,9 @@ def infer_type(cd):
933940
'ReqTRES, '
934941
'max(Elapsed) AS Elapsed, '
935942
'max(NCPUS) AS NCPUS, '
936-
'max(totalcpu)/max(cputime) AS CPUeff, ' # highest TotalCPU is for the whole allocation
943+
'sum(totalcpu)/max(cputime) AS CPUeff, ' # highest TotalCPU is for the whole allocation
937944
'max(cputime) AS cpu_s_reserved, '
938-
'max(totalcpu) AS cpu_s_used, '
945+
'sum(totalcpu) AS cpu_s_used, '
939946
'max(ReqMemNode) AS MemReq, '
940947
'max(AllocMem) AS AllocMem, '
941948
'max(TotalMem) AS TotalMem, '

test.py

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -153,15 +153,31 @@ def test_queuetime(db, data1):
153153
#
154154
def test_cpueff(db):
155155
data = """
156-
JobID,CPUTime,TotalCPU
157-
1,50:00,25:00
156+
JobID, CPUTime, TotalCPU, TRESUsageInTot
157+
1, 50:00, 25:00, cpu=00:25:00
158158
"""
159159
slurm2sql.slurm2sql(db, [], csv_input=csvdata(data))
160160
print(db.execute('select * from eff;').fetchall())
161161
assert fetch(db, 1, 'CPUTime') == 3000
162162
assert fetch(db, 1, 'TotalCPU') == 1500
163163
assert fetch(db, 1, 'CPUeff', table='eff') == 0.5
164164

165+
def test_cpueff_steps(db):
166+
data = """
167+
JobID,CPUTime,TotalCPU,TRESUsageInTot
168+
1, 50:00, 02:00,
169+
1.1, 25:00, 24:00, cpu=00:25:00
170+
1.2, 25:00, 24:00, cpu=00:25:00
171+
"""
172+
slurm2sql.slurm2sql(db, [], csv_input=csvdata(data))
173+
print(db.execute('select * from eff;').fetchall())
174+
#assert fetch(db, 1, 'CPUTime') == 3000
175+
#assert fetch(db, 1, 'TotalCPU') == 1500
176+
assert fetch(db, 1, 'CPUeff', table='eff') == 1.0
177+
assert fetch(db, 1, 'cpu_s_reserved', table='eff') == 3000
178+
assert fetch(db, 1, 'cpu_s_used', table='eff') == 3000
179+
180+
165181
def test_memeff(db):
166182
data = """
167183
JobID,AllocTRES,TRESUsageInTot
@@ -231,14 +247,14 @@ def test_sacct(db, capsys):
231247
#
232248
def test_seff(db, capsys):
233249
data = """
234-
JobID,Start,End,CPUTime,TotalCPU
235-
111,1970-01-01T00:00:00,1970-01-01T00:50:00,50:00,25:00
236-
111.2,,,,25:00
250+
JobID, Start, End, Elapsed, CPUTime, TotalCPU,TRESUsageInTot
251+
111, 1970-01-01T00:00:00, 1970-01-01T00:50:00, 50:00, 50:00, ,
252+
111.2, , , , , 25:00, cpu=00:25:00
237253
"""
238254
slurm2sql.seff_cli(argv=[], csv_input=csvdata(data))
239-
captured = capsys.readouterr()
240-
assert '111' in captured.out
241-
assert '50%' in captured.out
255+
captured = capsys.readouterr().out
256+
assert '111' in captured
257+
assert '50%' in captured
242258

243259
def test_seff_mem(db, capsys):
244260
data = """

0 commit comments

Comments
 (0)