condor.service
on the old Central Manager hostpssh -h /etc/pssh/cloud -l centos -i 'nslookup condor-cm.galaxyproject.eu'
/etc/resolv.conf
on all nodes to e.g. 1.1.1.1
condor.service
on the new Central Managerpssh -h /etc/pssh/cloud -l centos -i 'condor_reconfig'
Reassign jobs from handler 11 to handler 3 with gxadmin:
gxadmin tsvquery queue-detail-by-handler handler_main_11 | cut -f1 | xargs -I{} -n1 gxadmin mutate reassign-job-to-handler {} handler_main_3 --commit
condor_edit is your friend and the htcondor classads. Most commonly used ones are RequestCpus
, RequestMemory
(in MiB), RequestGpus
, but a full list of all Job ClassAd Attributes can be found in the htcondor docs..
condor_qedit 37110378 RequestMemory=50000
sed -i 's/"compute"/"dockerhack"/g' /etc/condor/condor_config.local; systemctl reload condor
Test with:
condor_status -autoformat Machine GalaxyGroup GalaxyDockerHack | grep hack | sort -u
The following command is failing all jobs of the service-account user.
gxadmin tsvquery jobs --user=service-account --nonterminal | awk '{print $1}' | xargs -I {} -n 1 gxadmin mutate fail-job {} --commit
This cmd will find all jobs matching a string (here “obabel”), returns the group-pid and kills those group process. This seems to be the only way to remove jobs from the condor nodes when condor_rm was not able to kill the jobs.
pdsh -g cloud 'ps xao pgid,cmd | grep "[o]babel" | awk "{ print \$1 }" | xargs -I {} sudo kill -9 {}'
condor_q -autoformat ClusterID JobDescription RemoteHost | grep cn032
condor_q -constraint 'JobDescription == "spades"' -af ClusterID JobDescription RemoteHost RequestMemory MemoryUsage HoldReason
condor_q -autoformat:t ClusterId JobDescription RequestMemory RequestCpus JobStatus | grep -P "\t1$"
condor_status -autoformat Name Cpus | cut -f2 -d' ' | paste -s -d'+' | bc
As input we had a galaxy job id 11ac61790d0cc33b8086442012d093zu (11384941)
and a note of an empty file as result. The job was a step of a big collection where the other steps were successful.
To understand the reason for the problem, I proceeded as follows:
condor_history | grep 11384941
to retrieve the condor id
condor_history -l 6545461
to retrieve all the job detail, and here, I found this error message:
"Error from slot1_1@cn030.bi.uni-freiburg.de: Failed to open '/data/dnb03/galaxy_db/job_working_directory/011/384/11384941/galaxy_11384941.o' as standard output: No such file or directory (errno 2)"
A quick check into the compute node ```[root@cn030 ~]# cd /data/dnb03 -bash: cd: /data/dnb03: No such file or directory
showed it was not mounting properly the NFS export.
### Reserve an handler for a tool and move all running jobs to it
Add a new handler to the `job_conf.xml`
```xml
<handlers assign_with="db-skip-locked" max_grab="8">
<handler id="handler_key_0"/>
<handler id="handler_key_1"/>
<handler id="handler_key_2"/>
<handler id="handler_key_3"/>
<handler id="handler_key_4"/>
<handler id="handler_key_5"/>
<handler id="handler_key_6" tags="special_handlers"/>
</handlers>
associate the tool to that handler
<tools>
<tool id="upload1" destination="gateway_singlerun" />
<tool id="toolshed.g2.bx.psu.edu/repos/chemteam/gmx_sim/gmx_sim/2020.4+galaxy0" destination="gateway_singlerun" resources="usegpu" />
<tool id="toolshed.g2.bx.psu.edu/repos/chemteam/gmx_sim/gmx_sim/2019.1.5.1" destination="gateway_singlerun" resources="usegpu" />
<tool id="gmx_sim" destination="gateway_singlerun" resources="usegpu" />
<tool id="param_value_from_file" handler="special_handlers" />
restart workflow schedulers and move all running jobs to the new handler
for j in `gxadmin query queue-detail --all| grep param_value_from_file |grep -v handler_key_6 | cut -f2 -d'|' | paste -s -d ' '`; do gxadmin mutate reassign-job-to-handler $j handler_key_6 --commit;done
This “peaceful” shutdown of a startd will cause that daemon to wait indefinitely for all existing jobs to exit before shutting down. During this time, no new jobs will start.
condor_off -peaceful -startd vgcnbwc-worker-c125m425-8231.novalocal
To begin running or restarting all daemons (other than condor_master) given in the configuration variable DAEMON_LIST on the host:
condor_on vgcnbwc-worker-c125m425-8231.novalocal
Both commands can be executed from the submitter node.
Since we do not configure MaxJobRetirementTime
in our condor setup, running condor_drain
will kill the job immediately as the default value of MaxJobRetirementTime
is 0.
Instead of condor_drain
, we could use condor_off -peaceful -name <worker node name>
and running this command will make the daemons wait for all jobs to finish while ensuring no new jobs are accepted.
Some useful links:
Fridays for Future organizes Global Climate Strikes that take place on specific Fridays to raise awareness of the grim consequences of global warming.
Galaxy Europe has participated in such strikes in the past, by closing the job queue so that new jobs will not start until the strike is over. The strike protocol is as follows:
We have written a TPV rule that takes advantage of HTCondor’s job deferral feature to enforce the strike. A copy of the rule is available below.
- id: climate_strike
# delay jobs using HTCondor's job deferral feature
# https://htcondor.readthedocs.io/en/latest/users-manual/time-scheduling-for-job-execution.html
if: True
execute: |
from datetime import datetime
strike_start = datetime(2023,9,15,7,0)
strike_end = datetime(2023,9,15,19,0)
training_roles = (
[r.name for r in user.all_roles()
if not r.deleted and r.name in
("training-bma231-ht23", "training-msc-tmr-ws23",
"training-bio00058m")]
if user is not None else []
)
now = datetime.now()
if strike_start <= now < strike_end and not training_roles:
entity.params["deferral_time"] = f"{int(strike_end.timestamp())}"
entity.params["deferral_prep_time"] = "60"
entity.params["deferral_window"] = "864000" # 10 days
Before merging the rule to
tool_defaults.yml
(as a list item under the key
tools.default.rules
),
make sure to:
"training-"
to the Event ID.After merging the rule, all jobs submitted during the strike will remain on “queued” state (gray) until the strike is over. Note that this TPV rule only affects jobs running on HTCondor.
comm -23 <(condor_q --json | jq '.[]? | .ClusterId' | sort) <(gxadmin query queue-detail | awk '{print $5}' | sort)
Those ID can be piped to condor_rm
if needed.
gxadmin query queue-detail --all | awk -F\| '{print$5}' | sort | uniq -c | sort -sn
Gives a list of all users that currently have jobs in the queue and how many (new, queued and running), in decending order.
condor_q -autoformat ClusterId Cmd JobDescription RemoteHost JobStartDate | awk '{ printf "%s %s %s %s %s\n", $1, $2, $3, $4, strftime("%Y-%m-%d %H:%M:%S", $5) }'
Helped to solve
gxadmin query q "select job.id from job inner join job_state_history jh on job.id = jh.job_id where job.handler = 'handler_sn06_0' and job.tool_id != '__DATA_FETCH__' and ( job.update_time between timestamp '2023-12-14 11:00:00' and '2023-12-14 12:00:00' )" | awk '{print$1}' | sort | uniq -c | sort -sn