We have access to a(t least one) A100 in
, although that machine is frequently busy.
has four A100s.
For now, to get an A100 from AWS, you have to rent eight of them with the
p4d.24xlargeinstance type. Use the "Deep Learning Base AMI (Amazon Linux 2)" to avoid having to deal with drivers and belike.
sudo yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm sudo yum install https://research.cs.wisc.edu/htcondor/repo/current/htcondor-release-current.amzn2.noarch.rpm sudo yum-builddep condor
Then set up your HTCondor build tree in the usual way. (Don't forget the
; this instance type has a
NVidia has a MIG user guide . Of particular note:
sudo nvidia-smi -i INDEX -mig 1enables MIG but does not create any GPU instances. Doing this step but not the next one is the (mis)configuration of relevance to HTCONDOR-476 .
sudo nvidia-smi mig -i 1 -cgi 19,19,19,19,19,19,19 -Ccreates a 7-way split of the A100.