Host Setup
Machine Hosting Setup
- Ubuntu 18.04 or newer (required)
- Dedicated machines only - the machine shouldn't be doing other stuff while rented
- Fast, reliable internet: at least 10Mbps per machine.
- 10-series Nvidia GPU (AMD not yet supported, older Nvidia not recommended).
- At least 1 physical CPU core (2 hyperthreads) per GPU.
- Your CPU must support AVX instruction set (not all lower end CPUs support this).
- At least 4GBM of system RAM per GPU.
- Fast SSD storage with at least 128GB per GPU.
- at least 1X PCIE for every 2.5 TFLOPS of GPU performance.
- All GPUs on the machine must be of the same type.
- An open port range mapped to each machine.
Install Ubuntu:
Ubuntu 23.04
Ubuntu 23.04 is possible with some work (and doesn't appear to have the nvml issue of 22), but has not been tested and is a short-lived release.Ubuntu 22.04
Ubuntu 22.04 has a serious nvidia-container issue that must be mitigated. The simple workaround that seems to work is option 2: Explicitly disabling systemd cgroup management in Docker.Add the parameter "exec-opts": ["native.cgroupdriver=cgroupfs"] to the /etc/docker/daemon.json file and restart docker. You do know how to restart docker right?
Then check the output of "docker info", and look for the Cgroup driver part, should look like this:
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
The intall script tests for this nvml bug.
Ubuntu 20.04
Ubuntu 20.04 is the most tested/stable but may have slower performance for some workloads or lack support for some newer hardware.Our software requires an XFS disk partition which is not installed by default. by default. If an XFS disk partition is not found, the installer will create a loopback (emulated) XFS partition as a fallback, but we recommend using an actual XFS partition. Our setup script will automatically create the XFS partition from free space on the drive, however the default Ubuntu install will not leave any free space.
If you have multiple drives, and want to use a drive just for ubuntu, leave unpartitioned space on the SSD for the XFS partition. You can then install ubuntu using the defaults. For an NVME SSD, you will also want to create a small partition on the SSD drive, and leave most of the space unpartitioned (as the installer currently assumes each drive has at least one partition).
However, if you only have a single drive, we recommend using the "something else" option on the Installation Type step to create custom partitions:
To set up your single drive:
- If necessary, create a new partition table using the button on the right. This is not necessary if the + button on the left already works.
- Create an "efi" type partition of size ~512MB. If you can't select the "efi" type for a partition, skip this step.
- Create an "ext4" type partition with mount point "/", of size 32GB, for the OS.
- We generally recommend NOT creating a swap partition, as it currently can cause major performance issues when docker containers start to exhaust RAM.
- Leave the rest unpartitioned.
If you are using multiple drives (OS + data), make sure to setup a small partition (ext4 is fine, 256MB) on the NVME SSD you want to use for the main data drive, and leave the rest unpartitioned. If you are having problems with the vast install script, you may need to manually setup an XFS partition on your SSD and adjust grub to use pquota and trim (see discord).
Your partitions should look something like this:
Now wait for ubuntu to finish installing, and proceed once it has successfully booted. The rest of the installation will be automated by our installer script.
!Important!: AMD EPYC IOMMU Issue
AMD EPYC systems have a known serious compatibility issues/problems with the core nvidia NCCL library that distributed pytorch (and other frameworks) use, which may make your machine unusable for many of our users. You can read more and workarounds if you google (or just search our discord history for EPYC), but here are a few links:
See this pytorch thread or this nvidia info pageWe may test pytorch distributed (NCCL) on your machine, and if it fails the machine could be deverified. Ensure this test works and your machine is compatible with nvidia NCCL.
Install the Nvidia Driver
Install the Vast.ai Manager Software
Note: you may need to install python2.7 to run the install script.
Troubleshooting
If your install succeeded, you should see your new machine show up on the machines page. If the install failed before starting the daemon, you'll want to look at the install log to see what went wrong. If the main install succeeded but the daemon isn't running or is restarting, this suggests an issue with nvidia-drivers, docker, nvidia-docker, or the XFS partition. The main daemon logs are in the daemon directory at: /var/lib/vastai_kaalia.
Configure Networking
We strongly recommend opening and mapping a port range from your NAT/firewall to each machine. This is important for compatibility with most server software and tools which assume open ports. You can still list a machine without open ports, but we prioritize verification of machines with a healthy sized properly configured open port range.
1.) First open a port range (TCP and UDP) on your NAT/firewalls and map it to the same destination port range on each target machine. Currently requires at least 1 open port per GPU on the machine, but we recommend at least 100 per gpu (to allow ssh and a few web based tools, a p2p connection, etc), and some docker images use many dozens.
2.) Test your open ports. You can use netcat: "nc -l -p PORT" on the target machine, then use a website like https://portchecker.co/ or similar. Make sure they work before going to step 3, because if you go to step 3 with broken port mappings your machine could be deverified.
3.) Inform the daemon by writing a small file with contents "FIRST-LAST" to /var/lib/vastai_kaalia/host_port_range. ie to map 16384-32768:echo -n "16384-32768" > /var/lib/vastai_kaalia/host_port_range
If you get permissions issues, may need to wrap it with sudo:
sudo bash -c 'echo -n "16384-32768" > /var/lib/vastai_kaalia/host_port_range'
4.) The daemon will update and inform our servers, no restart required. You should see the #ports update on the machine card in a few minutes.
By default your machine's public IP is just the incoming IP that connects to our server. For most networks this is sufficient, but sysadmins of unusual datacenter networks may want to override this behavior (useful for some asymmetric NATs/firewalls) by writing the IP to /var/lib/vastai_kaalia/host_ipaddr:
sudo bash -c 'echo -n "172.13.108.14" > /var/lib/vastai_kaalia/host_ipaddr'
List Your Machine
Go to the Machines page and find your new machine. It can take a few minutes to an hour for the main performance stats to appear. Prepare your machine properly to be online and ready for long periods of continuous work.
If your machine doesn't appear, please go to our (discord server) and check pinned messages and announcements about updates.
List your machine for rental using the List button, and then adjust max rental duration using the Availability date field. On the end date the listing will expire and your machine will unlist. However any existing client jobs will still remain until ended by their owners.
Once you list your machine and it is rented, it is extremely important that you don't interfere with the machine in any way. If your machine has an active client job and then goes offline, crashes, or has performance problems, this could permanently lower your reliability rating. We strongly recommend you test the machine first and only list when ready.
Verification
Disable Auto Updates
You probably should disable auto updates to prevent the nvidia driver from auto-updating, as that can cause NVML mismatch driver errors for nvidia-smi and nvidia-docker, resulting in auto-deverification of your machines.
Customize apt mirrors
The default ubuntu apt sources use US based mirror sources. If your vast.ai servers don't have good network throughput to these mirrors it could cause containers to load very slowly and even timeout. You can override this behavior by writing a small file named "apt-select-out" into your main daemon directory 'var/lib/vastai_kaalia' . This file should just contain a text string replacement for 'archive.ubuntu.com'. For example, if you want to change the apt source to use 'mirror.yandex.ru', run the following command on each server:
echo 'mirror.yandex.ru' > /var/lib/vastai_kaalia/apt-select-out
echo 'mirror.yandex.ru' > /var/lib/vastai_kaalia/latest/apt-select-out
Set a Min Bid Price or Background Task
Create a background job to run on the machine, along with a floor price. This will create a low priority job that runs only where there is no higher priced job to run. The floor price will determine the minimum clients will have to pay to rent your gpus.
You can use this feature to mine in the background. You can also use this feature to prevent your machine from being rented at a price below your electricity cost by creating a background job that does nothing and setting the price to your electricity cost.
Alternatively you can use the CLI CLI to set a min bid price if you don't want to mess with mining images/containers.
Background/Mining Jobs
Your job must be packaged with a docker container. Use the "Run custom command on start" option. We have tested the following mining apps (these are all probably now out of date - search the host discord for newer images):
Z-enemyImage:
nvidia/cuda:8.0-runtime-ubuntu16.04
Run custom command on start:
bash -c 'apt update; apt install wget; apt install libcurl3; apt install libjansson4; wget -c https://github.com/RCS1/res2/raw/master/z-enemy; chmod +x z-enemy; ./z-enemy -a x16r -o stratum+tcp://us.ravenminer.com:4567 -u RQcdYaLCw37aLbeAw14qEU75BZmGCGcyMC -p d=16 -i 20'
Replace RQcdYaLCw37aLbeAw14qEU75BZmGCGcyMC with your output address and adjust any other z-enemy params as needed.
ethminer
Image:
anthonytatowicz/eth-cuda-miner
Run custom command on start:
-U -S us-west1.nanopool.org:9999 -O 0x5C9314b28Fbf25D1d054a9184C0b6abF27E20d95 --farm-recheck 200
Replace us-west1.nanopool.org:9999 with your pool, 0x5C9314b28Fbf25D1d054a9184C0b6abF27E20d95 with your output address and adjust any other z-enemy params as needed.
ccminer (x16r)
Image:
cgarnier/x16r-ccminer-docker
Run custom command on start:
-a x16r --url="stratum+tcp://rvn.suprnova.cc:6666" --userpass="USER:PWD"
Replace USER and PWD with your suprnova worker username and password.