Skip to main content

Remote Execution

RepX provides transparent execution on remote systems via SSH, with optional SLURM workload manager integration. The Lab artifact and its dependencies are synchronized to the target before execution.

Prerequisites

RequirementDescription
SSH accessPasswordless key-based authentication to target host
StorageWritable directory on target for artifacts and outputs
ConfigurationTarget definition in config.toml

Target Configuration

Define remote targets in ~/.config/repx/config.toml:

[targets.cluster]
address = "user@login.cluster.edu"
base_path = "/scratch/user/repx_work"
default_scheduler = "slurm"
node_local_path = "/tmp/user/repx"

Configuration Parameters

ParameterRequiredDescription
addressyesSSH connection string
base_pathyesRemote working directory
default_scheduleryeslocal or slurm
node_local_pathnoNode-local storage for container caching

Execution

Specify the target via the --target flag:

repx run simulation --lab ./result --target cluster

Resource Allocation (SLURM)

SLURM resource requirements are specified in resources.toml:

[defaults]
time = "01:00:00"
mem = "4G"
cpus-per-task = 1
partition = "standard"

[[rules]]
job_id_glob = "*training*"
time = "12:00:00"
mem = "64G"
sbatch_opts = ["--gres=gpu:1"]

[[rules]]
job_id_glob = "*preprocess*"
mem = "32G"
partition = "highmem"

Apply resources during execution:

repx run simulation --lab ./result --target cluster --resources resources.toml

Synchronization

RepX employs a multi-phase synchronization strategy:

Phase 1: Bootstrap

Static host tools (bash, jq, rsync, etc.) are deployed to ensure consistent execution regardless of target system configuration.

Phase 2: Lab Transfer

The Lab artifact is synchronized using rsync with the following characteristics:

  • Incremental transfer of changed files only
  • Preservation of symbolic links and permissions
  • Atomic updates via temporary staging

Phase 3: Container Image Sync (Incremental)

Container images are synchronized incrementally to minimize transfer overhead:

  1. Image manifest is compared against remote cache
  2. Only modified or missing layers are transferred
  3. Existing layers are reused from node_local_path when available

This optimization significantly reduces synchronization time for iterative development workflows where container images change infrequently.

Phase 4: Job Submission

SchedulerMechanism
SLURMGenerates sbatch scripts and submits via sbatch
LocalSpawns RepX agent process for direct execution

Directory Structure

Remote artifacts are organized under base_path:

<base_path>/
artifacts/ # Lab closure and job packages
outputs/ # Job execution outputs
<job-id>/
out/ # User artifacts
repx/ # Execution metadata
host-tools/ # Static tool binaries
<hash>/bin/
bin/ # Deployed utilities

Troubleshooting

Connection Issues

Verify SSH connectivity:

ssh -o BatchMode=yes user@host echo "Connection successful"

Permission Errors

Ensure base_path is writable:

ssh user@host "mkdir -p /scratch/user/repx_work && touch /scratch/user/repx_work/.test"

SLURM Submission Failures

Check SLURM queue status and partition availability:

ssh user@host "sinfo -p partition_name"