Scaffold and Submit Computational Jobs to HTC Schedulers • submitr

The problem with the handoff

You have an R analysis that runs on your laptop. Maybe it takes a while. Maybe you need to run it many times — once per species, once per county, once per simulation parameter, once per experimental condition. Maybe both.

CHTC’s high-throughput computing infrastructure can run many independent jobs across a large pool of compute resources. The barrier is rarely the value of the computing. The barrier is the handoff: turning a local analysis into something a scheduler can run somewhere else.

That handoff requires several pieces to line up at once. Your R code needs to run without relying on the interactive session where you developed it. Your software environment needs to be portable. Your files need to move to a submit node. HTCondor needs a submit file. The execute node needs a shell script. Your results need to come back.

submitr is designed to make that handoff easier. It generates the HTCondor submit file, generates the executable script, wraps the SSH and SCP commands that move files to and from the submit node, submits the job, checks status, and downloads results — all from R.

If you are new to CHTC, submitr gives you a guided path to your first successful submission. If you already use CHTC, submitr reduces repetitive setup work and makes common submission patterns easier to reproduce, review, and share.

When to use submitr

Use submitr when you are:

sending a containerized R analysis to CHTC for the first time;
teaching researchers the structure of an HTCondor job;
moving from a single local analysis to many independent HTC jobs;
standardizing a submit-file and executable-script pattern across projects;
reducing repeated SSH, SCP, and condor_submit command-line work;
making CHTC submissions easier to review, rerun, and share.

submitr is useful on its own if your project is already organized and containerized. It also fits into a broader workflow for moving from a literate analysis document to a portable, scalable computation.

The toolero family

submitr is the third step in the From the Notebook to the Cluster package family:

toolero     organize, scaffold, split
  └─ containr   freeze the software environment in a container
       └─ submitr    send the analysis to CHTC and retrieve results

Each package is useful on its own. Together, they form a path from a local R project to a completed high-throughput computing run.

toolero helps you start with a maintainable project structure, use Quarto as a source of truth, and split data into job-sized pieces.
containr helps you build a container image from your renv.lock so the software environment can travel with the analysis.
submitr helps you send the containerized analysis to CHTC, monitor the job, and bring results back.

You can adopt these packages one at a time. submitr does not require toolero, and toolero does not require submitr. The family exists so that each step prepares cleanly for the next when your project is ready to scale.

Before you start

submitr assumes your project is already organized and containerized. Before using it, confirm that:

your R script runs with Rscript analysis.R outside RStudio;
your container image is pushed to a registry CHTC can access;
you have SSH access to a CHTC submit node such as ap2002.chtc.wisc.edu.

Set up SSH connection reuse before anything else. Every submitr function that touches CHTC opens an SSH connection, which can trigger a Duo MFA prompt. Setting up ControlMaster caches your authenticated session and makes the entire workflow significantly smoother. The setup takes two minutes and is worth doing before your first htc_config() call. Full instructions appear after Step 1 below.

Installation

Install the development version from GitHub:

# install.packages("pak")
pak::pak("erwinlares/submitr")

A first workflow

## A first workflow

```r
library(submitr)

# 1. Start the session (reads htc.cfg, stores config for all calls)
htc_start()

# 2. Generate the submit file
htc_gen_submit(
  output_file     = "analysis.sub",
  container_image = "registry.doit.wisc.edu/your.netid/my-analysis:1.0.0",
  executable      = "analysis.sh",
  output_files    = "results.tar.gz",
  resources       = "small",
  comments        = TRUE
)

# 3. Generate the executable script
htc_gen_executable(
  r_script       = "analysis.R",
  output_file    = "analysis.sh",
  results_folder = "results",
  comments       = TRUE
)

# 4. Upload files to the submit node
htc_upload(files = c("analysis.sub", "analysis.sh"))

# 5. Submit the job
cluster_id <- htc_submit(submit_file = "analysis.sub")

# 6. Check progress
htc_status(cluster_id = cluster_id, watch = TRUE)

# 7. Download results
htc_download(files = "*.tar.gz", local_path = "results/")


---

## Core workflow functions

### `htc_start()`

`htc_start()` reads your project's `htc.cfg` and stores the connection
config for the rest of the R session. All subsequent `htc_*()` calls use
it automatically -- no need to pass `config = cfg` on every call.

```r
htc_start()
#> v Session started: "your.netid"@"ap2002.chtc.wisc.edu"

If this is your first time, htc_start() prompts for your NetID and submit node, writes htc.cfg, and displays ControlMaster setup instructions. On subsequent calls it reads the existing config and validates the connection.

You can still pass config explicitly to any function to override the session config:

other_cfg <- htc_config(path = "other-project/")
htc_upload(files = "job.sub", config = other_cfg)

`htc_config()`

htc_config() is the lower-level function that reads or creates htc.cfg. Most researchers should use htc_start() instead, which calls htc_config() and stores the result for the session. Use htc_config() directly when you need to manage multiple configs or pass a config to a single call without starting a session.

cfg <- htc_config()
#> Reading HTC config from ./htc.cfg
#> v Connected to "ap2002.chtc.wisc.edu" as "your.netid".

Setting up SSH connection reuse

Before continuing, take two minutes to set up ControlMaster. Add this block to ~/.ssh/config:

Host *.chtc.wisc.edu
  ControlMaster auto
  ControlPersist 2h
  ControlPath ~/.ssh/connections/%r@%h:%p

Then create the directory used by ControlPath:

mkdir -p ~/.ssh/connections

With ControlMaster in place, all subsequent SSH connections — uploads, submits, status checks, downloads — reuse the same authenticated session without prompting for Duo MFA. Full documentation is at https://chtc.cs.wisc.edu/uw-research-computing/configure-ssh.

`htc_gen_submit()`

Generates the HTCondor .sub submit file. It tells HTCondor which container to use, which executable to run, which files to transfer, what resources to request, and what output files to expect.

htc_gen_submit(
  output_file     = "analysis.sub",
  container_image = "docker://registry.doit.wisc.edu/your.netid/my-analysis:1.0.0",
  executable      = "analysis.sh",
  input_files     = c("analysis.R", "data.csv"),
  output_files    = "results.tar.gz",
  resources       = "small",
  comments        = TRUE
)

Use comments = TRUE on a first submission. The generated file includes explanations of each section, making it useful both as a working submit file and as a learning document.

Resource presets:

preset	cpus	memory	disk	when to use
small	1	4 GB	4 GB	first test jobs, lightweight scripts, quick summaries
medium	4	16 GB	15 GB	moderate analyses, multiple input files, model fitting
large	8	64 GB	32 GB	memory-intensive work, large datasets, parallel computation

Start with "small" for a first test regardless of what your eventual job will need. The HTCondor log file reports actual resource usage after each run, which is the best guide for tuning future submissions. Requesting too little causes jobs to fail; requesting much more than you need makes jobs harder to match with available resources. The log is the ground truth.

`htc_gen_executable()`

Generates the .sh script that HTCondor runs inside the container. The generated script creates the results directory, runs your R script with Rscript, and archives the results as a .tar.gz file.

htc_gen_executable(
  r_script       = "analysis.R",
  output_file    = "analysis.sh",
  results_folder = "results",
  comments       = TRUE
)

`htc_upload()`

Copies files to the CHTC submit node via scp. Use dry_run = TRUE to preview the command before running it.

# Preview first
htc_upload(
  files   = c("analysis.sub", "analysis.sh", "analysis.R", "data.csv"),
  config  = cfg,
  dry_run = TRUE
)
#> ✔ Dry run -- command that would be executed:
#>   `scp analysis.sub analysis.sh analysis.R data.csv your.netid@ap2002.chtc.wisc.edu:~/`

# Then upload
htc_upload(
  files  = c("analysis.sub", "analysis.sh", "analysis.R", "data.csv"),
  config = cfg
)

`htc_submit()`

Runs condor_submit on the remote server via SSH and returns the cluster ID.

cluster_id <- htc_submit(
  submit_file = "analysis.sub",
  config      = cfg,
  verbose     = TRUE
)
#> Submitting "analysis.sub" on "ap2002.chtc.wisc.edu"...
#> 1 job(s) submitted to cluster 6302860.
#> ✔ Job submitted successfully.

`htc_status()`

Runs condor_q on the remote server. Use watch = TRUE to poll until all jobs in the cluster leave the queue.

# One-shot check
htc_status(cluster_id = cluster_id)

# Watch until complete
htc_status(cluster_id = cluster_id, watch = TRUE)

`htc_download()`

Copies files back from the submit node via scp. Supports glob patterns.

# Download results
htc_download(files = "*.tar.gz", local_path = "results/")

# Download logs
htc_download(files = c("job.log", "job.err"), local_path = "logs/")

Scaling to many jobs

Once a single job works, scaling up is mostly a matter of changing the queue. Use toolero::write_by_group() to split your dataset and produce a manifest, then switch to multiple-job mode:

htc_gen_submit(
  output_file     = "analysis.sub",
  container_image = "docker://registry.doit.wisc.edu/your.netid/my-analysis:1.0.0",
  executable      = "analysis.sh",
  input_files     = "analysis.R",
  mode            = "multiple",
  queue_from      = "data/jobs/manifest.csv",
  resources       = "medium",
  comments        = TRUE
)

htc_gen_executable(
  r_script       = "analysis.R",
  output_file    = "analysis.sh",
  results_folder = "results",
  mode           = "multiple",
  comments       = TRUE
)

In multiple-job mode, HTCondor passes each subset filename to your R script as a positional argument. Your script should read that argument explicitly:

args       <- commandArgs(trailingOnly = TRUE)
input_file <- args[[1]]

data <- readr::read_csv(input_file)

Quick function reference

Function	What it does
`htc_start()`	Start a session – reads config and stores it for all calls
`htc_config()`	Create or read `htc.cfg`, validate connection
`htc_gen_submit()`	Generate the HTCondor `.sub` submit file
`htc_gen_executable()`	Generate the `.sh` executable script
`htc_upload()`	Copy files to the submit node via `scp`
`htc_submit()`	Run `condor_submit` on the submit node
`htc_status()`	Check job progress via `condor_q`
`htc_download()`	Copy results back from the submit node

What submitr does not do

submitr reduces friction. It does not replace understanding.

It does not decide whether your workload is appropriate for CHTC.
It does not manage large input files greater than 1 GB. Those belong in CHTC’s staging area and require a different transfer pattern.
It does not validate that your container image is correct or that your analysis script will run successfully inside it. Test both locally before submitting to CHTC.
It does not replace CHTC consultation for complex workloads, custom scheduling requirements, or non-standard resource requests.

The CHTC facilitation team is the right resource for complex workflow questions.

Learn more

The package vignette walks through a complete first submission step by step, with annotated output at each stage:

From the Notebook to the Cluster: Your First CHTC Job with submitr

submitr is part of the From the Notebook to the Cluster package family:

toolero — organize and scaffold the project, use Quarto as the source of truth, and split datasets for parallel jobs
containr — containerize the software environment
submitr — submit to CHTC and retrieve results (this package)

submitr

The problem with the handoff

When to use submitr

The toolero family

Before you start

Installation

A first workflow

htc_config()

Setting up SSH connection reuse

htc_gen_submit()

htc_gen_executable()

htc_upload()

htc_submit()

htc_status()

htc_download()