From the Notebook to the Cluster: Your First CHTC Job with submitr

The promise of a first CHTC job

Many research coding projects begin in a notebook-style workflow: an RStudio project, a Quarto document, a few scripts, a folder of input files, and enough local experimentation to understand what the analysis needs to do.

That is a good place to start. A laptop is often the right place to explore data, write early code, make plots, and decide what question the analysis is actually answering. But at some point, the same local workflow can become the wrong place to keep pushing.

Maybe the analysis takes too long. Maybe you need to run the same model across hundreds of parameter combinations. Maybe you need one job per participant, county, simulation, image, genome, or bootstrap sample. Maybe you simply want a workflow that will be easier to rerun six months from now.

That is where high-throughput computing becomes useful.

The UW-Madison Center for High Throughput Computing (CHTC) gives researchers access to large pools of computing capacity. Instead of asking one computer to do everything in sequence, you can break work into independent jobs and let the HTC system run those jobs when resources are available.

submitr helps you take the final step in the From the Notebook to the Cluster workflow: sending a prepared R project to CHTC. It is designed for researchers who know R but may not yet be comfortable with HTCondor submit files, executable shell scripts, ssh, scp, or the rhythm of working on a remote submit node. It is also useful for regular CHTC users who want to reduce repetitive setup work and make job submission easier to reproduce, review, and share.

The goal is not to hide CHTC from you. The goal is to make the standard path visible, repeatable, and less fragile.

Before you submit anything

A successful CHTC submission starts before condor_submit.

Before using submitr, confirm that:

your R script runs with Rscript analysis.R outside RStudio;
your container image is pushed to a registry CHTC can access;
you have SSH access to a CHTC submit node such as ap2002.chtc.wisc.edu.

Set up SSH connection reuse now, before anything else. Every submitr function that touches CHTC opens an SSH connection, which can trigger a Duo MFA prompt. ControlMaster caches your authenticated session so all subsequent calls – uploads, submits, status checks, downloads – reuse the same connection without prompting again. The setup takes two minutes and is worth doing before your first htc_config() call. Full instructions appear right after Step 1.

For a first submission, choose something small and intentionally boring. The goal is not to prove that your full analysis can scale yet. The goal is to prove that the pathway works.

A small example analysis

Suppose your project has this shape:

my-analysis/
├── analysis.R
├── data.csv
├── renv.lock
└── results/

Your analysis.R script might look like this:

library(readr)
library(dplyr)

input <- read_csv("data.csv")

summary <- input |>
  group_by(group) |>
  summarise(
    mean_value = mean(value, na.rm = TRUE),
    n          = dplyr::n(),
    .groups    = "drop"
  )

if (!dir.exists("results")) dir.create("results")

write_csv(summary, "results/summary.csv")

This script is deliberately modest. A first CHTC job should be easy to inspect. Once the small version works, you can scale the pattern with more confidence.

Step 1: configure your CHTC connection

Load submitr and create a project-level configuration:

library(submitr)

cfg <- htc_config()

On first use, htc_config() prompts for your NetID and submit node. It writes an htc.cfg file to the project directory so later calls can reuse the same connection information, and it displays ControlMaster setup instructions.

A later call should look something like this:

cfg <- htc_config()
#> Reading HTC config from ./htc.cfg
#> ✔ Connected to "ap2002.chtc.wisc.edu" as "your.netid".

Setting up SSH connection reuse

Before continuing, take two minutes to configure ControlMaster. Add this block to ~/.ssh/config:

Host *.chtc.wisc.edu
  ControlMaster auto
  ControlPersist 2h
  ControlPath ~/.ssh/connections/%r@%h:%p

Then create the directory used by ControlPath:

mkdir -p ~/.ssh/connections

With ControlMaster in place, all subsequent SSH connections reuse the same authenticated session. You authenticate once when the connection is first established; everything that follows – file uploads, job submission, status checks, result downloads – happens without prompting for Duo MFA again. Full documentation is at https://chtc.cs.wisc.edu/uw-research-computing/configure-ssh.

The rest of this vignette assumes ControlMaster is in place.

Step 2: generate the submit file

The submit file is the main HTCondor instruction file. It answers the question: what should the HTC system run, and what does it need?

htc_gen_submit(
  output_file     = "analysis.sub",
  container_image = "docker://registry.doit.wisc.edu/your.netid/my-analysis:1.0.0",
  executable      = "analysis.sh",
  input_files     = c("analysis.R", "data.csv"),
  output_files    = "results.tar.gz",
  resources       = "small",
  comments        = TRUE,
  output          = "."
)

For a first submission, keep comments = TRUE. The generated file includes explanations of the main sections, making it easier to inspect, learn from, and share with a collaborator or consultant.

The resources argument uses presets. For a first test, always start with "small" regardless of what your eventual job will need:

preset	cpus	memory	disk	when to use
small	1	4 GB	4 GB	first test jobs, lightweight scripts
medium	4	16 GB	15 GB	moderate analyses, model fitting
large	8	64 GB	32 GB	memory-intensive work, large datasets

The HTCondor log file reports actual resource usage after each run. That log is the ground truth for tuning future submissions – not guesswork. Requesting too little causes jobs to fail; requesting much more than you need makes jobs harder to match with available resources.

Step 3: generate the executable script

The executable script answers a different question: once the job starts, what commands should run?

htc_gen_executable(
  r_script       = "analysis.R",
  output_file    = "analysis.sh",
  results_folder = "results",
  comments       = TRUE
)

The generated script handles a standard sequence: create the results folder, run the R script with Rscript, and archive the results as results.tar.gz. That sequence is not complicated, but it is exactly the kind of glue code that can become a barrier for researchers who are new to shell scripts. submitr writes the standard version so you can focus on the analysis.

Step 4: preview and upload files

Before copying files to the submit node, do a dry run:

htc_upload(
  files   = c("analysis.sub", "analysis.sh", "analysis.R", "data.csv"),
  config  = cfg,
  dry_run = TRUE
)
#> ✔ Dry run -- command that would be executed:
#>   `scp analysis.sub analysis.sh analysis.R data.csv your.netid@ap2002.chtc.wisc.edu:~/`

A dry run is a safety habit. It lets you see the command before it changes anything on the remote system. Once the command looks right, upload the files:

htc_upload(
  files  = c("analysis.sub", "analysis.sh", "analysis.R", "data.csv"),
  config = cfg
)

Step 5: submit the job

cluster_id <- htc_submit(
  submit_file = "analysis.sub",
  config      = cfg,
  verbose     = TRUE
)
#> Submitting "analysis.sub" on "ap2002.chtc.wisc.edu"...
#> Submitting job(s)...
#> 1 job(s) submitted to cluster 6302860.
#> ✔ Job submitted from "~/analysis.sub" on "ap2002.chtc.wisc.edu".

The cluster ID is the handle for this submission. Store it in an object so you can check the job later without having to look it up.

Step 6: check progress

# One-shot status check
htc_status(cluster_id = cluster_id, config = cfg)

# Watch until the job completes
htc_status(cluster_id = cluster_id, config = cfg, watch = TRUE)

For a small test job, watch = TRUE is useful. For larger workloads, occasional one-shot checks are usually a better fit than keeping an R session occupied.

Step 7: download results

When the job is complete, download the result archive and logs:

# Results
htc_download(
  files      = "*.tar.gz",
  config     = cfg,
  local_path = "results/"
)

# Logs
htc_download(
  files      = c("job.log", "job.err"),
  config     = cfg,
  local_path = "logs/"
)

The logs are not just for failures. They record what happened when the job ran – including actual resource usage, which informs future resource requests.

From one test job to many HTC jobs

A first job proves that the path works. The next step is to think like an HTC user: how can the analysis be divided into many independent pieces?

Common patterns include one job per simulation replicate, model specification, input file, county, participant, sample, or parameter set. This is where toolero::write_by_group() helps upstream. It splits a data frame into separate CSV files and writes a manifest describing those files. Then submitr queues one job per row of the manifest:

htc_gen_submit(
  output_file     = "analysis.sub",
  container_image = "docker://registry.doit.wisc.edu/your.netid/my-analysis:1.0.0",
  executable      = "analysis.sh",
  input_files     = "analysis.R",
  mode            = "multiple",
  queue_from      = "data/manifest.csv",
  resources       = "medium",
  comments        = TRUE
)

htc_gen_executable(
  r_script       = "analysis.R",
  output_file    = "analysis.sh",
  results_folder = "results",
  mode           = "multiple",
  comments       = TRUE
)

In multiple-job mode, the generated executable passes the per-job input file to your R script as the first command-line argument. Your script should read that argument explicitly:

args       <- commandArgs(trailingOnly = TRUE)
input_file <- args[[1]]

data <- readr::read_csv(input_file)

This is a key pattern. The script stays the same; each job receives a different input.

Where containr fits

CHTC needs to know what software environment your job should use. A container image solves that problem by packaging the R version, packages, and system libraries needed to run the analysis. containr handles that step:

containr::generate_dockerfile(r_version = "4.4.0", output = ".")
containr::build_image(verbose = TRUE)
imgs <- containr::list_images()
containr::push_image(
  image_id = imgs$image_id[1],
  netid    = "your.netid",
  project  = "my-analysis",
  tag      = "1.0.0"
)

Use explicit image tags such as "1.0.0" rather than "latest". A versioned tag makes it unambiguous which software environment was used for a particular analysis.

A practical first-submission checklist

Before scaling up, confirm that the small job works end to end:

The script runs locally with Rscript analysis.R.
ControlMaster is configured and the session is authenticated.
The container image is pushed to a registry CHTC can access.
The image tag is explicit, not "latest".
The submit file lists the correct executable and input files.
The executable script points to the correct R script and results folder.
The dry-run upload shows the expected files.
htc_config() connects to the submit node without error.
The resource request is reasonable for a test job.
The job produces logs and a result archive.

Once this works, you have something valuable: a known-good pathway from local R project to CHTC.

What submitr does not do

submitr reduces friction, but it does not remove the need to make sound research-computing decisions.

It does not:

decide whether your workload is a good fit for CHTC;
make interactive R code safe for batch execution;
guarantee that your container image contains every system dependency;
manage restricted or sensitive data;
replace CHTC documentation or consultation for complex workflows.

That boundary is intentional. Good tools should make the common path easier while still leaving the important decisions visible.

The CHTC facilitation team is the right resource for complex workflow questions.

A good first goal

Do not make your first submission your largest analysis.

Make your first goal smaller: send one boring job to CHTC, watch it run, and download one result file.

After that, the cluster becomes less mysterious. You can inspect the generated files, adjust resources, split work into many jobs, and grow the workflow with more confidence. That is the role of submitr: to help you take the first successful step from local R code to high-throughput research computing. lysis is actually answering. But at some point, the same local workflow can become the wrong place to keep pushing.

That is where high-throughput computing becomes useful.

submitr helps you take the final step in the From the Notebook to the Cluster workflow: sending a prepared R project to CHTC.

It is designed for researchers who know R but may not yet be comfortable with HTCondor submit files, executable shell scripts, ssh, scp, or the rhythm of working on a remote submit node. It is also useful for regular CHTC users who want to reduce repetitive setup work and make job submission easier to reproduce, review, and share.

The goal is not to hide CHTC from you. The goal is to make the standard path visible, repeatable, and less fragile.

The larger idea: make the right choice easy

submitr is part of a small family of R packages for research computing workflows:

local R project
  └─ toolero: organize the project and prepare job-sized inputs
      └─ containr: capture the R software environment in a container image
          └─ submitr: send the containerized job to CHTC

You can use each package on its own.

Use toolero if you want a better project skeleton, cleaner data-loading habits, Quarto scaffolding, or a simple way to split a dataset into many job-sized files.

Use containr if you already have a project with an renv.lock file and want to build a container image that can run somewhere other than your laptop.

Use submitr if your project is already organized and containerized, and you are ready to submit it to CHTC.

Used together, the packages support a practical arc: start with a project that is easier to understand, make its software environment portable, then send it to CHTC with fewer command-line hurdles.

What submitr does

A CHTC job needs a few pieces of information:

what code to run;
what input files to transfer;
what container image to use;
how much CPU, memory, and disk to request;
what output files to retrieve;
how many jobs to queue.

In HTCondor, that information is split across two main files.

The submit file tells HTCondor how to run the job. It describes the executable script, container image, input files, output files, log files, resource requests, and queue instructions.

The executable script tells the job what to do after it starts. For an R analysis, that usually means creating an output folder, running Rscript, and packaging results.

submitr helps you generate those files and then use them:

submitr::htc_config()         # configure your submit-node connection
submitr::htc_gen_submit()     # generate the HTCondor submit file
submitr::htc_gen_executable() # generate the executable shell script
submitr::htc_upload()         # copy files to the submit node
submitr::htc_submit()         # submit the job
submitr::htc_status()         # check progress
submitr::htc_download()       # copy results back

Before you submit anything

A successful CHTC submission starts before condor_submit.

Before using submitr, make sure you have:

an R script that can run with Rscript;
the input files needed by that script;
an renv.lock file or another clear record of package dependencies;
a container image available from a registry CHTC can access;
SSH access to a CHTC submit node, such as ap2002.chtc.wisc.edu.

The most important check is simple: your analysis should run outside RStudio.

Rscript analysis.R

If that command fails locally, the same analysis is likely to fail on CHTC. Fix that first. CHTC will not know about objects in your Global Environment, local RStudio settings, manually clicked files, or packages that happen to be installed on your laptop.

For a first submission, choose something small and intentionally boring. The goal is not to prove that your full analysis can scale yet. The goal is to prove that the pathway works.

A small example analysis

Suppose your project has this shape:

my-analysis/
├── analysis.R
├── data.csv
├── renv.lock
└── results/

Your analysis.R script might look like this:

library(readr)
library(dplyr)

input <- read_csv("data.csv")

summary <- input |>
  group_by(group) |>
  summarise(
    mean_value = mean(value, na.rm = TRUE),
    n = dplyr::n(),
    .groups = "drop"
  )

if (!dir.exists("results")) {
  dir.create("results")
}

write_csv(summary, "results/summary.csv")

This script is deliberately modest. A first CHTC job should be easy to inspect. Once the small version works, you can scale the pattern with more confidence.

Step 1: configure your CHTC connection

Load submitr and create a project-level configuration:

library(submitr)

cfg <- htc_config()

On first use, htc_config() prompts for your NetID and submit node. It writes an htc.cfg file to the project directory so later calls can reuse the same connection information.

A later call should look something like this:

cfg <- htc_config()
#> Reading HTC config from ./htc.cfg
#> ✔ Connected to "ap2002.chtc.wisc.edu" as "netid".

This configuration file is deliberately project-local. Different projects may need different submit nodes, paths, or connection settings.

Step 2: generate the submit file

The submit file is the main HTCondor instruction file. It answers the question: “What should the HTC system run, and what does it need?”

htc_gen_submit(
  output_file     = "analysis.sub",
  container_image = "docker://registry.doit.wisc.edu/netid/my-image:1.0.0",
  executable      = "analysis.sh",
  input_files     = c("analysis.R", "data.csv"),
  output_files    = "results.tar.gz",
  resources       = "small",
  comments        = TRUE,
  output          = "."
)

For a first submission, keep comments = TRUE. The generated file will include explanations of the main sections. That makes it easier to inspect the file, learn from it, and share it with a collaborator or consultant.

The resources argument uses presets:

preset	cpus	memory	disk
small	1	4 GB	4 GB
medium	4	16 GB	15 GB
large	8	64 GB	32 GB

For a first test, choose the smallest preset that is plausible for your job. Requesting too little can make a job fail. Requesting much more than you need can make the job harder to match with available resources.

Step 3: generate the executable script

The executable script answers a different question: “Once the job starts, what commands should run?”

htc_gen_executable(
  r_script       = "analysis.R",
  output_file    = "analysis.sh",
  results_folder = "results",
  comments       = TRUE
)

The generated script handles a standard sequence:

create the results folder;
run the R script with Rscript;
archive the results as results.tar.gz.

That sequence is not complicated, but it is exactly the kind of glue code that can become a barrier for researchers who are new to shell scripts. submitr writes the standard version so you can focus on the analysis.

Step 4: preview the upload

Before copying files to the submit node, do a dry run:

htc_upload(
  files   = c("analysis.sub", "analysis.sh", "analysis.R", "data.csv"),
  config  = cfg,
  dry_run = TRUE
)
#> ✔ Dry run -- command that would be executed:
#>   `scp analysis.sub analysis.sh analysis.R data.csv netid@ap2002.chtc.wisc.edu:~/`

A dry run is a safety habit. It lets you see the command before it changes anything on the remote system.

If the command looks right, upload the files:

htc_upload(
  files  = c("analysis.sub", "analysis.sh", "analysis.R", "data.csv"),
  config = cfg
)

Step 5: submit the job

Submit the job from R:

cluster_id <- htc_submit(
  submit_file = "analysis.sub",
  config      = cfg,
  verbose     = TRUE
)
#> Submitting "analysis.sub" on "ap2002.chtc.wisc.edu"...
#> Submitting job(s)...
#> 1 job(s) submitted to cluster 6302860.
#> ✔ Job submitted from "~/analysis.sub" on "ap2002.chtc.wisc.edu".

The cluster ID is the handle for this submission. Store it in an object so you can check the job later.

Step 6: check progress

For a one-time status check:

htc_status(cluster_id = cluster_id, config = cfg)

For a small test job, you can watch until the job completes:

htc_status(
  cluster_id = cluster_id,
  config     = cfg,
  watch      = TRUE
)

For larger workloads, occasional status checks are usually better than keeping an R session occupied.

Step 7: download results

When the job is complete, download the result archive:

htc_download(
  files      = "*.tar.gz",
  config     = cfg,
  local_path = "results/"
)

You can also retrieve logs and error files:

htc_download(
  files      = c("job.log", "job.err"),
  config     = cfg,
  local_path = "logs/"
)

The logs are not just for failures. They are part of the record of what happened when the job ran.

Making repeated SSH prompts less painful

Each upload, submit, status, and download call uses SSH. Without connection reuse, that can mean repeated Duo MFA prompts.

CHTC recommends using SSH ControlMaster settings so multiple SSH calls can reuse one authenticated connection. Add this block to ~/.ssh/config:

Host *.chtc.wisc.edu
  ControlMaster auto
  ControlPersist 2h
  ControlPath ~/.ssh/connections/%r@%h:%p

Then create the directory used by ControlPath:

mkdir -p ~/.ssh/connections

htc_config() displays this guidance when it creates a new configuration file. This is a small setup step, but it can make the day-to-day workflow much smoother.

From one test job to many HTC jobs

A first job proves that the path works. The next step is to think like an HTC user.

HTC works best when a large task can be divided into many independent pieces. Common examples include:

one job per simulation replicate;
one job per model specification;
one job per input file;
one job per county, participant, sample, or parameter set;
one job per bootstrap iteration.

This is where toolero can help upstream. For example, toolero::write_by_group() can split a data frame into separate CSV files and write a manifest describing those files.

Then submitr can queue one job per row of the manifest:

htc_gen_submit(
  output_file     = "analysis.sub",
  container_image = "docker://registry.doit.wisc.edu/netid/my-image:1.0.0",
  executable      = "analysis.sh",
  input_files     = "analysis.R",
  mode            = "multiple",
  queue_from      = "data/manifest.csv",
  resources       = "medium",
  comments        = TRUE
)

Generate the executable in multiple-job mode:

htc_gen_executable(
  r_script       = "analysis.R",
  output_file    = "analysis.sh",
  results_folder = "results",
  mode           = "multiple",
  comments       = TRUE
)

In multiple-job mode, the generated executable passes the per-job input file to your R script as the first command-line argument. Your script should read that argument explicitly:

args <- commandArgs(trailingOnly = TRUE)
input_file <- args[[1]]

input <- readr::read_csv(input_file)

This is a key pattern. The script stays the same, but each job receives a different input.

Where containr fits

CHTC needs to know what software environment your job should use. Your laptop may have the right R packages installed, but the execute node will not automatically have the same setup.

A container image solves that problem by packaging the software environment needed to run the analysis.

containr helps with that step:

containr::generate_dockerfile(r_version = "4.4.0", output = ".")
containr::build_image(verbose = TRUE)
imgs <- containr::list_images()
containr::push_image(
  image_id = imgs$image_id[1],
  netid    = "netid",
  project  = "container-registry",
  tag      = "1.0.0"
)

After the image is pushed to a registry CHTC can access, submitr can refer to it in container_image.

Use explicit image tags such as "1.0.0" rather than relying on "latest". A versioned tag makes it easier to know which software environment was used for a particular analysis.

A practical first-submission checklist

Before scaling up, confirm that the small job works:

The script runs locally with Rscript analysis.R.
The project has the input files listed in input_files.
The container image has been pushed to a registry CHTC can access.
The image tag is explicit, not just latest.
The submit file was generated with the expected executable and inputs.
The executable script points to the correct R script and results folder.
The dry-run upload shows the files you expect.
htc_config() can connect to the submit node.
The resource request is reasonable for a test job.
The job produces logs and a result archive.

Once this works, you have something valuable: a known-good pathway from local project to CHTC.

What submitr does not try to do

submitr reduces friction, but it does not remove the need to make sound research-computing decisions.

It does not:

decide whether your workload is a good fit for CHTC;
make interactive R code safe for batch execution;
guarantee that your container image contains every system dependency;
manage restricted or sensitive data;
replace CHTC documentation or consultation for complex workflows.

That boundary is intentional. Good tools should make the common path easier while still leaving the important decisions visible.

A good first goal

Do not make your first submission your largest analysis.

Make your first goal smaller:

Send one boring job to CHTC, watch it run, and download one result file.

After that, the cluster becomes less mysterious. You can inspect the generated files, adjust resources, split work into many jobs, and grow the workflow with more confidence.

That is the role of submitr: to help you take the first successful step from local R code to high-throughput research computing.

Created 2026-04-30 | Last updated 2026-05-16

The promise of a first CHTC job

Before you submit anything

A small example analysis

Step 1: configure your CHTC connection

Setting up SSH connection reuse

Step 2: generate the submit file

Step 3: generate the executable script

Step 4: preview and upload files

Step 5: submit the job

Step 6: check progress

Step 7: download results

From one test job to many HTC jobs

Where containr fits

A practical first-submission checklist

What submitr does not do

A good first goal

The larger idea: make the right choice easy

What submitr does

Before you submit anything

A small example analysis

Step 1: configure your CHTC connection

Step 2: generate the submit file

Step 3: generate the executable script

Step 4: preview the upload

Step 5: submit the job

Step 6: check progress

Step 7: download results

Making repeated SSH prompts less painful

From one test job to many HTC jobs

Where containr fits

A practical first-submission checklist

What submitr does not try to do

A good first goal