Skip to contents

The problem with starting from scratch

Every research coding project begins with a blank slate and a set of early decisions: where to put the data, how to name the scripts, whether to track dependencies, and whether to use version control. These decisions feel low-stakes in the moment. They rarely are. The cost usually appears later, when the project needs to be shared, reviewed, rerun, containerized, or moved to a larger computing system.

A project that starts with a flat folder, no dependency tracking, and scripts that mix data loading, cleaning, modeling, and reporting is not impossible to rescue later — but it is genuinely hard. Collaborators cannot reproduce results because the package versions are unknown. The analysis breaks when moved to a different machine. The manuscript references outputs that no longer exist in the file system.

toolero is a small, opinionated set of tools designed to make good research workflow decisions easier to adopt. It does not impose a rigid framework. It provides practical defaults for common research projects and gets out of the way when you need to customize.

If you are new to research computing, toolero gives you a solid starting point without requiring you to know in advance why each piece matters. If you are experienced, it automates the setup work you would otherwise do by hand at the start of every project.


When to use toolero

Use toolero when you are:

  • starting a new research coding project;
  • teaching students or collaborators a reproducible project structure;
  • preparing an analysis that may later need to run outside your laptop;
  • using Quarto as the source of truth for an analysis;
  • reading and cleaning tabular data files at the start of a workflow;
  • splitting data into independent pieces for parallel or high-throughput workflows;
  • standardizing setup across multiple projects;
  • publishing technical documentation that should stay synchronized with its source.

toolero is useful on its own. You do not need to containerize your project or submit work to a cluster to benefit from better project structure, cleaner inputs, literate analysis documents, and repeatable workflows.


The toolero family

toolero is also the first step in a three-package family for reproducible research workflows, from local project setup to containerization and high-throughput computing submission:

toolero     organize, scaffold, split
  └─ containr   freeze the software environment in a container
       └─ submitr    send the analysis to CHTC and retrieve results

Each package is useful on its own. Together, they form a path from a new local R project to a containerized analysis that can run on high-throughput computing infrastructure.

You can adopt these packages one at a time. toolero does not require containr, and containr does not require submitr. The family exists so that each step prepares cleanly for the next when your project is ready to scale.


Installation

Install from CRAN:

install.packages("toolero")

Install the development version from GitHub:

# install.packages("pak")
pak::pak("erwinlares/toolero")

A first workflow

The functions below cover a common path from project creation to analysis-ready data. This example uses a temporary directory so you can try the workflow without writing to your Documents folder.

library(toolero)

project_dir <- file.path(tempdir(), "my-analysis")

# 1. Create a project with sensible defaults
init_project(path = project_dir)

# 2. Audit the project structure
check_project(path = project_dir)

# 3. Scaffold a reproducible Quarto analysis document
create_qmd(path = project_dir, filename = "analysis.qmd")

# 4. Extract the R code from the document into a standalone script
qmd_to_r(
  input  = file.path(project_dir, "analysis.qmd"),
  output = file.path(project_dir, "R", "analysis.R")
)

# 5. Read and clean a CSV file
data <- read_clean_csv(
  file.path(project_dir, "data-raw", "input.csv"),
  na      = c("", "NA", "N/A", "."),
  summary = TRUE
)

# 6. Write the cleaned data
write_clean_csv(data, file.path(project_dir, "data", "clean.csv"))

# 7. Split data into per-group subsets for parallel processing
write_by_group(
  data,
  group_col  = "species",
  output_dir = file.path(project_dir, "data", "jobs"),
  manifest   = TRUE
)

In a real project, replace project_dir with the path where you want the project to live. The important idea is that toolero helps you start with a structure that can grow: local analysis first, reproducible execution later, and scalable computing when needed.


Core workflow functions

init_project()

Creates a new R project with a standard folder structure suited for research workflows. Optionally initializes renv for dependency management and git for version control — both on by default, because both matter.

The default structure includes data/, data-raw/, R/, scripts/, plots/, images/, results/, and docs/. Extra folders can be added without disrupting the defaults.

# Standard project
init_project(path = "~/Documents/my-project")

# With additional folders
init_project(
  path          = "~/Documents/my-project",
  extra_folders = c("notebooks", "presentations")
)

The renv lockfile that init_project() creates is also what containr::generate_dockerfile() reads to containerize the project later. Starting with init_project() means that step is already prepared, even if you never need it.


check_project()

Audits an existing project directory and reports whether it follows toolero conventions. Useful both for projects initialized with init_project() and for any existing R project you want to evaluate.

The report checks for the expected folder structure, an .Rproj file, renv.lock, a git repository, a README, and a .gitignore. It also notes the presence of hidden files like .RData and .Rhistory that are common sources of reproducibility problems.

# Audit the current project
check_project()

# Return results as a tibble for programmatic use
issues <- check_project(error = FALSE)

create_qmd()

Scaffolds a new Quarto document from a reproducible template with optional sample data, custom styling, YAML pre-population, and a post-render hook that automatically extracts R code from the rendered document into a companion .R file.

The function has two main motivations. First, it reduces repetitive setup work. If you regularly create Quarto documents with the same author information, institutional metadata, or preferred format settings, the yaml_data argument lets you pre-populate the YAML header from a personal configuration file instead of rebuilding the same header by hand.

Second, it helps reduce code drift. In a literate programming workflow, the .qmd document can serve as the source of truth: prose, code, results, and interpretation live together. The post-render hook derives the standalone .R script from the document automatically, so you do not have to maintain a separate script by hand. This pattern is discussed in more detail in the post From the Notebook to the Cluster. Part 1: Start with the Document.

Arguments:

  • filename – name of the .qmd file. Must be supplied explicitly.
  • path – directory where the document is created. Defaults to ".".
  • yaml_data – path to a YAML file for pre-populating the header.
  • overwrite – whether to overwrite existing files. Defaults to FALSE.
  • use_purl – if TRUE (default), scaffolds _quarto.yml and R/purl.R.
  • include_examples – if TRUE (default), copies a sample dataset into data-raw/, a placeholder logo into assets/, and uses a worked example template. If FALSE, creates a blank skeleton.
  • use_style – controls custom styling. FALSE (default) produces plain Quarto output. TRUE scans assets/ for .css and .html files and wires them into the YAML. A directory path scans that directory instead.
# Blank skeleton -- no examples, no styling, no purl hook
create_qmd(path = "my-project", filename = "analysis.qmd",
           include_examples = FALSE, use_purl = FALSE)

# Full worked example with sample data and placeholder logo (default)
create_qmd(path = "my-project", filename = "analysis.qmd")

# Blank document wired to branding assets in assets/
create_qmd(path = "my-project", filename = "report.qmd",
           include_examples = FALSE, use_style = TRUE)

# Blank document with custom branding from another directory
create_qmd(path = "my-project", filename = "report.qmd",
           include_examples = FALSE, use_style = "my-branding/")

# Pre-populate YAML from a personal config file
create_qmd(path = "my-project", filename = "analysis.qmd",
           yaml_data = "my-config.yml")

qmd_to_r()

Extracts R code chunks from any .qmd file into a standalone .R script. This is the direct counterpart to the purl hook in create_qmd() — it works on any Quarto document regardless of how it was created.

The output path defaults to the same directory as the input with the .qmd extension replaced by .R. The documentation argument controls how much context is preserved in the extracted script: chunk labels only (1, the default), full roxygen blocks (2), or pure code with no comments (0).

# Default output: same directory, .R extension
qmd_to_r(input = "analysis.qmd")

# Explicit output path
qmd_to_r(
  input  = "analysis.qmd",
  output = "scripts/analysis.R"
)

read_clean_csv()

Reads a CSV file into a tibble and cleans the column names in one step. Column names become lowercase, spaces become underscores, and special characters are removed. Beyond name cleaning, the function supports explicit missing-value handling, selective row dropping, and an optional ingest summary that surfaces common data problems immediately.

# Basic usage
data <- read_clean_csv("data-raw/input.csv")

# Explicit missing-value codes and ingest summary
data <- read_clean_csv(
  "data-raw/input.csv",
  na      = c("", "NA", "N/A", ".", "-999", "unknown"),
  summary = TRUE
)

# Drop rows missing in specific columns
data <- read_clean_csv(
  "data-raw/input.csv",
  drop_na = c("participant_id", "response_score")
)

write_clean_csv()

Writes a cleaned data frame to a CSV file with cli feedback. The natural counterpart to read_clean_csv(), reinforcing the convention that data-raw/ holds original inputs and data/ holds analysis-ready outputs.

If the data frame’s column names are not already clean, write_clean_csv() applies janitor::clean_names() before writing and warns you about the affected columns, so the output file always has consistent names regardless of what was passed in.

data <- read_clean_csv("data-raw/input.csv")

write_clean_csv(data, "data/clean.csv")

# Overwrite an existing file
write_clean_csv(data, "data/clean.csv", overwrite = TRUE)

detect_execution_context()

Identifies which of three environments the code is currently running in — an interactive R session, a quarto render call, or a plain Rscript invocation — and returns "interactive", "quarto", or "rscript". Useful for writing code that resolves input file paths correctly across all three contexts without maintaining separate versions.

context <- detect_execution_context()

input_file <- switch(context,
  interactive = "data/sample.csv",
  quarto      = params$input_file,
  rscript     = commandArgs(trailingOnly = TRUE)[1]
)

write_by_group()

Splits a data frame by a grouping column and writes each group to a separate CSV file. Filenames are derived from sanitized group values — lowercase, with spaces and special characters replaced by dashes. Optionally writes a manifest.csv listing all output files, group values, and row counts.

This is useful any time a project needs independent input files: one file per county, participant, simulation parameter, model specification, or study site. For high-throughput workflows, the manifest is the input to submitr::htc_gen_submit() in multiple-job mode.

sample_path <- system.file("templates", "sample.csv", package = "toolero")
penguins    <- read_clean_csv(sample_path)

write_by_group(
  penguins,
  group_col  = "species",
  output_dir = "data/jobs",
  manifest   = TRUE
)

Documentation and communication utilities

generate_kb_xml()

Produces a UW-Madison Knowledge Base importable XML file from a rendered Quarto document. Write and maintain the guide in Quarto, then generate the KB-ready XML from that source. The Quarto document remains the maintained version and the XML becomes a derived artifact, reducing documentation drift.

generate_kb_xml(
  html_path  = "docs/analysis.html",
  output_dir = "exports"
)

When importing the resulting XML into the KB, check the Decode HTML entity in body content option.


arborize()

Renders a syntactic tree as a standalone PNG image using Quarto’s Typst engine. Accepts bracket notation for simple trees or structured notation for trees requiring movement arrows and per-node styling. A provenance .yaml file is written alongside the PNG by default, recording the tree string and render settings so the image can be reproduced or modified later.

# Simple bracket notation
arborize(
  "[NP [Det the] [N cat]]",
  output    = "figures/np-tree.png",
  papersize = "a6"
)

The papersize argument controls how tightly the image is cropped around the tree. Use "a6" or "a7" for small trees, "a5" (the default) for medium trees, and "a4" or "a3" for wide or deep trees. Requires Quarto 1.4+ with Typst support and the pdftools package.


Dependencies

toolero builds on a focused set of R packages for project setup, file handling, data import, documentation, and workflow automation:

cli, fs, glue, janitor, purrr, readr, renv, tibble, tidyr, usethis,
yaml, rlang, rvest, xml2, quarto, withr, lifecycle

toolero is the first step in a family of packages for reproducible research workflows:

  • toolero — organize and scaffold research projects
  • containr — containerize an R project
  • submitr — submit containerized R jobs to CHTC and retrieve results

Each package can be used independently. The shared design goal is to make good research-computing practices easier to adopt before a project becomes difficult to change.


Citation

citation("toolero")

License

MIT © Erwin Lares