Elaborating a core feature of Catlab.jl, we argue that attributed Csets, or “acsets” for short, offer a fundamental data structure for technical computing that generalizes both data frames and graphs, as well as more elaborate structures such as wiring diagrams and Petri nets.
Publications

Categorical data structures for technical computing, 2021. Evan Patterson, Owen Lynch, James Fairbanks.
Abstract
Many mathematical objects can be represented as functors from finitelypresented categories š¯–¢ to š¯–²š¯–¾š¯—¨. For instance, graphs are functors to š¯–¢ from the category with two parallel arrows. Such functors are known informally as š¯–¢sets. In this paper, we describe and implement an extension of š¯–¢sets having data attributes with fixed types, such as graphs with labeled vertices or realvalued edge weights. We call such structures "acsets," short for "attributed š¯–¢sets." Derived from previous work on algebraic databases, acsets are a joint generalization of graphs and data frames. They also encompass more elaborate graphlike objects such as wiring diagrams and Petri nets with rate constants. We develop the mathematical theory of acsets and then describe a generic implementation in the Julia programming language, which uses advanced language features to achieve performance comparable with specialized data structures.

Operadic modeling of dynamical systems: Mathematics and computation, 2021. Sophie Libkind, Andrew Baas, Evan Patterson, James Fairbanks.
Abstract
Dynamical systems are ubiquitous in science and engineering as models of phenomena that evolve over time. Although complex dynamical systems tend to have important modular structure, conventional modeling approaches suppress this structure. Building on recent work in applied category theory, we show how deterministic dynamical systems, discrete and continuous, can be composed in a hierarchical style. In mathematical terms, we reformulate some existing operads of wiring diagrams and introduce new ones, using the general formalism of Csets (copresheaves). We then establish dynamical systems as algebras of these operads. In a computational vein, we show that Euler's method is functorial for undirected systems, extending a previous result for directed systems. All the ideas in this paper are implemented as practical software using Catlab and the AlgebraicJulia ecosystem, written in the Julia programming language for scientific computing.
BibTeX
@inproceedings{libkind2021, title={Operadic modeling of dynamical systems: Mathematics and computation}, author={Libkind, Sophie and Baas, Andrew and Patterson, Evan and Fairbanks, James}, booktitle={Proceedings of the 2021 Applied Category Theory Conference}, year={2021}, eprinttype={arXiv}, eprint={2105.12282} }
Using the mathematics of operads and operad algebras, we show how to hierarchically compose deterministic dynamical systems, both discrete and continuous. The method is implemented in the Julia package AlgebraicDynamics.jl.
We will present this project as a distinguished talk at Applied Category Theory 2021.

Wiring diagrams as normal forms for computing in symmetric monoidal categories, 2020. Evan Patterson, David I. Spivak, Dmitry Vagner.
Abstract
Applications of category theory often involve symmetric monoidal categories (SMCs), in which abstract processes or operations can be composed in series and parallel. However, in 2020 there remains a dearth of computational tools for working with SMCs. We present an "unbiased" approach to implementing symmetric monoidal categories, based on an operad of directed, acyclic wiring diagrams. Because the interchange law and other laws of a SMC hold identically in a wiring diagram, no rewrite rules are needed to compare diagrams. We discuss the mathematics of the operad of wiring diagrams, as well as its implementation in the software package Catlab.
BibTeX
@inproceedings{patterson2020, title={Wiring diagrams as normal forms for computing in symmetric monoidal categories}, author={Patterson, Evan and Spivak, David I. and Vagner, Dmitry}, booktitle={Proceedings of the 2020 Applied Category Theory Conference}, year={2020}, eprinttype={arXiv}, eprint={2101.12046} }
We introduce a formalism for directed, acyclic wiring diagrams and explain how it leads to a normal form for morphism expressions in a symmetric monoidal category. We also discuss how directed wiring diagrams are implemented in Catlab.jl.
We presented this paper at Applied Category Theory 2020 (video).

The algebra and machine representation of statistical models, 2020. Evan Patterson.
Abstract
As the twin movements of open science and open source bring an ever greater share of the scientific process into the digital realm, new opportunities arise for the metascientific study of science itself, including of data science and statistics. Future science will likely see machines play an active role in processing, organizing, and perhaps even creating scientific knowledge. To make this possible, large engineering efforts must be undertaken to transform scientific artifacts into useful computational resources, and conceptual advances must be made in the organization of scientific theories, models, experiments, and data. This dissertation takes steps toward digitizing and systematizing two major artifacts of data science, statistical models and data analyses. Using tools from algebra, particularly categorical logic, a precise analogy is drawn between models in statistics and logic, enabling statistical models to be seen as models of theories, in the logical sense. Statistical theories, being algebraic structures, are amenable to machine representation and are equipped with morphisms that formalize the relations between different statistical methods. Turning from mathematics to engineering, a software system for creating machine representations of data analyses, in the form of Python or R programs, is designed and implemented. The representations aim to capture the semantics of data analyses, independent of the programming language and libraries in which they are implemented.
BibTeX
@phdthesis{patterson2020, title={The algebra and machine representation of statistical models}, author={Patterson, Evan}, school={Stanford University, Department of Statistics}, eprinttype={arXiv}, eprint={2006.08945}, year={2020} }
In my PhD thesis, I draw a precise analogy between models in logic and statistics, enabling statistical models to be seen as models of statistical theories, in the logical sense. This is accomplished using tools from algebra, particularly categorical logic. I develop the algebra of statistical theories and models and present a wide range of examples from classical statistics. In addition, I recapitulate and extend previous work on the semantic modeling of data science code.
I presented the new material at the MIT Categories Seminar (slides) and the 2020 Workshop on Categorical Probability and Statistics (slides).

Hausdorff and Wasserstein metrics on graphs and other structured data, 2019. Evan Patterson.
Abstract
Optimal transport is widely used in pure and applied mathematics to find probabilistic solutions to hard combinatorial matching problems. We extend the Wasserstein metric and other elements of optimal transport from the matching of sets to the matching of graphs and other structured data. This structurepreserving form of optimal transport relaxes the usual notion of homomorphism between structures. It applies to graphs, directed and undirected, labeled and unlabeled, and to any other structure that can be realized as a Cset for some finitely presented category C. We construct both Hausdorffstyle and Wassersteinstyle metrics on Csets and we show that the latter are convex relaxations of the former. Like the classical Wasserstein metric, the Wasserstein metric on Csets is the value of a linear program and is therefore efficiently computable.
BibTeX
@article{patterson2020, title={{Hausdorff} and {Wasserstein} metrics on graphs and other structured data}, author={Patterson, Evan}, journal={Information and Inference: A Journal of the {IMA}}, volume={iaaa025}, year={2020}, eprinttype={arXiv}, eprint={1907.00257}, doi={10.1093/imaiai/iaaa025} }
I extend ideas from optimal transport to structured data by probabilistically relaxing the notion of a homomorphism. As an application, I construct Wassersteinstyle metrics on graphs, simplicial sets, and other structures.
I talked about this project at the 2019 AMS Special Session on Applied Category Theory (slides).

String diagrams for assembly planning, 2019. Jade Master, Evan Patterson, Shahin Yousfi, Arquimedes Canedo.
Abstract
Assembly planning is a difficult problem for companies. Many disciplines such as design, planning, scheduling, and manufacturing execution need to be carefully engineered and coordinated to create successful product assembly plans. Recent research in the field of design for assembly has proposed new methodologies to design product structures in such a way that their assembly is easier. However, present assembly planning approaches lack the engineering tool support to capture all the constraints associated to assembly planning in a unified manner. This paper proposes CompositionalPlanning, a string diagram based framework for assembly planning. In the proposed framework, string diagrams and their compositional properties serve as the foundation for an engineering tool where CAD designs interact with planning and scheduling algorithms to automatically create highquality assembly plans. These assembly plans are then executed in simulation to measure their performance and to visualize their key build characteristics. We demonstrate the versatility of this approach in the LEGO assembly domain. We developed two reference LEGO CAD models that are processed by CompositionalPlanning's algorithmic pipeline. We compare sequential and parallel assembly plans in a Minecraft simulation and show that the timetobuild performance can be optimized by our algorithms.
BibTeX
@inproceedings{2019compositionalplanning, title={String diagrams for assembly planning}, author={Master, Jade and Patterson, Evan and Yousfi, Shahin and Canedo, Arquimedes}, booktitle={International Conference on Theory and Application of Diagrams (Diagrams 2020)}, pages={167183}, year={2020}, eprinttype={arXiv}, eprint={1909.10475}, doi={10.1007/9783030542498_14} }
We explore using monoidal categories and string diagrams as a foundation for assembly planning. Working in the simplified domain of LEGO assembly in Minecraft, we design and implement a proofofconcept system that spans the assembly process, from the object’s CAD description to planning and scheduling to simulation.

Conformalized quantile regression, 2019. Yaniv Romano, Evan Patterson, Emmanuel J. CandĆØs.
Abstract
Conformal prediction is a technique for constructing prediction intervals that attain valid coverage in finite samples, without making distributional assumptions. Despite this appeal, existing conformal methods can be unnecessarily conservative because they form intervals of constant or weakly varying length across the input space. In this paper we propose a new method that is fully adaptive to heteroscedasticity. It combines conformal prediction with classical quantile regression, inheriting the advantages of both. We establish a theoretical guarantee of valid coverage, supplemented by extensive experiments on popular regression datasets. We compare the efficiency of conformalized quantile regression to other conformal methods, showing that our method tends to produce shorter intervals.
BibTeX
@inproceedings{romano2019, title={Conformalized quantile regression}, author={Romano, Yaniv and Patterson, Evan and Cand{\`e}s, Emmanuel J.}, booktitle={Thirtythird Conference on Neural Information Processing Systems (NeurIPS19)}, eprinttype={arXiv}, eprint={1905.03222}, year={2019} }
We propose a new method to construct prediction intervals, based on conformal inference. Our philosophy is that if interval estimation is the goal, then one should bypass point estimation and directly estimate the interval endpoints by quantile regression. One can then conformalize the intervals to attain valid, distributionfree coverage in finite samples.

Teaching machines to understand data science code by semantic enrichment of dataflow graphs, 2018. Evan Patterson, Ioana Baldini, Aleksandra Mojsilovic, Kush R. Varshney.
Abstract
Your computer is continuously executing programs, but does it really understand them? Not in any meaningful sense. That burden falls upon human knowledge workers, who are increasingly asked to write and understand code. They deserve to have intelligent tools that reveal the connections between code and its subject matter. Towards this prospect, we develop an AI system that forms semantic representations of computer programs, using techniques from knowledge representation and program analysis. To create the representations, we introduce an algorithm for enriching dataflow graphs with semantic information. The semantic enrichment algorithm is undergirded by a new ontology language for modeling computer programs and a new ontology about data science, written in this language. Throughout the paper, we focus on code written by data scientists and we locate our work within a larger movement towards collaborative, open, and reproducible science.
BibTeX
@inproceedings{patterson2018, title={Teaching machines to understand data science code by semantic enrichment of dataflow graphs}, author={Patterson, Evan and Baldini, Ioana and Mojsilovi{\'c}, Aleksandra and Varshney, Kush R}, booktitle={Proceedings of 2018 KDD Workshop on the Fragile Earth: Theory Guided Data Science to Enhance Scientific Discovery}, eprinttype={arXiv}, eprint={1807.05691}, year={2018} }
Building on previous work, we refine our method for creating semantic representations of data science programs. We implement the method for both Python and R and we release the first version of the Data Science Ontology. The ontology is written in a new ontology language called Monocl. All source code is available on GitHub.
We demoed the system at IJCAIECAI 2018 and we presented a shorter paper at the 2018 KDD Fragile Earth Workshop. Later, we presented the project at PyData NYC 2019 (slides).

Knowledge representation in bicategories of relations, 2017. Evan Patterson.
Abstract
We introduce the relational ontology log, or relational olog, a knowledge representation system based on the category of sets and relations. It is inspired by Spivak and Kent's olog, a recent categorical framework for knowledge representation. Relational ologs interpolate between ologs and description logic, the dominant formalism for knowledge representation today. In this paper, we investigate relational ologs both for their own sake and to gain insight into the relationship between the algebraic and logical approaches to knowledge representation. On a practical level, we show by example that relational ologs have a friendly and intuitiveā€”yet fully preciseā€”graphical syntax, derived from the string diagrams of monoidal categories. We explain several other useful features of relational ologs not possessed by most description logics, such as a type system and a rich, flexible notion of instance data. In a more theoretical vein, we draw on categorical logic to show how relational ologs can be translated to and from logical theories in a fragment of firstorder logic. Although we make extensive use of categorical language, this paper is designed to be selfcontained and has considerable expository content. The only prerequisites are knowledge of firstorder logic and the rudiments of category theory.
BibTeX
@article{patterson2017, title={Knowledge Representation in Bicategories of Relations}, author={Patterson, Evan}, eprinttype={arXiv}, eprint={1706.00526}, year={2017} }
I explain how to represent relational knowledge in a mathematical structure called a bicategory of relations. I compare this structure to the more familiar systems of firstorder logic and description logic. Despite the heavy dose of monoidal category theory, the paper is supposed to be readable by noncategorytheorists. Many pictures and informal explanations are provided.
I talked about this project at the 2017 CSLI Workshop (slides) and later at the 2017 AMS Special Session on Applied Category Theory (slides).

Dataflow representation of data analyses: Toward a platform for collaborative data science, 2017. Evan Patterson, Robert McBurney, Hollie Schmidt, Ioana Baldini, Aleksandra Mojsilovic, Kush R. Varshney.
Abstract
Data science plays an increasingly important role in solving today's scientific and social challenges. To promote progress toward a cure for multiple sclerosis, the Accelerated Cure Project has created an open repository of biological and survey data on patients with multiple sclerosis. Similar largescale repositories are being created in other domains. As the open, datadriven model of science proliferates, the research community faces a growing need for a cloud platform for collaborative data science. Such a platform should facilitate collaboration between domain experts and data scientists and possess artificial intelligence capabilities for organizing, recommending, and manipulating data analyses. In this paper, we present some foundational technologies motivated by this vision. Our system automatically extracts a highlevel dataflow graph from a data analysis. This graph describes how data flows through an analysis pipeline, including which statistical methods are used and how they fit together. The system requires no special annotations from the data analyst and consumes analyses written in Python using standard tools, such as Scikitlearn and Statsmodels. In this paper, we explain how our system works and how it fits into our larger vision for a collaborative data science platform.
BibTeX
@article{patterson2017, author={Patterson, Evan and McBurney, Robert and Schmidt, Hollie and Baldini, Ioana and Mojsilovi{\'c}, Aleksandra and Varshney, Kush R.}, journal={IBM Journal of Research and Development}, title={Dataflow representation of data analyses: Toward a platform for collaborative data science}, volume={61}, number={6}, pages={9:19:13}, year={2017}, doi={10.1147/JRD.2017.2736278} }
We argue that a cloud platform for data science, powered by artificial intelligence, could accelerate scientific discovery by helping data scientists and domain scientists collaborate. Existing data science platforms do not meet this need. We further argue that to enable these AI capabilities, we must create useful, machineinterpretable representations of data analyses. We develop a first version of this technology (largely superseded by later work).
We presented a shorter paper (slides) at the AAAI 2017 Spring Symposium on Artificial Intelligence for the Social Good.