We introduce a new framework for categorical doctrines based on doublecategorical functorial semantics. For a brief introduction, see the blog post.
Publications

Cartesian double theories: A doublecategorical framework for categorical doctrines, 2023. Michael Lambert, Evan Patterson.
Abstract
The categorified theories known as "doctrines" specify a category equipped with extra structure, analogous to how ordinary theories specify a set with extra structure. We introduce a new framework for doctrines based on double category theory. A cartesian double theory is defined to be a small double category with finite products and a model of a cartesian double theory to be a finite productpreserving lax functor out of it. Many familiar categorical structures are models of cartesian double theories, including categories, presheaves, monoidal categories, braided and symmetric monoidal categories, 2groups, multicategories, and cartesian and cocartesian categories. We show that every cartesian double theory has a unital virtual double category of models, with lax maps between models given by cartesian lax natural transformations, bimodules between models given by cartesian modules, and multicells given by multimodulations. In many cases, the virtual double category of models is representable, hence is a genuine double category. Moreover, when restricted to pseudo maps, every cartesian double theory has a virtual equipment of models, hence an equipment of models in the representable case. Compared with 2monads, double theories have the advantage of being straightforwardly presentable by generators and relations, as we illustrate through a large number of examples.

Structured and decorated cospans from the viewpoint of double category theory, 2023. Evan Patterson.
Abstract
Structured and decorated cospans are broadly applicable frameworks for building bicategories or double categories of open systems. We streamline and generalize these frameworks using central concepts of double category theory. We show that, under mild hypotheses, double categories of structured cospans are cocartesian (have finite doublecategorical coproducts) and are equipments. The proofs are simple as they utilize appropriate doublecategorical universal properties. Maps between double categories of structured cospans are studied from the same perspective. We then give a new construction of the double category of decorated cospans using the recently introduced double Grothendieck construction. Besides its conceptual value, this reconstruction leads to a natural generalization of decorated cospans, which we illustrate through an example motivated by statistical theories and other theories of processes.
I argue that open systems, as modeled by structured and decorated cospans, should be understood using concepts from double category theory. Structured cospans are shown to form a cocartesian equipment and decorated cospans to arise from the double Grothendieck construction.
I presented this paper at Applied Category Theory 2023 (slides). It is based on a series of blog posts (1, 2, 3).

A compositional framework for convex model predictive control, 2023. Tyler Hanks, Baike She, Matthew Hale, Evan Patterson, Matthew Klawonn, James Fairbanks.
Abstract
Model predictive control (MPC) is an optimal control technique which involves solving a sequence of constrained optimization problems across a given time horizon. In this paper we show that the structure of problems in this sequence is elegantly captured via a novel categorytheoretic viewpoint. Specifically, we develop a category  called Para(Conv)  whose objects are Euclidean spaces and whose morphisms represent constrained convex optimization problems. We go on to show that convex MPC optimization problems result from compositions of simpler subproblems in Para(Conv), and we also show that Para(Conv) can be used to model both initial and terminal state constraints, which often arise in practice. We present a novel Julia library that leverages our theoretical results to automate the implementation of correctbyconstruction MPC problems in software. Specifically we provide code snippets showing the ease with which MPC problems can be implemented and solved using our library.
Inspired by Rockafellar’s notion of convex bifunction, we propose a compositional framework for model predictive control (MPC) and we implement it in the Julia package AlgebraicControl.jl.

A compositional account of motifs, mechanisms, and dynamics in biochemical regulatory networks, 2023. Rebekah Aduddell, James Fairbanks, Amit Kumar, Pablo S. Ocal, Evan Patterson, Brandon T. Shapiro.
Abstract
Regulatory networks depict promoting or inhibiting interactions between molecules in a biochemical system. We introduce a categorytheoretic formalism for regulatory networks, using signed graphs to model the networks and signed functors to describe occurrences of one network in another, especially occurrences of network motifs. With this foundation, we establish functorial mappings between regulatory networks and other mathematical models in biochemistry. We construct a functor from reaction networks, modeled as Petri nets with signed links, to regulatory networks, enabling us to precisely define when a reaction network could be a physical mechanism underlying a regulatory network. Turning to quantitative models, we associate a regulatory network with a LotkaVolterra system of differential equations, defining a functor from the category of signed graphs to a category of parameterized dynamical systems. We extend this result from closed to open systems, demonstrating that LotkaVolterra dynamics respects not only inclusions and collapsings of regulatory networks, but also the process of building up complex regulatory networks by gluing together simpler pieces. Formally, we use the theory of structured cospans to produce a lax double functor from the double category of open signed graphs to that of open parameterized dynamical systems. Throughout the paper, we ground the categorical formalism in examples inspired by systems biology.
Inspired by the philosophy of systems biology, we introduce a categorical formalism for biochemical regulatory networks, using signed graphs to model the networks and signed functors to describe occurrences of one network in another. We then give a functorial semantics to regulatory networks using LotkaVolterra systems of differential equations.

Compositional modeling with stock and flow diagrams, 2022. John Baez, Xiaoyan Li, Sophie Libkind, Nathaniel Osgood, Evan Patterson.
Abstract
Stock and flow diagrams are widely used in epidemiology to model the dynamics of populations. Although tools already exist for building these diagrams and simulating the systems they describe, we have created a new package called StockFlow, part of the AlgebraicJulia ecosystem, which uses ideas from category theory to overcome notable limitations of existing software. Compositionality is provided by the theory of decorated cospans: stock and flow diagrams can composed to form larger ones in an intuitive way formalized by the operad of undirected wiring diagrams. Our approach also cleanly separates the syntax of stock and flow diagrams from the semantics they can be assigned. We consider semantics in ordinary differential equations, although others are possible. As an example, we explain code in StockFlow that implements a simplified version of a COVID19 model used in Canada.
BibTeX
@inproceedings{2022stockflow, title={Compositional modeling with stock and flow diagrams}, author={Baez, John and Li, Xiaoyan and Libkind, Sophie and Osgood, Nathaniel and Patterson, Evan}, booktitle={Proceedings of the Fifth International Conference on Applied Category Theory (ACT 2022)}, pages={7796}, year={2022}, doi={10.4204/EPTCS.380.5}, eprinttype={arXiv}, eprint={2205.08373} }
Stock and flow diagrams, originating in the field of system dynamics, are also used in epidemiology to specify compartmental models. We give a categorytheoretic and compositional account of stock and flow diagrams, which is implemented in the package StockFlow.jl. This paper can be seen as complementary to our previous work on Petri nets in epidemiology modeling.
We presented this paper at Applied Category Theory 2022 (slides, video).

A diagrammatic view of differential equations in physics, 2022. Evan Patterson, Andrew Baas, Timothy Hosgood, James Fairbanks.
Abstract
Presenting systems of differential equations in the form of diagrams has become common in certain parts of physics, especially electromagnetism and computational physics. In this work, we aim to put such use of diagrams on a firm mathematical footing, while also systematizing a broadly applicable framework to reason formally about systems of equations and their solutions. Our main mathematical tools are categorytheoretic diagrams, which are well known, and morphisms between diagrams, which have been less appreciated. As an application of the diagrammatic framework, we show how complex, multiphysical systems can be modularly constructed from basic physical principles. A wealth of examples, drawn from electromagnetism, transport phenomena, fluid mechanics, and other fields, is included.
BibTeX
@article{2022diagrammaticdiffeqs, title={A diagrammatic view of differential equations in physics}, author={Patterson, Evan and Baas, Andrew and Hosgood, Timothy and Fairbanks, James}, journal={Mathematics in Engineering}, volume={5}, number={2}, pages={159}, year={2023}, doi={10.3934/mine.2023036}, eprinttype={arXiv}, eprint={2204.01843} }
We formalize and systematize Tonti diagrams, a graphical mode of presenting partial differential equations in physics. More generally, we explain how to reason formally about systems of equations and their solutions using categorytheoretic diagrams. For an overview of this long paper, see our series of blog posts (part 1, part 2).

An algebraic framework for structured epidemic modelling, 2022. Sophie Libkind, Andrew Baas, Micah Halter, Evan Patterson, James Fairbanks.
Abstract
Pandemic management requires that scientists rapidly formulate and analyse epidemiological models in order to forecast the spread of disease and the effects of mitigation strategies. Scientists must modify existing models and create novel ones in light of new biological data and policy changes such as social distancing and vaccination. Traditional scientific modelling workflows detach the structure of a model—its submodels and their interactions—from its implementation in software. Consequently, incorporating local changes to model components may require global edits to the code base through a manual, timeintensive and errorprone process. We propose a compositional modelling framework that uses highlevel algebraic structures to capture domainspecific scientific knowledge and bridge the gap between how scientists think about models and the code that implements them. These algebraic structures, grounded in applied category theory, simplify and expedite modelling tasks such as model specification, stratification, analysis and calibration. With their structure made explicit, models also become easier to communicate, criticize and refine in light of stakeholder feedback.
BibTeX
@article{2023structuredepimodeling, title={An algebraic framework for structured epidemic modelling}, author={Libkind, Sophie and Baas, Andrew and Halter, Micah and Patterson, Evan and Fairbanks, James P.}, journal={Philosophical Transactions of the Royal Society A}, volume={380}, number={2233}, pages={20210309}, year={2022}, doi={10.1098/rsta.2021.0309}, eprinttype={arXiv}, eprint={2203.16345} }
We introduce an algebraic framework for building compartmental models in epidemiology, with highlevel operations for hierarchically composing and stratifying models. The framework is implemented in the packages AlgebraicPetri.jl and AlgebraicDynamics.jl, and the paper’s examples are reproduced in Jupyter notebooks.

Computational categorytheoretic rewriting, 2022. Kristopher Brown, Evan Patterson, Tyler Hanks, James Fairbanks.
Abstract
We demonstrate how category theory provides specifications that can efficiently be implemented via imperative algorithms and apply this to the field of graph transformation. By examples, we show how this paradigm of software development makes it easy to quickly write correct and performant code. We provide a modern implementation of graph rewriting techniques at the level of abstraction of finitelypresented Csets and clarify the connections between Csets and the typed graphs supported in existing rewriting software. We emphasize that our opensource library is extensible: by taking new categorical constructions (such as slice categories, structured cospans, and distributed graphs) and relating their limits and colimits to those of their underlying categories, users inherit efficient algorithms for pushout complements and (final) pullback complements. This allows one to perform double, single, and sesquipushout rewriting over a broad class of data structures. Graph transformation researchers, scientists, and engineers can then use this library to computationally manipulate rewriting systems and apply them to their domains of interest.
BibTeX
@article{2023acsetrewriting, title={Computational categorytheoretic rewriting}, author={Brown, Kristopher and Patterson, Evan and Hanks, Tyler and Fairbanks, James}, journal={Journal of Logical and Algebraic Methods in Programming}, volume={134}, year={2023}, sortyear={2303}, doi={10.1016/j.jlamp.2023.100888}, eprinttype={arXiv}, eprint={2111.03784} }
We explain how to rewrite categorical databases (Csets) using span rewriting, including double pushout (DPO), single pushout (SPO), and sesquipushout rewriting. We implement these methods in the Julia package AlgebraicRewriting.jl.
We presented the conference version of this paper at at 2022 International Conference on Graph Transformation.

Categorical data structures for technical computing, 2021. Evan Patterson, Owen Lynch, James Fairbanks.
Abstract
Many mathematical objects can be represented as functors from finitelypresented categories C to Set. For instance, graphs are functors to Set from the category with two parallel arrows. Such functors are known informally as Csets. In this paper, we describe and implement an extension of Csets having data attributes with fixed types, such as graphs with labeled vertices or realvalued edge weights. We call such structures "acsets," short for "attributed Csets." Derived from previous work on algebraic databases, acsets are a joint generalization of graphs and data frames. They also encompass more elaborate graphlike objects such as wiring diagrams and Petri nets with rate constants. We develop the mathematical theory of acsets and then describe a generic implementation in the Julia programming language, which uses advanced language features to achieve performance comparable with specialized data structures.
BibTeX
@article{patterson2022, title={Categorical data structures for technical computing}, author={Patterson, Evan and Lynch, Owen and Fairbanks, James}, journal={Compositionality}, volume={4}, number={5}, year={2022}, eprinttype={arXiv}, eprint={2106.04703}, doi={10.32408/compositionality45} }
Presenting a core feature of Catlab.jl, we argue that attributed Csets, or “acsets” for short, offer a fundamental data structure for technical computing that generalizes both data frames and graphs, as well as more elaborate structures such as wiring diagrams and Petri nets.

Operadic modeling of dynamical systems: Mathematics and computation, 2021. Sophie Libkind, Andrew Baas, Evan Patterson, James Fairbanks.
Abstract
Dynamical systems are ubiquitous in science and engineering as models of phenomena that evolve over time. Although complex dynamical systems tend to have important modular structure, conventional modeling approaches suppress this structure. Building on recent work in applied category theory, we show how deterministic dynamical systems, discrete and continuous, can be composed in a hierarchical style. In mathematical terms, we reformulate some existing operads of wiring diagrams and introduce new ones, using the general formalism of Csets (copresheaves). We then establish dynamical systems as algebras of these operads. In a computational vein, we show that Euler's method is functorial for undirected systems, extending a previous result for directed systems. All the ideas in this paper are implemented as practical software using Catlab and the AlgebraicJulia ecosystem, written in the Julia programming language for scientific computing.
BibTeX
@inproceedings{2021algdynamics, title={Operadic modeling of dynamical systems: Mathematics and computation}, author={Libkind, Sophie and Baas, Andrew and Patterson, Evan and Fairbanks, James}, booktitle={Proceedings of the Fourth International Conference on Applied Category Theory (ACT 2021)}, pages={192206}, year={2021}, doi={10.4204/EPTCS.372.14}, eprinttype={arXiv}, eprint={2105.12282} }
Using the mathematics of operads and operad algebras, we show how to hierarchically compose deterministic dynamical systems, both discrete and continuous. The method is implemented in the Julia package AlgebraicDynamics.jl.
We presented this paper as a distinguished talk at Applied Category Theory 2021 (slides, video).

Wiring diagrams as normal forms for computing in symmetric monoidal categories, 2020. Evan Patterson, David I. Spivak, Dmitry Vagner.
Abstract
Applications of category theory often involve symmetric monoidal categories (SMCs), in which abstract processes or operations can be composed in series and parallel. However, in 2020 there remains a dearth of computational tools for working with SMCs. We present an "unbiased" approach to implementing symmetric monoidal categories, based on an operad of directed, acyclic wiring diagrams. Because the interchange law and other laws of a SMC hold identically in a wiring diagram, no rewrite rules are needed to compare diagrams. We discuss the mathematics of the operad of wiring diagrams, as well as its implementation in the software package Catlab.
BibTeX
@inproceedings{2020wdnormalform, title={Wiring diagrams as normal forms for computing in symmetric monoidal categories}, author={Patterson, Evan and Spivak, David I. and Vagner, Dmitry}, booktitle={Proceedings of the Third International Applied Category Theory Conference (ACT 2020)}, pages={4964}, year={2020}, doi={10.4204/EPTCS.333.4}, eprinttype={arXiv}, eprint={2101.12046} }
We introduce a formalism for directed, acyclic wiring diagrams and explain how it leads to a normal form for morphism expressions in a symmetric monoidal category. We also discuss how directed wiring diagrams are implemented in Catlab.jl.
We presented this paper at Applied Category Theory 2020 (video).

The algebra and machine representation of statistical models, 2020. Evan Patterson.
Abstract
As the twin movements of open science and open source bring an ever greater share of the scientific process into the digital realm, new opportunities arise for the metascientific study of science itself, including of data science and statistics. Future science will likely see machines play an active role in processing, organizing, and perhaps even creating scientific knowledge. To make this possible, large engineering efforts must be undertaken to transform scientific artifacts into useful computational resources, and conceptual advances must be made in the organization of scientific theories, models, experiments, and data. This dissertation takes steps toward digitizing and systematizing two major artifacts of data science, statistical models and data analyses. Using tools from algebra, particularly categorical logic, a precise analogy is drawn between models in statistics and logic, enabling statistical models to be seen as models of theories, in the logical sense. Statistical theories, being algebraic structures, are amenable to machine representation and are equipped with morphisms that formalize the relations between different statistical methods. Turning from mathematics to engineering, a software system for creating machine representations of data analyses, in the form of Python or R programs, is designed and implemented. The representations aim to capture the semantics of data analyses, independent of the programming language and libraries in which they are implemented.
BibTeX
@phdthesis{patterson2020, title={The algebra and machine representation of statistical models}, author={Patterson, Evan}, school={Stanford University, Department of Statistics}, eprinttype={arXiv}, eprint={2006.08945}, year={2020} }
In my PhD thesis, I draw a precise analogy between models in logic and statistics, enabling statistical models to be seen as models of statistical theories, in the logical sense. This is accomplished using tools from algebra, particularly categorical logic. I develop the algebra of statistical theories and models and present a wide range of examples from classical statistics. In addition, I recapitulate and extend previous work on the semantic modeling of data science code.
I presented the material about statistical theories and models at the MIT Categories Seminar (slides) and the 2020 Workshop on Categorical Probability and Statistics (slides).

Hausdorff and Wasserstein metrics on graphs and other structured data, 2019. Evan Patterson.
Abstract
Optimal transport is widely used in pure and applied mathematics to find probabilistic solutions to hard combinatorial matching problems. We extend the Wasserstein metric and other elements of optimal transport from the matching of sets to the matching of graphs and other structured data. This structurepreserving form of optimal transport relaxes the usual notion of homomorphism between structures. It applies to graphs, directed and undirected, labeled and unlabeled, and to any other structure that can be realized as a Cset for some finitely presented category C. We construct both Hausdorffstyle and Wassersteinstyle metrics on Csets and we show that the latter are convex relaxations of the former. Like the classical Wasserstein metric, the Wasserstein metric on Csets is the value of a linear program and is therefore efficiently computable.
BibTeX
@article{patterson2020, title={{Hausdorff} and {Wasserstein} metrics on graphs and other structured data}, author={Patterson, Evan}, journal={Information and Inference: A Journal of the {IMA}}, volume={iaaa025}, year={2020}, eprinttype={arXiv}, eprint={1907.00257}, doi={10.1093/imaiai/iaaa025} }
I extend ideas from optimal transport to structured data by probabilistically relaxing the notion of a homomorphism. As an application, I construct Wassersteinstyle metrics on graphs, simplicial sets, and other combinatorial structures.
I talked about this project at the 2019 AMS Special Session on Applied Category Theory (slides).

String diagrams for assembly planning, 2019. Jade Master, Evan Patterson, Shahin Yousfi, Arquimedes Canedo.
Abstract
Assembly planning is a difficult problem for companies. Many disciplines such as design, planning, scheduling, and manufacturing execution need to be carefully engineered and coordinated to create successful product assembly plans. Recent research in the field of design for assembly has proposed new methodologies to design product structures in such a way that their assembly is easier. However, present assembly planning approaches lack the engineering tool support to capture all the constraints associated to assembly planning in a unified manner. This paper proposes CompositionalPlanning, a string diagram based framework for assembly planning. In the proposed framework, string diagrams and their compositional properties serve as the foundation for an engineering tool where CAD designs interact with planning and scheduling algorithms to automatically create highquality assembly plans. These assembly plans are then executed in simulation to measure their performance and to visualize their key build characteristics. We demonstrate the versatility of this approach in the LEGO assembly domain. We developed two reference LEGO CAD models that are processed by CompositionalPlanning's algorithmic pipeline. We compare sequential and parallel assembly plans in a Minecraft simulation and show that the timetobuild performance can be optimized by our algorithms.
BibTeX
@inproceedings{2019compositionalplanning, title={String diagrams for assembly planning}, author={Master, Jade and Patterson, Evan and Yousfi, Shahin and Canedo, Arquimedes}, booktitle={International Conference on Theory and Application of Diagrams (Diagrams 2020)}, pages={167183}, year={2020}, eprinttype={arXiv}, eprint={1909.10475}, doi={10.1007/9783030542498_14} }
We explore using monoidal categories and string diagrams as a foundation for assembly planning. Working in the simplified domain of LEGO assembly in Minecraft, we design and implement a proofofconcept system that spans the assembly process, from the object’s CAD description to planning and scheduling to simulation.

Conformalized quantile regression, 2019. Yaniv Romano, Evan Patterson, Emmanuel J. Candès.
Abstract
Conformal prediction is a technique for constructing prediction intervals that attain valid coverage in finite samples, without making distributional assumptions. Despite this appeal, existing conformal methods can be unnecessarily conservative because they form intervals of constant or weakly varying length across the input space. In this paper we propose a new method that is fully adaptive to heteroscedasticity. It combines conformal prediction with classical quantile regression, inheriting the advantages of both. We establish a theoretical guarantee of valid coverage, supplemented by extensive experiments on popular regression datasets. We compare the efficiency of conformalized quantile regression to other conformal methods, showing that our method tends to produce shorter intervals.
BibTeX
@inproceedings{romano2019, title={Conformalized quantile regression}, author={Romano, Yaniv and Patterson, Evan and Cand{\`e}s, Emmanuel J.}, booktitle={Thirtythird Conference on Neural Information Processing Systems (NeurIPS19)}, eprinttype={arXiv}, eprint={1905.03222}, year={2019} }
We propose a new method to construct prediction intervals, based on conformal inference. Our philosophy is that if interval estimation is the goal, then one should bypass point estimation and directly estimate the interval endpoints by quantile regression. One can then conformalize the intervals to attain valid, distributionfree coverage in finite samples.

Teaching machines to understand data science code by semantic enrichment of dataflow graphs, 2018. Evan Patterson, Ioana Baldini, Aleksandra Mojsilovic, Kush R. Varshney.
Abstract
Your computer is continuously executing programs, but does it really understand them? Not in any meaningful sense. That burden falls upon human knowledge workers, who are increasingly asked to write and understand code. They deserve to have intelligent tools that reveal the connections between code and its subject matter. Towards this prospect, we develop an AI system that forms semantic representations of computer programs, using techniques from knowledge representation and program analysis. To create the representations, we introduce an algorithm for enriching dataflow graphs with semantic information. The semantic enrichment algorithm is undergirded by a new ontology language for modeling computer programs and a new ontology about data science, written in this language. Throughout the paper, we focus on code written by data scientists and we locate our work within a larger movement towards collaborative, open, and reproducible science.
BibTeX
@inproceedings{patterson2018, title={Teaching machines to understand data science code by semantic enrichment of dataflow graphs}, author={Patterson, Evan and Baldini, Ioana and Mojsilovi{\'c}, Aleksandra and Varshney, Kush R}, booktitle={Proceedings of 2018 KDD Workshop on the Fragile Earth: Theory Guided Data Science to Enhance Scientific Discovery}, eprinttype={arXiv}, eprint={1807.05691}, year={2018} }
Building on previous work, we refine our method for creating semantic representations of data science programs. We implement the method for both Python and R and we release the first version of the Data Science Ontology. The ontology is written in a new ontology language called Monocl. All source code is available on GitHub.
We demoed the system at IJCAIECAI 2018 and we presented a shorter paper at the 2018 KDD Fragile Earth Workshop. Later, we presented the project at PyData NYC 2019 (slides).

Knowledge representation in bicategories of relations, 2017. Evan Patterson.
Abstract
We introduce the relational ontology log, or relational olog, a knowledge representation system based on the category of sets and relations. It is inspired by Spivak and Kent's olog, a recent categorical framework for knowledge representation. Relational ologs interpolate between ologs and description logic, the dominant formalism for knowledge representation today. In this paper, we investigate relational ologs both for their own sake and to gain insight into the relationship between the algebraic and logical approaches to knowledge representation. On a practical level, we show by example that relational ologs have a friendly and intuitive—yet fully precise—graphical syntax, derived from the string diagrams of monoidal categories. We explain several other useful features of relational ologs not possessed by most description logics, such as a type system and a rich, flexible notion of instance data. In a more theoretical vein, we draw on categorical logic to show how relational ologs can be translated to and from logical theories in a fragment of firstorder logic. Although we make extensive use of categorical language, this paper is designed to be selfcontained and has considerable expository content. The only prerequisites are knowledge of firstorder logic and the rudiments of category theory.
BibTeX
@article{patterson2017, title={Knowledge Representation in Bicategories of Relations}, author={Patterson, Evan}, eprinttype={arXiv}, eprint={1706.00526}, year={2017} }
I explain how to represent relational knowledge in a mathematical structure called a bicategory of relations. I compare this structure to the more familiar systems of firstorder logic and description logic. Despite the heavy dose of monoidal category theory, the paper is supposed to be readable by noncategorytheorists. Many pictures and informal explanations are provided.
I talked about this project at the 2017 CSLI Workshop (slides) and later at the 2017 AMS Special Session on Applied Category Theory (slides).

Dataflow representation of data analyses: Toward a platform for collaborative data science, 2017. Evan Patterson, Robert McBurney, Hollie Schmidt, Ioana Baldini, Aleksandra Mojsilovic, Kush R. Varshney.
Abstract
Data science plays an increasingly important role in solving today's scientific and social challenges. To promote progress toward a cure for multiple sclerosis, the Accelerated Cure Project has created an open repository of biological and survey data on patients with multiple sclerosis. Similar largescale repositories are being created in other domains. As the open, datadriven model of science proliferates, the research community faces a growing need for a cloud platform for collaborative data science. Such a platform should facilitate collaboration between domain experts and data scientists and possess artificial intelligence capabilities for organizing, recommending, and manipulating data analyses. In this paper, we present some foundational technologies motivated by this vision. Our system automatically extracts a highlevel dataflow graph from a data analysis. This graph describes how data flows through an analysis pipeline, including which statistical methods are used and how they fit together. The system requires no special annotations from the data analyst and consumes analyses written in Python using standard tools, such as Scikitlearn and Statsmodels. In this paper, we explain how our system works and how it fits into our larger vision for a collaborative data science platform.
BibTeX
@article{patterson2017, author={Patterson, Evan and McBurney, Robert and Schmidt, Hollie and Baldini, Ioana and Mojsilovi{\'c}, Aleksandra and Varshney, Kush R.}, journal={IBM Journal of Research and Development}, title={Dataflow representation of data analyses: Toward a platform for collaborative data science}, volume={61}, number={6}, pages={9:19:13}, year={2017}, doi={10.1147/JRD.2017.2736278} }
We argue that a cloud platform for data science, powered by artificial intelligence, could accelerate scientific discovery by helping data scientists and domain scientists collaborate. Existing data science platforms do not meet this need. We further argue that to enable these AI capabilities, we must create useful, machineinterpretable representations of data analyses. We develop a first version of this technology (largely superseded by later work).
We presented a shorter version of this paper at the AAAI 2017 Spring Symposium on Artificial Intelligence for the Social Good (slides).