Evan Patterson
Stanford University, Statistics Department
Ioana Baldini, Aleksandra Mojsilović, Kush R. Varshney
IBM Research AI
PyData NYC
November 4, 2019
...to people, with varying backgrounds, professional roles, and levels of expertise
...to machines, for the sake of:
Not independent: Machine understanding is the means, human understanding is the end.
Benefits of the traditional scientific report:
Problems:
Computational notebooks like Jupyter and R Markdown improve:
However:
Understanding the code is hard because data science is a big tent, encompassing different:
Open source and open science are producing ever more data science code.
But once it is published, how do we utilize it?
To help people share the results of data analysis, and machines process it, we need a model of data science code that is
- machine-interpretable
- language- and library-agnostic
- not too hard to generate
Goal: Create semantic models of data science code, with minimal human intervention.
import numpy as np from scipy.cluster.vq import kmeans2 iris = np.genfromtxt('iris.csv', dtype='f8', delimiter=',', skip_header=1) iris = np.delete(iris, 4, axis=1) centroids, clusters = kmeans2(iris, 3)
import pandas as pd from sklearn.cluster import KMeans iris = pd.read_csv('iris.csv') iris = iris.drop('Species', 1) kmeans = KMeans(n_clusters=3).fit(iris.values) centroids = kmeans.cluster_centers_ clusters = kmeans.labels_
iris = read.csv("datasets/iris.csv", stringsAsFactors=FALSE) iris = iris[, names(iris) != "Species"] km = kmeans(iris, 3) centroids = km$centers clusters = km$cluster
A real data analysis from the DREAM Challenge for rheumatoid arthritis (RA):
We analyze one top-ranking submission, written in R.
library("caret") library("VIF") library("Cubist") merge.p.with.template <- function(p){ template = read.csv("RAchallenge_Q1_final_template.csv") template$row = 1:nrow(template) template = template[,c(1,3)] ids = data.resp$IID[is.na(y)] p = data.frame(ID=ids, Response.deltaDAS=p) p = merge(template, p) p = p[order(p$row), ] p[,c(1,3)] } data = readRDS("pred.rds") resp = readRDS("resp.rds") # non-clinical model data.resp = merge(data, resp[c("FID", "IID", "Response.deltaDAS")]) y = data.resp$Response.deltaDAS y.training = y[!is.na(y)] data.resp2 = data.resp[!(names(data.resp) %in% c("Response.deltaDAS", "FID", "IID"))] dummy = predict(dummyVars(~., data=data.resp2), newdata=data.resp2) dummy.training = dummy[!is.na(y),] dummy.testing = dummy[is.na(y),] v = vif(y.training, dummy.training, dw=5, w0=5, trace=F) dummy.training.selected = as.data.frame(dummy.training[,v$select]) dummy.testing.selected = as.data.frame(dummy.testing[,v$select]) m1 = cubist(dummy.training.selected, y.training, committees=100) p1 = predict(m1, newdata=dummy.testing.selected) # clinical model dummy = data.resp[c("baselineDAS", "Drug", "Age", "Gender", "Mtx")] dummy = predict(dummyVars(~., data=dummy), newdata=dummy) dummy.training = dummy[!is.na(y),] dummy.testing = dummy[is.na(y), ] m2 = cubist(dummy.training, y.training, committees=100) p2 = predict(m2, newdata=dummy.testing) ## create csv files p1.df = merge.p.with.template(p1) p2.df = merge.p.with.template(p2) write.csv(p1.df, quote=F, row.names=F, file="clinical_and_genetic.csv") write.csv(p2.df, quote=F, row.names=F, file="clinical_only.csv")
How we construct a semantic model of data science code:
Step 1. Dataflow analysis using computer program analysis:
Step 2. Semantic enrichment using knowledge-based methods:
Programming model: "Everything that happens [in R] is a function call" (John Chambers)
In fact, not everything that happens is actually a function call, especially in Python.
In static phase, transform the abstract syntax tree (AST):
sys.settrace
In dynamic phase, record the raw flow graph by executing the transformed code:
Assumption: Data analysis must be executable: need code, data, and environment,
the same requirement for reproducibility.
Problem: Determine the "meaning" (semantics) of data science code.
↯ Simply impossible without restriction of scope.
Assumption: Code uses semantically meaningful classes and functions from standard packages.
Strategy: Express standard classes and functions in terms of universal concepts.
k-means clustering
fit supervised model
k-means clustering in scikit-learn
k-means clustering in SciPy
Assumption: Some code will be unannotated, hence semantics will be partial.
Transform raw flow graph to semantic flow graph:
Architecture: Implemented in Julia, not Python or R, partly for maximal decoupling.
We need your help! Please contact me if interested in contributing.