Data-intensive science

“Data-intensive science” is an umbrella term for science involving larger or more diverse data than is traditional. Jim Gray calls it “eScience” or the “fourth paradigm”. Michael Nielsen speaks of “networked science” and “data-driven intelligence”. No terminology has become standard.

Case studies

Large-scale collaboration

  • Polymath projects (Nielsen, 2012, Ch. 1)
  • Kasparov vs the World (Nielsen, 2012, Ch. 2-3)
  • MathWorks programming competition (Nielsen, 2012, Ch. 4)
  • Kaggle competitions

Data-driven science

  • Don Swanson: discovery of link between migraines and magnesium
    • Swanson, 1990: Medical literature as a potential source of new knowledge (PubMed )
    • See also Nielsen, 2012, Ch. 6
  • Purvesh Khatri: Sepsis multi-cohort analysis using data from GEO
    • Sweeney et al, 2015: A comprehensive time-course-based multicohort analysis of sepsis… (doi)

Literature

Books

  • Nielsen, 2012: Reinventing Discovery: The New Era of Networked Science
    • Ch. 3: “Restructuring expert opinion”. This chapter is about how online tools can be used to direct the attention of domain experts to problems where their expertise is most relevant. It introduces some punchy terminology like “microexpertise”, “designed serendipity,” and “architecture of attention”
    • Ch. 6: “All the world’s knowledge”. The latter half of this chapter is about the “data web” and “data-driven intelligence”. It’s the only part of the book that discusses the semantic representation of data and scientific knowledge.
    • Ch. 8-9: These final two chapters are about the incentives problems that impede open science and how to overcome them. Includes interesting historical discussion on science in the pre-journal era.
  • Hey et al, 2009: The Fourth Paradigm: Data-Intensive Scientific Discovery
    • Collection of essays published by MS Research
    • Gray: “eScience: A transformed scientific method”
      • Four paradigms: empirical, theoretical, computational (numerics and simulations), eScience (data-intensive)
      • Part 1 of talk: data tools
        • Need “self-describing data” (data with built-in schema)
        • Need better data analysis tools [Gray seems off here: he mentions MATLAB and Excel but not Python or R, which were already going strong in 2009]
      • Part 2 of talk: scholarly communication
        • “In principle, [the Internet] can unify all the scientific data with all the literature to create a world in which the data and the literature interoperate with each other”
        • Overlay journals
  • Borgman, 2007: Scholarship in the Digital Age
  • Borgman, 2017: Big Data, Little Data, No Data: Scholarship in the Networked World