Tags: [[🌲️/&]] - [[Data Science]]
Title: [[& 50 Years of Data Science]]
Author: [[@David Donoho]]
DOI: <https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734>
Reference:
Publish Year: 2017
Publish Month: October
Publish ISO: 2017-10-02
----
# Today's Data Science Moment
# Data Science "Versus" Statistics
## The "Big Data" Meme
## The "Skills" Meme
## The "Jobs" Meme
## What Here is Real
## A Better Framework
# The Future of Data Analytics, 1962
John Tuckey coined the term Data Analysis.
# The 50 Years Since FoDA
## Exhortations
Cleveland proposed six foci of activity, even suggesting allocations of effort:
- Multidisciplinary investigations (25%)
- Models and Methods for Data(20%)
- Computing with Data (15%)
- Pedagogy (15%)
- Tool Evaluation (5%)
- Theory (20%)
At the time most stats departments were 100% focused on statistics
## Reification
# Breiman's "Two Cultures" 2001
Breiman's article _Statistical Modeling: The Two Cultures_ describes 2 cultural outlooks about extracting value from data.
Statistics starts with data. Think of the data as being generated by a black box in which a vector of input variables x (independent variables) go in one side and on the other side the response variables y come out. Inside the black box, nature functions to associate the predictor variables with the response variable... There are two goals in analyzing the data:
- **Prediction**: To be able to predict what the responses are going to be to future input variables.
- **Inference**: To infer how nature is associating the response variables to the input variables.
# The Predictive Culture's Secret Sauce
## The Common Task Framework
## Experience with CTF
**CTF**: _Common Task Framework_
An instance of CTF has these ingredients:
- A publicly available training dataset involving, for each observation, a list of (possibly many) feature measurements, and a class label for that observation
- A set of enrolled competitors whose common task is to infer a class prediction rule from the training data
- A scoring referee, to which competitors can submit their prediction rule. The referee runes the prediction rule against a testing dataset, which is sequested behind a Chinese wall. The referee objectively and automatically reports the score (prediction accuracy) achieved by the submitted rule.
The general experience with CTF was summarized by Liberman as follows:
- Error rates decline by a fixed percentage each year, to an asymptote depending on task and data quality
- Progress usually comes from many small improvements; a change of 1\% can be a reason to break out the champagne
- Shared data plays a crucial role - and is reused in unexpected ways.
## The Secret Sauce
The combination of predictive modeling culture together with CTF is the "secret sauce" of machine learning.
## Required Skills
# Teaching of Today's Consensus Data Science
# The Full Scope of Data Science
## The Six Divisions
The activities of Greater Data Science are classified into size divisions:
- Data Gathering, Preparation, and Exploration
- Data Representation and Transformation
- Computing with Data
- Data Modeling
- Data Visualization and Presentation
- Science about Data Science
Tuckey coined EDA (Exploratory Data Analysis) as well.
## Discussion
## Teaching of GDS
## Research in GDS
## Discussion
# Science About Data Science
## Science-Wide Meta-Analysis
## Cross-Study Analysis
## Cross-Workflow Analysis
## Summary
# The Next 50 Years of Data Science
## Open Science Takes Over
_"An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures."_
## Science as Data
## Scientific Data Analysis, Tested Empirically
## Data Science in 2005
# Conclusion