& 2020-11-16 Tidy Data

Status: Tags: [[🌲️/&]] Title: [[& 2020-11-16 Tidy Data]] Author: [[@Hadley Wickham]] Keywords: [[Data Cleaning]] - [[Data Tidying]] - [[Relational Databases]] - [[DevLog/r]] DOI: <http://dx.doi.org/10.18637/jss.v059.i10> File: [File](zotero://open-pdf/library/items/28GS4KL8) Publish Date: 2014-08-01 Reviewed Date: --- # Extracted Annotations (2020-11-16) > "Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets. This structure also makes it easier to develop tidy tools for data analysis, tools that both input and output tidy datasets." ([Wickham 2014:1](zotero://open-pdf/library/items/28GS4KL8?page=1)) Utilizing a common framework and standardization makes for a common “API” for data manipulation. If the inputs are always shaped the same way, then the operations can be standardized and the tool sets developed can work synergistically together with fewer incongruencies Reference Jenny Bryan Tweet on cognitive load for programming with style guide? ([note on p.1](zotero://open-pdf/library/items/28GS4KL8?page=1)) > "A standard makes initial data cleaning easier because you do not need to start from scratch and reinvent the wheel every time" ([Wickham 2014:1](zotero://open-pdf/library/items/28GS4KL8?page=1)) A standard for data set structure and the tools that can be utilized upon that structure make lessen cognitive load by not having to reinvent the wheel in regards to the desired state of the prepared data set. This allows one to approach the cleaning process with a clear end goal in mind and if not clear, at least less obscure than starting from scratch. ([note on p.1](zotero://open-pdf/library/items/28GS4KL8?page=1)) > "Happy families are all alike; every unhappy family is unhappy in its own way. Leo Tolstoy" ([Wickham 2014:2](zotero://open-pdf/library/items/28GS4KL8?page=2)) > "A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organized in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes." ([Wickham 2014:3](zotero://open-pdf/library/items/28GS4KL8?page=3)) ![[202011151522 Pre tidying data.png]] *The before data often seen in the wild* ![[202011151523 Tidy Dataset.png]] *Akin to a pivot table with the larger scopes to the left and the more granular variables, scopes and finally values to the right* > "Table 3 is the tidy version of Table 1. Each row represents an observation, the result of one treatment on one person, and each column is a variable." ([Wickham 2014:4](zotero://open-pdf/library/items/28GS4KL8?page=4)) > "While the order of variables and observations does not aect analysis, a good ordering makes it easier to scan the raw values. One way of organizing variables is by their role in the analysis: are values xed by the design of the data collection, or are they measured during the course of the experiment? Fixed variables describe the experimental design and are known in advance. Computer scientists often called variables dimensions, and statisticians usually denote them with subscripts on random variables. Measured variables are what we actually measure in the study. Fixed variables should come rst, followed by measured variables, each ordered so that related variables are contiguous. Rows can then be ordered by the rst variable, breaking ties with the second and subsequent (xed) variables. This is the convention adopted by all tabular displays in this paper." ([Wickham 2014:5](zotero://open-pdf/library/items/28GS4KL8?page=5)) ![[202011151550 Tuberculosis messy data.png]] *Tuberculosis messy data* > "Storing the values in this form resolves another problem in the original data. We want to compare rates, not counts. But to compute rates, we need to know the population. In the original format, there is no easy way to add a population variable. It has to be stored in a separate table, which makes it hard to correctly match populations to counts." ([Wickham 2014:9](zotero://open-pdf/library/items/28GS4KL8?page=9)) ![[Pasted image 20201115155048.png]] *Molten then tidied data* ![[Pasted image 20201115155604.png]] > "This dataset is mostly tidy, but we have two variables stored in rows: tmin and tmax, the type of observation" ([Wickham 2014:10](zotero://open-pdf/library/items/28GS4KL8?page=10)) The difference between this example and the one showing Treatment A and Treatment B is that the column name for this example just says “element” but this column i just holding the names of variables, where as in the first example the column was “treatments” because they were both treatments, just of 2 kinds, but the variable was treatment with 2 seperate values, and those values had their resulting numerical value. ([note on p.10](zotero://open-pdf/library/items/28GS4KL8?page=10)) > "In my experience, these are the four fundamental verbs of data manipulation: Filter: subsetting or removing observations based on some condition. Transform: adding or modifying variables. These modications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume). Aggregate: collapsing multiple values into a single value (e.g., by summing or taking means). Sort: changing the order of observations." ([Wickham 2014:13](zotero://open-pdf/library/items/28GS4KL8?page=13))