DDQ THU 2021-10-14

23. Design of Experiments¶

23.1. Agenda¶

General Announcements
Discussion & Activity

Table 23.1 Current Assignments for Everyone¶
Category	Assignment	Day	Date
Term Project	Milestone 3: Design Alternatives	FRI	2021-11-05

23.2. Activity¶

23.2.1. What is an Experiment?¶

experiment¶

A series of observations conducted under controlled conditions to study a relationship with the purpose of drawing causal inferences about that relationship.

An experiment involves the manipulation of one or more independent variables, the measurement of a dependent variable, and the exposure of various participants to one or more of the conditions being studied.

conditions¶

The different levels of one or more independent variables.

23.2.2. Models¶

\[y = f(x_1, x_2, \dots) + \epsilon\]

In math class, you likely know \(f\) and your goal is to determine \(y\) based on measurements given for \(x_1\), \(x_2\), etc.
In an experiment, you measure \(y\) for different measurements of \(x_1\), \(x_2\), etc., and your goal is to determine \(f\). That is, your goal is to determine the relationship.

23.2.3. Ronald Fisher’s Principles for Designed Experiments¶

Comparison: Comparisons between treatments are often compared against a scientific control or traditional treatment that acts as baseline.

treatment
The conditions applied to one or more groups that are expected to cause change in some outcome.

control
The regulation of all extraneous conditions and variables in an experiment so that any change in the dependent variable being measured can be attributed solely to manipulation.
Randomization: The process of assigning individuals at random to groups or to different groups in an experiment, so that there is a uniform probability of study participation among individuals in the target population.

confound
An independent variable that is conceptually distinct but empirically inseparable from one or more other independent variables.

Randomization mitigates confounding, i.e., it helps minimize the presence of confounds. This is important because confounding makes it impossible for those conducting the experiment to differentiate a variable’s effects in isolation from its effects in conjunction with other variables.
Statistical Replication: Measurements are repeated and full experiments are replicated to help identify the sources of variation.
Blocking: Blocking is the non-random arrangement of experimental units into groups consisting of units that are similar to one another.

block¶
A relatively homogeneous subset of the study participants.

The purpose of a block design is to ensure that a characteristic of the study participants that is related to the target outcome is distributed equally across treatments.
Orthogonality: Ideally, independent variables in an experiment are uncorrelated.

Note for Grad Students

If some of the independent variables are correlated, then this may also break some statistical models. For example, if you suspect that the dependent variable is some linear combination of the independent variables, then you might try to use Linear Regression The most common way to estimate the parameters in a linear regression model is the Ordinary Least Squares (OLS) method, and the calculus-based approach for OLS involves the computer taking the inverse of a square matrix constructed using the values of the independent variables.

If correlation causes the matrix to become singular, then it becomes impossible to take the inverse! If that happens, then you might try approximating the inverse using a Singular Value Decomposition (SVD) of the matrix (or perhaps some other eigenvalue-based decomposition); however, it may be more appropriate to remove some independent variables from the model to reduce or remove correlation instead.

Most techniques for model reduction (i.e, removing some of the independent variables) seek to minimize information loss (e.g., PCA and AIC- or SBIC-based elimination methods). You may also need to reduce a model for it to fit in memory, but that’s another point entirely.
Factorial Design: Instead of testing each discrete dependent variable one at a time, groups are formed so that all combinations of discrete dependent variables (and their levels, if appropriate) are tested.

Suppose you want to understand the relationship between some dependent variable \(y\) and two independent variables \({\rm sex}\) (M or F) and \({\rm enrolled}\) (Y or N). You could try to understand the impact of each variable separately using two experiments (thus collecting two datasets):

\(y = f({\rm sex}) + \epsilon\)

\(y = f({\rm enrolled}) + \epsilon\)

\(y\)

\({\rm sex}\)

\(y\)

\({\rm enrolled}\)

Using those datasets, you might attempt to construct a table with statistics for the various combinations of the independent variables:

\({\rm enrolled}\)

\({\rm sex}\)

M

F

Y

stats(Y, M)

stats(Y, F)

N

stats(N, M)

stats(N, F)

In a factorial design, you can cut the work down to one experiment (and thus one dataset) if you’re careful about the way you construct the sample.

\(y = f({\rm sex}, {\rm enrolled}) + \epsilon\)

\(y\)

\({\rm sex}\)

\({\rm enrolled}\)

With this one dataset, you can compute the same kind of statistics table that was presented before with less work. In order for it to be a true factorial design, you’ll want all the possible values for discrete independent variables to be covered, and you’ll want each such value to be uniformly distributed among your samples. The latter point just means that the numbers of observations you have to compute each “stat” entry in the statistics table above is roughly, if not exactly, equal.

\({\rm enrolled}\)	\({\rm sex}\)
Y	stats(Y, M)	stats(Y, F)
N	stats(N, M)	stats(N, F)

23.2.4. Breakout Groups¶

No no in-class breakout group activity today.

23.2.5. After Breakout Groups¶

N/A.

23.2.6. After Class¶

Before 11:55PM today, pick one of the components of Ronald Fisher’s Principles for Designed Experiments that you are or were least familiar with, then discuss that component in the corresponding followup discussion in Piazza @74.

Discussion Points

Discussion points can include questions, commentary, insight, and reflection on the topic and/or what’s written by others participating in the discussion.
1. Comparison: Piazza @74_f1
2. Randomization: Piazza @74_f2
3. Statistical Replication: Piazza @74_f3
4. Blocking: Piazza @74_f4
5. Orthogonality: Piazza @74_f5
6. Factorial Design: Piazza @74_f6
Before 11:55PM on SUN 10-17, pick one of the components of Ronald Fisher’s Principles for Designed Experiments that you are or were most familiar with, then contribute to its ongoing followup discussion in Piazza @74

Contributions

Try to make your contribution a “value add.” You may include references, where appropriate, but only if you include corresponding commentary to frame those references..
1. Comparison: Piazza @74_f1
2. Randomization: Piazza @74_f2
3. Statistical Replication: Piazza @74_f3
4. Blocking: Piazza @74_f4
5. Orthogonality: Piazza @74_f5
6. Factorial Design: Piazza @74_f6

Continue reading the Design module, and make sure you’re aware of current assignments and their due dates.
Extra Practice: Consider reproducing an app you use every day as a mockup in Adobe XD or Figma.

\(y = f({\rm sex}) + \epsilon\)		\(y = f({\rm enrolled}) + \epsilon\)
\(y\)	\({\rm sex}\)	\(y\)	\({\rm enrolled}\)

\(y = f({\rm sex}, {\rm enrolled}) + \epsilon\)
\(y\)	\({\rm sex}\)	\({\rm enrolled}\)