Causal Inference

Potential Outcomes

Conditional Independence

用 $ X \perp\kern{-10mu}\perp Y \mid Z $ 表示给定随机变量 $ Z $ 时 $ X $ 与 $ Y $ 条件独立,其数学定义包含以下等价形式:

  • 联合分布分解:$ f(x, y \mid z) = f(x \mid z) f(y \mid z) $
  • 条件分布不变性:$ f(y \mid x, z)=f(y \mid z) $ 或 $ f(x \mid y, z) = f(x \mid z) $

核心意义:当已知 $ Z $ 时,额外观测 $ X $ 不会为预测 $ Y $ 提供任何新信息。

重要推论(当 $ X \perp\kern{-10mu}\perp Y \mid Z $ 时):

  1. 期望可分解:$ \mathbb{E}[XY \mid Z] = \mathbb{E}[X \mid Z]\mathbb{E}[Y \mid Z] $
  2. 条件期望不变性:
    • $ \mathbb{E}[Y \mid X, Z] = \mathbb{E}[Y \mid Z] $
    • $ \mathbb{E}[X \mid Y, Z] = \mathbb{E}[X \mid Z] $

Neyman–Rubin Causal Model (Potential Outcomes Framework)

对个体 $i$,记 $$ D_i= \begin{cases}1 & \text { if individual } i \text { receives treatment } \\ 0 & \text { if individual } i \text { does not receive treatment }\end{cases} $$ 使用 $Y_i(1), Y_i(0)$ 分别表示这个个体接受、不接受处理的结果值。这个个体的 individual treatment effect 就是 $$ \tau_i = Y_i(1) - Y_i(0) $$ 通常 $Y_i(1), Y_i(0)$ 只有一个能被观测到。我们关心整个群体的 treatment effect:

类型 定义 解释
ATE $ \mathbb{E}[Y(1)-Y(0)] $ 全体平均处理效应
ATT $ \mathbb{E}[Y(1)-Y(0) \mid D=1] $ 接受处理组的平均效应
ATU $ \mathbb{E}[Y(1)-Y(0) \mid D=0] $ 未处理组的平均效应

注意到 $$ \begin{aligned} \tau_{ATT} & = E[Y(1) \mid D=1] - E[Y(0)\mid D=1] \\ & = E[Y(1)] - E[Y(0)] + E[Y(0) \mid D=0] - E[Y(0)\mid D=1] \\ & = \tau_{ATE} + E[Y(0) \mid D=0] - E[Y(0)\mid D=1] \end{aligned} $$

称 $E[Y(0) \mid D=1] - E[Y(0)\mid D=0]$ 为 selection bias。当实验并非完全随机时,选择误差一般不为零。

实际上,如果某些人知道自己接受 treatment 能有更好/更坏的结果,比如说,某些人知道自己花很多钱去读大学的结果不一定好,这就会导致 $$ E[Y(1) \mid D=1] - E[Y(1) \mid D=0] > 0 $$ 其含义为,会有一部分聪明的学生主动去选择接受大学教育,“主动选择接受大学教育的聪明学生的平均毕业薪资”大于“主动选择不接受大学教育的普通学生如果接受了大学教育的平均毕业薪资”。

selection bias 其实就是 $D$ 与 $Y_0, Y_1$ 不独立,这个是在自然实验中必须要考虑的问题。

如果 $P(D=1) = \pi=1-P(D=0)$,此时有 $$ \tau_{ATE} = \pi \cdot \tau_{ATT} + (1-\pi) \cdot \tau_{ATU} $$ 也就是说 ATE 是 ATT 和 ATU 的加权。

模型假设

SUTVA

SUTVA (Stable Unit Treatment Value Assumption) 是 PO 的经典假设,它由以下两部分组成:

  • No Interference: 每个人的 outcome 只取决于自身的 treatment,不受他人处理的影响
  • Consistency: 如果 treatment 是 $D$,那么outcome $Y=Y(D)$,即观测值与对应潜在结果一致

两个假设合起来就是: $$ Y_i = Y_i(D_i) $$

在研究诸如“肥胖对寿命的影响”时,“肥胖”是 non-manipulable 的,不同的干预措施 $D$ (比如说运动、节食)可以导致相同的“肥胖”结果,但是 $Y(D)$ 不一样(运动和节食对 $Y$ 的影响不一样)[1]

这就会导致 consistency 出问题,这是在用 observational data 做研究时需要注意的事情;这类问题并不会发生在 RCT 中。

Ignorability / Unconfoundedness

$$ (Y_0, Y_1) \perp\kern{-10mu}\perp D \mid X $$ 意味着在控制协变量 $ X $ 后,处理分配完全随机。保证 ATE 可通过观测数据识别。

Positivity / Commom Support

$$ 0 < P(D=1 \mid X=x) < 1 $$

直觉上,当 positivity 假设不成立时,某个 subgroup $X=x$ 都接受或者都不接受 treatment,这会导致 ATE 无法计算。具体来说 $$ E[Y(1)-Y(0)] = E_X[E[Y \mid T=1, X] - E[Y\mid T=0, X]] $$

Positivity-Unconfoundedness Tradeoff

Although conditioning on more covariates could lead to a higher chance of satisfying unconfoundedness, it can lead to a higher chance of violating positivity.

Adjustment Formula

把以上假设融合起来,可以得到 adjustment formula: $$ \begin{aligned} \mathbb{E}[Y(1)-Y(0)] = & \mathbb{E}[Y(1)]-\mathbb{E}[Y(0)] \notag \\ = & \mathbb{E}_X[\mathbb{E}[Y(1) \mid X]-\mathbb{E}[Y(0) \mid X]] & \text{(law of iterated expectations)}\\ = & \mathbb{E}_X[\mathbb{E}[Y(1) \mid T=1, X]-\mathbb{E}[Y(0) \mid T=0, X]] & \text {(unconfoundedness and positivity)} \\ = & \mathbb{E}_X[\mathbb{E}[Y \mid T=1, X]-\mathbb{E}[Y \mid T=0, X]] & \text{(consistency) } \end{aligned} $$

Propensity Score

定义 $e(X) = E[D \mid X]=P(D=1\mid X)$,其含义为特征为 $X$ 的个体接受 treatment 的概率。

Balancing property: $D \perp \kern{-10mu} \perp X \mid e(X)$

unconfoundedness $$ D \perp \kern{-10mu} \perp (Y_0, Y_1) \mid e(X) $$ 这等价于证明 $$ P(D=1 \mid Y_0, Y_1, e(X)) = P(D=1\mid e(X)) $$ 证明的要点在于借助条件期望的 tower property 应用 $D \perp \kern{-10mu} \perp (Y_0, Y_1)\mid X$

注意到 $$ \begin{aligned} E[D\mid Y_0, Y_1, e(X)] & = E[E[D\mid Y_0, Y_1, e(X), X] \mid Y_0, Y_1, e(X)] \\ & =E[E[D \mid X] \mid Y_0, Y_1, e(X)] \\ & = E[e(X) \mid Y_0, Y_1, e(X)] \\ & = e(X) \\ & = E[D \mid e(X)] \end{aligned} $$

从 causal graph 的角度来理解的话,如果 $D$ 是一个二元变量,那么 $X$ 对 $D$ 的影响可以完全由 $e(X)$ 来决定,因此阻断 $e(X)$ 就控制住了由 $X$ 带来的 $D$ 和 $Y$ 的相关性。

graph LR
X --> A["e(X)"]
A --> D
X --> Y
D --> Y

Inverse Probability Weighting

Causation in Graphs

Basic Structures

Chain (Mediator)

graph LR
X1 --> X2
X2 --> X3

$X_2$ 阻断了 $X_1$ 到 $X_3$ 的相关性,即 $X_1 \perp \!\!\! \perp X_3 \mid X_2$ 。我们有 $$ \begin{gathered} p(x_1, x_2, x_3) = p(x_1)p(x_2 \mid x_1)p(x_3 \mid x_2) \\ p(x_1, x_3 \mid x_2) = p(x_1 \mid x_2) p(x_3 \mid x_2) \end{gathered} $$

Fork (Common Cause) 🍴

graph LR
X2 --> X1
X2 --> X3

控制 $X_2$,$X_1$ 和 $X_3$ 无相关性。比方说,$X_2$ 的气温,$X_1, X_3$ 分别是冰淇淋销量和溺水人数。

Collider (Common Effect)

graph LR
X1 --> X2
X3 --> X2

$X_1$ 和 $X_3$ 是独立的,但是给定 $X_2$,$X_1$ 和 $X_3$ 就产生了相关性。

d-separation

d-separation is a graphical criterion for determining whether a set of variables $Z$ renders two other variables (or sets of variables) $X$ and $Y$ conditionally independent. If $Z$ d-separates $X$ from $Y$, then $X$ is independent of $Y$ given $Z$, written as $X \perp \!\!\! \perp Y \mid Z$.

A path between $X$ and $Y$ is said to be blocked by a set of nodes $Z$ if at least one of the following conditions holds for a node $W$ on the path:

  1. Chain or Fork: $W$ is in the path $X \dots \rightarrow W \rightarrow \dots Y$ or $X \dots \leftarrow W \rightarrow \dots Y$, and $W$ is in the conditioning set $Z$.
  2. Collider: $W$ is in the path $X \dots \rightarrow W \leftarrow \dots Y$ (i.e., $W$ is a collider on the path), and neither $W$ nor any of its descendants are in the conditioning set $Z$.

Conversely, a path is open (or active) if it is not blocked. A path with a collider $W$ becomes open if $W$ itself or one of its descendants is in the conditioning set $Z$.

Definition of d-separation: Two nodes $X$ and $Y$ are d-separated by a set of nodes $Z$ if every path between $X$ and $Y$ is blocked by $Z$.

Implication: If $X$ and $Y$ are d-separated by $Z$ in a directed acyclic graph (DAG) $G$, then $X \perp \!\!\! \perp Y \mid Z$ for any probability distribution $P$ that is Markovian with respect to $G$.

Summary of Path Blocking/Opening:

  • Chain ($A \rightarrow W \rightarrow B$):
    • Blocked if $W \in Z$.
    • Open if $W \notin Z$.
  • Fork ($A \leftarrow W \rightarrow B$):
    • Blocked if $W \in Z$.
    • Open if $W \notin Z$.
  • Collider ($A \rightarrow W \leftarrow B$):
    • Blocked if $W \notin Z$ AND no descendant of $W$ is in $Z$.
    • Open if $W \in Z$ OR some descendant of $W$ is in $Z$.

Understanding d-separation is fundamental for identifying which variables need to be controlled for (or not controlled for) when trying to estimate causal effects or assess associations from observational data.

Causal Association vs. Non-causal Association

Causal Association ➡️

A causal association between two variables $X$ and $Y$ exists if one causally influences the other. This is typically represented by a directed path from $X$ to $Y$ (e.g., $X \rightarrow \dots \rightarrow Y$) or $Y$ to $X$ (e.g., $Y \rightarrow \dots \rightarrow X$) in a causal DAG.

Identification: If there's an open directed path between $X$ and $Y$ after conditioning on a set $Z$, the remaining association may be causal.

Non-causal Association (Spurious Association) 👻

A non-causal association (or spurious association) is a statistical dependency between $X$ and $Y$ that does not arise from $X$ causing $Y$ or $Y$ causing $X$.

Using d-separation to distinguish:
  • An observed association between $X$ and $Y$ is causal if it's transmitted along directed paths from $X$ to $Y$ (or $Y$ to $X$) that remain open after appropriate conditioning.
  • An observed association is non-causal (confounding) if it's transmitted along "backdoor paths" (paths that are not directed from $X$ to $Y$ but have an arrow into $X$). These paths often involve common causes. Conditioning on appropriate variables (identified by d-separation) can block these paths.
  • An observed association is non-causal (collider bias) if it's created by inappropriately conditioning on a collider or its descendant, thereby opening a path that was previously blocked.

Causal Models

do-operator

定义:do-operator 表示干预(intervention),即人为设定变量 $X$ 为特定值(打破自然生成过程),记为 $do(X=x)$。
关键区别

  • $P(Y|X=x)$ 是观测到 $X=x$ 时 $Y$ 的条件概率(含混杂路径的关联)
  • $P(Y|do(X=x))$ 是强制设定 $X=x$ 时 $Y$ 的分布(仅保留因果路径的效应)

Truncated Factorization $$ P(x_1, \ldots, x_n \mid do(S=s)) = \prod_{i \notin S} P(x_i \mid \mathrm{pa}_i) $$ 即移除干预变量 $S$ 的因果机制,其他机制保持不变。

Example: Confounders

graph LR
X --> Y
X --> T
T --> Y

直接估计 $P(Y\mid T)$ 会混入非因果关联(通过 $X \rightarrow T$ 和 $X \rightarrow Y$ 的路径),需校正 $X$。

这里 $X$ 是一个 confounder $$ P(y, x \mid d o(t))=P(x) P(y \mid t, x) \implies P(y \mid d o(t))=\sum_x P(y \mid t, x) P(x) $$ 然而 $$ P(y \mid t) = \sum_{x} P(y, x \mid t) $$

Backdoor Adjustment

Definition of Backdoor Criterion

一组变量 $Z$ 满足后门准则需:

  1. 阻断所有后门路径:阻塞所有 $T$ 和 $Y$ 之间指向 $T$ 的路径(如 $T \leftarrow X \rightarrow Y$)
  2. 不引入新偏差:$Z$ 不包含 $T$ 的后代节点

Backdoor Adjustment Theorem

若 $Z$ 满足后门准则,则因果效应可识别为: $$ P(Y=y \mid do(T=t)) = \sum_{z} P(Y=y \mid T=t, Z=z) P(Z=z) $$ 按混杂因子 $Z$ 分层计算条件效应,再以 $Z$ 的分布加权平均。

Front-Door Criterion

当存在未观测混杂因子(Unobserved Confounder),无法直接使用后门调整时

graph LR
U((U)) --> T
U --> Y
T --> M
M --> Y
style U fill:#f9f,stroke:#333,stroke-width:1px

($U$ 未观测,传统后门调整失效)

前门准则

需存在一组变量 $M$ 满足:

  1. 中介因果路径:$T \rightarrow M \rightarrow Y$,且 $M$ 截断所有 $T$ 到 $Y$ 的因果路径
  2. 无混杂
    • $T$ 到 $M$ 无后门路径
    • $M$ 到 $Y$ 的后门路径被 $T$ 阻断
前门调整公式

$$ P(Y=y \mid do(T=t)) = \sum_{m} P(m \mid t) \sum_{t'} P(y \mid t', m) P(t') $$ 计算本质

  1. 估计 $T$ 对 $M$ 的因果效应:$P(m \mid t)$
  2. 估计 $M$ 对 $Y$ 的因果效应:$\sum_{t'} P(y \mid t', m) P(t')$(校正 $T$ 的混杂)
  3. 组合二者:$T \rightarrow M \rightarrow Y$ 的效应

Randomized Experiments

Field experiments 使得数据中,实验组和对照组的 covariates 的分布相同,即 $X \perp \!\!\! \perp T$ $$ P(X \mid T=0) \equiv P(X \mid T=1) $$ 这叫做 covariate balance。

Covariate balance implies that association is causation: $$ P(y \mid do(t)) = P(y \mid t) $$

do-calculus

Propensity Score

定义 $e(X) = E[D \mid X]=P(D=1\mid X)$,其含义为特征为 $X$ 的个体接受 treatment 的概率。

Balancing property: $D \perp \kern{-10mu} \perp X \mid e(X)$

unconfoundedness $$ D \perp \kern{-10mu} \perp (Y_0, Y_1) \mid e(X) $$ 这等价于证明 $$ P(D=1 \mid Y_0, Y_1, e(X)) = P(D=1\mid e(X)) $$ 证明的要点在于借助条件期望的 tower property 应用 $D \perp \kern{-10mu} \perp (Y_0, Y_1)\mid X$

注意到 $$ \begin{aligned} E[D\mid Y_0, Y_1, e(X)] & = E[E[D\mid Y_0, Y_1, e(X), X] \mid Y_0, Y_1, e(X)] \\ & =E[E[D \mid X] \mid Y_0, Y_1, e(X)] \\ & = E[e(X) \mid Y_0, Y_1, e(X)] \\ & = e(X) \\ & = E[D \mid e(X)] \end{aligned} $$

从 causal graph 的角度来理解的话,如果 $D$ 是一个二元变量,那么 $X$ 对 $D$ 的影响可以完全由 $e(X)$ 来决定,因此阻断 $e(X)$ 就控制住了由 $X$ 带来的 $D$ 和 $Y$ 的相关性。

graph LR
X --> A["e(X)"]
A --> D
X --> Y
D --> Y

  1. Hernán, M., Taubman, S. Does obesity shorten life? The importance of well-defined interventions to answer causal questions. Int J Obes 32 (Suppl 3), S8–S14 (2008). https://doi.org/10.1038/ijo.2008.82 

updatedupdated2025-12-022025-12-02