Skip to content

edsteva.models.step_function.algos.loss_minimization

loss_minimization

loss_minimization(
    predictor: pd.DataFrame,
    index: List[str],
    x_col: str = "date",
    y_col: str = "c",
    loss_function: Callable = l2_loss,
) -> pd.DataFrame

Computes the threshold \(t_0\) of a predictor \(c(t)\) by minimizing the following loss function:

\[ \begin{aligned} \mathcal{L}(t_0) & = \frac{\sum_{t = t_{min}}^{t_{max}} \mathcal{l}(c(t), f_{t_0}(t))}{t_{max} - t_{min}} \\ \hat{t_0} & = \underset{t_0}{\mathrm{argmin}}(\mathcal{L}(t_0)) \end{aligned} \]

Where the loss function \(\mathcal{l}\) is by default the L2 distance and the estimated completeness \(c_0\) is the mean completeness after \(t_0\).

\[ \begin{aligned} \mathcal{l}(c(t), f_{t_0}(t)) & = |c(t) - f_{t_0}(t)|^2 \\ c_0 & = \frac{\sum_{t = t_0}^{t_{max}} c(t)}{t_{max} - t_0} \end{aligned} \]
PARAMETER DESCRIPTION
predictor

\(c(t)\) computed in the Probe

TYPE: pd.DataFrame

index

Variable from which data is grouped

EXAMPLE: ["care_site_level", "stay_type", "note_type", "care_site_id"]

TYPE: List[str]

x_col

Column name for the time variable \(t\)

TYPE: str DEFAULT: 'date'

y_col

Column name for the completeness variable \(c(t)\)

TYPE: str DEFAULT: 'c'

loss_function

The loss function \(\mathcal{L}\)

TYPE: Callable DEFAULT: l2_loss

Source code in edsteva/models/step_function/algos/loss_minimization.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
def loss_minimization(
    predictor: pd.DataFrame,
    index: List[str],
    x_col: str = "date",
    y_col: str = "c",
    loss_function: Callable = l2_loss,
) -> pd.DataFrame:
    r"""Computes the threshold $t_0$ of a predictor $c(t)$ by minimizing the following loss function:

    $$
    \begin{aligned}
    \mathcal{L}(t_0) & = \frac{\sum_{t = t_{min}}^{t_{max}} \mathcal{l}(c(t), f_{t_0}(t))}{t_{max} - t_{min}} \\
    \hat{t_0} & = \underset{t_0}{\mathrm{argmin}}(\mathcal{L}(t_0))
    \end{aligned}
    $$

    Where the loss function $\mathcal{l}$ is by default the L2 distance and the estimated completeness $c_0$ is the mean completeness after $t_0$.

    $$
    \begin{aligned}
    \mathcal{l}(c(t), f_{t_0}(t)) & = |c(t) - f_{t_0}(t)|^2 \\
    c_0 & = \frac{\sum_{t = t_0}^{t_{max}} c(t)}{t_{max} - t_0}
    \end{aligned}
    $$


    Parameters
    ----------
    predictor : pd.DataFrame
        $c(t)$ computed in the Probe
    index : List[str]
        Variable from which data is grouped

        **EXAMPLE**: `["care_site_level", "stay_type", "note_type", "care_site_id"]`
    x_col : str, optional
        Column name for the time variable $t$
    y_col : str, optional
        Column name  for the completeness variable $c(t)$
    loss_function : Callable, optional
        The loss function $\mathcal{L}$
    """
    check_columns(df=predictor, required_columns=[*index, x_col, y_col])
    predictor = predictor.sort_values(x_col)
    cols = [*index, x_col, y_col]
    iter = predictor[cols].groupby(index)
    results = []
    for partition, group in tqdm.tqdm(iter):
        if not isinstance(partition, tuple):
            partition = tuple([partition])
        row = dict(zip(index, partition))
        t_0, c_0 = _compute_one_threshold(
            group,
            x_col,
            y_col,
            loss_function,
        )
        row["t_0"] = t_0
        row["c_0"] = c_0
        results.append(row)

    return pd.DataFrame(results)