Training data

HyCastle, the lens and building a training set

In the previous notebook, we showed an end-to-end exemplar of the Hylode platform - but we skated top speed over some details worth spending more time on.

Here we take a more measured pace and zoom in on HyCastle and the lens. Together these two abstractions make it easy to ask Hylode for a retrospective training set and then to pre-process that training set.

Hylode does this in a way that allows the same underlying code to furnish the live data needed for deployable prediction.

HyCastle

The HyCastle module is the main workhorse for pulling the complete available feature set out of hylode_db (Hylode’s internal databases). Having defined our features in HyGear (covered in vignette 3), HyCastle can do two main things:

~ it can pick out a training set comprising all the features for each hourly slice for each patient
~ it can give us a live set of features for the patients currently on the ward

Let’s try it out…

from hycastle.icu_store.retro import retro_dataset
from hycastle.icu_store.live import live_dataset # <-- includes PII

ward = 'T03'
# the retro_dataset function gives us all the historical episode slices to build up our training set
train_df = retro_dataset(ward)
train_df.shape
# and we can see the various feature columns we have generated
train_df.head()

Then using the same machinery, we can get the corresponding features for the patients currently on the ward.

Why this is important is that the same code is generating our training features and the features we will use to deploy the model (- ruling out unwanted surprises from divergence between the two!)

predict_df = live_dataset(ward)
predict_df.shape
predict_df['horizon_dt'].head()

The lens

In the code above, we saw that HyCastle is very nifty in delivering us all the features we have pre-defined in hylode_db. But the question naturally arises, what if we want to use a subset of those features? Or to pre-process them in a specific way?

Will this not require custom code - exposing us to the same risk of code divergence between training and deployment?

Our answer to this is the lens. It is an abstraction that provides a more robust (transferrable) way to subset and pre-process the features coming out of HyCastle. Let’s have a look at a very simple example.

from hycastle.lens.base import BaseLens
from typing import List
from sklearn.compose import ColumnTransformer
from hycastle.lens.transformers import DateTimeExploder
class SimpleLens(BaseLens):
    numeric_output = True
    index_col = "episode_slice_id"

    @property
    def input_cols(self) -> List[str]:
        return [
            "episode_slice_id",
            "admission_dt",
        ]

    def specify(self) -> ColumnTransformer:
        return ColumnTransformer(
            [
                (
                    "select",
                    "passthrough",
                    [
                        "episode_slice_id"
                    ],
                ),
                (
                    "admission_dt_exp",
                    DateTimeExploder(),
                    ["admission_dt"],
                ),
            ]
        )

Notice that what we really have here is a list of 3-tuples to initialise the ColumnTransformer (which is a standard SKLearn class). For instance, the triple:

                (
                    "admission_dt_exp",
                    DateTimeExploder(),
                    ["admission_dt"],
                )

Let’s see what happens when we put this lens to work on the output from HyCastle

lens = SimpleLens()

X = lens.fit_transform(train_df)
X.head()

…basically we seem to have the episode_slice_id for every slice, and then a bunch of features about the admission_dt. In our original HyCastle dataset, we notice that admission_dt is a series of datetimes:

train_df['admission_dt'].head()

…but after we have transformed the retro dataframe, we have these additional admission features. This is thanks to the triple quoted above and the DateTimeExploder(). Let’s have a look to see what that code looks like…

??DateTimeExploder
??DateTimeExploder.transform

In short, what we are doing in defining a lens is defining a set of input columns from HyCastle that we want to work with, and then a sequence of column transformations (as a ColumnTransformer object) that we use to specifically define our pre-processing pathway.

This lens can then be used consistently between model training and deployment.

Appendix 1: A more complete example

Here’s a fuller and more complete example of a lens (along the lines of what we will use in the next vignette).

It might be worthwhile using the ?? shortcut to get a sense of the different transformations being applied.

from sklearn.preprocessing import (
    FunctionTransformer,
    OneHotEncoder,
    OrdinalEncoder,
    StandardScaler,
)
from sklearn.impute import MissingIndicator, SimpleImputer
from hycastle.lens.transformers import timedelta_as_hours
class DemoLens(BaseLens):
    numeric_output = True
    index_col = "episode_slice_id"

    @property
    def input_cols(self) -> List[str]:
        return [
            "episode_slice_id",
            "admission_age_years",
            "avg_heart_rate_1_24h",
            "max_temp_1_12h",
            "avg_resp_rate_1_24h",
            "elapsed_los_td",
            "admission_dt",
            "horizon_dt",
            "n_inotropes_1_4h",
            "wim_1",
            "bay_type",
            "sex",
            "vent_type_1_4h",
        ]

    def specify(self) -> ColumnTransformer:
        return ColumnTransformer(
            [
                (
                    "select",
                    "passthrough",
                    [
                        "episode_slice_id",
                        "admission_age_years",
                        "n_inotropes_1_4h",
                        "wim_1",
                    ],
                ),
                ("bay_type_enc", OneHotEncoder(), ["bay_type"]),
                (
                    "sex_enc",
                    OrdinalEncoder(
                        handle_unknown="use_encoded_value", unknown_value=-1
                    ),
                    ["sex"],
                ),
                (
                    "admission_dt_exp",
                    DateTimeExploder(),
                    ["admission_dt", "horizon_dt"],
                ),
                (
                    "vent_type_1_4h_enc",
                    OrdinalEncoder(
                        handle_unknown="use_encoded_value", unknown_value=-1
                    ),
                    ["vent_type_1_4h"],
                ),
                (
                    "vitals_impute",
                    SimpleImputer(strategy="mean", add_indicator=False),
                    [
                        "avg_heart_rate_1_24h",
                        "max_temp_1_12h",
                        "avg_resp_rate_1_24h",
                    ],
                ),
                (
                    "elapsed_los_td_hrs",
                    FunctionTransformer(timedelta_as_hours),
                    ["elapsed_los_td"],
                ),
            ]
        )
lens = DemoLens()

X = lens.fit_transform(train_df)
X.head()
X.dtypes