from hycastle.icu_store.retro import retro_dataset
from hycastle.icu_store.live import live_dataset # <-- includes PII
= 'T03' ward
Training data
HyCastle, the lens and building a training set
In the previous notebook, we showed an end-to-end exemplar of the Hylode platform - but we skated top speed over some details worth spending more time on.
Here we take a more measured pace and zoom in on HyCastle and the lens. Together these two abstractions make it easy to ask Hylode for a retrospective training set and then to pre-process that training set.
Hylode does this in a way that allows the same underlying code to furnish the live data needed for deployable prediction.
HyCastle
The HyCastle module is the main workhorse for pulling the complete available feature set out of hylode_db
(Hylode’s internal databases). Having defined our features in HyGear (covered in vignette 3), HyCastle can do two main things:
~ it can pick out a training set comprising all the features for each hourly slice for each patient
~ it can give us a live set of features for the patients currently on the ward
Let’s try it out…
# the retro_dataset function gives us all the historical episode slices to build up our training set
= retro_dataset(ward)
train_df train_df.shape
# and we can see the various feature columns we have generated
train_df.head()
Then using the same machinery, we can get the corresponding features for the patients currently on the ward.
Why this is important is that the same code is generating our training features and the features we will use to deploy the model (- ruling out unwanted surprises from divergence between the two!)
= live_dataset(ward)
predict_df predict_df.shape
'horizon_dt'].head() predict_df[
The lens
In the code above, we saw that HyCastle is very nifty in delivering us all the features we have pre-defined in hylode_db
. But the question naturally arises, what if we want to use a subset of those features? Or to pre-process them in a specific way?
Will this not require custom code - exposing us to the same risk of code divergence between training and deployment?
Our answer to this is the lens
. It is an abstraction that provides a more robust (transferrable) way to subset and pre-process the features coming out of HyCastle. Let’s have a look at a very simple example.
from hycastle.lens.base import BaseLens
from typing import List
from sklearn.compose import ColumnTransformer
from hycastle.lens.transformers import DateTimeExploder
class SimpleLens(BaseLens):
= True
numeric_output = "episode_slice_id"
index_col
@property
def input_cols(self) -> List[str]:
return [
"episode_slice_id",
"admission_dt",
]
def specify(self) -> ColumnTransformer:
return ColumnTransformer(
[
("select",
"passthrough",
["episode_slice_id"
],
),
("admission_dt_exp",
DateTimeExploder(),"admission_dt"],
[
),
] )
Notice that what we really have here is a list of 3-tuples to initialise the ColumnTransformer (which is a standard SKLearn class). For instance, the triple:
(
"admission_dt_exp",
DateTimeExploder(),
["admission_dt"],
)
Let’s see what happens when we put this lens to work on the output from HyCastle
= SimpleLens()
lens
= lens.fit_transform(train_df)
X X.head()
…basically we seem to have the episode_slice_id
for every slice, and then a bunch of features about the admission_dt
. In our original HyCastle dataset, we notice that admission_dt
is a series of datetimes:
'admission_dt'].head() train_df[
…but after we have transformed the retro dataframe, we have these additional admission features. This is thanks to the triple quoted above and the DateTimeExploder()
. Let’s have a look to see what that code looks like…
??DateTimeExploder
??DateTimeExploder.transform
In short, what we are doing in defining a lens
is defining a set of input columns from HyCastle that we want to work with, and then a sequence of column transformations (as a ColumnTransformer
object) that we use to specifically define our pre-processing pathway.
This lens can then be used consistently between model training and deployment.
Appendix 1: A more complete example
Here’s a fuller and more complete example of a lens
(along the lines of what we will use in the next vignette).
It might be worthwhile using the ?? shortcut to get a sense of the different transformations being applied.
from sklearn.preprocessing import (
FunctionTransformer,
OneHotEncoder,
OrdinalEncoder,
StandardScaler,
)from sklearn.impute import MissingIndicator, SimpleImputer
from hycastle.lens.transformers import timedelta_as_hours
class DemoLens(BaseLens):
= True
numeric_output = "episode_slice_id"
index_col
@property
def input_cols(self) -> List[str]:
return [
"episode_slice_id",
"admission_age_years",
"avg_heart_rate_1_24h",
"max_temp_1_12h",
"avg_resp_rate_1_24h",
"elapsed_los_td",
"admission_dt",
"horizon_dt",
"n_inotropes_1_4h",
"wim_1",
"bay_type",
"sex",
"vent_type_1_4h",
]
def specify(self) -> ColumnTransformer:
return ColumnTransformer(
[
("select",
"passthrough",
["episode_slice_id",
"admission_age_years",
"n_inotropes_1_4h",
"wim_1",
],
),"bay_type_enc", OneHotEncoder(), ["bay_type"]),
(
("sex_enc",
OrdinalEncoder(="use_encoded_value", unknown_value=-1
handle_unknown
),"sex"],
[
),
("admission_dt_exp",
DateTimeExploder(),"admission_dt", "horizon_dt"],
[
),
("vent_type_1_4h_enc",
OrdinalEncoder(="use_encoded_value", unknown_value=-1
handle_unknown
),"vent_type_1_4h"],
[
),
("vitals_impute",
="mean", add_indicator=False),
SimpleImputer(strategy
["avg_heart_rate_1_24h",
"max_temp_1_12h",
"avg_resp_rate_1_24h",
],
),
("elapsed_los_td_hrs",
FunctionTransformer(timedelta_as_hours),"elapsed_los_td"],
[
),
] )
= DemoLens()
lens
= lens.fit_transform(train_df)
X X.head()
X.dtypes