META-LEARNING IN MEDICINE
A Specialization Project Report
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
MSc.
by
Yong Huang, Pargol Gheissari
May 2020
©c 2020 Yong Huang, Pargol Gheissari
ALL RIGHTS RESERVED
ABSTRACT
In recent years, the amount of digital information stored in electronic health
records (EHRs) has increased dramatically. At the same time, the advances in
the field of machine learning, specifically deep learning has accommodated the
opportunity for knowledge discovery and data mining algorithms to gain in-
sight from this digital health data. Predictive modeling of clinical risks from
EHRs, such as in-hospital mortality rate, in-hospital length of stay and chronic
disease onset, can be helpful to the improvement of the quality of healthcare
delivery. However, there are many challenges, such as sparsity, irregularity and
temporality, associated with this clinical data. Therefore, this provides an op-
portunity for meta-learning methodologies to solve such problems and to have
a large impact on medicine and quality of healthcare delivery. In this paper, we
provide the background of this problem, review the commonly used strategies
for solving such problems and discuss the state-of-the-art of meta-learning mod-
els. To address the clinical challenges associated with EHR data, we propose a
meta-learning model, which uses latent-ODE as the base-learner and LSTM as
the meta-learner, to solve disease phenotyping tasks. We then demonstrate that
our proposed method outperforms the state-of-the-art models addressing clas-
sification tasks on healthcare data.
ACKNOWLEDGEMENTS
We thank all advisors and staff who supported us throughout this process; es-
pecially, Dr. Deborah Estrin, Dr. Fei Wang, Dr. Xi Sheryl Zhang and Dr.Calvin
Zang.
4
TABLE OF CONTENTS
1 Introduction 1
2 Related Work 3
2.1 Challenges and Solutions Associated with EHR Systems . . . . . 3
2.2 Neural ODE on EHR . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Meta-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Methods 10
3.1 Base-Learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 ODE-RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Latent ODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Meta-Learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.1 Meta-LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.2 Our Modification . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Results 15
4.1 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Meta-Learner Experiments . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Base-Learner Experiments . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Disease Classification in a Normal Data Regime . . . . . . . . . . 18
4.5 Disease classification in few-shot learning setting . . . . . . . . . . 20
5 Discussion 21
5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6 Conclusion 21
Bibliography 22
5
LIST OF TABLES
4.1 5-shot 5-class Accuracy on MiniImageNet . . . . . . . . . . . . . 17
4.2 Mortality prediction on Physionet . . . . . . . . . . . . . . . . . . 17
4.3 5-shot 5-class experiments on MIMIC-III . . . . . . . . . . . . . . 20
6
LIST OF FIGURES
2.1 Common tasks using EHR data and benchmark models for these
tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Continuous dynamics of ODE network [4] . . . . . . . . . . . . . 7
2.3 Example of meta-learning setup [15] . . . . . . . . . . . . . . . . . 9
3.1 LSTM as a meta-learner . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1 AUC of tested models on acute diseases within MIMIC-III . . . . 19
7
CHAPTER 1
INTRODUCTION
Each year over 30 million patients visit hospitals in the United States. 83 % of
the hospitals use an electronic health record (EHR) system [8]. With the surge
in the availability of digital clinical data, a significant increase of interest in us-
ing data mining algorithms to improve the quality of healthcare delivery has
been observed. Predictive modeling of clinical risks from patient EHRs, such as
in-hospital mortality rate, length of stay, chronic disease onset and phenotype
classification, have attracted attention in this field [26]. Accurate clinical risk
prediction models can help clinicians identify the potential risk at early stages
and allow for appropriate actions to be taken in a timely manner; thus, resulting
in the improvement of healthcare delivery.
Although there has been a steady growth in developing such algorithms,
including both conventional approaches and deep learning models, several ob-
stacles have slowed the progress in harnessing digital health data. In medical
settings, labeled data samples are typically limited and are often very expensive
to obtain. Therefore, sequentiality, sparsity, noisiness and irregularity are some
challenges when working with EHR data [26]. Furthermore, there is accumu-
lating evidence that many prediction tasks are interrelated. For instance, the
highest risk and highest cost patients are often those with complex comorbidi-
ties while decompensating patients have a higher risk for poor outcomes. Thus,
making efficient use of limited patient samples becomes essential to accurately
predict complicated clinical risks [2][8].
Another challenge in this field is the absence of widely accepted bench-
marks for evaluating competing models. Such benchmarks accelerate progress
1
in machine learning by bringing the community into focus and facilitating re-
producibility and competition. For example, the winning error rate in the Ima-
geNet Large Scale Visual Recognition Challenge (ILSVRC) plummeted an order
of magnitude from 2010 (0.2819) to 2016 (0.02991). In contrast, practical progress
in clinical machine learning has been difficult to measure due to variability in
data sets and task definitions [8].
Our machine learning techniques, similar to human intelligence, should be
able to learn and adapt quickly from a few examples and continue to adapt as
more data becomes available [23]. Meta-learning, also known as learning to
learn, addresses this problem [24] [15]. It is the science of systematically ob-
serving how different machine learning approaches perform on a wide range of
learning tasks, and then learning from this experience [7] [22]. Although meta-
learning algorithms have been explored in applications such as robotics and
neural machine translation to address the similar problem of limited samples,
the application of these algorithms in medical problems are rarely explored. In
order to address the problem of learning from a few examples, we reviewed
the commonly used strategies in modern state-of-the-art meta-learning meth-
ods and constructed a meta-learning model to address the task of phenotyping
in ICU data.
2
CHAPTER 2
RELATED WORK
In this section, we systematically explore and select literature to elucidate
and summarize the application of meta-learning in clinical settings, specifically
on EHR dataset. This allowed us to discover potential gaps and areas for in-
novation in this field. We have identified the research questions and relevant
methods used to address these questions, which subsequently allowed us to
develop an innovative model for solving this problem. We first discuss the so-
lutions that have been proposed to this date that address some challenges with
the EHR systems. Next, we discuss the state-of-the-art meta-learning models
and the application of certain models to EHR data.
2.1 Challenges and Solutions Associated with EHR Systems
EHRs are a comprehensive collection of patient care details, such as order
of medications, procedures, lab tests, and diagnosis. They are an important
source of information for healthcare technology [6]. The patient records in these
databases are usually recorded with thorough details, including both numerical,
structured, and textual data. The sources of numerical data are measurements
such as heart rate, blood pressure, and clinical test results. The structured data
are in the form of medical codes associated with each patient record. These
codes are manually assigned by the clinical decision maker for billing and ad-
ministration purposes. The unstructured data comes from the clinical notes that
contain detailed natural language description of the healthcare provided to the
patients during the admission [12].
3
As obtaining publicly available data for running experiments in clinical set-
ting is an issue, many studies focus on using the public Medical Information
Mart for Intensive Care (MIMIC) data. This data is associated with over forty
thousand patients who stayed in critical care units of the Beth Israel Deaconess
Medical Center between 2001 and 2012, and includes information such as de-
mographics, vital sign measurements, laboratory test results, procedures, med-
ications, imaging reports, and mortality, both in and out of hospital. Recently,
investigators have also explored how interpretation mechanisms for deep learn-
ing models could be applied to clinical predictions [9]. Most common tasks
involving EHR data include but are not limited to information extraction, rep-
resentation learning, outcome prediction, phenotyping and de-identification. A
more detailed description of these tasks and benchmark methods can be seen in
figure 2.1.
Figure 2.1: Common tasks using EHR data and benchmark models for
these tasks
Deep learning approaches have achieved significant results in many of these
4
tasks, such as medical concept representation. Efficient representations for med-
ical concepts is an important element in healthcare applications. Medical con-
cepts contain rich latent relationships that cannot be represented by simple one-
hot coding. For example, pneumonia and bronchitis are clearly more related
than pneumonia and obesity. In one-hot coding, such relationships between
different codes are not represented [5]. Recently, studies using deep learning
approaches in this field have demonstrated significant improvement in the per-
formance of various predictive models without the need for medical expertise.
Methods such as word embedding, recurrent neural networks (RNN), convolu-
tional neural networks (CNN) or stacked denoising autoencoders (SDA), have
been developed [6]. As these models do not require expert feature reconstruc-
tion, they perform significantly better than logistic regression or multilayer per-
ceptron models.
Nonetheless, efficiently learning the representations of healthcare concepts
remains a challenge. First, EHR data have a unique structure where the visits
are ordered with respect to time but the medical codes within a visit are un-
ordered. As there is a sequential relationship between the visits, the sequence
of visits cannot be captured by simply aggregating code-level representations.
Second, while the interpretability of the state-of-the-art representation learn-
ing methods in healthcare is essential, it is difficult to interpret many of these
models such as recurrent neural networks (RNN). Finally, the algorithm should
be scalable enough to handle large EHR datasets with hundreds of thousands
of patients and millions of visits [5]. However, the good performance of cur-
rent deep learning methods relies much on the amount of high quality labeled
data. In many prediction tasks, labeled data are desired but very limited. As
we discussed before, meta-learning approaches have strong potential for solv-
5
ing this problem. Modern meta-learning algorithms, such as Model-Agnostic
Meta-Learning, enable us to quickly adapt to new tasks and make accurate pre-
dictions with few examples. In the following part, we will discuss the most
commonly used strategies in meta-learning approaches.
2.2 Neural ODE on EHR
RNNs are currently the dominant model class for high-dimensional, regularly-
sampled time series data. However, since EHR data typically consists of spo-
radically observed longitudinal patient data that have no standard method to
align patient trajectories, RNN models will not be able to provide accurate re-
sults. On the other hand, since neural ODEs are essentially a continuous version
of neural networks, they facilitate modeling time series data. Thus, the difficul-
ties associated with the analysis of these datasets provide the opportunity for
neural ODEs to have a large impact on solving such problems. In these mod-
els, instead of specifying a discrete sequence of hidden layers, the continuous
dynamics of the hidden units is parameterized using an ordinary differential
equation (ODE) specified by a neural network [4].
These models have several benefits including memory efficiency, adaptive
computation, parameter efficiency, scalable and invertible normalizing flows,
continuous time-series models.
6
Figure 2.2: Continuous dynamics of ODE network [4]
2.3 Meta-learning
Meta-learning, also known as learning to learn, essentially allows for machines
to learn new skills and concepts fast with a few training examples. A good
meta-learning model is expected to be capable of adapting to new tasks and
new environments quickly, even if it has never seen those before during training
time. Meta-learning methods have a wide range of applications in supervised
learning and reinforcement learning. Few-shot classification problems are ex-
amples of meta-learning in supervised learning setting. In this paper, we focus
on discussing meta-learning in a supervised learning setting.
In a typical machine learning setting, the dataset D is split such that the pa-
rameters θ are optimized based on the training set Dtrain and their generalization
is evaluated on the test set Dtest. On the other hand, in a few-shot learning set-
ting, the meta-sets D contain multiple regular datasets, where each D ∈ Dmeta
7
has a split of Dtrain (support set) and Dtest (query set). Typically, a K-shot N-
class classification task is considered in such problems, where the support set
contains K labelled examples for each of N classes. The objective during meta-
training is to learn an efficient learning procedure (meta-learner) that can pro-
duce a classifier (the base-learner) with high average classification performance
on its corresponding test set Dtest. During meta-testing, the generalization per-
formance on Dmeta−test, where there are labels not seen during the meta-training
process, is evaluated. An example of this setup can be seen in figure 3.1. In
this figure, the top represents the meta-training set Dmeta−train,where inside each
gray box is a separate dataset that consists of the training set Dtrain (left side of
dashed line) and the test set Dtest (right side of dashed line). In this illustration,
we are considering the 1-shot, 5-class classification task where for each dataset,
we have one example from each of 5 classes (each given a label 1-5) in the train-
ing set and 2 examples for evaluation in the test set. The meta-test set Dmeta−test
is defined in the same way, but with a different set of datasets that cover classes
not present in any of the datasets in Dmeta−train (similarly, we additionally have a
meta-validation set that is used to determine hyper-parameters) [15].
Overall, modern meta-learning models can be categorized into three differ-
ent types: metric-based meta-learning [20][21][16][11][3], model based meta-
learning [14][25][18], and optimization-based meta-learning [16]. Most recent
meta-learning research has been focusing on optimization-based models. Tra-
ditional gradient-based optimization in training deep models is not designed to
deal with a small number of samples or to converge with a small number of op-
timization steps. Optimization-based meta-learning methodologies attempted
to solve this problem by designing special optimization algorithms that are able
to learn good initialization of the model’s parameters with small amount of data
8
Figure 2.3: Example of meta-learning setup [15]
and small amount of optimization steps. One benefit of optimization-based
meta-learning models is that it explicitly introduces the concepts of base-learner
and meta-leaner and decouples the functions of these two. This accommodates
for the flexibility of the base-learner design, which allows for applying meta-
learning methods to not just few-shot image classification problems, but also to
many other problems. In meta-learning, the base-learner is the learning compo-
nent of the model that learns how to perform a specific task. For instance, a base-
learner can be a convolutional neural network that learns to classify images.
Furthermore, the meta-learner is another learning component of the model that
learns how to update base-learner’s parameters according to the loss and gra-
dient information from the base-learner on the support set.
9
CHAPTER 3
METHODS
3.1 Base-Learner
Problems associated to EHR data, such as sparsity and irregularity, create the
opportunity for finding new methodologies that can help mitigate these prob-
lems. To be more specific, EHR data is recorded in a sporadic manner which
means that time intervals between observations are not fixed. A simple way to
handle irregularly-timed samples is to include the time gap between observa-
tions into RNN update function; however, this approach assumes that the hid-
den state from the previous observation time-point can be directly used in the
next hidden state update step. An alternative to this approach is to introduce
an exponential decay of the hidden state between the observation intervals. Re-
cent research on neural ODE has introduced a more accurate way to model the
hidden state dynamics. In our work, we follow one key observation from a pre-
vious study where an RNN with exponentially-decayed hidden state implicitly
obeys the following ODE [13]:
dh(t)
ODE = −τh; h (t0) = hdt 0
Recently proposed models called ODE-RNN and latent ODE [17] provide a
more efficient and explicable way to model the hidden state update. For our
model, we used ODE-RNN and latent ODE as our base-learner and adapted
two ODE-based models. We then applied this model to disease classification
task.
10
3.1.1 ODE-RNN
As mentioned, time series data with non-uniform intervals, especially in medi-
cal settings, are difficult to model using standard autoregressive models such as
RNNs. These models are often sufficient for densely sampled data, but perform
worse when observations are sparse. ODE-RNN tries to model the hidden state
using a Neural ODE. A detailed description of ODE-RNN is given in Algorithm
1.
Algorithm 1: ODE-RNN
Input: Data points and their timestamps {(xi, ti)}i=1...N
Output: Last Hidden state and output at each timestamps
1: procedure ODE-RNN
2: h0 ← 0
3: for i in 1,2,...,N do
4: h′i = ODESolv(e ( fθ, h)i−1, (ti−1, ti))
5: hi = GRUCell h′i , xi
6: oi = OutputNN(hi) for all i = 1...N
7: return {oi}i=1..N; hN
3.2 Latent ODE
RNNs can be generalized to have continuous-time hidden dynamics defined by
ODEs to address such difficulties in modeling. Latent ODEs define a generative
11
process over time series based on the deterministic evolution of an initial latent
state, and can be trained as a variational autoencoder [10]. Latent-ODEs can
handle time gaps between observations, and remove the need to group obser-
vations into equally-timed bins [17].
In 2018, Chen et al. proposed a latent-variable time series model, where the
generative model is defined by ODE whose initial latent state z0 determines the
entire trajectory [4]:
z0 ∼ p0
z0, z1, ..., zN = ODES olve( fθ, z0, (t0, t1, ..., tN))
eachx ind∼ep.i p(xi|zi)i = 0, 1, ...,N
In this model, a variational autoencoder framework is used for both train-
ing and prediction, which is essentially an encoder-decoder architecture. The
input to the encoder is a variable-length sequence, such as time series data, that
is then encoded into a fixed-dimensional embedding. The output from the en-
coder is then decoded to another variable-length sequence and the trajectory is
reconstructed. There are several benefits to using a latent variable framework.
First, this framework allows for examining the dynamics of the ODE system,
the likelihood of observations, and the recognition model, separately without
any association to each other. Second, the posterior distribution over latent
states provides an explicit measure of uncertainty, which is not available in stan-
dard RNNs and ODE-RNNs. Finally, it becomes easier to answer non-standard
queries, such as making predictions backwards in time, or conditioning on a
subset of observations.
12
Algorithm 2: Latent-ODE
Input: Data points and their timestamps {(xi, ti)}i=1...N
1: procedure LATENT-ODE
2: z′0 = ODE − (RN)N ({x(i}i=)1..N)
3: µ ′ ′z0 , σz0 (= gµ z0) , gσ z0 . gµ and gδ are MLPs
4: z0 ∼ N µz0 , σz0
5: {zi} = ODESolve ( f , z0, (t0 . . . tN))
6: x̃i = Output NN (zi) for all i = 1..N
7: return {x̃i}i=1...N
3.3 Meta-Learner
Optimization-based meta-learning algorithms are designed to adjust the opti-
mization algorithm commonly used in deep learning so that the model can ex-
cel in learning with a few examples. Meta-LSTM is one of the approaches that
tries to explicitly model the optimization algorithm.
3.3.1 Meta-LSTM
We followed Ravi and Larochelle’s [15] work in our project. This is paper is a pi-
oneer in explicitly introducing the concept of “meta-learner”, where the model
for handling the specific task is called “learner”. The goal of the meta-learner is
to efficiently update the learner’s parameters using a small support set so that
13
the learner can adapt to the new task quickly.
We denote the learner model as Mθ parameterized by θ, the meta-learner as
RΘ with parameters Θ, and the loss function L.
There are two very interesting intuitions behind this idea. First, there is sim-
ilarity between the gradient-based update in backpropagation and the cell-state
update in LSTM. Second, knowing a history of gradients benefits the gradient
update; for instance, the momentum algorithm.
The update for the learner’s parameters at time step t with a learning rate αt
is:
θt = θt−1 − αt∇θt−1Lt
Figure 3.1: LSTM as a meta-learner
It has the same form as the cell state update in LSTM, if we set forget gate
14
ft = 1, input gate it = αt, cell state ct = θt, and new cell state c̃t = ∇θt−1Lt:
ct = ft  ct−1 + it  c̃t
= θt−1 − αt∇θt−1Lt
3.3.2 Our Modification
In our work, we expand the original model by stacking a normal LSTM cell
on top of the meta-LSTM cell. The purpose for this design is to match the input
dimension and output dimension in meta-LSTM since the output of meta-LSTM
is the new base learner parameters while the input consists of the gradients and
loss besides the original base leaner parameters.
CHAPTER 4
RESULTS
4.1 Data Pre-Processing
As mentioned in previous sections, despite the rapid growth in applying ma-
chine learning methods to clinical data, the progress in this field is less signifi-
cant than the progress in other applications of machine learning. In addition to
factors such as data complexity, noisiness and sparsity, absence of community
benchmarks contribute to the slower progress of machine learning in clinical
settings. Benchmarks can play an important role in accelerating progress in ma-
chine learning research; they facilitate reproducibility and direct comparison of
competing ideas.
15
Recently, many studies focus on using the public Medical Information Mart
for Intensive Care (MIMIC) data. This data is integrates de-identified, compre-
hensive clinical data of patients who stayed in critical care units of the Beth
Israel Deaconess Medical Center between, which includes information such as
demographics, vital sign measurements and laboratory test results [9]. In order
to conduct any analysis on this data, a training must be completed for the data
use agreement.
In our analysis, we mainly focus on disease classification. Thus, we executed
the necessary filtering steps to obtain the subset of patients with acute diseases.
In this process, we first generated a directory per patient that included the ICU
stay information, diagnoses and events for that patient. Next, we filtered pa-
tients with missing ICU stay ID, missing ICU length of stay and ICU stays with
no events. Approximately 80% of the data remained after this filtering step.
Next, we filtered patients with mixed diseases and chronic disease. Since we
are mainly focusing on disease classification we exclude patients with mixed
diseases. Finally, we constructed the episodes needed for few-shot learning.
4.2 Meta-Learner Experiments
In this section, we describe the results of our meta-learner experiments and the
comparison against other meta-learning models. We followed the standard ex-
periment settings in the meta-learning community and tested our meta-learner
performance on the MiniImageNet dataset. The MiniImagenet dataset was pro-
posed by Ravi & Larochelle [15], and involves 64 training classes, 12 validation
classes, and 24 test classes. Our base learner follows the same architecture as the
16
Methods Accuracy
Matching Network 51.09 ± 0.71%
Matching Network FCE 55.31 ± 0.73%
RNN-VAE 0.515 ± 0.040
ODE-RNN 0.833 ± 0.009
Latent-ODE 0.826 ± 0.007
Table 4.1: 5-shot 5-class Accuracy on MiniImageNet
Methods AUC-ROC
RNN-impute 0.764 ± 0.016
RNN-decay 0.807 ± 0.003
RNN-VAE 0.515 ± 0.040
ODE-RNN 0.833 ± 0.009
Latent-ODE 0.826 ± 0.007
Table 4.2: Mortality prediction on Physionet
CNN used by Vinyals et al.[23], which has 4 modules with a 3 x 3 convolutions
and 64 filters, followed by batch normalization, a ReLU nonlinearity, and 2 x 2
max-pooling layer.
One of the major challenges in this implementation is compressing the pa-
rameter space in LSTM meta-learner. As the meta-learner is modeling param-
eters of another neural network, it has hundreds of thousands of variables to
learn. We followed the idea proposed by Andrychowicz [1], which is sharing
parameters across coordinates. Furthermore, to simplify the training process,
the meta-learner assumes that the loss ∇t and the gradient ∇θt−1Lt are indepen-
dent. The results of the experiments are shown in table 4.1. This table indicates
that although Model-Agnostic Meta-Learning (MAML) [7] achieved better per-
formance than our approach, it requires vast amount of memory usage and is
computationally inefficient in since it involves second order derivative compu-
tation. For our project, we take model complexity into consideration as well.
17
4.3 Base-Learner Experiments
We evaluated ODE-based models on the PhysioNet Challenge 2012 dataset [19].
This dataset contains 8000 time series, each time series contains measurements
from the first 48 hours of a different patient’s admission to ICU. Measurements
were made at irregular times, and of sparse subsets of the 37 possible features.
Due to label imbalance, we used AUC-ROC as our metrics of evaluation. Our
results show that ODE-based models outperform traditional RNN-based mod-
els. In our experiments we constructed a classifier to predict in-hospital mortal-
ity. We passed the hidden state at the last measured time point into a two-layer
classifier and jointly train both the encoder and decoder by maximizing the ev-
idence lower bound (ELBO). Through these experiments, we conclude that the
best performance is achieved by reconstructing the trajectory at the same time
as predicting disease categories. The results of these experiments are shown in
table 4.2
4.4 Disease Classification in a Normal Data Regime
First, we compared the performance of the ODE-based models in a normal ma-
chine learning setting where the amount of labeled data is not constricted com-
pared to the few-shot learning setting. In our experiments, we chose the im-
puted LSTM and GRU-D as our comparison models. The best LSTM baseline
performance is achieved using standard LSTM network with missing values
18
Figure 4.1: AUC of tested models on acute diseases within MIMIC-III
imputation strategy and without leveraging multitasking training. Our ODE-
based model achieved good performance without using any imputation strat-
egy. The results from our experiments indicate that the best ODE-based model
performance is achieved when we are reconstructing the trajectory at the same
time during training. Furthermore, our model has significantly outperformed
standard LSTM without using any imputation strategy. However, the LSTM-
imputed achieved similar results compared to ODE-based model. One possible
reason is that standard RNNs are ignoring the time gaps between points. Typ-
ically, standard RNNs work well on regularly spaced data with few missing
values, or when the time intervals between points are short. When proper im-
putation strategy is applied, the input to LSTM is very dense and boosts the
performance.
19
Methods Accuracy
Meta-LSTM + RNN-impute 0.647 ± 0.013
Meta-LSTM + RNN-decay 0.673 ± 0.004
Meta-LSTM + RNN-VAE 0.563 ± 0.027
Meta-LSTM+ ODE-RNN 0.737 ± 0.007
Meta-LSTM+ Latent-ODE 0.719 ± 0.008
Table 4.3: 5-shot 5-class experiments on MIMIC-III
4.5 Disease classification in few-shot learning setting
Finally, we construct a few-shot learning sub-dataset from MIMIC-III to vali-
date the efficiency of both our meta-learner and base-learner. In our experi-
ment, we performed 5-shot 5-class experiments on MIMIC-III data using the
adjusted meta-LSTM as our meta-learner. The query set in each episode con-
tains 15 examples. We achieved 0.737 accuracy with a standard deviation of
0.007. We also compared the performance against other base-learners under this
low data regime. The results in table 4.3 show that meta-LSTM and ODE-RNN
performed very well, even with very limited training data.
20
CHAPTER 5
DISCUSSION
5.1 Future work
In this work, we built a meta-learning model using Latent-ODE as the base-
learner and meta-LSTM as the meta-learner. We primarily focused on pheno-
typing acute diseases within the MIMIC-III dataset and optimizing the model
based on the base-learner. Therefore, the meta-learner design can be further
investigated using neural-ODE. Furthermore, tasks beyond phenotyping can
be explored on MIMIC-III dataset using this model. The scope of this project
can also be expanded to other health datasets outside ICU setting, such as MRI
images and more. The construction of few-shot learning setting may not be
very practical in real-world clinical settings. A more appropriate usage of meta-
learning in medicine could be rare disease classification, where we could con-
struct a meta-learning training scheme and put rare disease in meta-test phase.
CHAPTER 6
CONCLUSION
In this project, we explored meta-learning and its applications in medicine. We
selected an optimization-based meta-learning approach which gave us flexibil-
ity in designing our base-learner according to a specific task or problem. In
our case, we investigated a neural-ODE based base-learner on EHR data. This
data is recorded sporadically, and missing values are common. Furthermore,
such properties of the dat are one of the major challenges for traditional neural
21
networks. Our results have shown that neural-ODE could achieve accurate dis-
ease classification even without employing imputation strategy, which is com-
monly used in traditional methods. Not only it simplifies the prediction, but
also makes our model more explainable.
BIBLIOGRAPHY
[1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman,
David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas.
Learning to learn by gradient descent by gradient descent. In Advances
in neural information processing systems, pages 3981–3989, 2016.
[2] Riccardo Bellazzi and Blaz Zupan. Predictive data mining in clinical
medicine : Current issues and guidelines. 7:81–97, 2006.
[3] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak
Shah. Signature verification using a ”siamese” time delay neural network.
In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural
Information Processing Systems 6, pages 737–744. Morgan-Kaufmann, 1994.
[4] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duve-
naud. Neural ordinary differential equations, 2018.
[5] Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine
Coffey, Michael Thompson, James Bost, Javier Tejedor-sojo, and Jimeng
Sun. Multi-layer Representation Learning for Medical Concepts. 2016.
[6] Edward Choi, Mohammad Taha Bahadori, Le Song, Walter F. Stewart, and
Jimeng Sun. GRAM: Graph-based Attention Model for Healthcare Repre-
sentation Learning. pages 1–15, 2016.
[7] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-Agnostic Meta-
Learning for Fast Adaptation of Deep Networks. 2017.
[8] Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, and Greg Ver
Steeg. Multitask Learning and Benchmarking with Clinical Time Series
Data. pages 1–19, 2018.
22
[9] Alistair E W Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling
Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo An-
thony Celi, and Roger G Mark. Data Descriptor : MIMIC-III , a freely ac-
cessible critical care database. pages 1–9, 2016.
[10] Diederik P Kingma and Max Welling. Auto-encoding variational bayes,
2013.
[11] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural
networks for one-shot image recognition. 2015.
[12] Jeong Min Lee. Diagnosis Code Prediction from Electronic Health Records
as Multilabel Text Classification : A Survey. 2017.
[13] Michael C Mozer, Denis Kazakov, and Robert V Lindsey. Discrete event,
continuous time rnns. arXiv preprint arXiv:1710.04110, 2017.
[14] Tsendsuren Munkhdalai and Hong Yu. Meta networks, 2017.
[15] Sachin Ravi and Hugo Larochelle. O m f -s l. pages 1–11, 2017.
[16] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot
learning. In ICLR, 2017.
[17] Yulia Rubanova, Ricky T. Q. Chen, and David Duvenaud. Latent odes for
irregularly-sampled time series, 2019.
[18] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and
Timothy Lillicrap. Meta-learning with memory-augmented neural net-
works. In Proceedings of the 33rd International Conference on International
Conference on Machine Learning - Volume 48, ICML’16, pages 1842–1850.
JMLR.org, 2016.
[19] Ikaro Silva, George Moody, Daniel J Scott, Leo A Celi, and Roger G Mark.
Predicting in-hospital mortality of icu patients: The physionet/computing
in cardiology challenge 2012. In 2012 Computing in Cardiology, pages 245–
248. IEEE, 2012.
[20] Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks
for few-shot learning, 2017.
23