{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pulse rate analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial shows how to extract pulse rate estimates using photoplethysmography (PPG) data and accelerometer data. The pipeline consists of a stepwise approach to determine signal quality, assessing both PPG morphology and accounting for periodic artifacts using the accelerometer. The usage of accelerometer is optional but is recommended to specifically account for periodic motion artifacts. Based on the signal quality, we extract high-quality segments and estimate the pulse rate for every 2 s using the smoothed pseudo Wigner-Ville Distribution.\n",
    "\n",
    "In this tutorial, we use two days of data from a participant of the Personalized Parkinson Project to demonstrate the functionalities. Since `ParaDigMa` expects contiguous time series, the collected data was stored in two segments each with contiguous timestamps. Per segment, we load the data and perform the following steps:\n",
    "1. Preprocess the time series data\n",
    "2. Extract signal quality features\n",
    "3. Signal quality classification\n",
    "4. Pulse rate estimation\n",
    "\n",
    "We then combine the output of the different segments for the final step:\n",
    "\n",
    "5. Pulse rate aggregation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import required modules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from importlib.resources import files\n",
    "from pathlib import Path\n",
    "\n",
    "import pandas as pd\n",
    "import tsdf\n",
    "\n",
    "from paradigma.classification import ClassifierPackage\n",
    "from paradigma.config import IMUConfig, PPGConfig, PulseRateConfig\n",
    "from paradigma.constants import DataColumns\n",
    "from paradigma.pipelines.pulse_rate_pipeline import (\n",
    "    aggregate_pulse_rate,\n",
    "    estimate_pulse_rate,\n",
    "    extract_signal_quality_features,\n",
    "    signal_quality_classification,\n",
    ")\n",
    "from paradigma.preprocessing import preprocess_ppg_data\n",
    "from paradigma.util import load_tsdf_dataframe, write_df_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This pipeline requires PPG data and can be enhanced with accelerometer data (optional). Here, we start by loading a single contiguous time series (segment), for which we continue running steps 1-4. [Below](#multiple_segments_cell) we show how to run these steps for multiple segments. The channel `green` represents the values obtained with PPG using green light.\n",
    "\n",
    "In this example we use the internally developed `TSDF` ([documentation](https://biomarkersparkinson.github.io/tsdf/)) to load and store data [[1](https://arxiv.org/abs/2211.11294)]. However, we are aware that there are other common data formats. For example, the following functions can be used depending on the file extension of the data:\n",
    "- _.csv_: `pandas.read_csv()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html))\n",
    "- _.json_: `json.load()` ([documentation](https://docs.python.org/3/library/json.html#json.load))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the path to where the prepared data is saved and load the data.\n",
    "# Note: the test data is stored in TSDF, but you can load your data in your own way\n",
    "path_to_prepared_data =  Path('../../example_data/verily')\n",
    "\n",
    "ppg_prefix = 'ppg'\n",
    "imu_prefix = 'imu'\n",
    "\n",
    "segment_nr = '0001'\n",
    "\n",
    "df_ppg, metadata_time_ppg, metadata_values_ppg = load_tsdf_dataframe(\n",
    "    path_to_data=path_to_prepared_data / ppg_prefix,\n",
    "    prefix=f'PPG_segment{segment_nr}'\n",
    ")\n",
    "\n",
    "# Only relevant if you have IMU data available\n",
    "df_imu, metadata_time_imu, metadata_values_imu = load_tsdf_dataframe(\n",
    "    path_to_data=path_to_prepared_data / imu_prefix,\n",
    "    prefix=f'IMU_segment{segment_nr}'\n",
    ")\n",
    "\n",
    "time_col = ['time']\n",
    "acc_cols = ['accelerometer_x', 'accelerometer_y', 'accelerometer_z']\n",
    "df_acc = df_imu[time_col + acc_cols]\n",
    "\n",
    "display(df_ppg, df_acc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Preprocess data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first step after loading the data is preprocessing using the [preprocess_ppg_data](https://biomarkersparkinson.github.io/paradigma/autoapi/paradigma/preprocessing/index.html#paradigma.preprocessing.preprocess_ppg_data). This begins by isolating segments containing both PPG and IMU data, discarding portions where one modality (e.g., PPG) extends beyond the other, such as when the PPG recording is longer than the accelerometer data. This functionality requires the starting times (`metadata_time_ppg.start_iso8601` and `metadata_time_imu.start_iso8601`) in iso8601 format as inputs. After this step, the preprocess_ppg_data function resamples the PPG and accelerometer data to uniformly distributed timestamps, addressing the fixed but non-uniform sampling rates of the sensors. If the difference between timestamps is larger than a specified tolerance (`config.tolerance`, in seconds), it will return an error that the timestamps are not contiguous.  If you still want to process the data in this case, you can create segments from discontiguous samples using the function [`create_segments`](https://biomarkersparkinson.github.io/paradigma/autoapi/paradigma/segmenting/index.html#paradigma.segmenting.create_segments) and analyze these segments consecutively as shown in [here](#multiple_segments_cell). After resampling, a bandpass Butterworth filter (4th-order, bandpass frequencies: 0.4--3.5 Hz) is applied to the PPG signal, while a high-pass Butterworth filter (4th-order, cut-off frequency: 0.2 Hz) is applied to the accelerometer data.\n",
    "\n",
    "Note: the printed shapes are (rows, columns) with each row corresponding to a single data point and each column representing a data column (e.g.time). The number of rows of the overlapping segments of PPG and accelerometer are not the same due to sampling differences (other sensors and possibly other sampling frequencies)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set column names: replace DataColumn.* with your actual column names.\n",
    "# It is only necessary to set the columns that are present in your data, and\n",
    "# only if they differ from the default names defined in DataColumns.\n",
    "imu_column_mapping = {\n",
    "    'TIME': DataColumns.TIME,\n",
    "    'ACCELEROMETER_X': DataColumns.ACCELEROMETER_X,\n",
    "    'ACCELEROMETER_Y': DataColumns.ACCELEROMETER_Y,\n",
    "    'ACCELEROMETER_Z': DataColumns.ACCELEROMETER_Z,\n",
    "}\n",
    "\n",
    "ppg_column_mapping = {\n",
    "    'TIME': DataColumns.TIME,\n",
    "    'PPG': DataColumns.PPG,\n",
    "}\n",
    "\n",
    "ppg_config = PPGConfig(column_mapping=ppg_column_mapping)\n",
    "imu_config = IMUConfig(column_mapping=imu_column_mapping)\n",
    "\n",
    "print(\n",
    "    f\"The tolerance for checking contiguous timestamps is set to \"\n",
    "    f\"{ppg_config.tolerance:.3f} seconds for PPG data and \"\n",
    "    f\"{imu_config.tolerance:.3f} seconds for accelerometer data.\"\n",
    ")\n",
    "print(\n",
    "    f\"Original data shapes:\\n- PPG data: {df_ppg.shape}\\n\"\n",
    "    f\"- Accelerometer data: {df_imu.shape}\"\n",
    ")\n",
    "df_ppg_proc, df_acc_proc = preprocess_ppg_data(\n",
    "    df_ppg=df_ppg,\n",
    "    ppg_config=ppg_config,\n",
    "    start_time_ppg=metadata_time_ppg.start_iso8601,  # Optional\n",
    "    df_acc=df_acc,  # Optional\n",
    "    imu_config=imu_config,  # Optional\n",
    "    start_time_imu=metadata_time_imu.start_iso8601  # Optional\n",
    ")\n",
    "\n",
    "print(f\"Overlapping preprocessed data shapes:\\n- PPG data: {df_ppg_proc.shape}\\n\"\n",
    "      f\"- Accelerometer data: {df_acc_proc.shape}\")\n",
    "display(df_ppg_proc, df_acc_proc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Extract signal quality features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The preprocessed data is windowed into overlapping windows of length `ppg_config.window_length_s` with a window step of `ppg_config.window_step_length_s`. From the PPG windows, 10 time- and frequency domain features are extracted to assess PPG morphology. In case of using the accelerometer data (optional),  one relative power feature is calculated per window to assess periodic motion artifacts.\n",
    "\n",
    "The detailed steps are encapsulated in [`extract_signal_quality_features`](https://biomarkersparkinson.github.io/paradigma/autoapi/paradigma/pipelines/pulse_rate_pipeline/index.html#paradigma.pipelines.pulse_rate_pipeline.extract_signal_quality_features)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pulse_rate_ppg_config = PulseRateConfig(\n",
    "    sensor='ppg',\n",
    "    ppg_sampling_frequency=ppg_config.sampling_frequency,\n",
    ")\n",
    "\n",
    "pulse_rate_acc_config = PulseRateConfig(\n",
    "    sensor='imu',\n",
    "    imu_sampling_frequency=imu_config.sampling_frequency,\n",
    "    accelerometer_colnames=imu_config.accelerometer_colnames,\n",
    ")\n",
    "\n",
    "print(\"The default window length for the signal quality feature extraction is \"\n",
    "      f\"set to {pulse_rate_ppg_config.window_length_s} seconds.\")\n",
    "print(\"The default step size for the signal quality feature extraction is \"\n",
    "      f\"set to {pulse_rate_ppg_config.window_step_length_s} seconds.\")\n",
    "\n",
    "# Remove optional arguments if you don't have accelerometer data\n",
    "# (set to None or remove arguments)\n",
    "df_features = extract_signal_quality_features(\n",
    "    df_ppg=df_ppg_proc,\n",
    "    df_acc=df_acc_proc,\n",
    "    ppg_config=pulse_rate_ppg_config,\n",
    "    acc_config=pulse_rate_acc_config,\n",
    ")\n",
    "\n",
    "display(df_features)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Signal quality classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A trained logistic classifier is used to predict PPG signal quality and returns the `pred_sqa_proba`, which is the posterior probability of a PPG window to look like the typical PPG morphology (higher probability indicates toward the typical PPG morphology).  \n",
    "If accelerometer is used, the relative power feature from the accelerometer is compared to a threshold for periodic artifacts and therefore `pred_sqa_acc_label` is used to return a label indicating predicted periodic motion artifacts (label 0) or no periodic motion artifacts (label 1).\n",
    "\n",
    "The classification step is implemented in [`signal_quality_classification`](https://biomarkersparkinson.github.io/paradigma/autoapi/paradigma/pipelines/pulse_rate_pipeline/index.html#paradigma.pipelines.pulse_rate_pipeline.signal_quality_classification).\n",
    "\n",
    "<u>**_Note on scale sensitivity_**</u>  \n",
    "The PPG sensor used for developing this pipeline records in arbitrary units. Some features are scale sensitive and require rescaling when applying the pipeline to other datasets or PPG sensors.  \n",
    "In this pipeline, the logistic classifier for PPG morphology was trained on z-scored features, using the means (μ) and standard deviations (σ) from the Personalized Parkinson Project training set (documentation of the training set can be found in the Signal Quality Assessment section [here](https://www.medrxiv.org/content/10.1101/2025.08.15.25333751v1.full-text)). These μ and σ values are stored in the `ppg_quality_classifier_package`. When applying the code to another dataset, users are advised to recalculate **_μ_** and **_σ_** for each feature on their (training) data and update the classifier package accordingly. Example code to recalculate the **_μ_** and **_σ_** can be found in the code cell below and in section 7.3.1. [here](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "config = PulseRateConfig()\n",
    "\n",
    "ppg_quality_classifier_package_filename = 'ppg_quality_clf_package.pkl'\n",
    "full_path_to_classifier_package = (\n",
    "    files('paradigma')\n",
    "    / 'assets'\n",
    "    / ppg_quality_classifier_package_filename\n",
    ")\n",
    "\n",
    "# Load the classifier package\n",
    "clf_package = ClassifierPackage.load(full_path_to_classifier_package)\n",
    "\n",
    "# If you use a different sensor or dataset, you have to update the classifier\n",
    "# package with the mu and sigma calculated on your training data.\n",
    "# Example code to recalculate the mean and std to update the clf_package:\n",
    "# import numpy as np\n",
    "\n",
    "# # create random x_train for demonstration purposes, 100 samples and 10 features\n",
    "# x_train = np.random.rand(100, 10)\n",
    "# clf_package.update_scaler(x_train)\n",
    "\n",
    "df_sqa = signal_quality_classification(\n",
    "    df=df_features,\n",
    "    config=config,\n",
    "    clf_package=clf_package\n",
    ")\n",
    "\n",
    "df_sqa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Store as TSDF\n",
    "The predicted probabilities (and optionally other features) can be stored and loaded in TSDF as demonstrated below. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set 'path_to_data' to the directory where you want to save the data\n",
    "metadata_time_store = tsdf.TSDFMetadata(\n",
    "    metadata_time_ppg.get_plain_tsdf_dict_copy(),\n",
    "    path_to_prepared_data\n",
    ")\n",
    "metadata_values_store = tsdf.TSDFMetadata(\n",
    "    metadata_values_ppg.get_plain_tsdf_dict_copy(),\n",
    "    path_to_prepared_data\n",
    ")\n",
    "\n",
    "# Select the columns to be saved\n",
    "metadata_time_store.channels = ['time']\n",
    "metadata_values_store.channels = ['pred_sqa_proba', 'pred_sqa_acc_label']\n",
    "\n",
    "# Set the units\n",
    "metadata_time_store.units = ['Relative seconds']\n",
    "metadata_values_store.units = ['Unitless', 'Unitless']\n",
    "metadata_time_store.data_type = float\n",
    "metadata_values_store.data_type = float\n",
    "\n",
    "# Set the filenames\n",
    "meta_store_filename = f'segment{segment_nr}_meta.json'\n",
    "values_store_filename = meta_store_filename.replace('_meta.json', '_values.bin')\n",
    "time_store_filename = meta_store_filename.replace('_meta.json', '_time.bin')\n",
    "\n",
    "metadata_values_store.file_name = values_store_filename\n",
    "metadata_time_store.file_name = time_store_filename\n",
    "\n",
    "write_df_data(\n",
    "    metadata_time_store,\n",
    "    metadata_values_store,\n",
    "    path_to_prepared_data,\n",
    "    meta_store_filename,\n",
    "    df_sqa\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_sqa, _, _ = load_tsdf_dataframe(\n",
    "    path_to_prepared_data,\n",
    "    prefix=f'segment{segment_nr}'\n",
    ")\n",
    "df_sqa.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Pulse rate estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For pulse rate estimation, we extract segments of `config.tfd_length` using [estimate_pulse_rate](https://biomarkersparkinson.github.io/paradigma/autoapi/paradigma/pipelines/pulse_rate_pipeline/index.html#paradigma.pipelines.pulse_rate_pipeline.estimate_pulse_rate). We calculate the smoothed-pseudo Wigner-Ville Distribution (SPWVD) to obtain the frequency content of the PPG signal over time. We extract for every timestamp in the SPWVD the frequency with the highest power. For every non-overlapping 2 s window we average the corresponding frequencies to obtain a pulse rate per window.\n",
    "\n",
    "Note: for the test data we set the tfd_length to 10 s instead of the default of 30 s, because the small PPP test data doesn't have 30 s of consecutive high-quality PPG data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    \"The standard default minimal window length for the pulse rate \"\n",
    "    \"extraction is set to\", pulse_rate_ppg_config.tfd_length, \"seconds.\"\n",
    ")\n",
    "\n",
    "df_pr = estimate_pulse_rate(\n",
    "    df_sqa=df_sqa,\n",
    "    df_ppg_preprocessed=df_ppg_proc,\n",
    "    config=pulse_rate_ppg_config\n",
    ")\n",
    "\n",
    "df_pr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run steps 1 - 4 for multiple segments <a id='multiple_segments_cell'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If your data is also stored in multiple segments, you can modify `segments` in the cell below to a list of the filenames of your respective segmented data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the path to where the prepared data is saved\n",
    "path_to_prepared_data =  Path('../../example_data/verily')\n",
    "\n",
    "ppg_prefix = 'ppg'\n",
    "imu_prefix = 'imu'\n",
    "\n",
    "# Set the path to the classifier package\n",
    "ppg_quality_classifier_package_filename = 'ppg_quality_clf_package.pkl'\n",
    "full_path_to_classifier_package = (\n",
    "    files('paradigma')\n",
    "    / 'assets'\n",
    "    / ppg_quality_classifier_package_filename\n",
    ")\n",
    "\n",
    "# Create a list of dataframes to store the estimated pulse rates of all segments\n",
    "list_df_pr = []\n",
    "\n",
    "segments = ['0001', '0002'] # list with all available segments\n",
    "\n",
    "for segment_nr in segments:\n",
    "\n",
    "    # Load the data\n",
    "    df_ppg, metadata_time_ppg, _ = load_tsdf_dataframe(\n",
    "        path_to_data=path_to_prepared_data / ppg_prefix,\n",
    "        prefix=f'PPG_segment{segment_nr}'\n",
    "    )\n",
    "    df_imu, metadata_time_imu, _ = load_tsdf_dataframe(\n",
    "        path_to_data=path_to_prepared_data / imu_prefix,\n",
    "        prefix=f'IMU_segment{segment_nr}'\n",
    "    )\n",
    "\n",
    "    # Drop the gyroscope columns from the IMU data\n",
    "    cols_to_drop = df_imu.filter(regex='^gyroscope_').columns\n",
    "    df_acc = df_imu.drop(cols_to_drop, axis=1)\n",
    "\n",
    "    # 1: Preprocess the data\n",
    "\n",
    "    ppg_config = PPGConfig()\n",
    "    imu_config = IMUConfig()\n",
    "\n",
    "    df_ppg_proc, df_acc_proc = preprocess_ppg_data(\n",
    "        df_ppg=df_ppg,\n",
    "        df_acc=df_acc,\n",
    "        ppg_config=ppg_config,\n",
    "        imu_config=imu_config,\n",
    "        start_time_ppg=metadata_time_ppg.start_iso8601,\n",
    "        start_time_imu=metadata_time_imu.start_iso8601\n",
    "    )\n",
    "\n",
    "    # 2: Extract signal quality features\n",
    "    ppg_config = PulseRateConfig(\n",
    "        sensor='ppg',\n",
    "        ppg_sampling_frequency=ppg_config.sampling_frequency\n",
    "    )\n",
    "    acc_config = PulseRateConfig(\n",
    "        sensor='imu',\n",
    "        imu_sampling_frequency=imu_config.sampling_frequency,\n",
    "        accelerometer_colnames=imu_config.accelerometer_colnames,\n",
    "    )\n",
    "\n",
    "    df_features = extract_signal_quality_features(\n",
    "        df_ppg=df_ppg_proc,\n",
    "        df_acc=df_acc_proc,\n",
    "        ppg_config=ppg_config,\n",
    "        acc_config=acc_config,\n",
    "    )\n",
    "\n",
    "    # 3: Signal quality classification\n",
    "    df_sqa = signal_quality_classification(\n",
    "        df=df_features,\n",
    "        config=ppg_config,\n",
    "        clf_package=clf_package\n",
    "    )\n",
    "\n",
    "    # 4: Estimate pulse rate\n",
    "    df_pr = estimate_pulse_rate(\n",
    "        df_sqa=df_sqa,\n",
    "        df_ppg_preprocessed=df_ppg_proc,\n",
    "        config=ppg_config\n",
    "    )\n",
    "\n",
    "    # Add the hr estimations of the current segment to the list\n",
    "    df_pr['segment_nr'] = segment_nr\n",
    "    list_df_pr.append(df_pr)\n",
    "\n",
    "df_pr = pd.concat(list_df_pr, ignore_index=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Pulse rate aggregation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The final step is to aggregate all 2 s pulse rate estimates using [aggregate_pulse_rate](https://biomarkersparkinson.github.io/paradigma/autoapi/paradigma/pipelines/pulse_rate_pipeline/index.html#paradigma.pipelines.pulse_rate_pipeline.aggregate_pulse_rate). In the current example, the mode and 99th percentile are calculated. We hypothesize that the mode gives representation of the resting pulse rate while the 99th percentile indicates the maximum pulse rate. In Parkinson's disease, we expect that these two measures could reflect autonomic (dys)functioning. The `nr_pr_est` in the metadata indicates based on how many 2 s windows these aggregates are determined."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pr_values = df_pr['pulse_rate'].values\n",
    "df_pr_agg = aggregate_pulse_rate(\n",
    "    pr_values=pr_values,\n",
    "    aggregates = ['mode', '99p']\n",
    ")\n",
    "\n",
    "print(json.dumps(df_pr_agg, indent=2))"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}