Data Input Formats Guide

ParaDigMa’s run_paradigma() function supports multiple flexible input formats for providing data to the analysis pipeline.

Prerequisites

Before using ParaDigMa, ensure your data meets the requirements:

Input Format Options

The dfs parameter accepts three input formats:

1. Single DataFrame

Use when you have a single prepared DataFrame to analyze:

import pandas as pd
from paradigma.orchestrator import run_paradigma

# Load your data
df = pd.read_parquet('data.parquet')

# Process with a single DataFrame
results = run_paradigma(
    dfs=df,  # Single DataFrame
    pipelines=['gait'],
    watch_side='right',  # Required for gait pipeline
    save_intermediate=['aggregation']  # Saves to ./output by default
)

The DataFrame is automatically assigned the identifier 'df_1' internally.

2. List of DataFrames

Use when you have multiple DataFrames that should be automatically assigned sequential IDs:

# Load multiple data segments
df1 = pd.read_parquet('morning_session.parquet')
df2 = pd.read_parquet('afternoon_session.parquet')
df3 = pd.read_parquet('evening_session.parquet')

# Process as list - automatically assigned to 'df_1', 'df_2', 'df_3'
results = run_paradigma(
    dfs=[df1, df2, df3],
    pipelines=['gait'],
    watch_side='right',
    save_intermediate=['quantification', 'aggregation']
)

Benefits:

  • Automatic segment ID assignment

  • Each DataFrame processed independently before aggregation

  • Aggregation performed across all input DataFrames

3. Dictionary of DataFrames

Use when you need custom identifiers for your data segments:

# Create dictionary with custom segment identifiers
dfs = {
    'patient_001_morning': pd.read_parquet('session1.parquet'),
    'patient_001_evening': pd.read_parquet('session2.parquet'),
    'patient_002_morning': pd.read_parquet('session3.parquet'),
}

# Process with custom segment identifiers
results = run_paradigma(
    dfs=dfs,
    pipelines=['gait'],
    watch_side='right',
    save_intermediate=[]  # No files saved - results only in memory
)

Benefits:

  • Custom segment identifiers in output

  • Improved traceability of data sources

  • Useful for multi-patient or multi-session datasets

Loading Data from Disk

To automatically load data files from a directory:

from paradigma.orchestrator import run_paradigma

# Load all files from a directory
results = run_paradigma(
    data_path='./data/patient_001/',
    pipelines=['gait'],
    watch_side='right',
    file_pattern='*.parquet',  # Optional: filter by pattern
    save_intermediate=['aggregation']
)

Supported file formats:

  • Pandas: .parquet, .csv, .pkl, .pickle

  • TSDF: .meta + .bin pairs

  • Device-specific: .avro (Empatica), .cwa (Axivity)

See Supported Devices for device-specific loading examples.

Required DataFrame Columns

Your DataFrame must contain the following columns depending on the pipeline:

For Gait and Tremor Pipelines

# Required columns
df.columns = ['time', 'accelerometer_x', 'accelerometer_y', 'accelerometer_z',
              'gyroscope_x', 'gyroscope_y', 'gyroscope_z']
  • time: Timestamp (float seconds or datetime)

  • accelerometer_x, accelerometer_y, accelerometer_z: Accelerometer data

  • gyroscope_x, gyroscope_y, gyroscope_z: Gyroscope data

For Pulse Rate Pipeline

# Required columns
df.columns = ['time', 'ppg']  # Accelerometer optional
  • time: Timestamp (float seconds or datetime)

  • ppg: PPG/BVP signal

Custom Column Names

If your data uses different column names, rename the columns or use column_mapping:

results = run_paradigma(
    dfs=df,
    pipelines=['gait'],
    watch_side='left',
    column_mapping={
        'timestamp': 'time',
        'acc_x': 'accelerometer_x',
        'acc_y': 'accelerometer_y',
        'acc_z': 'accelerometer_z',
        'gyr_x': 'gyroscope_x',
        'gyr_y': 'gyroscope_y',
        'gyr_z': 'gyroscope_z'
    }
)

Data Preparation Parameters

If your data needs preparation (unit conversion, resampling, etc.), ParaDigMa can handle it automatically:

results = run_paradigma(
    dfs=df_raw,
    pipelines=['gait'],
    watch_side='left',
    skip_preparation=False,  # Default: perform preparation

    # Unit conversion
    accelerometer_units='m/s^2',  # Auto-converts to 'g'
    gyroscope_units='rad/s',      # Auto-converts to 'deg/s'

    # Resampling
    target_frequency=100.0,

    # Time handling
    time_input_unit='relative_s',  # Or 'absolute_datetime'

    # Orientation correction
    device_orientation=['x', 'y', 'z'],

    # Segmentation for non-contiguous data
    split_by_gaps=True,
    max_gap_seconds=1.5,
    min_segment_seconds=1.5,
)

If your data is already prepared (correct units, sampling rate, column names), skip preparation:

results = run_paradigma(
    dfs=df_prepared,
    pipelines=['gait', 'tremor'],
    watch_side='left',
    skip_preparation=True
)

Output Control

Output Directory

results = run_paradigma(
    dfs=df,
    pipelines=['gait'],
    watch_side='left',
    output_dir='./results',  # Custom output directory (default: './output')
)

Saving Intermediate Results

Control which intermediate steps are saved to disk:

results = run_paradigma(
    dfs=df,
    pipelines=['gait'],
    watch_side='left',
    save_intermediate=[
        'preparation',      # Prepared data
        'preprocessing',    # Preprocessed data
        'classification',   # Gait/tremor bout classifications
        'quantification',   # Segment-level measures
        'aggregation'       # Aggregated measures
    ]
)

To keep results only in memory without saving files:

results = run_paradigma(
    dfs=df,
    pipelines=['gait'],
    watch_side='left',
    save_intermediate=[]  # No files saved
)

Results Structure

Regardless of input format, results are returned in the same structure:

results = {
    'quantifications': {
        'gait': pd.DataFrame,    # Segment-level gait measures
        'tremor': pd.DataFrame,  # Segment-level tremor measures
    },
    'aggregations': {
        'gait': dict,            # Time-period aggregated gait measures
        'tremor': dict,          # Time-period aggregated tremor measures
    },
    'metadata': dict,            # Analysis metadata
    'errors': list               # List of errors (empty if successful)
}

Error Tracking

The errors list contains any errors encountered during processing. Always check this after running:

if results['errors']:
    print(f"Warning: {len(results['errors'])} error(s) occurred")
    for error in results['errors']:
        print(f"  Stage: {error['stage']}")
        print(f"  Error: {error['error']}")
        if 'file' in error:
            print(f"  File: {error['file']}")

Each error dict contains:

  • stage: Where the error occurred (loading, preparation, pipeline_execution, aggregation)

  • error: Error message

  • file: Filename (if file-specific, optional)

  • pipeline: Pipeline name (if pipeline-specific, optional)

File Key Column

When processing multiple files, the quantifications DataFrame includes a file_key column:

  • Single DataFrame input: No file_key column

  • List input (2+ files): 'df_1', 'df_2', etc.

  • Dict input (2+ files): Custom keys you provided

This preserves traceability while keeping single-file results concise.

Best Practices

  1. Single DataFrame: Use for single files or pre-aggregated data

  2. List of DataFrames: Use when you don’t need specific naming

  3. Dictionary of DataFrames: Use when segment identifiers are important for traceability

  4. Check file_key column: Trace results back to input segments in multi-file processing

  5. Skip preparation: Set skip_preparation=True if data is already standardized

  6. Save selectively: Only save intermediate results you need to reduce disk usage

See Also