# Data Input Formats Guide

ParaDigMa's `run_paradigma()` function supports multiple flexible input formats for providing data to the analysis pipeline.

## Prerequisites

Before using ParaDigMa, ensure your data meets the requirements:
- **Sensor requirements**: See [Sensor Requirements](sensor_requirements.md)
- **Device compatibility**: See [Supported Devices](supported_devices.md)
- **Data format**: Pandas DataFrame with required columns (see below)

## Input Format Options

The `dfs` parameter accepts three input formats:

### 1. Single DataFrame

Use when you have a single prepared DataFrame to analyze:

```python
import pandas as pd
from paradigma.orchestrator import run_paradigma

# Load your data
df = pd.read_parquet('data.parquet')

# Process with a single DataFrame
results = run_paradigma(
    dfs=df,  # Single DataFrame
    pipelines=['gait'],
    watch_side='right',  # Required for gait pipeline
    save_intermediate=['aggregation']  # Saves to ./output by default
)
```

The DataFrame is automatically assigned the identifier `'df_1'` internally.

### 2. List of DataFrames

Use when you have multiple DataFrames that should be automatically assigned sequential IDs:

```python
# Load multiple data segments
df1 = pd.read_parquet('morning_session.parquet')
df2 = pd.read_parquet('afternoon_session.parquet')
df3 = pd.read_parquet('evening_session.parquet')

# Process as list - automatically assigned to 'df_1', 'df_2', 'df_3'
results = run_paradigma(
    dfs=[df1, df2, df3],
    pipelines=['gait'],
    watch_side='right',
    save_intermediate=['quantification', 'aggregation']
)
```

**Benefits:**
- Automatic segment ID assignment
- Each DataFrame processed independently before aggregation
- Aggregation performed across all input DataFrames

### 3. Dictionary of DataFrames

Use when you need custom identifiers for your data segments:

```python
# Create dictionary with custom segment identifiers
dfs = {
    'patient_001_morning': pd.read_parquet('session1.parquet'),
    'patient_001_evening': pd.read_parquet('session2.parquet'),
    'patient_002_morning': pd.read_parquet('session3.parquet'),
}

# Process with custom segment identifiers
results = run_paradigma(
    dfs=dfs,
    pipelines=['gait'],
    watch_side='right',
    save_intermediate=[]  # No files saved - results only in memory
)
```

**Benefits:**
- Custom segment identifiers in output
- Improved traceability of data sources
- Useful for multi-patient or multi-session datasets

## Loading Data from Disk

To automatically load data files from a directory:

```python
from paradigma.orchestrator import run_paradigma

# Load all files from a directory
results = run_paradigma(
    data_path='./data/patient_001/',
    pipelines=['gait'],
    watch_side='right',
    file_pattern='*.parquet',  # Optional: filter by pattern
    save_intermediate=['aggregation']
)
```

**Supported file formats:**
- Pandas: `.parquet`, `.csv`, `.pkl`, `.pickle`
- TSDF: `.meta` + `.bin` pairs
- Device-specific: `.avro` (Empatica), `.cwa` (Axivity)

See [Supported Devices](supported_devices.md) for device-specific loading examples.

## Required DataFrame Columns

Your DataFrame must contain the following columns depending on the pipeline:

### For Gait and Tremor Pipelines

```python
# Required columns
df.columns = ['time', 'accelerometer_x', 'accelerometer_y', 'accelerometer_z',
              'gyroscope_x', 'gyroscope_y', 'gyroscope_z']
```

- `time`: Timestamp (float seconds or datetime)
- `accelerometer_x`, `accelerometer_y`, `accelerometer_z`: Accelerometer data
- `gyroscope_x`, `gyroscope_y`, `gyroscope_z`: Gyroscope data

### For Pulse Rate Pipeline

```python
# Required columns
df.columns = ['time', 'ppg']  # Accelerometer optional
```

- `time`: Timestamp (float seconds or datetime)
- `ppg`: PPG/BVP signal

### Custom Column Names

If your data uses different column names, rename the columns or use `column_mapping`:

```python
results = run_paradigma(
    dfs=df,
    pipelines=['gait'],
    watch_side='left',
    column_mapping={
        'timestamp': 'time',
        'acc_x': 'accelerometer_x',
        'acc_y': 'accelerometer_y',
        'acc_z': 'accelerometer_z',
        'gyr_x': 'gyroscope_x',
        'gyr_y': 'gyroscope_y',
        'gyr_z': 'gyroscope_z'
    }
)
```

## Data Preparation Parameters

If your data needs preparation (unit conversion, resampling, etc.), ParaDigMa can handle it automatically:

```python
results = run_paradigma(
    dfs=df_raw,
    pipelines=['gait'],
    watch_side='left',
    skip_preparation=False,  # Default: perform preparation

    # Unit conversion
    accelerometer_units='m/s^2',  # Auto-converts to 'g'
    gyroscope_units='rad/s',      # Auto-converts to 'deg/s'

    # Resampling
    target_frequency=100.0,

    # Time handling
    time_input_unit='relative_s',  # Or 'absolute_datetime'

    # Orientation correction
    device_orientation=['x', 'y', 'z'],

    # Segmentation for non-contiguous data
    split_by_gaps=True,
    max_gap_seconds=1.5,
    min_segment_seconds=1.5,
)
```

If your data is already prepared (correct units, sampling rate, column names), skip preparation:

```python
results = run_paradigma(
    dfs=df_prepared,
    pipelines=['gait', 'tremor'],
    watch_side='left',
    skip_preparation=True
)
```

## Output Control

### Output Directory

```python
results = run_paradigma(
    dfs=df,
    pipelines=['gait'],
    watch_side='left',
    output_dir='./results',  # Custom output directory (default: './output')
)
```

### Saving Intermediate Results

Control which intermediate steps are saved to disk:

```python
results = run_paradigma(
    dfs=df,
    pipelines=['gait'],
    watch_side='left',
    save_intermediate=[
        'preparation',      # Prepared data
        'preprocessing',    # Preprocessed data
        'classification',   # Gait/tremor bout classifications
        'quantification',   # Segment-level measures
        'aggregation'       # Aggregated measures
    ]
)
```

To keep results only in memory without saving files:

```python
results = run_paradigma(
    dfs=df,
    pipelines=['gait'],
    watch_side='left',
    save_intermediate=[]  # No files saved
)
```

## Results Structure

Regardless of input format, results are returned in the same structure:

```python
results = {
    'quantifications': {
        'gait': pd.DataFrame,    # Segment-level gait measures
        'tremor': pd.DataFrame,  # Segment-level tremor measures
    },
    'aggregations': {
        'gait': dict,            # Time-period aggregated gait measures
        'tremor': dict,          # Time-period aggregated tremor measures
    },
    'metadata': dict,            # Analysis metadata
    'errors': list               # List of errors (empty if successful)
}
```

### Error Tracking

The `errors` list contains any errors encountered during processing. Always check this after running:

```python
if results['errors']:
    print(f"Warning: {len(results['errors'])} error(s) occurred")
    for error in results['errors']:
        print(f"  Stage: {error['stage']}")
        print(f"  Error: {error['error']}")
        if 'file' in error:
            print(f"  File: {error['file']}")
```

Each error dict contains:
- `stage`: Where the error occurred (loading, preparation, pipeline_execution, aggregation)
- `error`: Error message
- `file`: Filename (if file-specific, optional)
- `pipeline`: Pipeline name (if pipeline-specific, optional)

### File Key Column

When processing multiple files, the `quantifications` DataFrame includes a `file_key` column:

- **Single DataFrame input**: No `file_key` column
- **List input (2+ files)**: `'df_1'`, `'df_2'`, etc.
- **Dict input (2+ files)**: Custom keys you provided

This preserves traceability while keeping single-file results concise.

## Best Practices

1. **Single DataFrame**: Use for single files or pre-aggregated data
2. **List of DataFrames**: Use when you don't need specific naming
3. **Dictionary of DataFrames**: Use when segment identifiers are important for traceability
4. **Check `file_key` column**: Trace results back to input segments in multi-file processing
5. **Skip preparation**: Set `skip_preparation=True` if data is already standardized
6. **Save selectively**: Only save intermediate results you need to reduce disk usage

## See Also

- [Sensor Requirements](sensor_requirements.md) - What sensor specs are needed
- [Supported Devices](supported_devices.md) - Device-specific loading examples
- [Data Preparation Tutorial](../tutorials/data_preparation.html) - Step-by-step preparation guide