Skip to main content

Bayesian Softmax / MPR

Python: NimbusSoftmax | Julia: RxPolyaModel
Mathematical Model: Bayesian Multinomial Logistic Regression
Bayesian Softmax (Python) / Bayesian MPR (Julia) is a Bayesian multinomial classification model with uncertainty quantification. It uses continuous transitions to map features to a latent space for probabilistic classification, making it ideal for complex multinomial tasks.
Available in Both SDKs:
  • Python SDK: NimbusSoftmax - Polya-Gamma variational inference
  • Julia SDK: RxPolyaModel - Multinomial Probit Regression with RxInfer.jl
Both provide Bayesian multinomial classification with uncertainty quantification.

Overview

Bayesian Softmax/MPR provides a powerful alternative to traditional Gaussian classifiers:
  • Full Bayesian inference with posterior probability distributions
  • Uncertainty quantification for each prediction
  • Continuous transition mapping from features to latent space
  • Flexible decision boundaries (non-Gaussian)
  • Fast inference (~15-25ms per trial)
  • Training and calibration support
  • Batch and streaming inference modes

Quick Start

from nimbus_bci import NimbusSoftmax
import numpy as np

# Create and fit classifier
clf = NimbusSoftmax(w_scale=1.0)
clf.fit(X_train, y_train)

# Predict with uncertainty
predictions = clf.predict(X_test)
probabilities = clf.predict_proba(X_test)

# Good for non-Gaussian decision boundaries

When to Use Bayesian MPR

Bayesian MPR is ideal for:
  • Complex multinomial classification tasks
  • When you need full posterior uncertainty over multinomial distributions
  • Advanced BCI applications requiring flexible classification
  • Tasks where Bayesian multinomial probit regression is theoretically appropriate
  • When RxLDA/RxGMM accuracy is insufficient for complex distributions
Consider Bayesian LDA or Bayesian GMM instead if:
  • You need faster inference (RxLDA/RxGMM are 5-10ms faster)
  • Classes are well-separated with Gaussian distributions
  • Interpretability of class centers is important (Bayesian MPR is discriminative)
  • You want Mahalanobis distance-based outlier detection

Model Architecture

Mathematical Foundation (Bayesian Multinomial Probit Regression)

Bayesian MPR implements Bayesian Multinomial Probit Regression using continuous transitions and MultinomialPolya likelihood: Generative Model:
B ~ N(ξβ, Wβ⁻¹)           # Regression coefficients prior
W ~ Wishart(ν, Λ)          # Precision matrix prior
Ψᵢ ~ ContinuousTransition(xᵢ, B, W)  # Latent transformation
yᵢ ~ MultinomialPolya(N, Ψᵢ)         # Observation model
Where:
  • B ∈ ℝ^((K-1)×D) = regression coefficient matrix
  • W = precision matrix in (K-1)-dimensional latent space
  • xᵢ = feature vector for observation i
  • Ψᵢ = latent (K-1)-dimensional representation
  • N = number of trials per observation (typically 1 for classification)
Key Feature: The model works in (K-1) dimensions due to the sum-to-one constraint in multinomial distributions, with the K-th class serving as the reference category.

Hyperparameters

Bayesian MPR supports configurable hyperparameters for optimal performance tuning: Available Hyperparameters:
ParameterTypeDefaultDescription
NInt1Number of trials per observation (typically 1 for classification)
ξβVectorones((K-1)×D)Prior mean for regression coefficients B
Matrix1e-5 × IPrior precision for regression coefficients B
W_dfFloat64K + 5Wishart degrees of freedom for precision matrix W
W_scaleMatrixIWishart scale matrix (K-1 × K-1)
Parameter Effects:
  • N: Number of trials per observation
    • Typically set to 1 for classification tasks
    • Higher values model count data (less common in BCI)
  • ξβ: Prior mean for regression coefficients
    • Default: ones((K-1)×D) provides mild regularization
    • Can be customized if you have prior knowledge of coefficient values
  • : Prior precision for regression coefficients
    • Lower values (1e-6) → Weaker prior, more data-driven
    • Higher values (1e-4) → Stronger prior, more regularization
    • Default (1e-5) provides balanced regularization
  • W_df: Wishart degrees of freedom
    • Controls strength of precision matrix prior
    • Higher values → Stronger regularization
    • Default (K + 5) provides reasonable regularization
  • W_scale: Wishart scale matrix
    • Shape of the prior precision distribution
    • Default (identity matrix) assumes no prior covariance structure
Hyperparameter configuration allows you to optimize model behavior for your specific dataset characteristics (SNR, trial count, data quality).

Model Structure

struct RxPolyaModel <: BCIModel
    B_posterior                        # Learned regression coefficients posterior
    W_posterior                        # Learned precision matrix posterior
    metadata::ModelMetadata            # Model info
    N::Int                            # Trials per observation
    ξβ::Vector{Float64}               # Prior mean for B
::Matrix{Float64}               # Prior precision for B
    W_df::Float64                     # Wishart DOF
    W_scale::Matrix{Float64}          # Wishart scale matrix
end

RxInfer Implementation

The RxPolya model uses RxInfer.jl for variational Bayesian inference: Learning Phase:
@model function RxPolya_learning_model(obs, N, X, n_classes, n_features, 
                                       ξβ, Wβ, W_df, W_scale)
    # Prior on regression coefficients
    B ~ MvNormalWeightedMeanPrecision(ξβ, Wβ)
    
    # Prior on precision matrix
    W ~ Wishart(W_df, W_scale)
    
    # Likelihood with continuous transition
    for i in eachindex(obs)
        Ψ[i] ~ ContinuousTransition(X[i], B, W)
        obs[i] ~ MultinomialPolya(N, Ψ[i])
    end
end
Prediction Phase:
@model function RxPolya_predictive(obs, N, X, B_posterior, W_posterior, 
                                   n_classes, n_features)
    B ~ B_posterior
    W ~ W_posterior
    
    # Predict latent representations
    for i in eachindex(X)
        Ψ[i] ~ ContinuousTransition(X[i], B, W)
        obs[i] ~ MultinomialPolya(N, Ψ[i])
    end
end

Usage

1. Load Pre-trained Model

Python SDK: The Python SDK trains models locally. See Python SDK Quickstart for training examples.
from nimbus_bci import NimbusSoftmax
import pickle

# Python SDK: Train locally
clf = NimbusSoftmax(reg_strength=1.0)
clf.fit(X_train, y_train)

# Save and load
with open("my_motor_imagery_softmax.pkl", "wb") as f:
    pickle.dump(clf, f)

with open("my_motor_imagery_softmax.pkl", "rb") as f:
    clf = pickle.load(f)

print(f"Model info: {clf.classes_} classes, {clf.n_features_in_} features")

2. Train Custom Model

from nimbus_bci import NimbusSoftmax
from nimbus_bci.compat import extract_csp_features
import mne
import pickle

# Load and preprocess EEG data
raw = mne.io.read_raw_gdf("motor_imagery.gdf", preload=True)
raw.filter(8, 30)

events = mne.find_events(raw)
event_id = {'left': 1, 'right': 2, 'feet': 3, 'tongue': 4}
epochs = mne.Epochs(raw, events, event_id, tmin=0, tmax=4, preload=True)

# Extract CSP features
csp_features, csp = extract_csp_features(epochs, n_components=8)
labels = epochs.events[:, 2]

# Train NimbusSoftmax with default hyperparameters
clf = NimbusSoftmax()
clf.fit(csp_features, labels)

# Or train with custom hyperparameters
clf_tuned = NimbusSoftmax(
    reg_strength=1.0    # L2 regularization strength
)
clf_tuned.fit(csp_features, labels)

# Save
with open("my_motor_imagery_softmax.pkl", "wb") as f:
    pickle.dump(clf_tuned, f)

print(f"Training accuracy: {clf_tuned.score(csp_features, labels):.1%}")
Training Parameters:
  • iterations: Number of variational inference iterations (default: 50)
    • More iterations = better convergence but slower training
    • 50-100 is typically sufficient
  • showprogress: Display progress bar during training
  • name: Model identifier
  • description: Model description for documentation
  • N: Trials per observation (default: 1, range: [1, ∞))
  • ξβ: Prior mean for B (default: ones((K-1)×D))
  • : Prior precision for B (default: 1e-5 × I)
  • W_df: Wishart degrees of freedom (default: K + 5)
  • W_scale: Wishart scale matrix (default: I)

3. Subject-Specific Calibration

Fine-tune a pre-trained model with subject-specific data (much faster than training from scratch):
from nimbus_bci import NimbusSoftmax
import pickle

# Load baseline model
with open("motor_imagery_softmax_baseline.pkl", "rb") as f:
    base_clf = pickle.load(f)

# Collect 10-20 calibration trials from new subject
X_calib, y_calib = collect_calibration_trials()

# Personalize using online learning
personalized_clf = NimbusSoftmax()
personalized_clf.fit(X_baseline, y_baseline)

for _ in range(10):
    personalized_clf.partial_fit(X_calib, y_calib)

# Save
with open("subject_001_calibrated.pkl", "wb") as f:
    pickle.dump(personalized_clf, f)
Calibration Benefits:
  • Requires only 10-20 trials per class (vs 50-100 for training from scratch)
  • Faster: 20 iterations vs 50-100
  • Better generalization: Uses pre-trained model as prior
  • Typical accuracy improvement: 5-15% over generic model
  • Hyperparameters preserved: calibrate_model() automatically uses the same hyperparameters as the base model

4. Batch Inference

Process multiple trials efficiently:
import numpy as np

# Run batch inference
predictions = clf.predict(X_test)
probabilities = clf.predict_proba(X_test)
confidences = np.max(probabilities, axis=1)

# Analyze results
print(f"Predictions: {predictions}")
print(f"Mean confidence: {np.mean(confidences):.3f}")

# Calculate accuracy
accuracy = np.mean(predictions == y_test)
print(f"Accuracy: {accuracy * 100:.1f}%")

# Calculate ITR
from nimbus_bci.metrics import calculate_itr
itr = calculate_itr(accuracy, n_classes=4, trial_time=4.0)
print(f"ITR: {itr:.1f} bits/minute")

5. Streaming Inference

Real-time chunk-by-chunk processing:
For detailed Python streaming examples, see Python SDK Streaming Inference.
from nimbus_bci import StreamingSession

# Initialize streaming session
session = StreamingSession(clf.model_, metadata_with_chunk_size)

# Process chunks as they arrive
for chunk in eeg_feature_stream:
    result = session.process_chunk(chunk)
    print(f"Chunk: pred={result.prediction}, conf={result.confidence:.3f}")

# Finalize trial with aggregation
final_result = session.finalize_trial()
print(f"Final: pred={final_result.prediction}, conf={final_result.confidence:.3f}")

session.reset()

Hyperparameter Tuning

Fine-tune Bayesian MPR for optimal performance on your specific dataset.

When to Tune Hyperparameters

Consider tuning when:
  • Default performance is unsatisfactory
  • You have specific data characteristics (very noisy or very clean)
  • You have limited or extensive training data
  • You want to optimize for your specific paradigm

Tuning Strategies

For High SNR / Clean Data / Many Trials

Use weaker priors to let the data drive the model:
from nimbus_bci import NimbusSoftmax

# Weaker regularization for clean data
clf = NimbusSoftmax(reg_strength=0.1)  # Lower regularization
clf.fit(X_train, y_train)
Use when:
  • SNR > 5 dB
  • 100+ trials per class
  • Clean, artifact-free data
  • Well-controlled experimental conditions

For Low SNR / Noisy Data / Few Trials

Use stronger priors for stability:
from nimbus_bci import NimbusSoftmax

# Stronger regularization for noisy data
clf = NimbusSoftmax(reg_strength=10.0)  # Higher regularization
clf.fit(X_train, y_train)
Use when:
  • SNR < 2 dB
  • 40-80 trials per class
  • Noisy data or limited artifact removal
  • Challenging recording conditions

Balanced / Default Settings

The defaults work well for most scenarios:
from nimbus_bci import NimbusSoftmax

# Default regularization (balanced)
clf = NimbusSoftmax(reg_strength=1.0)  # Default
clf.fit(X_train, y_train)
Use when:
  • Moderate SNR (2-5 dB)
  • 80-150 trials per class
  • Standard BCI recording conditions
  • Starting point for experimentation

Hyperparameter Search Example

Systematically search for optimal hyperparameters:
from sklearn.model_selection import GridSearchCV, train_test_split
from nimbus_bci import NimbusSoftmax

# Split data
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define search grid
param_grid = {
    'reg_strength': [0.1, 1.0, 10.0]
}

# GridSearchCV
clf = NimbusSoftmax()
grid_search = GridSearchCV(
    clf, param_grid, cv=5, scoring='accuracy', verbose=2
)
grid_search.fit(X_train, y_train)

# Best parameters
print("\nBest hyperparameters:")
print(f"  reg_strength: {grid_search.best_params_['reg_strength']}")
print(f"  Validation accuracy: {grid_search.best_score_ * 100:.1f}%")

# Use best model
best_clf = grid_search.best_estimator_

Quick Tuning Guidelines

Scenario scaleW_df offsetNotes
Excellent data quality1e-62Minimal regularization
Good data quality1e-5 (default)5 (default)Balanced approach
Moderate data quality1e-5 to 1e-45-8Slight regularization
Poor data quality1e-410Strong regularization
Very limited trials1e-415Maximum regularization
Pro Tip: Start with defaults (Wβ = 1e-5 × I, W_df = K + 5) and only tune if performance is unsatisfactory. The defaults are optimized for typical BCI scenarios.

Training Requirements

Data Requirements

  • Minimum: 40 trials per class (160 total for 4-class)
  • Recommended: 80+ trials per class (320+ total for 4-class)
  • For calibration: 10-20 trials per class sufficient
RxPolya requires at least 2 observations to estimate regression coefficients and the precision matrix. Single-trial training will raise an error.

Feature Normalization

Critical for cross-session BCI performance!Normalize your features before training for 15-30% accuracy improvement across sessions.
from sklearn.preprocessing import StandardScaler
import pickle

# Estimate normalization from training data
scaler = StandardScaler()
X_train_norm = scaler.fit_transform(X_train)

# Train with normalized features
clf = NimbusSoftmax()
clf.fit(X_train_norm, y_train)

# Save model and scaler together
with open("model_with_scaler.pkl", "wb") as f:
    pickle.dump({'model': clf, 'scaler': scaler}, f)

# Later: Apply same normalization to test data
X_test_norm = scaler.transform(X_test)
predictions = clf.predict(X_test_norm)
See the Feature Normalization guide for complete details.

Feature Requirements

Bayesian MPR expects preprocessed features, not raw EEG: Required preprocessing:
  • Bandpass filtering (8-30 Hz for motor imagery)
  • Artifact removal (ICA recommended)
  • Spatial filtering (CSP for motor imagery)
  • Feature extraction (log-variance for CSP features)
NOT accepted:
  • Raw EEG channels
  • Unfiltered data
  • Non-extracted features
See Preprocessing Requirements for details.

Performance Characteristics

Computational Performance

OperationLatencyNotes
Training15-40 seconds50 iterations, 100 trials per class
Calibration8-20 seconds20 iterations, 20 trials per class
Batch Inference15-25ms per trial10 iterations
Streaming Chunk15-25ms10 iterations per chunk
Slightly slower than RxLDA/RxGMM due to continuous transition complexity.

Classification Accuracy

ParadigmClassesTypical AccuracyWhen to Use RxPolya
Motor Imagery2-470-85%When RxLDA/RxGMM insufficient
P3002 (Target/Non-target)85-95%Complex distributions
SSVEP2-685-98%Advanced applications
Bayesian MPR typically provides comparable or slightly better accuracy than RxLDA/RxGMM for complex distributions, at the cost of ~5-10ms additional latency.

Model Inspection

View Model Parameters

import numpy as np

# Model coefficients
print("Regression coefficients (weights):")
print(f"  Shape: {clf.model_['coef'].shape}")
print(f"  Coefficients:\n{clf.model_['coef']}")

print("\nIntercepts:")
print(f"  Intercepts: {clf.model_['intercept']}")

# Model info
print("\nModel info:")
print(f"  Features: {clf.n_features_in_}")
print(f"  Classes: {clf.classes_}")
print(f"  Regularization: {clf.reg_strength}")

# Class probabilities (from training)
print("\nClass priors (estimated from training):")
for k, class_label in enumerate(clf.classes_):
    prior = np.sum(y_train == class_label) / len(y_train)
    print(f"  Class {class_label}: {prior:.3f}")

Compare Models

from sklearn.model_selection import cross_val_score
from nimbus_bci import NimbusSoftmax

# Compare different regularization strengths
results = []
for reg in [0.1, 1.0, 10.0]:
    clf = NimbusSoftmax(reg_strength=reg)
    scores = cross_val_score(clf, X_train, y_train, cv=5)
    accuracy = np.mean(scores)
    
    print(f"Regularization: {reg} → Accuracy: {accuracy * 100:.1f}%")
    results.append((reg, clf, accuracy))

Advantages & Limitations

Advantages

Flexible Multinomial Classification: Natural handling of multinomial distributions
Continuous Transition Mapping: Sophisticated feature-to-latent-space transformation
Full Bayesian Uncertainty: Complete posterior distributions over predictions
No Gaussian Assumption: More flexible than Gaussian classifiers (RxLDA/RxGMM)
Production-Ready: Battle-tested in real BCI applications
Calibration Support: Fast subject-specific adaptation

Limitations

More Complex: More parameters than RxLDA, requires careful hyperparameter tuning
Slower Inference: ~15-25ms vs ~10-15ms for RxLDA
(K-1) Dimensional Space: Works in reduced dimension (not always intuitive)
No Mahalanobis Distance: Discriminative model lacks explicit class centers for outlier detection
Requires Multiple Trials: Cannot train on single trial (minimum 2 observations)
Less Interpretable: Harder to interpret than generative models with explicit means

Comparison: Bayesian MPR vs RxLDA vs RxGMM

AspectBayesian MPRRxLDARxGMM
Mathematical ModelBayesian Multinomial Probit RegressionPooled Gaussian Classifier (PGC)Heteroscedastic Gaussian Classifier (HGC)
Representation(K-1) dimensional latent spaceK class means with shared precisionK class means with class-specific precisions
Training SpeedModerateFastestModerate
Inference Speed15-25ms10-15ms15-20ms
FlexibilityHighestLowestHigh
Best ForComplex multinomial tasksWell-separated classesOverlapping classes with different covariances
InterpretabilityLow (discriminative)High (generative with means)High (generative with means)
Mahalanobis Distance❌ No (no explicit means)✅ Yes✅ Yes
Entropy Metrics✅ Yes✅ Yes✅ Yes
Free Energy✅ Yes (training only)✅ Yes✅ Yes

Decision Tree

Do you need Mahalanobis-based outlier detection?
├─ Yes → Use RxLDA or RxGMM
└─ No → Are classes well-separated with Gaussian structure?
    ├─ Yes → Use RxLDA (fastest)
    └─ No → Do classes have different covariances?
        ├─ Yes → Use RxGMM
        └─ No → Are distributions complex/non-Gaussian?
            ├─ Yes → Use Bayesian MPR
            └─ No → Use RxLDA (simplest)

Practical Examples

Motor Imagery Classification

from nimbus_bci import NimbusSoftmax
from sklearn.metrics import classification_report

# Train on motor imagery data (4-class MI)
clf = NimbusSoftmax(reg_strength=1.0)
clf.fit(X_train, y_train)  # CSP features, 4 classes

# Test
predictions = clf.predict(X_test)

# Calculate per-class accuracy
print(classification_report(y_test, predictions))

# Per-class breakdown
for class_id in clf.classes_:
    class_mask = y_test == class_id
    class_acc = np.mean(predictions[class_mask] == class_id)
    print(f"Class {class_id} accuracy: {class_acc * 100:.1f}%")

P300 Detection

from nimbus_bci import NimbusSoftmax
from sklearn.metrics import confusion_matrix

# Train on P300 data (binary: 0=non-target, 1=target)
clf = NimbusSoftmax(reg_strength=1.0)
clf.fit(X_train, y_train)  # ERP features

# Test
predictions = clf.predict(X_test)

# Calculate sensitivity/specificity
cm = confusion_matrix(y_test, predictions)
tn, fp, fn, tp = cm.ravel()

sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)

print(f"Sensitivity: {sensitivity * 100:.1f}%")
print(f"Specificity: {specificity * 100:.1f}%")

Comparing with RxLDA/RxGMM

from nimbus_bci import NimbusLDA, NimbusGMM, NimbusSoftmax

# Train all three models
lda_clf = NimbusLDA()
lda_clf.fit(X_train, y_train)

gmm_clf = NimbusGMM()
gmm_clf.fit(X_train, y_train)

softmax_clf = NimbusSoftmax()
softmax_clf.fit(X_train, y_train)

# Evaluate all three
models = [
    ("NimbusLDA", lda_clf),
    ("NimbusGMM", gmm_clf),
    ("NimbusSoftmax", softmax_clf)
]

for name, clf in models:
    predictions = clf.predict(X_test)
    probas = clf.predict_proba(X_test)
    
    accuracy = np.mean(predictions == y_test)
    mean_conf = np.mean(np.max(probas, axis=1))
    mean_entropy = -np.mean(np.sum(probas * np.log(probas + 1e-10), axis=1))
    
    print(f"{name}:")
    print(f"  Accuracy: {accuracy * 100:.1f}%")
    print(f"  Mean confidence: {mean_conf:.3f}")
    print(f"  Mean entropy: {mean_entropy:.3f} bits")
    print()

Next Steps

References

Implementation: Theory:
  • Albert, J. H., & Chib, S. (1993). “Bayesian analysis of binary and polychotomous response data”
  • Imai, K., & van Dyk, D. A. (2005). “A Bayesian analysis of the multinomial probit model”
  • Multinomial probit regression with Bayesian inference
BCI Applications:
  • Blankertz et al. (2008). “Optimizing spatial filters for robust EEG single-trial analysis”
  • Lotte et al. (2018). “A review of classification algorithms for EEG-based BCI”