Skip to main content

Bayesian Multinomial (NimbusSoftmax / NimbusProbit)

Python: NimbusSoftmax | Julia: NimbusProbit
Model family: Non-Gaussian multinomial classification with uncertainty quantification
Bayesian Softmax is a Bayesian multinomial classification model with uncertainty quantification. It uses continuous transitions to map features to a latent space for probabilistic classification, making it ideal for complex multinomial tasks.
Availability
  • Python SDK: ✅ NimbusSoftmax (Polya-Gamma VI; optional dependency)
  • Julia SDK: ✅ NimbusProbit

Start Here

Quickstart

Start with SDK setup and first inference workflow.

Model Selection

Compare Nimbus models by data characteristics and use case.

Examples

See practical BCI examples for training and inference.

Overview

Bayesian Softmax provides a powerful alternative to traditional Gaussian classifiers: ✅ Full Bayesian inference with posterior probability distributions
Uncertainty quantification for each prediction
Continuous transition mapping from features to latent space
Flexible decision boundaries (non-Gaussian)
Fast inference (~15-25ms per trial)
Training and calibration support
Batch and streaming inference modes

Quick Start

from nimbus_bci import NimbusSoftmax
import numpy as np

# Create and fit classifier
clf = NimbusSoftmax(w_scale=1.0)
clf.fit(X_train, y_train)

# Predict with uncertainty
predictions = clf.predict(X_test)
probabilities = clf.predict_proba(X_test)

# Good for non-Gaussian decision boundaries

When to Use Bayesian Softmax

Bayesian Softmax is ideal for:
  • Complex multinomial classification tasks
  • When you need full posterior uncertainty over multinomial distributions
  • Advanced BCI applications requiring flexible classification
  • Problems where Gaussian class-conditional structure (NimbusLDA/NimbusQDA) is a poor fit
  • When NimbusLDA/NimbusQDA accuracy is insufficient for complex distributions
Consider Bayesian LDA or Bayesian QDA instead if:
  • You need faster inference (NimbusLDA/NimbusQDA are 5-10ms faster)
  • Classes are well-separated with Gaussian distributions
  • Interpretability of class centers is important (Bayesian Softmax is discriminative)
  • You want Mahalanobis distance-based outlier detection

Model Architecture

Mathematical Foundation (Bayesian Multinomial Logistic Regression)

Bayesian Softmax implements Bayesian Multinomial Logistic Regression using continuous transitions and MultinomialPolya likelihood: Generative Model:
B ~ N(ξβ, Wβ⁻¹)           # Regression coefficients prior
W ~ Wishart(ν, Λ)          # Precision matrix prior
Ψᵢ ~ ContinuousTransition(xᵢ, B, W)  # Latent transformation
yᵢ ~ MultinomialPolya(N, Ψᵢ)         # Observation model
Where:
  • B ∈ ℝ^((K-1)×D) = regression coefficient matrix
  • W = precision matrix in (K-1)-dimensional latent space
  • xᵢ = feature vector for observation i
  • Ψᵢ = latent (K-1)-dimensional representation
  • N = number of trials per observation (typically 1 for classification)
Key Feature: The model works in (K-1) dimensions due to the sum-to-one constraint in multinomial distributions, with the K-th class serving as the reference category.

Hyperparameters

Bayesian Softmax supports configurable hyperparameters for optimal performance tuning: Available Hyperparameters:
ParameterTypeDefaultDescription
NInt1Number of trials per observation (typically 1 for classification)
ξβVectorones((K-1)×D)Prior mean for regression coefficients B
Matrix1e-5 × IPrior precision for regression coefficients B
W_dfFloat64K + 5Wishart degrees of freedom for precision matrix W
W_scaleMatrixIWishart scale matrix (K-1 × K-1)
Parameter Effects:
  • N: Number of trials per observation
    • Typically set to 1 for classification tasks
    • Higher values model count data (less common in BCI)
  • ξβ: Prior mean for regression coefficients
    • Default: ones((K-1)×D) provides mild regularization
    • Can be customized if you have prior knowledge of coefficient values
  • : Prior precision for regression coefficients
    • Lower values (1e-6) → Weaker prior, more data-driven
    • Higher values (1e-4) → Stronger prior, more regularization
    • Default (1e-5) provides balanced regularization
  • W_df: Wishart degrees of freedom
    • Controls strength of precision matrix prior
    • Higher values → Stronger regularization
    • Default (K + 5) provides reasonable regularization
  • W_scale: Wishart scale matrix
    • Shape of the prior precision distribution
    • Default (identity matrix) assumes no prior covariance structure
Hyperparameter configuration allows you to optimize model behavior for your specific dataset characteristics (SNR, trial count, data quality).

Model Structure

struct NimbusProbit <: BCIModel
    B_posterior                        # Learned regression coefficients posterior
    W_posterior                        # Learned precision matrix posterior
    metadata::ModelMetadata            # Model info
    N::Int                            # Trials per observation
    ξβ::Vector{Float64}               # Prior mean for B
::Matrix{Float64}               # Prior precision for B
    W_df::Float64                     # Wishart DOF
    W_scale::Matrix{Float64}          # Wishart scale matrix
end

RxInfer Implementation

The NimbusSoftmax model uses RxInfer.jl for variational Bayesian inference: Learning Phase:
@model function RxPolya_learning_model(obs, N, X, n_classes, n_features, 
                                       ξβ, Wβ, W_df, W_scale)
    # Prior on regression coefficients
    B ~ MvNormalWeightedMeanPrecision(ξβ, Wβ)
    
    # Prior on precision matrix
    W ~ Wishart(W_df, W_scale)
    
    # Likelihood with continuous transition
    for i in eachindex(obs)
        Ψ[i] ~ ContinuousTransition(X[i], B, W)
        obs[i] ~ MultinomialPolya(N, Ψ[i])
    end
end
Prediction Phase:
@model function RxPolya_predictive(obs, N, X, B_posterior, W_posterior, 
                                   n_classes, n_features)
    B ~ B_posterior
    W ~ W_posterior
    
    # Predict latent representations
    for i in eachindex(X)
        Ψ[i] ~ ContinuousTransition(X[i], B, W)
        obs[i] ~ MultinomialPolya(N, Ψ[i])
    end
end

Usage

1. Load Pre-trained Model

Python SDK: The Python SDK trains models locally. See Python SDK Quickstart for training examples.
from nimbus_bci import NimbusSoftmax
import pickle

# Python SDK: Train locally
clf = NimbusSoftmax(reg_strength=1.0)
clf.fit(X_train, y_train)

# Save and load
with open("my_motor_imagery_softmax.pkl", "wb") as f:
    pickle.dump(clf, f)

with open("my_motor_imagery_softmax.pkl", "rb") as f:
    clf = pickle.load(f)

print(f"Model info: {clf.classes_} classes, {clf.n_features_in_} features")

2. Train Custom Model

from nimbus_bci import NimbusSoftmax
from nimbus_bci.compat import extract_csp_features
import mne
import pickle

# Load and preprocess EEG data
raw = mne.io.read_raw_gdf("motor_imagery.gdf", preload=True)
raw.filter(8, 30)

events = mne.find_events(raw)
event_id = {'left': 1, 'right': 2, 'feet': 3, 'tongue': 4}
epochs = mne.Epochs(raw, events, event_id, tmin=0, tmax=4, preload=True)

# Extract CSP features
csp_features, csp = extract_csp_features(epochs, n_components=8)
labels = epochs.events[:, 2]

# Train NimbusSoftmax with default hyperparameters
clf = NimbusSoftmax()
clf.fit(csp_features, labels)

# Or train with custom hyperparameters
clf_tuned = NimbusSoftmax(
    reg_strength=1.0    # L2 regularization strength
)
clf_tuned.fit(csp_features, labels)

# Save
with open("my_motor_imagery_softmax.pkl", "wb") as f:
    pickle.dump(clf_tuned, f)

print(f"Training accuracy: {clf_tuned.score(csp_features, labels):.1%}")
Training Parameters:
  • iterations: Number of variational inference iterations (default: 50)
    • More iterations = better convergence but slower training
    • 50-100 is typically sufficient
  • showprogress: Display progress bar during training
  • name: Model identifier
  • description: Model description for documentation
  • N: Trials per observation (default: 1, range: [1, ∞))
  • ξβ: Prior mean for B (default: ones((K-1)×D))
  • : Prior precision for B (default: 1e-5 × I)
  • W_df: Wishart degrees of freedom (default: K + 5)
  • W_scale: Wishart scale matrix (default: I)

3. Subject-Specific Calibration

Fine-tune a pre-trained model with subject-specific data (much faster than training from scratch):
from nimbus_bci import NimbusSoftmax
import pickle

# Load baseline model
with open("motor_imagery_softmax_baseline.pkl", "rb") as f:
    base_clf = pickle.load(f)

# Collect 10-20 calibration trials from new subject
X_calib, y_calib = collect_calibration_trials()

# Personalize using online learning
personalized_clf = NimbusSoftmax()
personalized_clf.fit(X_baseline, y_baseline)

for _ in range(10):
    personalized_clf.partial_fit(X_calib, y_calib)

# Save
with open("subject_001_calibrated.pkl", "wb") as f:
    pickle.dump(personalized_clf, f)
Calibration Benefits:
  • Requires only 10-20 trials per class (vs 50-100 for training from scratch)
  • Faster: 20 iterations vs 50-100
  • Better generalization: Uses pre-trained model as prior
  • Typical accuracy improvement: 5-15% over generic model
  • Hyperparameters preserved: calibrate_model() automatically uses the same hyperparameters as the base model

4. Batch Inference

Process multiple trials efficiently:
import numpy as np

# Run batch inference
predictions = clf.predict(X_test)
probabilities = clf.predict_proba(X_test)
confidences = np.max(probabilities, axis=1)

# Analyze results
print(f"Predictions: {predictions}")
print(f"Mean confidence: {np.mean(confidences):.3f}")

# Calculate accuracy
accuracy = np.mean(predictions == y_test)
print(f"Accuracy: {accuracy * 100:.1f}%")

# Calculate ITR
from nimbus_bci.metrics import calculate_itr
itr = calculate_itr(accuracy, n_classes=4, trial_time=4.0)
print(f"ITR: {itr:.1f} bits/minute")

5. Streaming Inference

Real-time chunk-by-chunk processing:
For detailed Python streaming examples, see Python SDK Streaming Inference.
from nimbus_bci import StreamingSession

# Initialize streaming session
session = StreamingSession(clf.model_, metadata_with_chunk_size)

# Process chunks as they arrive
for chunk in eeg_feature_stream:
    result = session.process_chunk(chunk)
    print(f"Chunk: pred={result.prediction}, conf={result.confidence:.3f}")

# Finalize trial with aggregation
final_result = session.finalize_trial()
print(f"Final: pred={final_result.prediction}, conf={final_result.confidence:.3f}")

session.reset()

Hyperparameter Tuning

Fine-tune Bayesian Softmax for optimal performance on your specific dataset.

When to Tune Hyperparameters

Consider tuning when:
  • Default performance is unsatisfactory
  • You have specific data characteristics (very noisy or very clean)
  • You have limited or extensive training data
  • You want to optimize for your specific paradigm

Tuning Strategies

For High SNR / Clean Data / Many Trials

Use weaker priors to let the data drive the model:
from nimbus_bci import NimbusSoftmax

# Weaker regularization for clean data
clf = NimbusSoftmax(reg_strength=0.1)  # Lower regularization
clf.fit(X_train, y_train)
Use when:
  • SNR > 5 dB
  • 100+ trials per class
  • Clean, artifact-free data
  • Well-controlled experimental conditions

For Low SNR / Noisy Data / Few Trials

Use stronger priors for stability:
from nimbus_bci import NimbusSoftmax

# Stronger regularization for noisy data
clf = NimbusSoftmax(reg_strength=10.0)  # Higher regularization
clf.fit(X_train, y_train)
Use when:
  • SNR < 2 dB
  • 40-80 trials per class
  • Noisy data or limited artifact removal
  • Challenging recording conditions

Balanced / Default Settings

The defaults work well for most scenarios:
from nimbus_bci import NimbusSoftmax

# Default regularization (balanced)
clf = NimbusSoftmax(reg_strength=1.0)  # Default
clf.fit(X_train, y_train)
Use when:
  • Moderate SNR (2-5 dB)
  • 80-150 trials per class
  • Standard BCI recording conditions
  • Starting point for experimentation

Hyperparameter Search Example

Systematically search for optimal hyperparameters:
from sklearn.model_selection import GridSearchCV, train_test_split
from nimbus_bci import NimbusSoftmax

# Split data
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define search grid
param_grid = {
    'reg_strength': [0.1, 1.0, 10.0]
}

# GridSearchCV
clf = NimbusSoftmax()
grid_search = GridSearchCV(
    clf, param_grid, cv=5, scoring='accuracy', verbose=2
)
grid_search.fit(X_train, y_train)

# Best parameters
print("\nBest hyperparameters:")
print(f"  reg_strength: {grid_search.best_params_['reg_strength']}")
print(f"  Validation accuracy: {grid_search.best_score_ * 100:.1f}%")

# Use best model
best_clf = grid_search.best_estimator_

Quick Tuning Guidelines

Scenario scaleW_df offsetNotes
Excellent data quality1e-62Minimal regularization
Good data quality1e-5 (default)5 (default)Balanced approach
Moderate data quality1e-5 to 1e-45-8Slight regularization
Poor data quality1e-410Strong regularization
Very limited trials1e-415Maximum regularization
Pro Tip: Start with defaults (Wβ = 1e-5 × I, W_df = K + 5) and only tune if performance is unsatisfactory. The defaults are optimized for typical BCI scenarios.

Training Requirements

Data Requirements

  • Minimum: 40 trials per class (160 total for 4-class)
  • Recommended: 80+ trials per class (320+ total for 4-class)
  • For calibration: 10-20 trials per class sufficient
NimbusSoftmax requires at least 2 observations to estimate regression coefficients and the precision matrix. Single-trial training will raise an error.

Feature Normalization

Critical for cross-session BCI performance!Normalize your features before training for 15-30% accuracy improvement across sessions.
from sklearn.preprocessing import StandardScaler
import pickle

# Estimate normalization from training data
scaler = StandardScaler()
X_train_norm = scaler.fit_transform(X_train)

# Train with normalized features
clf = NimbusSoftmax()
clf.fit(X_train_norm, y_train)

# Save model and scaler together
with open("model_with_scaler.pkl", "wb") as f:
    pickle.dump({'model': clf, 'scaler': scaler}, f)

# Later: Apply same normalization to test data
X_test_norm = scaler.transform(X_test)
predictions = clf.predict(X_test_norm)
See the Feature Normalization guide for complete details.

Feature Requirements

Bayesian Softmax expects preprocessed features, not raw EEG: Required preprocessing:
  • Bandpass filtering (8-30 Hz for motor imagery)
  • Artifact removal (ICA recommended)
  • Spatial filtering (CSP for motor imagery)
  • Feature extraction (log-variance for CSP features)
NOT accepted:
  • Raw EEG channels
  • Unfiltered data
  • Non-extracted features
See Preprocessing Requirements for details.

Performance Characteristics

Computational Performance

OperationLatencyNotes
Training15-40 seconds50 iterations, 100 trials per class
Calibration8-20 seconds20 iterations, 20 trials per class
Batch Inference15-25ms per trial10 iterations
Streaming Chunk15-25ms10 iterations per chunk
Slightly slower than NimbusLDA/NimbusQDA due to continuous transition complexity.

Classification Accuracy

ParadigmClassesTypical AccuracyWhen to Use NimbusSoftmax
Motor Imagery2-470-85%When NimbusLDA/NimbusQDA insufficient
P3002 (Target/Non-target)85-95%Complex distributions
SSVEP2-685-98%Advanced applications
Bayesian Softmax typically provides comparable or slightly better accuracy than NimbusLDA/NimbusQDA for complex distributions, at the cost of ~5-10ms additional latency.

Model Inspection

View Model Parameters

import numpy as np

# Model coefficients
print("Regression coefficients (weights):")
print(f"  Shape: {clf.model_['coef'].shape}")
print(f"  Coefficients:\n{clf.model_['coef']}")

print("\nIntercepts:")
print(f"  Intercepts: {clf.model_['intercept']}")

# Model info
print("\nModel info:")
print(f"  Features: {clf.n_features_in_}")
print(f"  Classes: {clf.classes_}")
print(f"  Regularization: {clf.reg_strength}")

# Class probabilities (from training)
print("\nClass priors (estimated from training):")
for k, class_label in enumerate(clf.classes_):
    prior = np.sum(y_train == class_label) / len(y_train)
    print(f"  Class {class_label}: {prior:.3f}")

Compare Models

from sklearn.model_selection import cross_val_score
from nimbus_bci import NimbusSoftmax

# Compare different regularization strengths
results = []
for reg in [0.1, 1.0, 10.0]:
    clf = NimbusSoftmax(reg_strength=reg)
    scores = cross_val_score(clf, X_train, y_train, cv=5)
    accuracy = np.mean(scores)
    
    print(f"Regularization: {reg} → Accuracy: {accuracy * 100:.1f}%")
    results.append((reg, clf, accuracy))

Advantages & Limitations

Advantages

Flexible Multinomial Classification: Natural handling of multinomial distributions
Continuous Transition Mapping: Sophisticated feature-to-latent-space transformation
Full Bayesian Uncertainty: Complete posterior distributions over predictions
No Gaussian Assumption: More flexible than Gaussian classifiers (NimbusLDA/NimbusQDA)
Production-Ready: Battle-tested in real BCI applications
Calibration Support: Fast subject-specific adaptation

Limitations

More Complex: More parameters than NimbusLDA, requires careful hyperparameter tuning
Slower Inference: ~15-25ms vs ~10-15ms for NimbusLDA
(K-1) Dimensional Space: Works in reduced dimension (not always intuitive)
No Mahalanobis Distance: Discriminative model lacks explicit class centers for outlier detection
Requires Multiple Trials: Cannot train on single trial (minimum 2 observations)
Less Interpretable: Harder to interpret than generative models with explicit means

Comparison: Bayesian Softmax vs NimbusLDA vs NimbusQDA vs NimbusSTS

AspectBayesian SoftmaxNimbusLDANimbusQDANimbusSTS
Mathematical ModelBayesian Multinomial Logistic RegressionPooled Gaussian Classifier (PGC)Heteroscedastic Gaussian Classifier (HGC)State-Space Model (EKF)
Representation(K-1) dimensional latent spaceK class means with shared precisionK class means with class-specific precisionsFeatures + latent state dynamics
Temporal ModelingStatic (stateless)Static (stateless)Static (stateless)Dynamic (stateful)
Training SpeedModerateFastestModerateModerate
Inference Speed15-25ms10-15ms15-20ms20-30ms
FlexibilityHighestLowestHighAdaptive
Best ForComplex multinomial tasksWell-separated classesOverlapping classes with different covariancesNon-stationary, long sessions
InterpretabilityLow (discriminative)High (generative with means)High (generative with means)Moderate (state evolution)
Mahalanobis Distance❌ No (no explicit means)✅ Yes✅ Yes❌ N/A
Entropy Metrics✅ Yes✅ Yes✅ Yes✅ Yes
Free Energy✅ Yes (training only)✅ Yes✅ Yes✅ Yes

Decision Tree

Is your data non-stationary (temporal drift)?
├─ Yes → Use NimbusSTS (handles temporal dynamics)
└─ No → Do you need Mahalanobis-based outlier detection?
    ├─ Yes → Use NimbusLDA or NimbusQDA
    └─ No → Are classes well-separated with Gaussian structure?
        ├─ Yes → Use NimbusLDA (fastest)
        └─ No → Do classes have different covariances?
            ├─ Yes → Use NimbusQDA
            └─ No → Are distributions complex/non-Gaussian?
                ├─ Yes → Use Bayesian Softmax
                └─ No → Use NimbusLDA (simplest)

Practical Examples

Motor Imagery Classification

from nimbus_bci import NimbusSoftmax
from sklearn.metrics import classification_report

# Train on motor imagery data (4-class MI)
clf = NimbusSoftmax(reg_strength=1.0)
clf.fit(X_train, y_train)  # CSP features, 4 classes

# Test
predictions = clf.predict(X_test)

# Calculate per-class accuracy
print(classification_report(y_test, predictions))

# Per-class breakdown
for class_id in clf.classes_:
    class_mask = y_test == class_id
    class_acc = np.mean(predictions[class_mask] == class_id)
    print(f"Class {class_id} accuracy: {class_acc * 100:.1f}%")

P300 Detection

from nimbus_bci import NimbusSoftmax
from sklearn.metrics import confusion_matrix

# Train on P300 data (binary: 0=non-target, 1=target)
clf = NimbusSoftmax(reg_strength=1.0)
clf.fit(X_train, y_train)  # ERP features

# Test
predictions = clf.predict(X_test)

# Calculate sensitivity/specificity
cm = confusion_matrix(y_test, predictions)
tn, fp, fn, tp = cm.ravel()

sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)

print(f"Sensitivity: {sensitivity * 100:.1f}%")
print(f"Specificity: {specificity * 100:.1f}%")

Comparing with NimbusLDA/NimbusQDA

from nimbus_bci import NimbusLDA, NimbusQDA, NimbusSoftmax, NimbusSTS

# Train all models
lda_clf = NimbusLDA()
lda_clf.fit(X_train, y_train)

gmm_clf = NimbusQDA()
gmm_clf.fit(X_train, y_train)

softmax_clf = NimbusSoftmax()
softmax_clf.fit(X_train, y_train)

sts_clf = NimbusSTS()
sts_clf.fit(X_train, y_train)

# Evaluate all models
models = [
    ("NimbusLDA", lda_clf),
    ("NimbusQDA", gmm_clf),
    ("NimbusSoftmax", softmax_clf),
    ("NimbusSTS", sts_clf)
]

for name, clf in models:
    predictions = clf.predict(X_test)
    probas = clf.predict_proba(X_test)
    
    accuracy = np.mean(predictions == y_test)
    mean_conf = np.mean(np.max(probas, axis=1))
    mean_entropy = -np.mean(np.sum(probas * np.log(probas + 1e-10), axis=1))
    
    print(f"{name}:")
    print(f"  Accuracy: {accuracy * 100:.1f}%")
    print(f"  Mean confidence: {mean_conf:.3f}")
    print(f"  Mean entropy: {mean_entropy:.3f} bits")
    print()

Next Read

Bayesian LDA (NimbusLDA)

Faster model with shared covariance

Bayesian QDA (NimbusQDA)

Flexible model with class-specific covariances

Bayesian STS (NimbusSTS)

Adaptive model for non-stationary data

Training Guide

Complete training tutorial

Code Examples

Working examples

Preprocessing Requirements

Feature extraction guide

References

Implementation: Theory:
  • Polson, N. G., Scott, J. G., & Windle, J. (2013). “Bayesian inference for logistic models using Pólya–Gamma latent variables”
  • Windle, J., Polson, N. G., & Scott, J. G. (2014). “Sampling Pólya–Gamma random variates: alternative and approximate techniques”
  • Bayesian multinomial logistic regression (softmax) with Pólya–Gamma augmentation and variational inference
BCI Applications:
  • Blankertz et al. (2008). “Optimizing spatial filters for robust EEG single-trial analysis”
  • Lotte et al. (2018). “A review of classification algorithms for EEG-based BCI”