Snapshot date: 2026-04-07
Provide a visual and tabular status view of dataset shape, held-out classifier quality, and Bayesian posterior behavior across four Phase 3 variants:
multinomial_naive_bayes (phase-3-nb)logreg_tfidf (phase-3-logreg)logreg_tfidf with Bayesian scoring disabled (phase-3-no-bayes)privacybert (phase-3-privacybert)Classifier metrics remain the baseline quality surface. Bayesian posterior outputs are included as a decision surface where available; no-bayes runs are marked as N/A by design.
artifacts/phase-3-nb/dataset_manifest.jsonartifacts/phase-3-nb/classifier_metrics.jsonartifacts/phase-3-nb/bayesian_risk_test.jsonartifacts/phase-3-logreg/dataset_manifest.jsonartifacts/phase-3-logreg/classifier_metrics.jsonartifacts/phase-3-logreg/bayesian_risk_test.jsonartifacts/phase-3-no-bayes/dataset_manifest.jsonartifacts/phase-3-no-bayes/classifier_metrics.jsonartifacts/phase-3-privacybert/dataset_manifest.jsonartifacts/phase-3-privacybert/classifier_metrics.jsonartifacts/phase-3-privacybert/bayesian_risk_test.jsonartifacts/phase-3-nb/calibration_test.jsonartifacts/phase-3-logreg/calibration_test.jsonartifacts/phase-3-no-bayes/calibration_test.jsonartifacts/phase-3-privacybert/calibration_test.jsonartifacts/phase-3-nb/threshold_sweep_test.jsonartifacts/phase-3-logreg/threshold_sweep_test.jsonartifacts/phase-3-no-bayes/threshold_sweep_test.jsonartifacts/phase-3-privacybert/threshold_sweep_test.jsonartifacts/phase-3-nb/bootstrap_ci_test.jsonartifacts/phase-3-logreg/bootstrap_ci_test.jsonartifacts/phase-3-no-bayes/bootstrap_ci_test.jsonartifacts/phase-3-privacybert/bootstrap_ci_test.jsonartifacts/phase3_run_history.jsonl| Area | Metric | Current Value |
|---|---|---|
| Dataset | Total rows | 19720 |
| Dataset | Class mix (org/system/user) | 84.18% / 3.81% / 12.00% (0.841836 / 0.038134 / 0.120030) |
| Split | Train / Validation / Test | 79.34% / 9.14% / 11.52% (0.793357 / 0.091430 / 0.115213) |
| Model | NB test accuracy / macro F1 | 81.47% / 64.01% (0.814701 / 0.640117) |
| Model | LogReg test accuracy / macro F1 | 89.26% / 77.90% (0.892606 / 0.779024) |
| Model | No-Bayes test accuracy / macro F1 | 89.26% / 77.90% (0.892606 / 0.779024) |
| Model | PrivacyBERT test accuracy / macro F1 | 95.60% / 89.20% (0.955986 / 0.891999) |
| Delta | LogReg vs NB (test accuracy / macro F1) | +7.79% / +13.89% (+0.077905 / +0.138907) |
| Delta | No-Bayes vs NB (test accuracy / macro F1) | +7.79% / +13.89% (+0.077905 / +0.138907) |
| Delta | PrivacyBERT vs NB (accuracy / macro F1) | +14.13% / +25.19% (+0.141285 / +0.251882) |
| Bayesian | Test posterior overall (NB/LogReg/NoB/PB) | 0.952613 / 0.809707 / N/A / 0.987510 |
| Leakage | Policy overlap (all split pairs) | 0 |
| Figure ID | Figure Preview | Key Takeaway |
|---|---|---|
| Fig 5 | ![]() |
Corpus remains organization-heavy, so macro and per-class metrics are still required for fair model comparison. |
| Fig 6 | ![]() |
Train-heavy split is stable for learning while preserving validation/test hold-outs with zero policy overlap. |
| Fig 7 | ![]() |
PrivacyBERT leads aggregate quality; LogReg and No-Bayes are classifier-identical and both improve over NB. |
| Fig 8 | ![]() |
Heatmap highlights strongest macro precision/recall/F1 concentration in PrivacyBERT and mid-tier gains from LogReg. |
| Fig 9 | ![]() |
Minority-class F1 gains are largest when moving from NB to LogReg/PrivacyBERT, especially on system. |
| Fig 10 | ![]() |
Confusion small multiples show spillover reduction from NB to LogReg and strongest diagonal concentration in PrivacyBERT. |
| Fig 11 | ![]() |
Bayesian posterior means/intervals separate model uncertainty profiles; No-Bayes is intentionally excluded as N/A. |
| Fig 12 | ![]() |
Delta map makes improvement vs NB explicit and shows that classifier gains for LogReg and No-Bayes are identical. |
| Fig 13 | ![]() |
Reliability curves expose calibration quality by comparing predicted confidence vs observed accuracy per model. |
| Fig 14 | ![]() |
ECE summary highlights which variants are most/least calibrated on held-out predictions. |
| Fig 15 | ![]() |
Threshold sweeps show user/system precision-recall operating point tradeoffs across models. |
| Fig 16 | ![]() |
Bootstrap intervals quantify uncertainty around held-out accuracy and macro F1 comparisons. |
| Fig 17 | ![]() |
Dated trend snapshots track run-to-run metric drift and ranking stability over time. |

What this means:

What this means:
policy_overlap.* == 0 confirms policy-level leakage protection.
What this means:
privacybert is highest on validation and test for both accuracy and macro F1.logreg_tfidf and logreg_tfidf (no bayes) have matching classifier bars, confirming Bayesian disablement affects scoring surface, not classifier output.
What this means:

What this means:
system.
What this means:

What this means:

What this means:

What this means:

What this means:

What this means:

What this means:

What this means:
artifacts/phase3_run_history.jsonl written by each pipeline execution.| Model | Split | Rows | Accuracy | Macro F1 |
|---|---|---|---|---|
| multinomial_naive_bayes | Validation | 1803 | 81.64% (0.816417) | 62.59% (0.625907) |
| multinomial_naive_bayes | Test | 2272 | 81.47% (0.814701) | 64.01% (0.640117) |
| logreg_tfidf | Validation | 1803 | 88.96% (0.889628) | 75.80% (0.757970) |
| logreg_tfidf | Test | 2272 | 89.26% (0.892606) | 77.90% (0.779024) |
| logreg_tfidf (no bayesian) | Validation | 1803 | 88.96% (0.889628) | 75.80% (0.757970) |
| logreg_tfidf (no bayesian) | Test | 2272 | 89.26% (0.892606) | 77.90% (0.779024) |
| privacybert | Validation | 1803 | 95.12% (0.951192) | 87.24% (0.872441) |
| privacybert | Test | 2272 | 95.60% (0.955986) | 89.20% (0.891999) |
| Label | NB F1 | LogReg F1 | No-Bayes F1 | PrivacyBERT F1 |
|---|---|---|---|---|
| user | 55.24% (0.552408) | 71.34% (0.713433) | 71.34% (0.713433) | 87.31% (0.873096) |
| system | 48.18% (0.481818) | 68.82% (0.688172) | 68.82% (0.688172) | 82.89% (0.828947) |
| organization | 88.61% (0.886125) | 93.55% (0.935466) | 93.55% (0.935466) | 97.40% (0.973954) |
| Model | Overall Posterior Mean | User Mean [95% CI] | System Mean [95% CI] | Organization Mean [95% CI] |
|---|---|---|---|---|
| multinomial_naive_bayes | 0.952613 | 0.916301 [0.890126, 0.942477] | 0.918496 [0.875283, 0.961709] | 0.964695 [0.955930, 0.973460] |
| logreg_tfidf | 0.809707 | 0.751016 [0.708318, 0.793714] | 0.763389 [0.687348, 0.839430] | 0.825634 [0.807973, 0.843295] |
| logreg_tfidf (no bayesian) | N/A | N/A | N/A | N/A |
| privacybert | 0.987510 | 0.972697 [0.954701, 0.990694] | 0.925694 [0.870264, 0.981124] | 0.992613 [0.988749, 0.996478] |
fig-13, fig-14) with reliability bins and ECE summaries.fig-15) focused on user/system classes.fig-16) for held-out accuracy and macro F1.fig-17) backed by canonical run-history indexing.Regeneration command:
PYTHONPATH=src python scripts/generate_phase3_dashboard_figures.py
| ⬅ Back | Next ⮕ |