Architecture Stack | README | Model Research

Model Research

Research into the different models that came before and which areas to keep and what could be improved. Privacy BERT-LSTM and Polsis were good models, however their lack of tokenization caused a bottleneck in the amount of data the model could process and keep track of. The addition of the LSTM transformer allowed for short-term memory to be kept for a longer period of time, but the word embedding was not made for longer pieces of text.

This new model purposed is called Prert-CNM (Pr-ivacy B-ERT Contextual Neural Memory) increases the number of specialised parameters (increased embedding layer) and tokens (100 Input w/ 512 limit to 4096), swaps the general purpose BERT-LSTM model with the specialised mode DeBERTa v3 (with help from several other models working as helper agents) to process larger docs and hold onto more memory.

Chapters

Prert-CNM
LSTM Layer
Dataset
ISO Controls
Model Architecture
Output

Prert-CNM

Table of Contents

Memory
Transformer Swaps
Fine-tuning LLMs
Synthetic Dataset Generation
Non-Compliance Handling
Policy Weighting
Hierarchical Transformer
Multi-label Mapping
Return Attributes

Memory

Context

Split into 3 areas
- Broad Context
  - Flag a broad policy
  - DOES NOT apply label
- Specialized Context
  - Checks flags using CNM
  - DOES apply label
- Fine-Grain Context
  - Checks label against specific control
  - Flags if policy is broken
    - Will not apply anything if data is ok
Trained on the Language of Privacy
- Specialized Vococab
- Train on 130k> Policies
- Word Embeddings
  - Subword Embeddings
  - FastText
  - BERT's WordPiece
Applying a Contextual Neural Memory (CNM)
- Agent Framework
  - Focus
    - Data Layer (Extraction)
      - Segments Text into chunks
    - Application Layer (Interface)
      - Can ask questions about the policy
    - ML Layer (Analysis)
      - AI Annotation of the text
  - Design
    - NIST AI Risk Management Framework (AI RMF)

Weightings

Class Weighting
Advanced Data Augmentation
Attention Mechanism
- Attention Weights
- Heatmap
- Real-time Highlighting on exact words
- Tokens that trigger an ISO compliance failure
- Audit Trail

Cluster Control

Vector Database
- https://www.pinecone.io/
- a specialized storage system designed to manage, index, and search high-dimensional vector embeddings
Have an AI manage the cluster
- Tasked to ensure data is processed correctly
- Local model that lives on the server

LSTM Layer

Swap LSTM with a pure transformer
- RoBERTa
  - Superior Language Understanding (Using Meta's Data)
  - Dynamic Masking Algo
  - Variants of model
    - 12 layer w/ 125 million parameters
    - 24 layers w/ 355 million. parameters
- DeBERTa
  - Improvement of RoBERTa and BERT (Disentangled attention)
  - Separates word content from its position (processes)
  - Replaces softmax decoder with a mask decoder
- Fine-tuned LLM with Specialized Training/Testing
BERT-LSTM is a hybrid (Note)

Dataset

Mix Datasets
Synthetic Dataset
- Mix of SMS and company data
- Based on Real-World

ISO Controls

ISO Non-Compliance
- Employ Synthetic Data Generation (Linked to Synthetic Dataset)
- "Rare Event"
Apply Weight to policies

Model Architecture

Hierarchical Transformer (Uses)

LLM (Operates Through)
- Glass-Box
Top Level
- Category Layer
- High-level ISO domain
  - Access Control
    - Linked to CNM
  - Data Retention
    - Linked to CNM
Bottom Level
- Attribute Layer
- Secondary Classifier
- Fine-grained requirements
  - Password Length
  - Encryption Standards

Output

Multi-label
Hierarchical Attributes
- Mapped to ISO controls

Returns back

Number of flags
What data triggered a flag
Which control was broken