Overview

Aparecium v2 is a neural network designed for embedding reversal: reconstructing the original text (or semantically equivalent text) from a compressed vector representation. This is the inverse of the standard NLP embedding process.

The model accepts a single 768-dimensional pooled vector from the sentence-transformers/all-mpnet-base-v2 encoder and produces natural language text that captures the semantic content encoded in that vector.

Input
768-D Vector
EmbAdapter
Pseudo-sequence
Decoder
Token Generation
Output
Natural Text

Key Features

  • Pooled-only input - Works directly from a single 768-D vector, no token-level matrix required
  • Crypto-domain specialization - Trained on synthetic crypto social media posts (markets, DeFi, L2s, MEV, NFTs)
  • Beam search decoding - Deterministic or stochastic generation with surrogate-based reranking
  • Constraint support - Optional constraints for tickers, hashtags, and amounts

Architecture

Aparecium v2 consists of four main components working together to reverse the embedding process:

1. EmbAdapter
Pooled Vector e (768-D)
Multi-scale Expansion
Memory H (B x S x D)
2. Sketcher (Optional)
URL Detection
Ticker Extraction
Plan Signals
3. RealizerDecoder
GPT-style Transformer
Cross-attention over H
d=768, layers=12, heads=8
4. Surrogate Scorer r(x, e)
Cosine Similarity Approximation
Candidate Reranking

Component Details

EmbAdapter

Transforms the pooled embedding e ∈ R^768 into a multi-scale pseudo-sequence memory H ∈ R^(B × S × D). This expansion provides the decoder with rich positional information despite the input being a single vector.

RealizerDecoder

A 12-layer transformer decoder with the following configuration:

  • d_model = 768 - Model dimension matching MPNet
  • n_layer = 12 - Transformer layers
  • n_head = 8 - Attention heads
  • d_ff = 3072 - Feed-forward dimension

Surrogate Scorer

A neural network that approximates the cosine similarity between the MPNet embedding of generated text and the target embedding. Used for sequence-level reranking during beam search.

Usage

The Aparecium package provides a high-level API for embedding reversal. Install from PyPI and use the Aparecium wrapper class.

Installation

pip install aparecium

Basic Usage

from aparecium import Aparecium
from sentence_transformers import SentenceTransformer

# 1. Encode text to embedding
encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
text = "Bitcoin ETF inflows hit a new weekly high as markets turn risk-on."
e = encoder.encode([text], convert_to_numpy=True, normalize_embeddings=True)[0]

# 2. Load Aparecium (auto-downloads from HuggingFace)
model = Aparecium()

# 3. Reverse the embedding back to text
result = model.invert_embedding(e, beam=5, max_len=64)

print("Reconstruction:", result.text)
print("Candidates:", result.candidates)

End-to-End Inversion

from aparecium import Aparecium

model = Aparecium()

# Directly invert from text (embeds internally)
text = "Ethereum L2 blob fees spiked after EIP-4844; MEV still shapes order flow."
result = model.invert_text(text, beam=5, max_len=64)

print(result.text)

CLI Usage

# Pipe text through the CLI
echo "Macro: DXY rallies while risk assets chop; crypto narratives rotate to AI tokens." | \
  python -m aparecium

Inference API

For production deployments, Aparecium includes a FastAPI inference service. Run the service and POST embedding vectors to the /invert endpoint.

Starting the Service

python -m aparecium.aparecium.infer.service --ckpt checkpoints/aparecium_v2_s1.pt

Request Format

{
    "embedding": [0.123, 0.456, ..., 0.789],  // 768 floats
    "deterministic": true,
    "beam": 5,
    "max_len": 64,
    "constraints": true,
    "final_mpnet": true
}

Response Format

{
    "text": "Top-1 reconstruction result",
    "candidates": ["candidate 1", "candidate 2", ...],
    "scores": {
        "lm_logp": [-1.23, -1.45, ...],
        "cos_mpnet": [0.89, 0.87, ...]  // if final_mpnet=true
    },
    "plan": { ... }  // optional plan information
}

Training Pipeline

Aparecium uses a two-stage training process: supervised learning (S1) followed by optional self-critical sequence training (S2/SCST).

Stage 1: Supervised Training

The model learns to reconstruct text from embeddings using cross-entropy loss on teacher-forced token predictions.

python -m aparecium.aparecium.train.train_s1_supervised \
    --shards ./data/train \
    --val_shards ./data/val \
    --save_dir ./checkpoints \
    --batch_size 64 \
    --epochs 1 \
    --steps 6000 \
    --lr 3e-4 \
    --max_len 96

Surrogate Scorer Training

Train the surrogate scorer r(x, e) to approximate embedding similarity:

python -m aparecium.aparecium.train.train_surrogate_r \
    --shards ./data/train \
    --save_dir ./checkpoints

Stage 2: SCST Fine-tuning (Optional)

Self-critical sequence training optimizes directly for embedding similarity using REINFORCE with baseline:

python -m aparecium.aparecium.train.train_s2_scst \
    --s1_ckpt ./checkpoints/aparecium_v2_s1.pt \
    --save_dir ./checkpoints

Limitations

Important Considerations
  • Reconstruction is not exact - Outputs preserve semantic meaning but may differ in wording, style, or specific details
  • Domain specificity - Best performance on crypto/finance social media; other domains may show quality degradation
  • Synthetic training data - Trained on synthetic crypto posts, not real social media; domain shift may occur
  • Privacy considerations - Do not use to reconstruct sensitive or personally identifiable content

Quality Factors

Reconstruction quality depends on:

  • Encoder alignment with sentence-transformers/all-mpnet-base-v2
  • Domain match (crypto/finance social media performs best)
  • Decode settings (beam size, constraints, rerank weights)

Recommended Decode Settings

Parameter Recommended Notes
beam 5 Balance between quality and speed
max_len 64-96 Match typical social media length
deterministic true Consistent results
alpha (rerank) 1.0-1.5 Surrogate scorer weight