Overview
Aparecium v2 is a neural network designed for embedding reversal: reconstructing the original text (or semantically equivalent text) from a compressed vector representation. This is the inverse of the standard NLP embedding process.
The model accepts a single 768-dimensional pooled vector from the
sentence-transformers/all-mpnet-base-v2 encoder and produces natural language
text that captures the semantic content encoded in that vector.
Key Features
- Pooled-only input - Works directly from a single 768-D vector, no token-level matrix required
- Crypto-domain specialization - Trained on synthetic crypto social media posts (markets, DeFi, L2s, MEV, NFTs)
- Beam search decoding - Deterministic or stochastic generation with surrogate-based reranking
- Constraint support - Optional constraints for tickers, hashtags, and amounts
Architecture
Aparecium v2 consists of four main components working together to reverse the embedding process:
Component Details
EmbAdapter
Transforms the pooled embedding e ∈ R^768 into a multi-scale pseudo-sequence memory
H ∈ R^(B × S × D). This expansion provides the decoder with rich positional information
despite the input being a single vector.
RealizerDecoder
A 12-layer transformer decoder with the following configuration:
d_model = 768- Model dimension matching MPNetn_layer = 12- Transformer layersn_head = 8- Attention headsd_ff = 3072- Feed-forward dimension
Surrogate Scorer
A neural network that approximates the cosine similarity between the MPNet embedding of generated text and the target embedding. Used for sequence-level reranking during beam search.
Usage
The Aparecium package provides a high-level API for embedding reversal. Install from PyPI
and use the Aparecium wrapper class.
Installation
pip install aparecium
Basic Usage
from aparecium import Aparecium
from sentence_transformers import SentenceTransformer
# 1. Encode text to embedding
encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
text = "Bitcoin ETF inflows hit a new weekly high as markets turn risk-on."
e = encoder.encode([text], convert_to_numpy=True, normalize_embeddings=True)[0]
# 2. Load Aparecium (auto-downloads from HuggingFace)
model = Aparecium()
# 3. Reverse the embedding back to text
result = model.invert_embedding(e, beam=5, max_len=64)
print("Reconstruction:", result.text)
print("Candidates:", result.candidates)
End-to-End Inversion
from aparecium import Aparecium
model = Aparecium()
# Directly invert from text (embeds internally)
text = "Ethereum L2 blob fees spiked after EIP-4844; MEV still shapes order flow."
result = model.invert_text(text, beam=5, max_len=64)
print(result.text)
CLI Usage
# Pipe text through the CLI
echo "Macro: DXY rallies while risk assets chop; crypto narratives rotate to AI tokens." | \
python -m aparecium
Inference API
For production deployments, Aparecium includes a FastAPI inference service. Run the service
and POST embedding vectors to the /invert endpoint.
Starting the Service
python -m aparecium.aparecium.infer.service --ckpt checkpoints/aparecium_v2_s1.pt
Request Format
{
"embedding": [0.123, 0.456, ..., 0.789], // 768 floats
"deterministic": true,
"beam": 5,
"max_len": 64,
"constraints": true,
"final_mpnet": true
}
Response Format
{
"text": "Top-1 reconstruction result",
"candidates": ["candidate 1", "candidate 2", ...],
"scores": {
"lm_logp": [-1.23, -1.45, ...],
"cos_mpnet": [0.89, 0.87, ...] // if final_mpnet=true
},
"plan": { ... } // optional plan information
}
Training Pipeline
Aparecium uses a two-stage training process: supervised learning (S1) followed by optional self-critical sequence training (S2/SCST).
Stage 1: Supervised Training
The model learns to reconstruct text from embeddings using cross-entropy loss on teacher-forced token predictions.
python -m aparecium.aparecium.train.train_s1_supervised \
--shards ./data/train \
--val_shards ./data/val \
--save_dir ./checkpoints \
--batch_size 64 \
--epochs 1 \
--steps 6000 \
--lr 3e-4 \
--max_len 96
Surrogate Scorer Training
Train the surrogate scorer r(x, e) to approximate embedding similarity:
python -m aparecium.aparecium.train.train_surrogate_r \
--shards ./data/train \
--save_dir ./checkpoints
Stage 2: SCST Fine-tuning (Optional)
Self-critical sequence training optimizes directly for embedding similarity using REINFORCE with baseline:
python -m aparecium.aparecium.train.train_s2_scst \
--s1_ckpt ./checkpoints/aparecium_v2_s1.pt \
--save_dir ./checkpoints
Limitations
- Reconstruction is not exact - Outputs preserve semantic meaning but may differ in wording, style, or specific details
- Domain specificity - Best performance on crypto/finance social media; other domains may show quality degradation
- Synthetic training data - Trained on synthetic crypto posts, not real social media; domain shift may occur
- Privacy considerations - Do not use to reconstruct sensitive or personally identifiable content
Quality Factors
Reconstruction quality depends on:
- Encoder alignment with
sentence-transformers/all-mpnet-base-v2 - Domain match (crypto/finance social media performs best)
- Decode settings (beam size, constraints, rerank weights)
Recommended Decode Settings
| Parameter | Recommended | Notes |
|---|---|---|
| beam | 5 | Balance between quality and speed |
| max_len | 64-96 | Match typical social media length |
| deterministic | true | Consistent results |
| alpha (rerank) | 1.0-1.5 | Surrogate scorer weight |