Clustering

EMBER-based binary clustering with UMAP + HDBSCAN

Vector Database Statistics

Loading stats...

TLSH Database (Debug)

TLSH database not available

Clustering Parameters

Controls local vs global structure

Minimum distance between points

Minimum points for a cluster

Maximum samples to process

How Clustering Works

  • EMBER Extraction: Extracts 2381-dimensional static features from PE files (imports, exports, headers, etc.)
  • UMAP: Reduces 2381 dimensions to 3D while preserving similarity structure
  • HDBSCAN: Density-based clustering that automatically determines number of clusters
  • Noise Points (gray): Samples that don't fit well into any cluster - potentially unique or rare binaries
  • Use Cases: Group similar macro binaries, identify malware families, detect outliers