I am building a B2B product search system using vector embeddings and would like advice specifically on how to generate embeddings for structured product attributes.
Context
Domain: B2B ecommerce
Queries: Short keyword-style searches (4 to 5 tokens), often containing numbers, units, and alphanumeric attributes
Examples:“12 kva diesel generator”
“5 hp air compressor”
“cnc milling machine 3 axis”
Search architecture
Initial candidate retrieval using product title embeddings
Reranking using product attribute embeddings
Product data
Each product has a title and a set of structured attributes stored as key-value pairs.
Example:
Product: Diesel Generator
Attributes:
“power_rating: 12 kva”
“fuel_type: diesel”
“phase: 3”
“cooling_type: air cooled”
“application: industrial backup”
Main question
What is the best way to preprocess and embed these attributes for semantic reranking?
Attribute embedding strategies we are considering
Flat concatenation
power rating 12 kva fuel type diesel phase 3 cooling type air cooled application industrial backupKey-value with separators
power_rating: 12 kva | fuel_type: diesel | phase: 3 | cooling_type: air cooled | application: industrial backupLine-separated attributes
power_rating: 12 kva fuel_type: diesel phase: 3 cooling_type: air cooled application: industrial backupNatural language passage
This diesel generator has a power rating of 12 kva, uses diesel fuel, supports 3 phase operation, and is air cooled for industrial backup usage.Per-attribute embeddings
- Generate one embedding per attribute and aggregate scores during reranking
Any other recommended method?
Specific questions
Should attributes be embedded as a single combined text or as individual attribute embeddings
Does explicitly preserving attribute keys help embedding quality
Are separator tokens or structured formatting important for short, attribute-heavy queries
Any best practices for handling numeric values, units, and alphanumeric attributes
Whether passage-style text performs better than structured key-value text for dense retrieval
Model considerations
Currently considering Marqo ecommerce embedding (large)
Open to recommendations for other models that work well for:
Short B2B queries
Numeric and unit-heavy matching
Attribute-based reranking