Background
I am maintaining a Spring Boot (Java) backend for a university campus platform. We recently implemented an Agentic RAG pipeline using Google's GenAI SDK (Gemini models) and Vertex AI Ranking API. The database is PostgreSQL with pgvector for hybrid search.
Right now, the system is extremely slow (often 5~10+ seconds before the stream starts), and the overall performance and instruction adherence from the LLM are quite poor. Because of these issues, we are completely open to changing the AI models (e.g., switching away from Gemini or Vertex AI) or completely revamping the architecture if there is a better industry standard.
Tech Stack
Backend: Spring Boot 3.x, Java 17, Spring Data JPA
Database: PostgreSQL +
pgvectorLLMs:
gemini-2.5-flash-lite(Router & Title Gen),gemini-2.5-flash(Main Chat),gemini-embedding-001(Embeddings)External APIs: Google GenAI SDK (Java), Vertex AI Ranking API (REST via
RestTemplate)
Current Architecture & Flow
Our streamChat logic follows a strict synchronous chain before emitting the first token:
Router Phase (
gemini-2.5-flash-lite): We send the user's question, conversation history, and current context to the Router LLM. It returns a structured JSON (viaresponseSchema) decidingneedsSearch, target tables (intents: NOTICE, POST, PETITION),searchQueries(contextually rewritten), andkeywords.Embedding Phase (
gemini-embedding-001): IfneedsSearchis true, we embed the rewritten query.Hybrid Retrieval (pgvector + GIN): We query our DB using vector similarity and exact keyword matching. We fetch up to 20 candidate chunks (8 notices, 8 posts, 4 petitions).
Re-Ranking Phase (Vertex AI Ranking): We prepend meta-information (e.g.,
[NOTICE] Target: CS Dept | Category: Academic | Views: 150\n) to the chunks and send them to the Vertex AI Ranking REST API. We filter out records with a score< 0.2.Context Assembly: We fetch the full JPA entities for the top-ranked documents to prevent
LazyInitializationExceptionand build a large context string using specific formatting ([Document 1]...).Generation Phase (Main Gemini Model): We pass the massive System Instruction, user question, and Google Search Tool (Grounding) to the main model and stream the response via SSE.
The Core Problems
1. High Latency Bottleneck
The TTFT (Time To First Token) is extremely high. The synchronous execution of: Router LLM -> Embedding API -> DB Search -> Vertex Ranking API -> Main LLM init causes the user to wait too long.
2. Prompt Engineering & Instruction Adherence
Our Main Model's system prompt is massive. We are injecting persona instructions, real-time system dates, user academic context (GPA, courses taken), RAG documents, and very strict output rules. Issue: The LLM frequently hallucinates citations (e.g., adding 3 when only 2 documents were provided) or fails to trigger the Google Search tool when the internal document lacks specific numerical data, despite explicit instructions.
Here is our exact System Prompt structure (translated to English for context):
[System Time] 2026-06-02
- Current: 2026 Semester 1
- Next: 2026 Semester 2
[User Affiliation] XYZ University Engineering Campus
[User Details]
- Name: John Doe
- Major: Computer Science
- Credits: 95 / Required: 132 (User input, not official)
... (Past courses list) ...
Your name is 'Cambi', a friendly and smart senior at the university. Use casual, friendly Korean (banmal).
[Added Reference Materials]
[Document 1] Notice Title...
- Content: ...
- Info: 2026-05-10 | Target: Common
**[Answering Guidelines]:**
1. Answer accurately based on the [Added Reference Materials].
2. MUST add citation tags N.
---
## 🚨 STRICT RULES (CRITICAL) 🚨
1. Single Output.
2. Ignore Irrelevant Departments.
3. Selective Referencing (No TMI).
4. MANDATORY GOOGLE SEARCH: If [Document N] lacks essential numerical data, you MUST actively trigger the Google Search Tool.
5. NO FAKE CITATIONS: ONLY cite exact document numbers provided. DO NOT cite Google Search results.
6. Closing template: Always append a specific follow-up question.
3. Doubtful Vertex AI Ranking Results
To help Vertex AI understand the importance of university documents, we manually prepend metadata to the text (e.g., [POST] Board: Free Board | Likes: 45\n{Body}). However, Vertex Ranking often drops highly relevant notices (scoring them < 0.2) simply because the semantic overlap of the exact words isn't perfect, ignoring the "Likes" or "Views" meta-context we appended.
My Questions
Standard RAG Architecture & Restructuring: How is a production RAG system usually built? How should I fundamentally restructure this flow to solve the severe latency and performance issues? Should we drop the LLM Router and rely entirely on hybrid search, or parallelize the Router and Retrieval (speculative execution)? We are completely open to changing our approach if there's a better industry standard.
Prompt Optimization & Best Practices: How is prompting typically handled in such a complex RAG setup? Our prompt mixes Persona, User Context, RAG Data, and System Constraints. Is it better to separate the Persona/Constraint logic from the Context injection? How can we force the LLM to strictly obey the
tag rules and Google Grounding triggers?Re-Ranking Strategy: Is Vertex AI Ranking suitable for tabular/meta-heavy data (like community posts with likes/views)? Should we replace it with a custom scoring algorithm in Java (e.g.,
VectorScore * 0.7 + Normalized(Likes) * 0.3) instead of relying on a semantic re-ranker?
Any architectural insights, prompt restructuring examples, alternative model recommendations, or Spring Boot specific optimizations would be greatly appreciated!