NL-to-SQL pipeline with PostgreSQL — join path selection, reserved word errors, and query accuracy
10:55 06 Mar 2026

i'm building a Natural Language to SQL system that;

  • Connects to any PostgreSQL database and auto-introspects schema using information_schema

  • Builds a graph of table relationships using NetworkX

  • Uses sentence-transformers embeddings to find relevant tables from a plain English query

  • Scores join paths using semantic similarity + keyword boosting + coverage scoring

  • Sends the best path to Groq LLM (Llama 3.3 70B) to generate SQL

  • Has a self-healing retry loop that sends failed SQL back to LLM for fixing

Tech Stack: Python, PostgreSQL, NetworkX, sentence-transformers, Groq API, psycopg2, FastAPI (planned)

Current Problems:

Problem 1 — What is the best way to handle reserved word table names when auto-generating SQL with an LLM? Should I rename the table or handle it in the prompt?

Problem 2 — LLM Self-Healing Loop Not Working

for attempt in range(3):
    execute SQL
    if fails → send error to LLM → fix SQL → retry
LLM returns identical broken SQL all 3 attempts

Question: What prompt engineering techniques work best for getting an LLM to actually fix broken SQL when given an error message?

Problem 3 — Join Path Ambiguity

Multiple valid paths between same tables patient → prescription → prescription_detail → drug ✅ correct patient → controlled_drug_log → drug ✅ also valid but wrong context


Currently scoring with:

- Semantic similarity (weight 0.8)

- Path length (weight 0.1)

- Nullable FK penalty (weight 0.1)

- Keyword boost (+0.15 per matching keyword)

- Relevant table coverage boost (×0.8)

Question1:What is the most reliable approach to disambiguate multiple valid join paths in a graph when the correct path depends on business context?

Problem 2— Schema Linking Noise** With 23 tables, embedding-based table selection sometimes picks irrelevant tables which pollute the path finding. Currently using top_k=5 with forced entity detection.

Question3:What is the recommended approach for schema linking in NL-to-SQL systems with 20+ tables?

  ### What I've Tried:

  - Full schema in prompt → LLM gets confused with 23 tables

  - Graph-based path finding with cutoff=4 → reduced paths from 1437 to 6

  - Embedding similarity alone → wrong path selected due to length bias

  - Keyword boosting → improved accuracy significantly

  - Irrelevant table penalty → reduced noise in path selection
python postgresql machine-learning large-language-model