NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering

**NeuSym-RAG** is a hybrid neural symbolic retrieval framework for PDF question answering.

Abstract

The increasing number of academic papers poses significant challenges for researchers to efficiently acquire key details. While retrieval augmented generation (RAG) shows great promise in large language model (LLM) based automated question answering, previous works often isolate neural and symbolic retrieval despite their complementary strengths. Moreover, conventional single-view chunking neglects the rich structure and layout of PDFs, e.g., sections and tables. In this work, we propose NeuSym-RAG, a hybrid neural symbolic retrieval framework which combines both paradigms in an interactive process. By leveraging multi-view chunking and schema-based parsing, NeuSym-RAG organizes semi-structured PDF content into both the relational database and vectorstore, enabling LLM agents to iteratively gather context until sufficient to generate answers. Experiments on three full PDF-based QA datasets, including a self-annotated one AirQA-Real, show that NeuSym-RAG stably defeats both the vector-based RAG and various structured baselines, highlighting its capacity to unify both retrieval schemes and utilize multiple views.

Framework of NeuSym-RAG

The entire workflow proceeds as follows: - **Parsing**: Firstly, we parse the raw PDF file into a pipeline of functions to segment it in multi-view, extract non-textual elements, and store them in a schema-constrained database (DB). - **Encoding**: Next, we identify those encodable columns in the DB, and utilize embedding models for different modalities to obtain and insert vectors of cell values into the vectorstore (VS). - **Interaction**: Finally, we build an iterative Q&A agent which can predict executable actions to retrieve context from the backend environment (either DB or VS) and answer the input question.

Multiview Document Parsing
Multimodal Vector Encoding
Iterative Agent Interaction

The multiview document parsing process transforms raw PDF files into a structured database (DuckDB) through a comprehensive pipeline:

Querying scholar APIs (e.g., arXiv) to obtain the metadata such as the authors and published conference, such that we can support metadata-based filtering during retrieval.
Splitting the text based on different granularities with tool PyMuPDF (Artifex Software, 2023), e.g., pages, sections, and fixed-length continuous tokens.
Leveraging OCR models (MinerU) to extract tables, figures, and other visual elements
Asking large language models (LLMs) or vision language models (VLMs) to generate concise summaries of the parsed texts, tables, and images.

The retrieved metadata, parsed elements and predicted summaries will all be populated into the symbolic DB. We handcraft the DB schema in advance, which is carefully designed and universal for PDF documents.

The multimodal vector encoding process transforms structured database content into vector representations:

Labeling columns suitable for vectorization (e.g., long text summaries, image bounding boxes)
Encoding models of both modalities to vectorize text snippets or cropped images
Creating a one-to-one mapping between database cells and vector entries, supplemented with metadata for efficient retrieval

These data entries will be inserted into the VS, categorized into different collections based on the encoding model and modality.

After the population of the database and vectorstore, we can start the iterative agent interaction process. The RAG agent can proactively retrieves context from both the DB and VS. In each turn, agents predict one action to interact with the environment and obtain the real-time observation. Five parameterized actions are supported during the interaction, including:

RetrieveFromDatabase

Generate an SQL query to retrieve the desired information from the DuckDB database.

{
  "action_type": "RetrieveFromDatabase",
  "parameters": {
    "sql": "SELECT ..." // str, required
  }
}

RetrieveFromVectorstore

Given a query text, retrieve relevant context from the Milvus vectorstore.

{
  "action_type": "RetrieveFromVectorstore",
  "parameters": {
    "query": "...",           // str, required
    "collection_name": "...", // str, required
    "table_name": "...",      // str, required
    "column_name": "...",     // str, required
    "filter": "",             // str, optional
    "limit": 5                 // int, optional
  }
}

CalculateExpr

Calculate the expression and return the result.

{
  "action_type": "CalculateExpr",
  "parameters": {
    "expr": "2 + 3 * 4" // str, required
  }
}

ViewImage

Retrieve the visual information of the paper by specifying paper id, page number, and optional bounding box.

{
  "action_type": "ViewImage",
  "parameters": {
    "paper_id": "...",         // str, required
    "page_number": 1,           // int, required
    "bounding_box": []          // List[float], optional
  }
}

GenerateAnswer

Terminate the interaction when the retrieved results suffice to answer the user question.

{
  "action_type": "GenerateAnswer",
  "parameters": {
    "answer": ... // Any, required
  }
}

Task Demonstration

We demonstrate NeuSym-RAG's capabilities through the AirQA-Real dataset, showing how the agent processes questions, retrieves information, and generates answers. For more examples, please visit our Task Viewer page.

Data Format
Prompt
Example 1
Example 2
Example 3

The data format of AirQA-Real is as follows:

uuid Unique identifier for each question instance.

question The natural language question about a PDF or set of PDFs.

answer_format Specifies the required format for the answer (e.g., list, string, number).

tags Labels describing the question type, content modality, and evaluation genre.

anchor_pdf The PDF(s) that must be referenced to answer the question.

reference_pdf Additional PDF(s) that may be used for context.

conference Conference(s) and year(s) related to the question or source paper(s).

evaluator The evaluation function and parameters used to assess the answer.

state The status of the question if directly answered by some LLM services.

annotator The human annotator who created the question or the source dataset of the question.

Datasets

We employ three Q&A datasets on AI research papers to evaluate the performance of NeuSym-RAG. You can download them from our Hugging Face page.

AirQA-Real

We construct a human-labeled Q&A dataset on AI research papers, featuring complex questions that require understanding of text, tables, images, formulas, and metadata. The dataset includes 553 questions apanning across 3 types, annotated by 16 researchers.

Total Questions

553

Task Types

3

Question Categories

5

Evaluation functions

18

Single-doc details Multi-doc analysis Paper retrieval

Text Table Image Formula Metadata

hard-coding objective metrics

LLM-based subjective assessment

Other Benchmarks

M3SciQA and SciDQA are two other benchmarks that we converted to the same format as AirQA-Real to evaluate the performance of NeuSym-RAG.

452 test samples

M3SciQA

2,937 test samples

SciDQA

Experiment

Main Results

We evaluate NeuSym-RAG and Classic-RAG on three full-PDF QA datasets using various LLMs. NeuSym-RAG consistently outperforms Classic-RAG across all datasets and models. Notably:

Model	AirQA-Real						M3SciQA			SciDQA
Model	text	table	image	formula	metadata	AVG	table	image	AVG	table	image	formula	AVG
Classic-RAG
GPT-4o-mini	12.3	11.9	12.5	16.7	13.6	13.4	17.9	10.6	15.6	59.4	60.4	59.3	59.8
GPT-4V	13.2	13.9	10.0	13.9	13.6	14.7	12.1	8.8	11.1	56.6	56.8	58.1	57.4
Llama-3.3-70B-Instruct	8.7	7.9	9.5	16.7	0.0	10.0	12.7	8.1	11.3	56.8	58.8	58.9	58.0
Qwen2.5-VL-72B-Instruct	9.6	5.9	11.9	11.1	13.6	10.5	11.6	11.6	11.6	54.8	56.9	56.3	56.2
DeepSeek-R1	11.7	13.9	9.5	30.6	9.1	13.9	11.9	9.5	11.2	63.9	61.3	61.7	62.4
NeuSym-RAG
GPT-4o-mini	33.0	12.9	11.9	19.4	18.2	30.7	18.7	16.6	18.0	63.0	63.6	62.5	63.0
GPT-4V	38.9	18.8	23.8	38.9	27.3	37.3	13.7	13.4	13.6	62.6	63.5	63.2	63.1
Llama-3.3-70B-Instruct	30.6	11.9	16.7	16.7	27.3	29.3	26.3	17.6	23.6	55.5	57.3	56.6	56.4
Qwen2.5-VL-72B-Instruct	43.4	15.8	11.9	25.0	27.3	39.6	20.2	22.7	21.1	60.2	60.6	61.8	60.5
DeepSeek-R1	33.2	16.8	11.9	27.8	18.2	32.4	19.0	13.7	17.4	64.3	64.6	63.9	64.5

1) NeuSym-RAG remarkably outperforms the Classic-RAG baseline across all datasets, with a minimal 17.3% improvement on AIRQA-Real for all LLMs, benefiting from more flexible actions and iterative retrieval.
2) VLMs perform better in tasks that require vision capability, e.g., in M3SciQA where LLMs have to view an anchor image in the first place.
3) Open-source LLMs are capable of handling this complicated interactive procedure in a zero-shot paradigm, and even better than closed-source LLMs.

Method Ablation

To further analyze the contribution of each component, we compare NeuSym-RAG with a series of structured agent baselines. The illustration below summarizes all baselines compared in the ablation study, differing in retrieval paradigm, multi-view support, and whether iterative interaction is allowed.

Illustration of all baseline agent methods compared in the ablation study.

On the whole, NeuSym-RAG defeats all adversaries. It verifies that multiple views and combining two retrieval strategies both contribute to the eventual performance. Notably:

Method	Neural	Symbolic	Multi-view	# Interaction(s)	sgl.	multi.	retr.	subj.	obj.	AVG
Method	Question only	❌	❌	❌	1	5.7	8.0	0.4	9.4	2.7	4.0
Title + Abstract	1				5.7	14.0	0.0	13.1	3.6	5.4
Full-text w/. cutoff	1				28.3	10.7	0.4	26.2	7.6	11.2
Classic RAG		❌	❌	1	18.2	4.0	9.4	8.4	11.0	10.5
Iterative Classic RAG		❌	❌	≥2	8.2	10.0	15.2	5.6	13.2	11.8
Two-stage Neu-RAG		❌		2	19.5	10.0	5.3	15.9	9.4	10.7
Iterative Neu-RAG		❌		≥2	37.7	18.7	48.4	32.7	38.3	37.3
Two-stage Sym-RAG	❌			2	12.2	5.4	9.4	10.6	8.7	9.1
Iterative Sym-RAG	❌			≥2	32.1	14.7	33.6	27.1	28.3	28.0
Graph-RAG		❌		2	22.2	11.1	0.0	21.1	11.5	15.6
Hybrid-RAG				2	23.3	9.3	5.7	16.8	10.5	11.8
NeuSym-RAG(Ours)				≥2	28.3	32.3	58.2	27.1	42.6	39.6

1) Two-stage Neu-RAG outperforms Classic-RAG, while Hybrid-RAG achieves even further improvements, which can be attributed to the fact that agents can adaptively determine the parameters of actions.
2) Iterative retrievals are superior to their two-stage variants. Through multi-turn interaction, the agent can explore the backend environment and select the most relevant information to answer the question.
3) As the number of interactions increases, objective scores rise faster than subjective scores, indicating that with more retrievals, LLMs generate more rational answers.

Citation

@inproceedings{cao-etal-2025-neusym,
    title = "{N}eu{S}ym-{RAG}: Hybrid Neural Symbolic Retrieval with Multiview Structuring for {PDF} Question Answering",
    author = "Cao, Ruisheng  and
      Zhang, Hanchong  and
      Huang, Tiancheng  and
      Kang, Zhangyi  and
      Zhang, Yuxin  and
      Sun, Liangtai  and
      Li, Hanqi  and
      Miao, Yuxun  and
      Fan, Shuai  and
      Chen, Lu  and
      Yu, Kai",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.311/",
    pages = "6211--6239",
    ISBN = "979-8-89176-251-0"
}