Latest 15 Papers - April 22, 2025

Apr 22, 2025 by ADMIN 34 views

Latest 15 Papers - April 22, 2025

Large Language Model Research Advances

The field of large language models (LLMs) has seen significant advancements in recent years, with researchers pushing the boundaries of what is possible with these powerful tools. In this article, we will explore the latest 15 papers in the field of LLMs, covering topics such as visual reasoning, multi-view understanding, and causal analysis.

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

**VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models**

In the paper "VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models," the authors introduce a new benchmark for evaluating the visual reasoning capabilities of LLMs. The benchmark consists of a set of tasks that require the model to reason about visual information, such as object detection and scene understanding. The authors demonstrate the effectiveness of their benchmark by evaluating several state-of-the-art LLMs on the VisuLogic tasks.

Date: 2025-04-21 Comment: Code, data, and baselines are available at https://visulogic-benchmark.github.io/VisuLogic

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

In the paper "Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs," the authors propose a new evaluation framework for multi-view understanding in LLMs. The framework consists of a set of tasks that require the model to reason about multiple views of a scene, such as object detection and scene understanding. The authors demonstrate the effectiveness of their framework by evaluating several state-of-the-art LLMs on the multi-view understanding tasks.

Date: 2025-04-21 Comment: Project page: https://danielchyeh.github.io/All-Angles-Bench/

Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

In the paper "Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning," the authors propose a new approach to credit assignment in process reward models. The approach, called stop summation, eliminates the need for summation in credit assignment, resulting in faster and more efficient computation. The authors demonstrate the effectiveness of their approach by evaluating several state-of-the-art LLMs on a range of tasks.

Date: 2025-04-21 Comment: None

Causal-Copilot: An Autonomous Causal Analysis Agent

In the paper "Causal-Copilot: An Autonomous Causal Analysis Agent," the authors propose a new autonomous causal analysis agent that can reason about causal relationships in data. The agent uses a combination of machine learning and causal inference techniques to identify causal relationships and provide explanations for its findings. The authors demonstrate the effectiveness of their agent by evaluating it on several real-world datasets.

Date: 2025-04-21 Comment: None

Interpretable Locomotion Prediction in Construction Using a Memory-Driven LLM Agent With Chain-of-Thought Reasoning

In the paper "Interpretable Locomotion Prediction in Construction Using a Memory-Driven LLM Agent With Chain-of-Thought Reasoning," the authors propose a new approach to locomotion prediction in construction using a memory-driven L agent with chain-of-thought reasoning. The approach uses a combination of machine learning and causal inference techniques to predict locomotion and provide explanations for its findings. The authors demonstrate the effectiveness of their approach by evaluating it on several real-world datasets.

Date: 2025-04-21 Comment: None

ASIDE: Architectural Separation of Instructions and Data in Language Models

In the paper "ASIDE: Architectural Separation of Instructions and Data in Language Models," the authors propose a new architectural separation of instructions and data in LLMs. The approach, called ASIDE, separates the instructions and data in the model, resulting in faster and more efficient computation. The authors demonstrate the effectiveness of their approach by evaluating several state-of-the-art LLMs on a range of tasks.

Date: 2025-04-21 Comment: ICLR 2025 Workshop on Building Trust in Language Models and Applications

CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation

In the paper "CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation," the authors propose a new comprehensive benchmark for C-to-safe-Rust transpilation. The benchmark consists of a set of tasks that require the model to transpile C code to safe Rust code. The authors demonstrate the effectiveness of their benchmark by evaluating several state-of-the-art LLMs on the transpilation tasks.

Date: 2025-04-21 Comment: None

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators

In the paper "Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators," the authors propose a new benchmark for evaluating LLMs as judges. The benchmark consists of a set of tasks that require the model to evaluate the performance of other LLMs on a range of tasks. The authors demonstrate the effectiveness of their benchmark by evaluating several state-of-the-art LLMs on the evaluation tasks.

Date: 2025-04-21 Comment: The first two authors contributed equally. The codebase is at https://github.com/SalesforceAIResearch/jetts-benchmark

Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet

In the paper "Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet," the authors investigate the ability of LLMs to rank the harmfulness of smaller LLMs. The authors demonstrate that current LLMs are not yet capable of ranking the harmfulness of smaller LLMs, and propose several approaches to improve the performance of LLMs on this task.

Date: 2025-04-21 Comment: None

MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning

In the paper "MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning," the authors propose a new approach to multilingual reasoning guardrail using curriculum learning. The approach uses a combination of machine learning and causal inference techniques to identify causal relationships and provide explanations for its findings. The authors demonstrate the effectiveness of their approach by evaluating it on several real-world.

Date: 2025-04-21 Comment: None

EvalAgent: Discovering Implicit Evaluation Criteria from the Web

In the paper "EvalAgent: Discovering Implicit Evaluation Criteria from the Web," the authors propose a new approach to discovering implicit evaluation criteria from the web. The approach uses a combination of machine learning and causal inference techniques to identify causal relationships and provide explanations for its findings. The authors demonstrate the effectiveness of their approach by evaluating it on several real-world datasets.

Date: 2025-04-21 Comment: None

Training on the Test Task Confounds Evaluation and Emergence

In the paper "Training on the Test Task Confounds Evaluation and Emergence," the authors investigate the effect of training on the test task on evaluation and emergence. The authors demonstrate that training on the test task can confound evaluation and emergence, and propose several approaches to improve the performance of LLMs on this task.

Date: 2025-04-21 Comment: ICLR 2025 (Oral)

Integrating Symbolic Execution into the Fine-Tuning of Code-Generating LLMs

In the paper "Integrating Symbolic Execution into the Fine-Tuning of Code-Generating LLMs," the authors propose a new approach to integrating symbolic execution into the fine-tuning of code-generating LLMs. The approach uses a combination of machine learning and causal inference techniques to identify causal relationships and provide explanations for its findings. The authors demonstrate the effectiveness of their approach by evaluating it on several real-world datasets.

Date: 2025-04-21 Comment: None

Compute-Optimal LLMs Provably Generalize Better With Scale

In the paper "Compute-Optimal LLMs Provably Generalize Better With Scale," the authors propose a new approach to compute-optimal LLMs that provably generalize better with scale. The approach uses a combination of machine learning and causal inference techniques to identify causal relationships and provide explanations for its findings. The authors demonstrate the effectiveness of their approach by evaluating it on several real-world datasets.

Date: 2025-04-21 Comment: ICLR 2025

Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges

In the paper "Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges," the authors propose a new approach to support evaluation for the TREC 2024 RAG track. The approach uses a combination of machine learning and causal inference techniques to identify causal relationships and provide explanations for its findings. The authors demonstrate the effectiveness of their approach by evaluating it on several real-world datasets.

Date: 2025-04-21 Comment: Accepted at SIGIR 2025 (short)

RAG Research Advances

The field of retrieval-augmented generation (RAG) has seen significant advancements in recent years, with researchers pushing the boundaries of what is possible with these powerful tools. In this article, we will explore the latest 15 papers in the field of RAG, covering topics such as knowledge graph-based RAG, causal analysis, and security.

**Support

Q&A: Latest 15 Papers - April 22, 2025

In this article, we will answer some of the most frequently asked questions about the latest 15 papers in the field of large language models (LLMs) and retrieval-augmented generation (RAG).

Q: What is the main contribution of the paper "VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models"?

A: The main contribution of the paper "VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models" is the introduction of a new benchmark for evaluating the visual reasoning capabilities of LLMs. The benchmark consists of a set of tasks that require the model to reason about visual information, such as object detection and scene understanding.

Q: What is the difference between the papers "Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs" and "VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models"?

A: The papers "Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs" and "VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models" both focus on evaluating the visual reasoning capabilities of LLMs. However, the paper "Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs" proposes a new evaluation framework for multi-view understanding in LLMs, while the paper "VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models" introduces a new benchmark for evaluating the visual reasoning capabilities of LLMs.

Q: What is the main contribution of the paper "Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning"?

A: The main contribution of the paper "Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning" is the proposal of a new approach to credit assignment in process reward models. The approach, called stop summation, eliminates the need for summation in credit assignment, resulting in faster and more efficient computation.

Q: What is the difference between the papers "Causal-Copilot: An Autonomous Causal Analysis Agent" and "Interpretable Locomotion Prediction in Construction Using a Memory-Driven LLM Agent With Chain-of-Thought Reasoning"?

A: The papers "Causal-Copilot: An Autonomous Causal Analysis Agent" and "Interpretable Locomotion Prediction in Construction Using a Memory-Driven LLM Agent With Chain-of-Thought Reasoning" both propose new approaches to causal analysis and locomotion prediction in construction. However, the paper "Causal-Copilot: An Autonomous Causal Analysis Agent" proposes a new autonomous causal analysis agent that can reason about causal relationships in data, while the paper "Interpretable Locomotion Prediction in Construction Using a Memory-Driven LLM Agent With Chain-of-Thought Reasoning" proposes a new approach to locomotion prediction in construction using a memory-driven L agent with chain-of-thought reasoning.

Q: What is the main contribution of the paper "ASIDE: Architectural Separation of Instructions and Data in Language Models"?

A: The main contribution of the paper "ASIDE: Architectural Separation of Instructions and in Language Models" is the proposal of a new architectural separation of instructions and data in LLMs. The approach, called ASIDE, separates the instructions and data in the model, resulting in faster and more efficient computation.

Q: What is the difference between the papers "CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation" and "Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators"?

A: The papers "CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation" and "Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators" both propose new benchmarks for evaluating the performance of LLMs. However, the paper "CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation" proposes a new comprehensive benchmark for C-to-safe-Rust transpilation, while the paper "Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators" proposes a new benchmark for evaluating LLMs as judges.

Q: What is the main contribution of the paper "Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet"?

A: The main contribution of the paper "Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet" is the investigation of the ability of LLMs to rank the harmfulness of smaller LLMs. The authors demonstrate that current LLMs are not yet capable of ranking the harmfulness of smaller LLMs, and propose several approaches to improve the performance of LLMs on this task.

Q: What is the difference between the papers "MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning" and "EvalAgent: Discovering Implicit Evaluation Criteria from the Web"?

A: The papers "MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning" and "EvalAgent: Discovering Implicit Evaluation Criteria from the Web" both propose new approaches to multilingual reasoning guardrail and implicit evaluation criteria. However, the paper "MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning" proposes a new approach to multilingual reasoning guardrail using curriculum learning, while the paper "EvalAgent: Discovering Implicit Evaluation Criteria from the Web" proposes a new approach to discovering implicit evaluation criteria from the web.

Q: What is the main contribution of the paper "Training on the Test Task Confounds Evaluation and Emergence"?

A: The main contribution of the paper "Training on the Test Task Confounds Evaluation and Emergence" is the investigation of the effect of training on the test task on evaluation and emergence. The authors demonstrate that training on the test task can confound evaluation and emergence, and propose several approaches to improve the performance of LLMs on this task.

Q: What is the difference between the papers "Integrating Symbolic Execution into the Fine-Tuning of Code-Generating LLMs" and "Compute-Optimal LLMs Provably Generalize Better With Scale"?

A: The papers "Integrating Symbolic Execution into the Fine-Tuning of Code-Generating LLMs" andCompute-Optimal LLMs Provably Generalize Better With Scale" both propose new approaches to integrating symbolic execution and compute-optimal LLMs. However, the paper "Integrating Symbolic Execution into the Fine-Tuning of Code-Generating LLMs" proposes a new approach to integrating symbolic execution into the fine-tuning of code-generating LLMs, while the paper "Compute-Optimal LLMs Provably Generalize Better With Scale" proposes a new approach to compute-optimal LLMs that provably generalize better with scale.

Q: What is the main contribution of the paper "Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges"?

A: The main contribution of the paper "Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges" is the proposal of a new approach to support evaluation for the TREC 2024 RAG track. The approach uses a combination of machine learning and causal inference techniques to identify causal relationships and provide explanations for its findings.

Q: What is the difference between the papers "The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models" and "AlignRAG: An Adaptable Framework for Resolving Misalignments in Retrieval-Aware Reasoning of RAG"?

A: The papers "The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models" and "AlignRAG: An Adaptable Framework for Resolving Misalignments in Retrieval-Aware Reasoning of RAG" both propose new approaches to fact extraction and RAG evaluation. However, the paper "The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models" proposes a new approach to automating fact extraction and RAG evaluation with large language models, while the paper "AlignRAG: An Adaptable Framework for Resolving Misalignments in Retrieval-Aware Reasoning of RAG" proposes a new approach to resolving misalignments in retrieval-aware reasoning of RAG.

Q: What is the main contribution of the paper "FinSage: A Multi-aspect RAG System for Financial Filings Question Answering"?

A: The main contribution of the paper "FinSage: A Multi-aspect RAG System for Financial Filings Question Answering" is the proposal of a new multi-aspect RAG system for financial filings question answering. The system uses a combination of machine learning and causal inference techniques to identify causal relationships and provide explanations for its findings.

Q: What is the difference between the papers "Detecting Malicious Source Code in PyPI Packages with LLMs: Does RAG Come in Handy?" and "RAG Without the Lag: Interactive Debugging for Retrieval-Augmented Generation Pipelines"?

A: The papers "Detecting Malicious Source Code in PyPI Packages with LLMs: Does RAG Come in Handy?" and "RAG Without the Lag: Interactive Debugging for Retrieval-Augmented Generation Pipelines" both propose new approaches to detecting malicious source code and interactive debugging for RAG pipelines. However, the paper "Detecting Malicious Source Code in PyPI Packages