Distilling Desired Comments for Enhanced Code Review with Large Language Models
Recent research has increasingly leveraged large language models (LLMs) for automating code reviews. However, existing LLM-based methods face challenges in generating Desired Review Comments (DRC)—comments that actually lead to code modifications. This paper proposes Desiview, a method to automatically identify DRC from code review datasets and construct high-quality datasets. Using these datasets to fine-tune and align LLaMA, we observe substantial improvements in DRC generation capability.
Research Objectives:
Background:
Code review is a critical part of software development but imposes a heavy burden on reviewers. Automating this process with LLMs has gained interest. However, existing LLM-based approaches sometimes generate comments that do not result in code changes, limiting practical usefulness. More effective code review automation requires LLMs that generate comments genuinely leading to code modifications.
Figure 2: Development process of Desiview4FT and Desiview4FA
A review comment ( R ) is based on the original code commit ( C_o ) (modeled as ( P(R|C_o) )).
Developers respond to ( R ) by making code changes ( C_r ) (modeled as ( P(C_r|C_o, R) )).
DRCs are comments that lead to code changes, scored by desirability score (DS):
[ DS = -\bigl( \text{PPL}(P(C_r|C_o, R)) - \text{PPL}(P(C_r|C_o)) \bigr) ]
where PPL is perplexity.
If ( DS > 0 ), the comment is judged to contribute to code modification and classified as DRC.
Method | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
10-line rule | 58.33% | 51.92% | 100.00% | 68.35% |
gpt3.5-turbo-0125 | 68.00% | 60.71% | 81.85% | 69.72% |
gpt-4o-0513 | 76.50% | 79.72% | 64.07% | 71.05% |
Desiview | 86.67% | 88.93% | 80.37% | 84.44% |
Table 3: Performance comparison in DRC identification
Method | BLEU-4 | Human Position (%) | Human Perfect (%) |
---|---|---|---|
LLaMA-Reviewer (LLaMA-3) | 8.33 | 70.33 | 16.67 |
Desiview4FT (LLaMA-3) | 11.87 (+42.5%) | 76.67 (+9.01%) | 18.33 (+9.96%) |
Desiview4FA (LLaMA-3) | 13.13 (+57.62%) | 80.00 (+13.75%) | 18.67 (+12.00%) |
LLaMA-Reviewer (LLaMA-3.1) | 6.86 | 68.67 | 12.67 |
Desiview4FT (LLaMA-3.1) | 12.48 (+81.92%) | 78.67 (+14.56%) | 16.00 (+26.28%) |
Desiview4FA (LLaMA-3.1) | 13.57 (+97.81%) | 79.00 (+15.04%) | 16.67 (+31.57%) |
Table 4: Performance on code review comment generation
2025-08-08
This paper proposes a novel method called Query-Guided Activation Refilling (ACRE) aimed at improving efficiency and performance in processing long contexts within Large Language Models (LLMs). By combining a two-layer KV cache and query-guided refilling, the approach enables processing of contexts beyond the native context window, significantly enhancing the practicality of long-context information retrieval.
2025-08-08
This paper proposes a hybrid Top-k recommendation system that combines traditional recommendation methods with large language models (LLMs). Users are categorized as "active users" and "weak users," with LLMs employed to improve recommendation accuracy and fairness for the latter group. At the same time, the model controls LLM computational costs to ensure practical feasibility.
2025-08-09
Rork is an AI tool that generates native mobile apps from natural language descriptions and supports building and deploying them to the App Store and Google Play. In this article, we share the process and impressions from inputting the requirements of a home life management app into Rork and testing the flow from "generation → functional check → store preparation."