RECOMP: Improving Retrieval-Augmented LMs with Compression

RECOMP: IMPROVING RETRIEVAL-AUGMENTED LMS

WITH COMPRESSION AND SELECTIVE AUGMENTATION

Fangyuan Xu

, Weijia Shi

, Eunsol Choi

Department of Computer Science

The University of Texas at Austin

University of Washington

{fangyuan,eunsol}@utexas.edu , [email protected]

ABSTRACT

Retrieving documents and prepending them in-context at inference time improves

performance of language model (LMs) on a wide range of tasks. However, these

documents, often spanning hundreds of words, make inference substantially more

expensive. We propose compressing the retrieved documents into textual sum-

maries prior to in-context integration. This not only reduces the computational

costs but also relieves the burden of LMs to identify relevant information in long re-

trieved documents. We present two compressors – an extractive compressor which

selects useful sentences from retrieved documents and an abstractive compressor

which generates summaries by synthesizing information from multiple documents.

Both compressors are trained to improve LMs’ performance on end tasks when the

generated summaries are prepended to the LMs’ input, while keeping the summary

concise. If the retrieved documents are irrelevant to the input or offer no additional

information to LM, our compressor can return an empty string, implementing

selective augmentation. We evaluate our approach on language modeling task and

open domain question answering task. We achieve a compression rate of as low as

6% with minimal loss in performance for both tasks, signiﬁcantly outperforming

the off-the-shelf summarization models. We show that our compressors trained

for one LM can transfer to other LMs on the language modeling task and provide

summaries largely faithful to the retrieved documents.

1 INTRODUCTION

Retrieval-augmented language models (RALMs) (Khandelwal et al., 2019; Izacard et al., 2022; Lewis

et al., 2020; Borgeaud et al., 2022) have shown impressive performance on knowledge-intensive tasks

(Kwiatkowski et al., 2019; Petroni et al., 2021). Simply prepending retrieved documents to the input

without updating the language models (LMs) (Shi et al., 2023b; Ram et al., 2023; Si et al., 2022)

allows retrieval augmentation even for black-box LMs, but such approach comes with limitations.

First, it increases computational costs as LMs now encode substantially more tokens. Second, even if

we manage to adapt LMs to efﬁciently incorporate longer context (Beltagy et al., 2020; Zaheer et al.,

2020), these models struggle to use all information in the context, frequently missing information

placed in the middle (Liu et al., 2023). Third, prepending a large number of documents in-context

can further confuse LMs with irrelevant information, degrading model performances (Mallen et al.,

2022; Shi et al., 2023a).

To overcome such limitations, we propose RECOMP (Retrieve, Compress, Prepend), an inter-

mediate step for RALMs which compresses retrieved documents into a textual summary prior to

in-context augmentation. Figure 1 illustrates our approach. The generated summary should be concise

to maximize efﬁciency, be faithful to the retrieved evidence documents, and guide RALM to generate

desired outputs when prepended to the input. To satisfy both efﬁciency and effectiveness constraints,

our compressor strategically performs selective augmentation by generating an empty summary when

the retrieved documents are irrelevant or unhelpful for target task.

Our code is available at https://github.com/carriex/recomp.

arXiv:2310.04408v1 [cs.CL] 6 Oct 2023

RECOMP during inference

moved from Smyrna, Tennessee, to

Nissan's facility in Canton, Mississippi.

Early US models include X, S and

PRO-4X, with a choice of 6-speed

manual or 5-speed automatic

transmissions, a choice of [...]

moved from Smyrna, Tennessee, to

Nissan's facility in Canton, Mississippi.

Early US models include X, S and

PRO-4X, with a choice of 6-speed

manual or 5-speed automatic

transmissions, a choice of [...]

moved from Smyrna, Tennessee, to

Nissan's facility in Canton, Mississippi.

Early US models include X, S and

PRO-4X, with a choice of 6-speed

manual or 5-speed automatic

transmissions, a choice of [...]

moved from Smyrna, Tennessee, to

Nissan's facility in Canton, Mississippi.

Early US models include X, S and

PRO-4X, with a choice of 6-speed

manual or 5-speed automatic

transmissions, a choice of [...]

moved from Smyrna, Tennessee,

to Nissan's facility in Canton,

Mississippi. Early US models

include X, S and PRO-4X, with a

choice of 6-speed manual…

Retrieved documents D

RECOMP

(58 tokens)

Retrieve Compress Prepend

No retrieval (0 tokens)

RALM (749 tokens)

2010

❌

✅

2015

Summary

Input query x

when did they

stop making the

nissan xterra?

Blackbox

LM M

2015

✅

Compressor

The Nissan Xterra is a

front-engine, 2-wheel or 4-

wheel drive, five-door …

Figure 1: An illustration of RECOMP, which compresses retrieved documents into a texual summary

before prepending it as input to a language model at inference time. The compressed summary guides

the LM to generate the correct answer, while signiﬁcantly reducing the computation costs required to

encode the documents.

We propose compressors: (1) Extractive compressor which selects relevant sentences from retrieved

document set; (2) Abstractive compressor which generates a summary synthesizing information

from multiple retrieved documents. Both compressors implement multi-document query-focused

summarization (Xu & Lapata, 2020), where we summarize retrieved evidence document set with

respect to the input query. As we aim to enable RALM to generate correct output when summary is

prepended to the input query, we design training schemes to optimize the end task performance. Our

extractive compressor is trained with a contrastive learning objective to identify sentences that lead to

target outputs, and our abstractive compressor is distilled (West et al., 2022) from an extreme-scale

LM (e.g. GPT-3), which achieves impressive summarization performance.

Our experiments show that RECOMP can improve performance of frozen LMs on language model-

ing (Merity et al., 2016) and three question answering datasets (Natural Questions (Kwiatkowski et al.,

2019), TriviaQA (Joshi et al., 2017) and HotpotQA (Yang et al., 2018)), while prepending signiﬁcantly

fewer tokens compared to RALM without compression. We present two oracle compression methods

– an extractive oracle which selects a sentence in evidence documents that leads to the best task

performance and an abstractive oracle which chooses between a summary generated by extreme-scale

LLM (e.g. GPT-3) and no retrieval augmentation that leads to the best task performance. Both oracle

methods achieve a compression rate as low as 6% and signiﬁcantly outperforms prepending full

documents. Our trained compressors also show promising results. For language modelling, both

trained compressors achieve a compression ratio of 25% with minimal performance drop. When

applied to QA datasets, our best model compresses the documents to 5 - 10% of the original tokens

with at most less than 10% relative performance drop. We conclude with careful analyses of our

approach that reveal both its strength and weaknesses, thereby building foundation for future work.

2 PROBLEM FORMULATION: RECOMP

Given an input sequence

, a target output sequence

and a set of

retrieved documents

(

, d

, ...d

]

RECOMP compresses retrieved documents

with respect to

into a summary

which captures core information in

relevant to

with signiﬁcantly fewer tokens than

. Our

architecture consists of two modules: compressor

and LM

. In this work, we assume a blackbox

LM and train the compressor. Given the set of retrieved

documents (

, d

, ...d

]

) and the input

sequence

, a compressor returns a token sequence

. We design our compressor to be substantially

smaller than LM

, as we aim to reduce computational costs of encoding a set of retrieved documents.

The output from compressor,

, should be: (1) Concise: The summary should be as short as possible

to optimize efﬁciency. If the retrieved documents do not contain relevant information or retrieval

augmentation is not necessary,

can be an empty sequence. (2) Effecive: when

is prepended to

input sequence

and provided to LM

as a prompt, LM should generate the target output sequence

. (3) Faithful:

should be a faithful and interpretable summary of the input document set (i.e.,

must be entailed by the input document set (

, d

, ...d

]

)). We focus on training compressors

for conciseness and effectiveness. We summarize the key ideas for our two compressors, extractive

compressors and abstractive compressor here, and discuss their training schemes formally in Section 3.

Improving retriever is not the focus of this work, so we assume a set of retrieved documents are provided.

Extractive Compressor Given

sentences

, s

...s

]

in the input document set (

, d

, ...d

]

we train a dual encoder model

enc

which embeds sentence

and the input sequence

into ﬁxed-

dimensional embeddings respectively. Their inner product represents how helpful it would be for the

to prepend

to the input

to generate

. The ﬁnal summary

from the compressor will be a

concatenation of top N sentences ranked by their inner product with the input. As this approach is

extractive, we assume the faithfulness criteria is mostly satisﬁed.

Abstractive Compressor We train an encoder-decoder model

encdec

to serve as an abstractive

compressor, which takes the input sequence

and a concatenation of retrieved document set

; d

; ...d

]

) and output a summary

. Although we do not have human annotations to train this

model, prior work (Goyal et al., 2022; Chen et al., 2023; Potluri et al., 2023) suggests that the extreme-

scale LMs can generate good query-focused summaries when prompted carefully. Yet, using an

extreme-scale model as the compressor is not desirable as we want the compressor to be substantially

smaller than the LMs. Thus, we perform distillation (Hinton et al., 2015) of extreme-scale LMs to

build a lightweight abstractive compressor

encdec

. We do not train speciﬁcally for faithfulness, but

later manually evaluate the faithfulness in Section 6.

3 LEARNING THE COMPRESSORS

Our compressor resembles text summarization models in the output should be faithful to the original

input, yet the main goal is different. Instead of capturing salient information for humans readers,

compressors aim to produce a concise text that are useful for a LM on an end task. In this section,

we describe how to train the extractive compressor (§3.1) and the abstractive compressor (§3.2)

leveraging end task signals. Further training details can be found in the Appendix A.2.

3.1 EXTRACTIVE COMPRESSION

As we formulate extractive compression as a ranking problem, training extractive compressor re-

sembles training a reranker for the retrieved documents

with two differences. First, our compressor

considers a different granularity of input (sentence) compared to the initial retrieval unit (paragraph).

Second, the sentence is evaluated based on whether it is useful as input for the LM

on the

downstream task (Shi et al., 2023b; Ram et al., 2023).

Input: Base LM

, Compressor

enc

, Training data

, S

, y

}

where

is input,

= {s

}

is a set of

candidate sentences from the retrieved documents for

is the target answer, and score threshold ϵ.

Output: An updated extractive compressor encoder

enc

1: T ← ∅

2: for i ∈ {1, . . . , T } do

3: p

← argMax

∈{S

}

Score(M, y

, [s

; x

])

4: for j ∈ {1, . . . , n} do

5: L ← ∅

Score(M, y

, [s

; x

]) + ϵ <

Score(M, y

, [p

; x

]) then

7: L ← L ∪ s

8: if |L| > 0 then

9: N

← argTop5

∈L

(⟨enc

), enc

)⟩)

10: T ← T ∪ {(x

, p

, N

)}

11: enc

= Finetune(enc

, T )

Figure 2: Learning an extractive compressor for lan-

guage modeling task.

Model We train a dual-encoder model

enc

which encodes the input context

and the candidate sentence

separately.

We obtain an embedding of

and

taking the representation of the

[CLS]

token respectively, and compute their sim-

ilarity by calculating the inner product of

the two. We initialize our model with

the contriever checkpoint (Izacard et al.,

2021). This model consists of 110M pa-

rameters, satisfying the efﬁciency desider-

atum of compressor.

Training Figure 2 presents pseudocode

for training an extractive compressor with

contrastive loss for the language modeling

task. For each input query

, we iden-

tify positive and negative sentences from

retrieved documents.

For each pair of input sequence

and candidate sentences

, we measure

Recent work (Zhang et al., 2022) shows that extractive approach does not always preserve faithfulness, but

such cases are still rare compared to abstractive approaches which can easily hallucinate.

Ram et al. (2023) proposes a document reranker based on a cross-encoder model, which is a similar set-up

to our sentence selector, but less compute efﬁcient.

Score(M, y

, [s

; x

]) = logp

(y| [s

; x

])

, log likelihood assigned to target output according to LM

when candidate sentence is prepended to the input. We consider the sentence with the highest

log likelihood as a positive example

(line 3). To construct negative examples

= {n

}

k=1

we choose up to ﬁve sentences with top contriever score that has the log likelihood lower than the

positive sentence for a threshold(line 6).

Training a compressor for QA task works similarly, but scoring will evaluate whether the LM will

generate the correct answer with summary prepended (change in line 6). Pseudo code for the QA tasks

is in Figure 6 the Appendix. We train our encoder with a contrastive loss (Karpukhin et al., 2020a),

maximizing the similarity between positive pairs

, p

)

and minimize negative pairs

, N

)

. The

training objective is to minimize −log

sim(x

)

sim(x

)

∈N

sim(x

)

Data For the language modeling task, we generate training data using the training split of the

Wikitext-103 dataset, selecting the top 20 sentences from the top 5 BM25 retrieved documents for

each input context

. For the QA tasks, we generate training data using the training split and consider

the top 20 sentences from the top 5 contriever-ms-marco

retrieved documents. We report detailed

statistics for the training data in Table 5 in the appendix. For each sentence from the retrieved

documents, we prepend the Wikipedia page title to it to for decontextualization.

3.2 ABSTRACTIVE COMPRESSION

To train an abstractive compressor, we distill the query-focused summarization ability of extreme-

scale LM by generating training dataset from it, ﬁlter the generated data, and train an encodedr

decoder model from the ﬁltered dataset (West et al., 2022). In contrast to prior work (Jung et al.,

2023) which use intrinsic summarization metric for ﬁltering, we use the LM’s performance on the

end task with the generated summaries prepended for ﬁltering. Fig. 3 presents pseudo algorithm for

training the abstractive compressor.

3.2.1 CREATING TRAINING DATASET FOR DISTILLATION

Input: Teacher LM

, LM

, Summarization

prompt set

}

, Compressor

encdec

, Training

data

, D

, y

}

where

is input,

is the

set of retrieved document for

is the target

answer.

Output: An updated encdec

1: T ← ∅

2: for i ∈ {1, . . . , T } do

3: v

← −∞

4: for j ∈ {1, . . . , n} do

5: s

= Decode(M

, [p

; x

; D

])

6: v

= Score(M, y

, [s

; x

])

7: if v

> v

then

8: s

← s

, v

← v

9: v

= Score(M, y

, [x

])

10: if v

< v

then

11: T ← T ∪ {(x

, D

, ∅)}

12: break

13: T ← T ∪ {(x

, D

, s

)}

14: encdec

= Finetune(encdec

, T )

Figure 3: Learning an abstractive compressor for

language modeling task.

Generation From Teacher Model For the

language modeling task, we manually construct

four prompts to summarize evidence document

set (

}

Given an input

, a retrieved doc-

ument set

, and a prompt

to summarize the

document set with respect to the input, GPT-3.5

generates a summary (line 3).

Filtering with Critic After generating a sum-

mary for each prompt template, we select the

summary which results in the highest end task

performance for each example (

) as the tar-

get summary (line 4-8).

Score(M, y

, [s

; x

])

is the same as the extractive compressor above.

We then compare the end task performance with

the target summary prepended and with input

only (i.e. no retrieval) on base model

(line 6). If the end task performance gets worse

(e.g., increase in perplexity) when prepending

the summary, we set the target summary to an

empty string (line 7), otherwise we add the tar-

get summary to the training set (line 9). This

allows for selective augmentation and mitigates

the risk of prepending irrelevant documents.

https://huggingface.co/facebook/contriever-msmarco

The exact prompts can be found in Table 6 in A.2.

We use gpt-3.5-turbo in all our experiments.

Constructing training datasets for the question answering tasks works similarly, with the following

modiﬁcations. As summarization for the question answering task is more straightforward, we use a

single prompt for each dataset. We ﬁlter out examples where prepending the summary does not lead

to performance improvement. Pseudo code for the QA tasks is in Figure 7 in the Appendix.

Model & Training We use encoder-decoder LM (775M), initialized from T5-large checkpoint (Raf-

fel et al., 2020). This model has been trained with summarization datasets (Hermann et al., 2015).

Data We summarize top 5 retrieved documents for both language modeling and question answering

tasks. We generate training examples using 2% of the training set for the Wikitext-103 dataset. We

generate training examples from the entire NQ training set and TriviaQA training set. For HotpotQA,

we only generate summaries for the training data where the gold answer is in the retrieved documents

(56% of the training data) to reduce API costs. We report percentage of data ﬁltered and percentage

of empty summaries in Table 5 in A.1.

4 EXPERIMENTAL SETTINGS

We evaluate our approach on language modeling and open-domain QA following prior work (Shi

et al., 2023b; Ram et al., 2023). For both tasks, we report the task performance as a measure of

effectiveness and the number of tokens provided in context as a measure of efﬁciency.

4.1 LANGUAGE MODELING

We evaluate language modeling perplexity on WikiText-103 (Merity et al., 2016) benchmark on

three open-sourced LMs of varying scale: GPT2 (117M), GPT2-XL (1.5B; Radford et al. (2019))

and GPT-J (6B; Wang & Komatsuzaki (2021)). We train our compressors using GPT2 as the base

model and evaluate whether the trained compressor transfer to GPT2-XL and GPT-J. We use the

BM25 retriever (Robertson & Zaragoza, 2009) to retrieve from the Wikipedia corpus from Dec. 20,

2018 (Karpukhin et al., 2020a). The articles are then truncated into non-overlapping documents of

100 words. During retrieval, articles containing the input sequence

is removed from the corpus to

prevent data contamination. Following Ram et al. (2023), we perform retrieval every 32 tokens.

4.2 OPEN-DOMAIN QA

Datasets We evaluate our model on three benchmark dataset: Natural Questions (NQ) (Kwiatkowski

et al., 2019), TriviaQA (Joshi et al., 2017)) and HotpotQA (Yang et al., 2018). We report results on

development set of NQ, test set of TriviaQA and randomly sampled 500 examples from HotpotQA

development set. We report Exact Match (EM) and F1token-level F1 of answer strings to measure

end task performance.

Base Language Models & Retrieval Corpus We use Flan-UL2 (20B)(Chung et al., 2022), a large

scale instruction-tuned LM. We use contriever model trained on MS MARCO dataset (Campos et al.,

2016) as a retriever on Wikipedia corpus from Dec. 20, 2018 for all three datasets. The articles are

truncated into non-overlapping documents of 100 words.

Prompt Format We include few-shot in-context examples in the prompt, followed by the retrieved

documents and the question. We use ﬁve randomly sampled training examples as in-context examples,

which constitutes 110, 147, and 149 tokens on average for NQ, TQA and HotpotQA respectively.

For retrieved documents, we concatenate them in ascending order of retrieval score, with the highest

scored document closest to the question (Si et al., 2022). We do not include the retrieved documents

for in-context examples as it did not improve performance. An example input can be found in

Appendix Table 7.

4.3 BASELINES AND ORACLES

Baselines We ﬁrst consider two heuristic token and phrase-level compression methods: BoW,

which converts the retrieved documents to a list of ordered unigram and concatenates them together

Table 1: Results on language modeling task. We report results on GPT-2, GPT2-XL and GPT-J with

compressors trained with GPT-2.

In-Domain Out-Domain

GPT2 (117M) GPT2-XL (1.5B) GPT-J (6B)

In-context Evidence # tokens PPL # tokens PPL # tokens PPL

- 0 37.84 0 19.89 0 11.44

RALM without compression

Top 1 document 141 32.90 14 17.86 141 10.57

Top 5 documents 512 35.53 - - - -

Phrase/token level compression

Top 1 document (BoW) 66 36.13 66 18.85 66 10.97

Top 1 document (NE) 34 37.23 33 19.67 33 11.39

Extractive compression of Top 5 documents (select top 1 sentence)

Oracle 32 30.36 32 16.58 31 9.92

Oracle (w/ gpt2) 32 30.36 32 16.99 32 10.22

Random 27 36.98 27 19.55 27 11.32

BM25 33 36.63 33 19.02 33 11.08

Contriever 33 35.54 33 18.98 33 11.05

Ours (init. w/ Contriever) 31 33.67 31 18.19 31 10.73

Abstractive compression of Top 5 documents

Oracle 68 30.67 66 16.87 65 10.10

Oracle (w/ gpt2) 68 30.67 68 17.23 68 10.37

GPT-3.5 33 34.84 33 18.70 33 10.96

T5 15 37.80 15 19.92 15 11.5

Ours (init. w/ T5) 15 33.64 15 18.09 15 10.66

and Named Entities (NE), which extracts a list of ordered named entities from retrieved documents

and concatenates them. For the extractive compressor on the language modeling task, we use BM25

and Contriever Izacard et al. (2021), which rank the sentences by their similarity to the input

baselines. For the QA datasets, we report results using BM25, Contriever ﬁnetuned on MS MARCO

and DPR (Karpukhin et al., 2020b) ﬁne-tuned on NQ. We also report a Random baseline which

randomly selects a sentence from the retrieved documents. For abstractive compression, we report

the performance of the off-the-shelf T5 (large, 770M) model and that of GPT-3.5 model. As we

experimented with multiple prompts for the language modeling task, we report the performance of

the summaries generated by GPT-3.5 model with the best single prompt.

Oracle We explore the performance upper bound of compressioin by considering two oracle

approaches. For the extractive approach, we construct oracle compressor by considering all sentences

in the evidence document set and choosing the sentence that leads to the best end task performance

(i.e., lowest perplexity or highest answer accuracy) for each example. For the abstractive approach,

we consider summaries generated from different prompts (

}

in Figure 3) and empty summary,

and choose the one that leads to the best end task performance. As oracle compression is model

dependent, we also report model-independent results by always using GPT-2 as a reference LM

(Oracle w/ gpt2) to test how well oracle sentences for one model transfer to other models for the

language modeling task.

5 RESULTS

Language modeling Table 1 reports the results on language modeling task. All retrieval augmen-

tation methods improve perplexity over no retrieval setting across three LMs. Heuristic token /

phrase-level compression methods (BoW and NE) are worse than prepending uncompressed docu-

ments, potentially due to the disﬂuency of the prepended text.

Both oracle settings show substantial gain over prepending the entire document set, with only 6-13%

of tokens. More tokens are not always better: prepending top 1 document outperforms prepending

top 5 documents. This conﬁrms that the naive retrieve-and-prepend approach has a signiﬁcant room

for improvement, as prepending irrelevant documents can hurt performances.

Table 2: Open-domain QA results with Flan-UL2 (20B) as the LM M. We report number of tokens

provided as in-context evidence document, excluding the in-context examples. We train separate

compressors (one extractive, one abstractive) for each dataset. Extractive compressor selects one

sentence for NQ/TQA, and two sentences for HotpotQA.

NQ TQA HotpotQA

In-Context evidence # tok EM F1 # tok EM F1 # tok EM F1

- 0 21.99 29.38 0 49.33 54.85 0 17.80 26.10

RALM without compression

Top 1 documents 132 33.07 41.45 136 57.84 64.94 138 28.80 40.58

Top 5 documents 660 39.39 48.28 677 62.37 70.09 684 32.80 43.90

Phrase/token level compression

Top 5 documents (NE) 338 23.60 31.02 128 54.96 61.19 157 22.20 31.89

Top 5 documents (BoW) 450 28.48 36.84 259 58.16 65.15 255 25.60 36.00

Extractive compression of top 5 documents

Oracle 34 60.22 64.25 32 79.29 82.06 70 41.80 51.07

Random 32 23.27 31.09 31 50.18 56.24 61 21.00 29.86

BM25 36 25.82 33.63 37 54.67 61.19 74 26.80 38.02

DPR 39 34.32 43.38 41 56.58 62.96 78 27.40 38.15

Contriever 36 30.06 31.92 40 53.67 60.01 78 28.60 39.48

Ours 37 36.57 44.22 38 58.99 65.26 75 30.40 40.14

Abstractive compression of top 5 documents

Oracle 51 45.68 53.66 37 71.01 76.38 102 35.80 46.25

GPT-3.5 56 37.12 46.35 41 62.03 69.66 107 31.60 42.65

T5 10 25.90 34.63 7 55.18 62.34 7 23.20 33.19

Ours 36 37.04 45.47 32 58.68 66.34 64 28.20 37.91

Our trained extractive compressor signiﬁcantly outperforms other extractive baselines (Contriever

and BM25) across all three LMs, while prepending slightly fewer tokens. Comparing to prepending

one document, we achieve a compression ratio of 25% at minimum performance drop. Our trained

abstractive compressor performs the best across the board, achieving the lowest perplexity and

the highest compression ratio. Our abstractive compressor achieves high compression rate through

selective augmentation, prepending summaries to only 33% of examples (length distribution of

generated summaries in Fig. 8).

Open-domain QA We report the results on QA tasks in Table 2. Similar to the language modeling

task, all retrieval augmentation methods improve performance over no retrieval setting, across three

datasets, consistent with previous study on other LMs (Shi et al., 2023b; Mallen et al., 2022; Si

et al., 2022). Unlike language modeling, prepending ﬁve documents shows signiﬁcant gains over

prepending a single document, motivating the use of compression to incorporate more documents.

We ﬁnd that extractive oracle outperforms the abstractive one in all datasets. Extractive oracle

selects the best one from

candidate sentences, while abstractive oracle selects from two options

– prepending GPT-3.5 summary or prepending nothing. Both oracles show improvements over

prepending all information, suggesting that removing irrelevant information beneﬁt the model.

Among extractive baselines, DPR performs the best as it has been ﬁne-tuned on high-quality NQ data.

On NQ, selecting the top 1 DPR ranked sentences from top 5 documents outperforms prepending

top 1 document, with much fewer tokens (39 vs. 132). However, its performance degrades in out of

domain datasets. Off-the-shelf summarization model (T5) boasts the highest level of compression,

achieving 4-6 points gains in EM while adding mere 7-10 tokens.

The trained compressors, both extractive and abstractive, shows promising performances. On NQ

and TQA, the abstractive approach is more effective. On NQ, it achieves a compression ratio of 5%

tokens while losing 2 EM points compared to prepending full documents. On TQA, we observe

similar trends, compression ratio of 5% tokens while losing 3.7 EM points compared to prepending

full sets of documents. On HotpotQA that requires multihop understanding of documents, we

We provide an example where our compressed summary yields correct answer while prepending full

document does not in Table 9 in the appendix.

ﬁnd extractive approach to be more helpful, achieving 11% compression rate while losing 2.4 EM

points compared to prepending full documents. We ﬁnd that learning an abstractive compressor

for more complex tasks, such as HotpotQA, demands further study. While extreme-scale LLM

boasts competitive summarization performance under single document setting, they are not good at

synthesizing information from multiple documents (Shaib et al. (2023) and hallucinate more often;

See Section 6 for further analysis).

6 ANALYSIS AND DISCUSSIONS

Transferring Across Different LMs One beneﬁt of textual summary is that they can transfer to

other LMs, unlike approaches such as soft prompts (Wingate et al., 2022; Chevalier et al., 2023;

Mu et al., 2023). We evaluate whether our compressors trained to achieve high performance with

respect to a speciﬁc LM (GPT2 for language modeling, FlanUL2 for open domain QA) can transfer

to other LMs. For language modeling, we ﬁnd that trained compressor transfers well to other LMs

(GPT2-XL and GPT-J), despite they are much larger LMs (Table 1. For open domain QA, we tested

transferring our compressors to LLaMA-13B (Touvron et al., 2023) model. The results can be found

in Table 10 in the appendix. Overall, the performance is worse than the LM from which compressors

are trained on, sometimes unable to outperform other compression baselines (e.g., no clear gain

from using contriever vs. our trained contriever on TQA/HotpotQA), leaving considerable gap to the

oracle compressions for LLAMA itself. Yet, on NQ/TQA, our compressor obtains 5% compression

ratio with less than 5 EM drop compared to full document setting, showing the robustness of our

retrieve-compress-prepend paradigm.

Figure 4: Histogram of abstractive sum-

mary length (# tokens) distribution.

How do the length of the summaries vary? Can the

learned compressor reliably determine when LMs require

retrieved documents or not? As retrieved documents were

hurting the model performances for some input queries,

4-24% of training examples for abstractive compressors

contain empty summary. Fig. 4 presents the length distri-

bution of abstractive summaries on NQ and Wikitext (his-

tograms for other datasets is in Fig. 8 in the appendix). The

input document lengths do not vary signiﬁcantly across

examples, yet we ﬁnd the abstractive summary vary signiﬁcantly in length, suggesting abstractive

compressor enables selective retrieval augmentation. We have not experimented selective compres-

sion with extractive compressor, ﬁxing the number of prepended sentences for the entire dataset

(1 for Wikitext, 1 for NQ/TQA, 2 for HotpotQA). Allowing adaptive augmentation with extractive

summarizer can be a promising direction for future work.

Table 3: Analysis on in-context evidence to answer

questions in NQ dev set. For the last column, we

report how frequently model copies from its evi-

dence on (1) a subset where gold answer is in the

evidence document / (2) when it is not.

Evidence EM %Gold in Evi. %Pred in Evi.

Top 1 33.1 36 92 / 51

Top 5 39.3 57 96 / 81

NE 26.0 46 84 / 48

Oracle sent 60.2 34 93 / 16

Contriever 30.2 25 88 / 36

Ours 36.6 28 90 / 33

GPT-3.5 37.1 45 98 / 85

T5 25.9 30 52 / 20

Ours 37.0 34 98 / 39

How does model leverage the in-context doc-

uments? We evaluate whether retrieval aug-

mented LMs tend to copy answers verbatim

from in-context evidence documents or generate

answers not present in the documents. This is an

desired behavior only when the gold answer is

in the evidence. We ﬁrst report how frequently

a gold answer span is present in evidence text

(% Gold in Evi). As expected, full documents

contain the answer most frequently, followed

by NE and GPT-3. However, having more gold

answers in the evidence doesn’t equate to better

performance, as the model cannot always iden-

tify the correct answer from the evidence (84 %

for NE v.s. 98% for T5(ours)).

We also observe that model can be easily dis-

tracted by irrelevant contexts, copying from a

document span even when it does not contain

gold answer, echoing ﬁndings from prior work (Shi et al., 2023a). Prepending top 5 documents has

a higher frequency (81%) of copying incorrectly compared to top 1 document (51%), and GPT-3

compression leads to an even higher incorrect copying frequency (85%), potentially as query-focused

summarization generates sentences that seemingly contains the answer. Our compressor successfully

reduce such erroneous behavior to 39%.

Is generated summary faithful and comprehensive? We (the authors) manually evaluate outputs

of the abstractive compressors on two axes (Chen et al., 2023): Faithfulness: whether the summary

can be entailed by the retrieved documents, Comprehensiveness: whether the summary contains

sufﬁcient information to answer the question, regardless of whether the generated information comes

from the retrieved documents. For both, we select one of three labels: Yes, Partially, No, and report

the % of Useful summaries which are both faithful and comprehensive. Annotation sample can

be found in Table 11 in the appendix. We evaluate the summaries generated by GPT-3.5 and our

abstractive compressor. We randomly sample 30 non-empty summaries from the test set.

Table 4: Manual analysis on abstractive summaries generated

for NQ, TQA and HotpotQA (HQA) dataset.

Dataset Model

% Faithful % Compre.

% Use.

Y P N Y P N

GPT-3.5 90 0 10 97 0 3 83

Ours 80 13 7 100 0 0 80

TQA

GPT-3.5 97 0 3 90 0 10 83

Ours 83 3 14 96 0 4 77

HQA

GPT-3.5 74 0 26 78 0 22 50

Ours 67 0 33 74 0 26 40

Table 4 presents annotation results.

GPT-3.5, substantially bigger than

our compressor, generates more use-

ful summary across all three datasets.

Overall, our abstractive compressors

were less faithful compared to the

original GPT-3.5, while improving

comprehensiveness. The effectiveness

of summarization also depends on the

datasets – summaries from both mod-

els were the most faithful for TQA

and the least faithful for HotpotQA

dataset. In terms of comprehensive-

ness, we ﬁnd both models easily ﬁnd the information for NQ, but struggle with HotpotQA. These

results partially explain why the performance gain was limited for HotpotQA.

7 RELATED WORK

Efﬁcient RALM He et al. (2021) improves efﬁciency of RALMs by improving retrieval compo-

nents, such as data store compression, dimensionality reduction for neural retriever. A line of work

also introduces reducing retrieval frequency through selective retrieval (He et al., 2021; Mallen et al.,

2022) or using a larger stride (Martins et al., 2022). In this work, we improve efﬁciency of RALM by

compressing retrieved documents into a concise summary or an empty sequence, facilitating selective

retrieval augmentation.

Prompt Compression Recent work (Wingate et al., 2022; Chevalier et al., 2023; Mu et al., 2023)

proposes compressing long contexts into summary vectors (soft prompts) that can be used by LMs,

rather than shorter textual summaries. Such soft prompts can serve as efﬁcient replacements for

plain-text demonstrations, minimizing the computational costs during inference. Another related line

of work proposes context distillation (Snell et al., 2022; Choi et al., 2022; Padmanabhan et al., 2023),

which injects the prepended context into the parameters of an LM. Compared to above approaches,

our approach yields more interpretable textual summary that can transfer across different LMs, and

can be applied to black box LMs without requiring gradient updates. Prior work has studied textual

compression for other tasks, such as political fact checking (Chen et al., 2023) and instruction

learning (Yin et al., 2023).

Distillation / Goal Oriented Summarization Recent work introduces symbolic knowledge distil-

lation (West et al., 2022), which transfers knowledge from a teacher model by generating a training

dataset with the teacher model and train a student model on it. For better performance, they introduce

critic criteria, which ﬁlter undesirable examples from generated training dataset. Such distillation

technique has been applied for various applications including summarization (Jung et al., 2023),

which aims to generate high quality summaries while we optimize for generating effective summary

for downstream LMs. One work that is similar to our setting is Hsu & Tan (2021) which trains an

extractive summarization model to optimize for prediction accuracy of a sentiment prediction model

based on the summary.

8 CONCLUSION

We introduce RECOMP, a method which compresses retrieved documents into textual summaries

before prepending them to improve in-context retrieval augmented language models. We present

two compression models – an extractive compressor and an abstractive compressor. We design a

training scheme which leverages end task signals from a blackbox LM to generate useful summaries

and allowing the compression models to perform selective augmentation. Our experiments show that

our compressors can improve the efﬁciency of retrieval augmented LMs signiﬁcantly with minimal

drop in performances.

ACKNOWLEDGEMENT

We thank the members of the UT and UW NLP community for feedback on the project. We especially

thank Alisa Liu, Junyi Jessy Li and Greg Durrett for providing comments on the draft. The project is

partially funded by NSF grant (IIS-2312948).

ETHICS STATEMENT

We use commercial language model to generate training data for our compressors, which might

include factual error. We conduct careful human evaluation on the data generated and present our

analysis in the paper.

REPRODUCIBILITY STATEMENT

We release our codes, prompt, and data generated with API access publicly.

REFERENCES

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer.

ArXiv, abs/2004.05150, 2020. URL

https://api.semanticscholar.org/CorpusID:

215737171.

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Mil-

lican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark,

Diego De Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron

Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Ge-

offrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack Rae, Erich Elsen, and

Laurent Sifre. Improving language models by retrieving from trillions of tokens. In Kama-

lika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato

(eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of

Proceedings of Machine Learning Research, pp. 2206–2240. PMLR, 17–23 Jul 2022. URL

https://proceedings.mlr.press/v162/borgeaud22a.html.

Daniel Fernando Campos, Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Ran-

gan Majumder, Li Deng, and Bhaskar Mitra. Ms marco: A human generated machine reading com-

prehension dataset. ArXiv, abs/1611.09268, 2016. URL

https://api.semanticscholar.

org/CorpusID:1289517.

Jifan Chen, Grace Kim, Aniruddh Sriram, Greg Durrett, and Eunsol Choi. Complex claim ver-

iﬁcation with evidence retrieved in the wild. ArXiv, abs/2305.11859, 2023. URL

https:

//api.semanticscholar.org/CorpusID:258822852.

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to

compress contexts. arXiv preprint arXiv:2305.14788, 2023.

Eunbi Choi, Yongrae Jo, Joel Jang, and Minjoon Seo. Prompt injection: Parameterization of

ﬁxed inputs. ArXiv, abs/2206.11349, 2022. URL

https://api.semanticscholar.org/

CorpusID:249953762.

Hyung Won Chung, Le Hou, S. Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang,

Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac

Suzgun, Xinyun Chen, Aakanksha Chowdhery, Dasha Valter, Sharan Narang, Gaurav Mishra,

Adams Wei Yu, Vincent Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed Huai

hsin Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei.

Scaling instruction-ﬁnetuned language models. ArXiv, abs/2210.11416, 2022.

Tanya Goyal, Junyi Jessy Li, and Greg Durrett. News summarization and evaluation in the era of

gpt-3. arXiv preprint arXiv:2209.12356, 2022.

Junxian He, Graham Neubig, and Taylor Berg-Kirkpatrick. Efﬁcient nearest neighbor language

models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language

Processing, pp. 5703–5714, Online and Punta Cana, Dominican Republic, November 2021.

Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.461. URL

https:

//aclanthology.org/2021.emnlp-main.461.

Karl Moritz Hermann, Tom

as Kocisk

y, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa

Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. ArXiv, abs/1506.03340,

2015. URL https://api.semanticscholar.org/CorpusID:6203757.

Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.

ArXiv, abs/1503.02531, 2015. URL

https://api.semanticscholar.org/CorpusID:

7200347.

Matthew Honnibal, Ines Montani, Sophie Van Landeghem, and Boyd Adriane. spaCy: Industrial-

strength natural language processing in python, 2020.

Chao-Chun Hsu and Chenhao Tan. Decision-focused summarization. In Proceedings of the 2021

Conference on Empirical Methods in Natural Language Processing, pp. 117–132, Online and

Punta Cana, Dominican Republic, November 2021. Association for Computational Linguis-

tics. doi: 10.18653/v1/2021.emnlp-main.10. URL

https://aclanthology.org/2021.

emnlp-main.10.

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand

Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning,

2021. URL https://arxiv.org/abs/2112.09118.

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane A.

Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval

augmented language models. ArXiv, abs/2208.03299, 2022.

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly

supervised challenge dataset for reading comprehension. arXiv preprint 1705.03551, 2017.

Jaehun Jung, Peter West, Liwei Jiang, Faeze Brahman, Ximing Lu, Jillian Fisher, Taylor Sorensen,

and Yejin Choi. Impossible distillation: from low-quality model to high-quality dataset & model

for summarization and paraphrasing. arXiv preprint arXiv:2305.16635, 2023.

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi

Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing

(EMNLP), pp. 6769–6781, Online, November 2020a. Association for Computational Linguis-

tics. doi: 10.18653/v1/2020.emnlp-main.550. URL

https://aclanthology.org/2020.

emnlp-main.550.

Vladimir Karpukhin, Barlas O

guz, Sewon Min, Patrick Lewis, Ledell Yu Wu, Sergey Edunov,

Danqi Chen, and Wen tau Yih. Dense passage retrieval for open-domain question answering.

In Conference on Empirical Methods in Natural Language Processing, 2020b. URL

https:

//api.semanticscholar.org/CorpusID:215737187.

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization

through memorization: Nearest neighbor language models. ArXiv, abs/1911.00172, 2019.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,

abs/1412.6980, 2014.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redﬁeld, Michael Collins, Ankur Parikh, Chris

Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion

Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav

Petrov. Natural questions: A benchmark for question answering research. Transactions of the

Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl a 00276. URL

https://aclanthology.org/Q19-1026.

Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal,

Heinrich Kuttler, Mike Lewis, Wen tau Yih, Tim Rockt

aschel, Sebastian Riedel, and Douwe Kiela.

Retrieval-augmented generation for knowledge-intensive nlp tasks. ArXiv, abs/2005.11401, 2020.

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni,

and Percy Liang. Lost in the middle: How language models use long contexts. arXiv preprint

arXiv:2307.03172, 2023.

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi.

When not to trust language models: Investigating effectiveness and limitations of parametric and

non-parametric memories. ArXiv, abs/2212.10511, 2022.

Pedro Henrique Martins, Zita Marinho, and Andr

e F. T. Martins. Chunk-based nearest neigh-

bor machine translation. In Proceedings of the 2022 Conference on Empirical Methods in

Natural Language Processing, pp. 4228–4245, Abu Dhabi, United Arab Emirates, December

2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.284. URL

https://aclanthology.org/2022.emnlp-main.284.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture

models. ArXiv, abs/1609.07843, 2016.

Jesse Mu, Xiang Lisa Li, and Noah D. Goodman. Learning to compress prompts with gist tokens.

ArXiv, abs/2304.08467, 2023. URL

https://api.semanticscholar.org/CorpusID:

258179012.

Shankar Padmanabhan, Yasumasa Onoe, Michael J.Q. Zhang, Greg Durrett, and Eunsol Choi.

Propagating knowledge updates to lms through distillation. ArXiv, abs/2306.09306, 2023. URL

https://api.semanticscholar.org/CorpusID:259165330.

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao,

James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim

Rockt

aschel, and Sebastian Riedel. KILT: a benchmark for knowledge intensive language tasks.

In Proceedings of the 2021 Conference of the North American Chapter of the Association for

Computational Linguistics: Human Language Technologies, pp. 2523–2544, Online, June 2021.

Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.200. URL

https:

//aclanthology.org/2021.naacl-main.200.

Abhilash Potluri, Fangyuan Xu, and Eunsol Choi. Concise answers to complex questions: Sum-

marization of long-form answers. In Proceedings of the 61st Annual Meeting of the Association

for Computational Linguistics (Volume 1: Long Papers), pp. 9709–9728, Toronto, Canada, July

2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.541. URL

https://aclanthology.org/2023.acl-long.541.

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language

models are unsupervised multitask learners. 2019.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi

Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text

transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and

Yoav Shoham. In-context retrieval-augmented language models. ArXiv, abs/2302.00083, 2023.

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-

networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language

Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-

IJCNLP), pp. 3982–3992, Hong Kong, China, November 2019. Association for Computational

Linguistics. doi: 10.18653/v1/D19-1410. URL

https://aclanthology.org/D19-1410

Stephen E. Robertson and Hugo Zaragoza. The probabilistic relevance framework: Bm25 and beyond.

Found. Trends Inf. Retr., 3:333–389, 2009.

Chantal Shaib, Millicent Li, Sebastian Joseph, Iain Marshall, Junyi Jessy Li, and Byron Wal-

lace. Summarizing, simplifying, and synthesizing medical evidence using GPT-3 (with vary-

ing success). In Proceedings of the 61st Annual Meeting of the Association for Computa-

tional Linguistics (Volume 2: Short Papers), pp. 1387–1407, Toronto, Canada, July 2023. As-

sociation for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.119. URL

https:

//aclanthology.org/2023.acl-short.119.

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Huai hsin Chi,

Nathanael Scharli, and Denny Zhou. Large language models can be easily distracted by ir-

relevant context. In International Conference on Machine Learning, 2023a. URL

https:

//api.semanticscholar.org/CorpusID:256459776.

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettle-

moyer, and Wen tau Yih. Replug: Retrieval-augmented black-box language models. ArXiv,

abs/2301.12652, 2023b.

Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan L. Boyd-Graber,

and Lijuan Wang. Prompting gpt-3 to be reliable. ArXiv, abs/2210.09150, 2022.

Charles Burton Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context.

ArXiv, abs/2209.15189, 2022. URL

https://api.semanticscholar.org/CorpusID:

252668389.

Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,

Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas

Blecher, Cristian Cant

on Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes,

Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S.

Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,

Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril,

Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar

Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan

Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang,

Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen

Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic,

Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and ﬁne-tuned chat models.

ArXiv, abs/2307.09288, 2023. URL

https://api.semanticscholar.org/CorpusID:

259950998.

Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model.

https://github.com/kingoflolz/mesh-transformer-jax, May 2021.

Peter West, Chandra Bhagavatula, Jack Hessel, Jena Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu,

Sean Welleck, and Yejin Choi. Symbolic knowledge distillation: from general language models to

commonsense models. In Proceedings of the 2022 Conference of the North American Chapter of

the Association for Computational Linguistics: Human Language Technologies, pp. 4602–4625,

Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/

2022.naacl-main.341. URL https://aclanthology.org/2022.naacl-main.341.

David Wingate, Mohammad Shoeybi, and Taylor Sorensen. Prompt compression and con-

trastive conditioning for controllability and toxicity reduction in language models. In Find-

ings of the Association for Computational Linguistics: EMNLP 2022, pp. 5621–5634, Abu

Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.

doi: 10.18653/v1/2022.ﬁndings-emnlp.412. URL

https://aclanthology.org/2022.

findings-emnlp.412.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,

Pierric Cistac, Tim Rault, R

emi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s

transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771, 2019.

Yumo Xu and Mirella Lapata. Coarse-to-ﬁne query focused multi-document summarization. In

Conference on Empirical Methods in Natural Language Processing, 2020. URL

https://api.

semanticscholar.org/CorpusID:226262229.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov,

and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question

answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.

Fan Yin, Jesse Vig, Philippe Laban, Shaﬁq R. Joty, Caiming Xiong, and Chien-Sheng Wu. Did you

read the instructions? rethinking the effectiveness of task deﬁnitions in instruction learning. In

Annual Meeting of the Association for Computational Linguistics, 2023. URL

https://api.

semanticscholar.org/CorpusID:259063796.

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, San-

tiago Onta

on, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big

bird: Transformers for longer sequences. ArXiv, abs/2007.14062, 2020. URL

https:

//api.semanticscholar.org/CorpusID:220831004.

Shiyue Zhang, David Wan, and Mohit Bansal. Extractive is not faithful: An investigation of

broad unfaithfulness problems in extractive summarization. ArXiv, abs/2209.03549, 2022. URL

https://api.semanticscholar.org/CorpusID:252118883.

Figure 5: We report the data distribution on NQ dev set, TriviaQA dev set and HotpotQA dev

set comparing the end task performance when prepending the oracle compression method (oracle

sentence or GPT-3 summaries) and when not prepending anything for the base model (Flan-UL2).

A APPENDIX

A.1 COMPRESSOR TRAINING DATA GENERATION

We report the statistics of the data used to train compressors in Table 5. We use SpaCy (Honnibal

et al., 2020) to extract named entities.

Extractive Data Generation We generate data using the training data for the four datasets we

tested (Wikitext, NQ, TQA and HotpotQA). We use the NLTK package to perform sentence splitting.

We remove examples without any negatives.

Abstractive Data Generation We report prompt used to generate summaries in Table 8. We

queried the Open AI API with temperature of 0.7 and top p = 1. For the language modeling task,

we use an ensemble of four prompts and choose the one which leads to the lowest perplexity as the

target. If none of the summaries lead to perplexity decrease, we treat an empty summary as target.

We queried the OpenAI API with temperature of 0.7 and top p = 1. We generate four summaries per

example for randomly sampled 2% of the training data (48,013 examples).

A.2 COMPRESSOR TRAINING DETAILS

Extractive Compressor For language modeling, we use the contriever checkpoint

trained with

unsupervised data. For the QA tasks, we use the contriever checkpoint ﬁne-tuned on the MSMARCO

task (Campos et al., 2016)

, following prior work (Si et al., 2022; Shi et al., 2023b). We implement

the model using the Transformers (Wolf et al., 2019) and the sentence-transformer library (Reimers

& Gurevych, 2019). We train with Adam optimizer (Kingma & Ba, 2014), using a batch size of 64,

learning rate of 2e-5 and 1000 warmup steps for 3 epochs. We report results on the model with the

best reranked perplexity on our validation set for the language modeling task and the best reranked

accuracy for the QA tasks.

https://huggingface.co/facebook/contriever

https://huggingface.co/facebook/contriever-msmarco

Input: Base LM

, Compressor encoder

enc

, Training data

, S

, y

}

where

is input,

= {s

}

is a set of candidate sentences from the retrieved document for x

, y

is the target answer.

Output: An updated extractive compressor encoder enc

1: T ← ∅

2: for i ∈ {1, . . . , T } do

3: p

← argMax

∈{S

}

Score(M, y

, [s

; x

])

4: for j ∈ {1, . . . , n} do

5: L ← ∅

6: if Score(M, y

, [s

; x

]) < Score(M, y

, [p

; x

]) then

7: L ← L ∪ s

8: if |L| > 0 then

9: N

← argTop5

∈L

(⟨enc

), enc

)⟩)

10: T ← T ∪ {(x

, p

, N

)}

11: enc

= Finetune(enc

, T )

Figure 6: Learning an extractive compressor for QA task. The Score here is the exact match between

the decoded answer and the gold answers.

Input: Teacher LM

, Base LM

, Summarization prompt

, Compressor

encdec

, Training data

, D

, y

}

where x

is input, D

is the set of retrieved document for x

, y

is the target answer.

Output: An updated encdec

1: T ← ∅

2: for i ∈ {1, . . . , T } do

3: s

= Decode(M

, [p; x

; D

])

4: v

= Score(M, y

, [s

; x

])

5: v

= Score(M, y

, [x

])

6: if v

< v

then

7: T ← T ∪ {(x

, D

, ∅)}

8: break

9: T ← T ∪ {(x

, D

, s

)}

10: encdec

= Finetune(encdec

, T )

Figure 7: Learning an abstractive compressor for QA task. The Score here is the exact match between

the decoded answer and the gold answers.

Table 5: Training data statistics for abstractive and extractive compressors.

Dataset

Extractive Abstractive

Train Validation % ﬁltered |N | Train Validation % ﬁltered % empty

NQ 42,149 9,769 46 4.44 39,466 4,931 50 25

TQA 70,032 8,753 56 4.37 48,322 5,887 32 16

HotpotQA 24,526 3,068 69 4.33 26,556 2,937 42 4

Wikitext 1,398,318 1,5483 41 4.04 38,410 9,603 0 24

Figure 8: Histogram of abstractive summary length (# tokens) distribution for testing data of NQ,TQA,

HotpotQA and Wikitext.

Abstractive Compressor We implement the model using the Transformers (Wolf et al., 2019). We

train abstractive summarizer with Adam optimizer (Kingma & Ba, 2014), using a batch size of 16,

learning rate of 1e-5 and 1000 warmup steps for 3 epochs.

Table 6: Example abstractive and extractive compression on wikitext-103 dev set and NQ.

Wikitext-103 Input

Original Top 1 document

present in most of

the Mediterranean

Sea, only missing

from the section east

of Crete, and along

only the north @-@

west coast of the

Black Sea

Sea of Crete” Sea of Crete The Sea of Crete (, ””Kritiko Pelagos””) or Cretan Sea, is a sea, part of the Aegean Sea, located in

its Southern extremity. The sea stretches to the North of the island of Crete, East of the islands of Kythera and Antikythera,

South of the Cyclades, and West of the Dodecanese islands of Rhodes, Karpathos and Kassos. The bounding sea to the

West is the Ionian Sea. To the Northwest is the Myrtoan Sea, a subdivision of the Mediterranean Sea that lies between the

Cyclades and Peloponnese. To the East-SE is the rest of the Mediterranean Sea,

Method Compressed document

BoW

present Mediterranean Sea missing section east Crete along north west coast Black The Kritiko Pelagos Cretan sea part

Aegean located Southern extremity stretches North island East islands Kythera Antikythera South Cyclades West Dodecanese

Rhodes Karpathos Kassos bounding Ionian To Northwest Myrtoan subdivision lies Peloponnese SE rest

Kythera the Aegean Sea the Ionian Sea Crete Southern South Rhodes the Myrtoan Sea Cretan Sea Antikythera Dodecanese

Kassos Karpathos West the Black Sea & Sea of Crete Peloponnese the Mediterranean Sea Cyclades

Extractive compres-

sion

To the Northwest is the Myrtoan Sea, a subdivision of the Mediterranean Sea that lies between the Cyclades and Peloponnese.

NQ Input Original Top 5 document

who got the ﬁrst no-

bel prize in physics

receive a diploma, a medal and a document conﬁrming the prize amount. Nobel Prize in Physics The Nobel Prize in Physics

() is a yearly award given by the Royal Swedish Academy of Sciences for those who have made the most outstanding

contributions for mankind in the ﬁeld of physics. It is one of the ﬁve Nobel Prizes established by the will of Alfred Nobel in

1895 and awarded since 1901; the others being the Nobel Prize in Chemistry, Nobel Prize in Literature, Nobel Peace Prize,

and Nobel Prize in Physiology or Medicine. The ﬁrst Nobel Prize in Physics was

science, Ernest Lawrence won the Nobel Prize in Physics in 1939. Lars Onsager won the 1968 Nobel Prize in Chemistry.

Norman Borlaug, father of the Green Revolution, won the Nobel Peace Prize in 1970. Christian B. Anﬁnsen won the Nobel

Prize for chemistry in 1972. Ivar Giaever won the Nobel Prize in Physics 1973. Carl Richard Hagen is noted for his work in

physics. In engineering, Clayton Jacobson II is credited with the invention of the modern personal watercraft. Ole Singstad

was a pioneer of underwater tunnels. Ole Evinrude invented the ﬁrst outboard motor with practical commercial application,

recognizable today Nobel Prize in Physics The Nobel Prize in Physics () is a yearly award given by the Royal Swedish

Academy of Sciences for those who have made the most outstanding contributions for mankind in the ﬁeld of physics. It is

one of the ﬁve Nobel Prizes established by the will of Alfred Nobel in 1895 and awarded since 1901; the others being the

Nobel Prize in Chemistry, Nobel Prize in Literature, Nobel Peace Prize, and Nobel Prize in Physiology or Medicine. The

ﬁrst Nobel Prize in Physics was awarded to physicist Wilhelm R00f6ntgen in recognition of the extraordinary services he

was also awarded the Abel prize. In addition, eight

normaliens

have gone on to receive the Nobel Prize in Physics: Claude

Cohen-Tannoudji, Pierre-Gilles de Gennes, Albert Fert, Alfred Kastler, Gabriel Lippmann, Louis N

00e9el, Jean Baptiste Perrin and Serge Haroche, while other ENS physicists include such major ﬁgures as Paul Langevin,

famous for developing Langevin dynamics and the Langevin equation. Alumnus Paul Sabatier won the Nobel Prize in

Chemistry. A ranking of universities worldwide based on ratios of alumni to Nobel prize-winners published in 2016 by

American scholars Stephen Hsu and Jonathan Wai placed ENS as the ﬁrst university worldwide, far

rendered by the discovery of the remarkable rays (or x-rays). This award is administered by the Nobel Foundation and

widely regarded as the most prestigious award that a scientist can receive in physics. It is presented in Stockholm at an

annual ceremony on 10 December, the anniversary of Nobel’s death. Through 2018, a total of 209 individuals have been

awarded the prize. Only three women (1.4% of laureates) have won the Nobel Prize in Physics: Marie Curie in 1903, Maria

Goeppert Mayer in 1963, and Donna Strickland in 2018. Alfred Nobel, in his last will and testament, stated that his

Method Compressed document

T5 Wilhelm R

ontgen received the ﬁrst Nobel Prize in Physics in recognition of his extraordinary services. It is one of the ﬁve

Nobel Prizes established by Alfred Nobel in 1895 and awarded since 1901.

GPT-3.5-turbo

The ﬁrst Nobel Prize in Physics was awarded to physicist Wilhelm R

ontgen in 1901 for his discovery of the remarkable rays

(or x-rays). Since then, 209 individuals have been awarded the prize, with only three women (1.4% of laureates) having won

it.

Table 7: Example input to the Flan-UL2 for NQ with in-context examples and retrieved documents.

Dataset Prompts

NQ who won a million on deal or no deal Answer: Tomorrow Rodriguez

who is the woman washing the car in cool hand luke Answer: Joy Harmon

who is the actor that plays ragnar on vikings Answer: Travis Fimmel

who said it’s better to have loved and lost Answer: Alfred , Lord Tennyson

name the ﬁrst indian woman to be crowned as miss world Answer: Reita Faria

Retrieved Docs

Question

Answer:

Table 8: Prompts used to generated summaries from GPT-3.5-turbo.

query

and

docs

represent the

actual input query and retrieved documents.

Dataset Prompts

Compress the information in the retrieved documents into a 2-sentence summary that

could be used to answer the question: Question:

query

Retrieved documents:

docs

Compressed documents:

TQA

Compress the information in the retrieved documents into a 2-sentence summary that

could be used to answer the question: Question:

query

Retrieved documents:

docs

Compressed documents:

HotpotQA

Source documents:

docs

Question:

query

Generate a reasoning chain to answer the

question:

Wikitext

Generate the next two sentences of the given query using the information from the

provided documents. \nSource Documents: docs \nQuery: query \n

Wikitext

Select sentences from the retrieved docs that are most likely be in the next sen-

tence.\nSource Documents: docs \nQuery: query\n

Wikitext

Generate the next one sentence of the given query using the information from the

provided documents\nSource Documents: docs \nQuery: query \n

Wikitext

Summarize the information from the provided documents

Source Documents:

docs

\nQuery: query\n

Table 9: Case study of how compressing the retrieved documents helps the model to identify the right

answer from NQ dev set.

Question: host of the late show who was once a correspondent for the daily show.

Gold answer: Stephen Colbert

Type In-context documents Predicted Answers

None Chelsea Handler

Top 5

by Conan O

Brien, in 2009. Leno explained that he did not want to see a repeat of the hard

feelings and controversy that occurred when he was given the show over David Letterman

following Carson

s retirement in 1992. O

Brien

s last ”Late Night” episode was taped on

February 20, 2009. Former Saturday Night Live alum Jimmy Fallon took over as host of

”Late Night with Jimmy Fallon” on March 2, 2009. The Colbert Report that aired four days a

week on Comedy Central from October 17, 2005, was hosted by Stephen Colbert, one of the

regulars on Comedy Central

s The Daily season as host began with a notable interview with

former British prime minister Tony Blair. The live interview occurred the night before a book

signing at Eason’s which attracted international attention when Blair was pelted with shoes

and eggs and successfully evaded an attempted citizen’s arrest on charges of war crimes. On

1 February 2013, Pat Kenny returned to host that night’s edition when Tubridy’s father died.

In 2015, Tubridy’s tone and choice of questions when interviewing Anti-Austerity Alliance

TD Paul Murphy in relation to the campaign against the implementation of a water tax was

much criticised. Opponents of the ’Michigan, interviewing Eminem. Colbert has been given

near-full control of the show, with little interference from CBS management in regard to

format. Colbert brought most of his staff from ”The Colbert Report” with him to ”The Late

Show”, as well as outsiders such as Brian Stack, who is best known for his work on Conan

Brien

s programs, and Jon Stewart, former host of Colbert

s previous sister program ”The

Daily Show”, who is credited as executive producer. Colbert no longer uses the character he

had portrayed on ”The Colbert Report”, jokingly remarking to Jeb Bush that ”I used to play

a Show” has had three regular hosts: Gay Byrne, Pat Kenny and Ryan Tubridy. Frank Hall

deputised for Byrne for one season in the 1960s. There have been at least four occasions on

which another presenter has hosted the show. The ﬁrst was when Byrne became unexpectedly

and seriously ill. Frequent panelist Ted Bonner presented instead. The second time was

towards the end of a show about feminism, when Byrne ushered a young Marian Finucane

into his seat to present the remainder of the show. On another occasion, radio broadcaster

and former news reader Andy O

Mahony replaced Byrne for an interview popular acclaim.

Colbert would host the program until he was chosen to replace David Letterman as host of

CBS

s ”Late Show” in 2015. Ed Helms, a former correspondent from 2002 to 2006, also

starred on NBC

s ”The Ofﬁce” and was a main character in the 2009 hit ”The Hangover”.

After ﬁlling in as host during Stewart

s two-month absence in the summer of 2013, John Oliver

went on to host his own show on HBO, ”Last Week Tonight with John Oliver”. In 2016,

former correspondent Samantha Bee launched her own late-night talk show ”Full Frontal

with Samantha Bee”. Bee

s husband Jason

Samantha Bee

GPT-3.5-turbo

Former Daily Show correspondent Stephen Colbert was chosen to replace David Letterman

as host of CBS’s ”Late Show” in 2015, while Ed Helms, a former correspondent from 2002

to 2006, also starred on NBC’s ”The Ofﬁce” and John Oliver, who ﬁlled in as host during Jon

Stewart’s absence in 2013, went on to host his own show on HBO.

T5 (ours)

Stephen Colbert was a former correspondent for The Daily Show and later became host

of CBS’s ”Late Show” in 2015. He has since brought most of his staff from ”The Colbert

Report” with him to ”The Late Show”, with little interference from CBS management in

regard to format.

Stephen Colbert

Table 10: Open-domain QA results on LLaMA-13B. We report the results of oracle compressions

with Flan-UL2, the base model for the compressors, (Oracle w/ FLAN) and the oracle compression

results for LLaMA-13B.

NQ TQA HotpotQA

In-context evidence # tok EM F1 # tok EM F1 # tok EM F1

- 0 30.89 40.73 0 65.00 71.18 0 24.20 34.50

RALM without compression

Top 1 document 132 33.35 43.13 136 66.62 73.10 138 34.40 44.17

Top 5 documents 660 37.04 47.60 667 70.61 77.51 684 37.00 47.11

Phrase / token level compression

Top 5 documents (BoW) 450 33.05 43.36 259 66.59 73.40 255 30.00 39.13

Top 5 documents (NE) 338 34.60 44.91 128 65.88 72.59 157 29.20 37.93

Extractive compression of top 5 documents

Oracle 31 56.62 68.89 31 84.61 80.46 69 42.20 51.34

Oracle (w/ FLAN) 34 40.89 50.06 32 68.52 74.96 70 35.20 45.13

Random 32 30.33 39.85 31 62.80 69.25 61 27.40 36.27

Contriever 36 32.52 42.01 40 65.88 72.44 78 34.60 43.99

Ours (init. w/ Contriever) 37 34.38 44.15 38 65.28 71.85 75 33.20 42.88

Abstractive compression of top 5 documents

Oracle 50 45.60 84.87 38 74.37 79.83 98 41.40 51.54

Oracle (w/ FLAN) 51 38.98 49.40 37 69.86 76.46 102 35.40 46.17

T5 10 33.38 43.54 7 63.18 70.92 7 30.40 40.60

Ours (init. w/ T5) 36 36.32 46.10 32 66.27 73.12 81 30.80 40.61

Table 11: Example summaries and their manual analysis labels. See Table 12 for more example.

Dataset Model Query, Passages and Summary Evaluation

NQ Ours Question: when will miraculous ladybug season 2 episode 12 come out

Passages: 2016 on TVNZ’s TV2. In Japan, Disney Channel streamed the episode ”Stormy

Weather” through its mobile application on 1 July 2018, before the ofﬁcial premiere on 23

July in the same year. The second season premiere is scheduled for a global launch around

September–November 2017 in Europe, At a panel at San Diego Comic-Con 2017, it was

announced that the second season would have its North American release on Netﬂix in

December 2017, with 13 episodes to be released. KidsClick will start airing season 2 of this

show in the US starting 30 August 2018, marking the ﬁrst time that Korea on 1 September

2015 on EBS1. In the United States, the series debuted on Nickelodeon on 6 December. In

the United Kingdom and Ireland, the show premiered on 30 January 2016 on Disney Channel.

A Christmas special was released in 2016 and the second season premiered in French on

TF1 and in English on Disney Channel UK in 2017. Netﬂix acquired the U.S. video-on-

demand streaming rights and further seasons are in production. Set in modern-day Paris, the

series focuses on teenagers Marinette Dupain-Cheng and Adrien Agreste. When evil arises,

Marinette transforms into her superhero persona Ladybug, while Adrien transforms rights in

Europe, Eastern Europe and Scandinavia, and free-to-air rights in Spain, Germany, Russia

and Turkey. South Korea was the ﬁrst country to premiere ”Ladybug”, with girl group Fiestar

to sing its translated theme song. It aired on 1 September 2015 on EBS1, and ran for 13

episodes until November 2015, with repeats through February 2016, and its second half of

the season airing from 1 March 2016. SK Broadband, having participated in the production,

provided the episodes on video on demand exclusively to subscribers of their IPTV platform

B TV, about a half-hour following the South Korean broadcast of Season 2 of this show airing

on American over-the-air television. A third season is also in development. On 22 January

2018, Zag posted on Instagram that the crew was working on a fourth and ﬁfth season. On 31

August premiered a new series with Miraculous: Tales of Ladybug & Cat Noir which this

time is in chibi style. On 5 September 2018, Skydance Media has acquired the live-action

ﬁlm and live-action television rights. Kimberly Cooper, a blog writer who has contributed

to news media such as ”The Hufﬁngton Post”, wrote that the show has inspired teens and

adults to create Channel on 1 November 2016. In the United Kingdom and Ireland, the show

premiered on 30 January 2016 on Disney Channel The ﬁrst series was also made available to

Netﬂix subscribers in the UK. Among the free-to-air terrestrial television broadcasters in the

Republic of Ireland, Raidi

o Teilif

ıs

Eireann premiered the show in 2018 on RT

E2’s teenage

programming block, TRT

E. In Australia, the Australian Broadcasting Corporation released

the show on its ABC iview streaming service on 1 January 2016, and premiered the show on

its linear channel ABC3 on 22 March 2016. In New Zealand, the show premiered on 27 April

Summary: Miraculous Ladybug season 2 episode 12 is scheduled for release in December

2017 on Netﬂix in North America. The show premiered in South Korea on 1 September 2015

and has since been aired in various countries, including in the United States, the UK, and

New Zealand.

Not faithful, Comprehensive

TQA Ours

Question: In which novel does Flora Poste move to live with her relatives, the Starkadders,

in Sussex?

Passages: to be a writer, decides that the only way for her to live whilst researching her

writing is to stay with relatives. Her city-based relatives show no interest, so she sends letters

to her country relatives. There are a few responses, most of them unsuitable, but one is

intriguing. Flora decides to stay for a while with the Starkadder family on their rundown farm.

The Starkadders are an assortment of rustic, uncouth, and truly eccentric characters, each of

whom has a hurdle (be it physical, emotional, or spiritual) to overcome before reaching his

or her potential. Flora quickly realises that, relatives at the isolated Cold Comfort Farm in

the ﬁctional village of Howling in Sussex. The inhabitants of the farm – Aunt Ada Doom,

the Starkadders, and their extended family and workers – feel obliged to take her in to

atone for an unspeciﬁed wrong once done to her father. As is typical in a certain genre of

romantic 19th-century and early 20th-century literature, each of the farm’s inhabitants has

some long-festering emotional problem caused by ignorance, hatred, or fear, and the farm is

badly run. Flora, being a level-headed, urban woman in the dandy tradition, determines that

she must Have Always Been Starkadders at Cold Comfort Farm””, set several years later and

based on ””Conference at Cold Comfort Farm””, when Flora is married with several children,

was broadcast. In 1995 a television ﬁlm was produced which was generally well-received,

with critics. Janet Maslin in the ””New York Times”” wrote that this screen version ””gets

it exactly right””. The ﬁlm starred Kate Beckinsale as Flora, Joanna Lumley as her friend

and mentor Mary Smiling, Rufus Sewell as Seth, Ian McKellen as Amos Starkadder, Eileen

Atkins as Judith, Stephen Fry as Mybug, Miriam Margolyes as Mrs. Beetle, and Angela

Thorne as dies suddenly of a heart attack and Lady Place is rented out, with the view that

Titus, once grown up, will return to the home and run the business. After twenty years of

being a live-in aunt Laura ﬁnds herself feeling increasingly stiﬂed both by her obligations to

the family and by living in London. When shopping for ﬂowers on the Moscow Road, Laura

decides she wishes to move to the Chiltern Hills and, buying a guide book and map to the

area, she picks the village of Great Mop as her new home. Against the wishes of her extended

respects to her brother. Agnis is moving to the ancestral family home in Newfoundland,

which has been abandoned for 44 years. Realizing that Quoyle is at a total loss through grief,

she ﬁrst offers to stay a few more days and help him through the crisis, and then persuades

him to move with her. While struggling to rebuild his life, ﬁx up the derelict house, and care

for his daughter, Quoyle meets local resident Wavey Prowse, a widow who has a pre-teen

boy with a learning disability. Wavey’s son and Quoyle’s daughter become friends, while the

two adults become friends,

Summary: Flora Poste moves to live with her relatives, the Starkadders, on their rundown

farm in Sussex. The Starkadders are eccentric characters with various obstacles to overcome.

Faithful, Not comprehensive

Table 12: Example summaries and their manual analysis labels (continued).

Dataset Model Query, Passages and Summary Evaluation

HotpotQA

GPT-3.5

Question: The composer of the music for the ballet ”The Seasons” was the director of what

organization from 1905 to 1928?

Passages: The Seasons (ballet) The Seasons (, ””Vremena goda””; also ) is an allegorical

ballet in one act, four scenes, by the choreographer Marius Petipa, with music by Alexander

Glazunov, his Op. 67. The work was composed in 1899 and ﬁrst performed by the Imperial

Ballet in 1900 in St. Petersburg, Russia. The score for Marius Petipa’s ””Les Saisons””

(””The Seasons””) was originally intended to have been composed by the Italian composer

and conductor Riccardo Drigo, who was Glazunov’s colleague and close friend. Since

1886, Drigo held the posts of director of music and ””chef d’orchestre”” to the Ballet of the

harmonium, guitar and even mandolin). ””The Seasons”” was commenced shortly after the

premiere of Tchaikovsky’s First Piano Concerto, and continued while he was completing

his ﬁrst ballet, ””Swan Lake””. In 1875, Nikolay Matveyevich Bernard, the editor of the

St. Petersburg music magazine ””Nouvellist””, commissioned Tchaikovsky to write 12

short piano pieces, one for each month of the year. Bernard suggested a subtitle for each

month’s piece. Tchaikovsky accepted the commission and all of Bernard’s subtitles, and

in the December 1875 edition of the magazine, readers were promised a new Tchaikovsky

piece each month throughout 1876. The January and February pieces The Seasons (Cage)

The Seasons is a ballet with music by John Cage and choreography by Merce Cunningham,

ﬁrst performed in 1947. It was Cage’s ﬁrst piece for orchestra and also the ﬁrst to use what

Cage later called the ””gamut”” technique, albeit in an early form. Cage composed the music

in early 1947, in the midst of working on ””Sonatas and Interludes””. A piano version was

ﬁrst completed, and an orchestral arrangement followed. Cage dedicated ””The Seasons”” to

Lincoln Kirstein. The ballet was premiered on May 17, 1947 by the Ballet Society (by which

the work was commissioned) at a new orchestra version. Morton Gould retained the piano

part for many of the pieces and orchestrated the work throughout, recording it with himself

at the piano in 1951 for American Columbia. In 1965, Kurt-Heinz Stolze orchestrated a

number of the pieces as part of the music for John Cranko’s ballet ””Onegin””. More recent

orchestral versions have been produced by David Matthews (for symphony orchestra), Peter

Breiner (for solo violin and symphony orchestra), and Georgii Cherkin (for solo piano and

symphony orchestra). French composer Philippe Sarde arranged the Barcarolle as a main

theme for the 1988 movie ””The Bear””. Slava Cranko was not invited to direct, and Sir John

Gielgud was brought in. Prosecuted for homosexual activity, Cranko left the UK for Stuttgart,

and in 1961 was appointed director of the Stuttgart Ballet, where he assembled a group of

talented performers such as Marcia Hayd

ee, Egon Madsen, Richard Cragun, Birgit Keil

and Suzanne Hanke. Among his following choreographies were ””Romeo and Juliet”” by

William Shakespeare in 1962, set to music by Prokoﬁev, ””Onegin”” in 1965, an adaptation

of the verse novel ””Eugene Onegin”” by Alexander Pushkin, set to music by Tchaikovsky

(mainly ””The Seasons””), orchestrated by Kurt-Heinz Stolze, ””The Taming

Summary: 1. The composer of the music for the ballet ””The Seasons”” is Alexander

Glazunov.

2. Alexander Glazunov composed the music for the ballet ””The Seasons”” in 1899.

3. The ballet ””The Seasons”” was ﬁrst performed by the Imperial Ballet in 1900.

4. From 1905 to 1928, the director of an organization was Marius Petipa.

5. Marius Petipa was the choreographer of the ballet ””The Seasons””.

6. Therefore, the composer of the music for the ballet ””The Seasons”” was the director of

the Imperial Ballet from 1905 to 1928.

Not Faithful, Comprehensive