TalkLess: Blending Extractive and Abstractive Summarization for Editing Speech to Preserve Content and Style

The University of Texas at Austin1, University of California, Berkeley2

UIST 2025
A figure shows an accessible cooking support system helping a blind user slice bell peppers. On the left, a How-To Video shows a person preparing food, and Accessible Resources highlight blind cooking tips. Arrows point to the center, where a blind user wearing meta glasses and an apron slices yellow bell peppers on a cutting board. The user asks, “I’m not confident with knives. Any tips?” and “Does this look complete?” Three guidance sections are displayed: 1. Instructions & Demonstration Details: Slice bell peppers. In the video, the person slices yellow and red bell peppers into thin 1/4 inch wide strips using a kitchen knife and wooden cutting board. 2. Accessible Tips & Workarounds: Use kitchen scissors to cut peppers directly over a tray or bowl, so you can easily find all the pieces with touch. Or, you can wear a cut resistant glove. 3. Proactive Progress Feedback: You don’t seem to be done yet because there are still some larger yellow pepper pieces on the right side. Try feeling for any thicker slices and trimming them down so they match the thinner ones. Keep going, you’re almost there!

TalkLess is a system that flexibly combines extraction and abstraction to condense speech while preserving its content and style. TalkLess's interface provides creators control over automated edits by separating low-level wording edits (via the compression pane) from major content edits (via the outline pane).

Abstract

Millions of people listen to podcasts, audio stories, and lectures, but editing speech remains tedious and time-consuming. Creators remove unnecessary words, cut tangential discussions, and even re-record speech to make recordings concise and engaging. Prior work automatically summarized speech by removing full sentences (extraction), but rigid extraction limits expressivity. AI tools can summarize then re-synthesize speech (abstraction), but abstraction strips the speaker's style. We present TalkLess, a system that flexibly combines extraction and abstraction to condense speech while preserving its content and style. To edit speech, TalkLess first generates possible transcript edits, selects edits to maximize compression, coverage, and audio quality, then uses a speech editing model to translate transcript edits into audio edits. TalkLess' interface provides creators control over automated edits by separating low-level wording edits (via the compression pane) from major content edits (via the outline pane). TalkLess achieves higher coverage and removes more speech errors than a state-of-the-art extractive approach. A comparison study (N=12) showed that TalkLess significantly decreased cognitive load and editing effort in speech editing. We further demonstrate TalkLess's potential in an exploratory study (N=3) where creators edited their own speech.

Design Goals


G1. Support surfacing and removing unnecessary speech including filler words, repetitions, and tangential content.

G2. Preserve important information and achieve high coverage of the core message.

G3. Preserve speaker style including linguistic and para-linguistic variations that carry identity and message delivery.

G4. Support granular control and efficient review of automated edits with visualizations and flexible review levels.

G5. Avoid audio errors by maintaining natural pacing and preventing artifacts during cuts.

Editing Views

TalkLess interface with three main components: Compression Pane for transcript-level editing, Outline Pane for content navigation, and Audio Pane for playback

A speech snippet displayed in all three views within the compression pane: The final view displays the final edited transcript with rendered cuts, the diff view renders both inserted and deleted parts of the transcript, and the edit types view supports skimming and reviewing of automated edits through edit types.

System

Diagram showing how Vid2Coach generates accessible cooking instructions from a how-to video using multimodal understanding and retrieval-augmented generation (RAG). On the left, frames from a video show someone preparing bell peppers, with narration: Now, I

TalkLess transcribes, aligns, and segments the original audio, generates potential transcript edits using an LLM, then selects a set of edits that maximize compression and content coverage without compromising audio quality.


TalkLess uses an optimization-based approach to select the best transcript edits. For each segment, we generate 25 candidate shortened transcripts and evaluate them using a multi-objective function that balances several key criteria:

Evaluation Function

$$E(C_{i,j}, \tau) = \lambda_1 \cdot E_{\text{comp}}(C_{i,j}, \tau) + \lambda_2 \cdot E_{\text{edits}}(C_{i,j}) + \lambda_3 \cdot E_{\text{len}}(C_{i,j}) + \lambda_4 \cdot E_{\text{cov}}(C_{i,j})$$
Variable Definitions
  • $C_{i,j}$: The j-th candidate shortened transcript for segment $S_i$
  • $\tau$: Target compression ratio (e.g., 0.15, 0.25, 0.5, 0.75)
  • $S_i$: Original transcript segment i
  • $\lambda_1, \lambda_2, \lambda_3, \lambda_4$: Weighting coefficients for each optimization component

Optimization Components:

1. Compression Score
$$E_{\text{comp}}(C_{i,j}, \tau) = 1 - \left|\frac{\text{length}(C_{i,j})}{\text{length}(S_i)} - \tau\right|$$

Measures how well the candidate achieves the target compression ratio $\tau$. Higher scores indicate better alignment with the desired compression level.

2. Number of Edits
$$E_{\text{edits}}(C_{i,j}) = 1 - \frac{\text{number of edits}}{\text{length}(S_i)/2}$$

Minimizes audio artifacts by reducing the number of required edit operations. Uses the Needleman-Wunsch algorithm to compute minimum edit distance between original and shortened transcript, then normalizes by segment length.

3. Insertion Length
$$E_{\text{len}}(C_{i,j}) = 1 - \frac{\sum \text{length of insertions}}{\text{number of insertions}}$$

Encourages candidates that avoid long insertions while preserving important content. Penalizes lengthy inserted phrases that may disrupt natural speech flow.

4. Coverage Score
$$E_{\text{cov}}(C_{i,j}) = \frac{1}{|S_i|} \sum_{s \in S_i} \max_{c \in C_{i,j}} \text{sim}(s, c)$$

Ensures preservation of important information by matching each sentence $s$ in the original segment to the most similar sentence $c$ in the candidate using sentence transformers, then averaging similarity scores.

Evaluation


TalkLess user study results showing significantly lower cognitive load and higher user satisfaction compared to baseline

TalkLess significantly decreased cognitive load and editing effort. All editors preferred using TalkLess to edit lecture speech recordings compared to an extractive baseline system.



Results created with TalkLess

Audio:

Compression:

0% 15% 25% 50% 75%

Video Overview

BibTeX

@article{benharrak2025talkless,
  title={TalkLess: Blending Extractive and Abstractive Speech Summarization for Editing Speech to Preserve Content and Style},
  author={Benharrak, Karim and Peng, Puyuan and Pavel, Amy},
  journal={arXiv preprint arXiv:2507.15202},
  year={2025}
}