TalkLess: Blending Extractive and Abstractive Summarization for Editing Speech to Preserve Content and Style

Karim Benharrak¹, Puyuan Peng¹, Amy Pavel²

The University of Texas at Austin¹, University of California, Berkeley²

UIST 2025

A figure shows an accessible cooking support system helping a blind user slice bell peppers. On the left, a How-To Video shows a person preparing food, and Accessible Resources highlight blind cooking tips. Arrows point to the center, where a blind user wearing meta glasses and an apron slices yellow bell peppers on a cutting board. The user asks, “I’m not confident with knives. Any tips?” and “Does this look complete?” Three guidance sections are displayed: 1. Instructions & Demonstration Details: Slice bell peppers. In the video, the person slices yellow and red bell peppers into thin 1/4 inch wide strips using a kitchen knife and wooden cutting board. 2. Accessible Tips & Workarounds: Use kitchen scissors to cut peppers directly over a tray or bowl, so you can easily find all the pieces with touch. Or, you can wear a cut resistant glove. 3. Proactive Progress Feedback: You don’t seem to be done yet because there are still some larger yellow pepper pieces on the right side. Try feeling for any thicker slices and trimming them down so they match the thinner ones. Keep going, you’re almost there!

TalkLess is a system that flexibly combines extraction and abstraction to condense speech while preserving its content and style. TalkLess's interface provides creators control over automated edits by separating low-level wording edits (via the compression pane) from major content edits (via the outline pane).

Design Goals

G1. Support surfacing and removing unnecessary speech including filler words, repetitions, and tangential content.

G2. Preserve important information and achieve high coverage of the core message.

G3. Preserve speaker style including linguistic and para-linguistic variations that carry identity and message delivery.

G4. Support granular control and efficient review of automated edits with visualizations and flexible review levels.

G5. Avoid audio errors by maintaining natural pacing and preventing artifacts during cuts.

Editing Views

TalkLess interface with three main components: Compression Pane for transcript-level editing, Outline Pane for content navigation, and Audio Pane for playback

A speech snippet displayed in all three views within the compression pane: The final view displays the final edited transcript with rendered cuts, the diff view renders both inserted and deleted parts of the transcript, and the edit types view supports skimming and reviewing of automated edits through edit types.

System

Diagram showing how Vid2Coach generates accessible cooking instructions from a how-to video using multimodal understanding and retrieval-augmented generation (RAG). On the left, frames from a video show someone preparing bell peppers, with narration: Now, I

TalkLess transcribes, aligns, and segments the original audio, generates potential transcript edits using an LLM, then selects a set of edits that maximize compression and content coverage without compromising audio quality.

TalkLess uses an optimization-based approach to select the best transcript edits. For each segment, we generate 25 candidate shortened transcripts and evaluate them using a multi-objective function that balances several key criteria:

Evaluation Function

$$E(C_{i,j}, \tau) = \lambda_1 \cdot E_{\text{comp}}(C_{i,j}, \tau) + \lambda_2 \cdot E_{\text{edits}}(C_{i,j}) + \lambda_3 \cdot E_{\text{len}}(C_{i,j}) + \lambda_4 \cdot E_{\text{cov}}(C_{i,j})$$

Variable Definitions

$C_{i,j}$: The j-th candidate shortened transcript for segment $S_i$
$\tau$: Target compression ratio (e.g., 0.15, 0.25, 0.5, 0.75)
$S_i$: Original transcript segment i
$\lambda_1, \lambda_2, \lambda_3, \lambda_4$: Weighting coefficients for each optimization component

Optimization Components:

1. Compression Score

$$E_{\text{comp}}(C_{i,j}, \tau) = 1 - \left|\frac{\text{length}(C_{i,j})}{\text{length}(S_i)} - \tau\right|$$

Measures how well the candidate achieves the target compression ratio $\tau$. Higher scores indicate better alignment with the desired compression level.

2. Number of Edits

$$E_{\text{edits}}(C_{i,j}) = 1 - \frac{\text{number of edits}}{\text{length}(S_i)/2}$$

Minimizes audio artifacts by reducing the number of required edit operations. Uses the Needleman-Wunsch algorithm to compute minimum edit distance between original and shortened transcript, then normalizes by segment length.

3. Insertion Length

$$E_{\text{len}}(C_{i,j}) = 1 - \frac{\sum \text{length of insertions}}{\text{number of insertions}}$$

Encourages candidates that avoid long insertions while preserving important content. Penalizes lengthy inserted phrases that may disrupt natural speech flow.

4. Coverage Score

$$E_{\text{cov}}(C_{i,j}) = \frac{1}{|S_i|} \sum_{s \in S_i} \max_{c \in C_{i,j}} \text{sim}(s, c)$$

Ensures preservation of important information by matching each sentence $s$ in the original segment to the most similar sentence $c$ in the candidate using sentence transformers, then averaging similarity scores.

Evaluation

Compared to the prior extractive approach ROPE, TalkLess removed more speech disfluencies, introduces less incoherence errors (e.g., noticeable cuts), and preserved more content across all target compressions (15%, 25%, 50%, and 75%).

We conducted a results evaluation (N=12), where human listeners listened to results generated by TalkLess and generated by the prior extractive approach (i.e., ROPE). Compared to the prior extractive approach, TalkLess removed more speech disfluencies, introduces less incoherence errors (e.g., noticeable cuts), and preserved more content across all target compressions. * denotes p < .05.

All human listeners in our results evaluation preferred listening to results generated by TalkLess over those generated by ROPE, especially for lighter compressions (i.e., 15% & 25%). * denotes p < .05 and *** denotes p < .001.

TalkLess user study results showing significantly lower cognitive load and higher user satisfaction compared to baseline

We conducted a user studz (N=12) and compared editing speech recordings with TalkLess against a baseline interface which only provides initial extractive results (similar to ROPE). Both interfaces let editors manually edit the audio. All editors preferred TalkLess over the baseline interface. TalkLess gave editors significantly greater creative flexiblity and significantly reduced cognitive load and editing effort when editing speech recordings to make them more concise and engaging. * denotes p < .05 and ** denotes p < .01.