Back to Journal

Transcribing Theta's own AMA on Theta's own Whisper

  • Whisper
  • EdgeCloud
  • AMA
  • Builder Log

Theta Labs publishes their AMAs as one-hour YouTube livestreams, a few times per year. There's no official transcript, no podcast feed, no searchable text. If you miss the stream, you watch the replay or you don't get the content.

A few people in the community have manually transcribed past AMAs. Nobody runs it on a pipeline.

So I tried something obvious: take Theta Labs' own audio, run it through Theta EdgeCloud's own Whisper service, and see what happens.

About this transcript. The text linked at the end of this post is auto-generated by Whisper on Theta EdgeCloud. It is approximately 90 % correct — names, technical terms, and roughly 5–10 % of the content contain errors. The official Theta Labs AMA on YouTube is the source of truth. Use the transcript as a search index and quick reference; verify any quote against the recording before citing.

The setup

Three steps:

  1. Pull audio from the YouTube livestream using yt-dlp
  2. Confirm the file with ffmpeg
  3. Upload to Whisper via the Theta EdgeCloud playground

The original plan was simpler: have Claude orchestrate the whole flow through Theta's MCP server, the same way I ran the multi-step text/image/vision pipeline a few days earlier. That would have meant one prompt — "transcribe this audio" — and Claude handles the upload, the inference call, and the result.

It almost worked. Whisper itself is exposed through MCP and runs fine. But the upload step — get_upload_url for audio files — returned an error every time I tried it. So I fell back to the playground and uploaded the file manually through the browser. Same result, more clicks.

When that upload path gets fixed (and it will), the entire pipeline collapses into a single prompt. Drop the file in chat, ask for the transcript, get it back. No browser tab, no manual uploads, no context-switching. That's the version of this experiment I actually wanted to write about. This is the prequel.

The audio file: 65 minutes, ~50 MB after extraction. Standard mp3, nothing special.

The first version of this experiment was a 10-minute pilot. That ran fine for the first 8 minutes, then the output broke down into incoherent text in the last 30 seconds. I assumed it was a context-window limit and considered splitting the full hour into segments.

That assumption was wrong. The full hour ran clean from start to finish. The pilot's tail was likely a quirk of how I clipped the audio, not a model limit.

What it cost

This is where it got interesting.

Processing time: about 5 minutes 15 seconds for 65 minutes of audio. Roughly 12x faster than realtime.

Cost: I can't tell you exactly.

The credit balance read $6.88 before the job and $6.88 after. The TDROP rebate counter sat at 4.20 before and 4.20 after. Theta's Billing History tab only logs deposits, not individual API calls. There's no per-job receipt visible to a customer.

What I know for certain: a full hour of Whisper transcription on Theta EdgeCloud cost less than half a cent. The actual number is somewhere below the precision of the dashboard.

For comparison, OpenAI's hosted Whisper API charges $0.006 per minute. An hour there is around $0.36. On Theta EdgeCloud it was a rounding error.

That's a useful data point even if the exact number isn't measurable. The order-of-magnitude difference is the story, not the second decimal place.

A small product note for Theta Labs if anyone reads this: usage-level transparency would help. For anyone evaluating the platform on cost-sensitive workloads, "see what each call cost you" is table stakes.

What it produced

The transcript runs about 56 KB of plain text. Coherent from beginning to end. Speaker turns blur into each other (Whisper doesn't do speaker diarization out of the box) but the conversation is readable. Download the full transcript →

A few things stood out.

Theta-specific terminology landed mostly correct. EdgeCloud, MiniMax M 2.5, GPT OSS 120B, h200s, RTX GPUs, TNT20, TFUEL — all transcribed accurately. That matters. Generic Whisper models often mangle product names and crypto vocabulary.

The most stubborn error was also the most ironic. Whisper consistently heard "Theta" as "data." Theta's own infrastructure couldn't catch its own name. There's a real fix here: Whisper supports prompt biasing where you pre-load expected vocabulary. A Theta-tuned variant — with "Theta," "TFUEL," "TDROP," "EdgeCloud" weighted up — would likely close that gap. That feels like a product opportunity, not just a bug.

Some named entities slipped. "Deutsche Telekom" came through as "george telecom." "Wes" appeared as "weston" in one spot. "Cloudician" (the Alibaba Cloud validator partner) came through as "cloudation." Names are where general-purpose ASR usually breaks down, and this followed the pattern.

The structure held. Question, answer, follow-up — all readable. Numbers came through (500,000 TFUEL, 70,000 for Bitcoin, 22% in the last week). The shape of the conversation survives.

Roughly 90% of the content is usable as-is. The remaining 10% is light cleanup — find/replace "data" → "Theta," fix a few names, smooth a few stutters.

Why this matters

This isn't a benchmark. It's a use case.

What I just did for one hour of one AMA is the same pipeline that runs for any organization sitting on hours of unsearchable audio. The mechanism doesn't change with size — only the volume of input. The per-minute economics stay the same whether you process one hour or ten thousand.

As I understand it, these are some of the businesses this could serve:

  • Universities with decades of recorded lectures, seminars, and conference talks sitting unsearchable in archives. A single faculty's back-catalog can run into thousands of hours.
  • Media companies with podcast libraries, broadcast archives, and interview footage that nobody can search through without watching it linearly. Public broadcasters often hold hundreds of thousands of hours.
  • Sports and esports organizations logging press conferences, post-game interviews, coaching sessions, and broadcast commentary across full seasons.
  • Law firms with deposition recordings, client interviews, and hearing transcripts where every minute of audio needs to become searchable evidence.
  • Government bodies and parliaments publishing hours of recorded debate every week that currently sit behind video players with no text layer.
  • Customer service and call-center operations wanting to index every recorded call for quality assurance, training, or compliance.
  • Researchers doing qualitative interviews, ethnographic fieldwork, oral history projects — anywhere structured audio needs to become structured text.

The bar to run this is one person with a laptop, not a team with a budget. That's what turns EdgeCloud from a demo into a platform.

— Jacob