5 ways to optimise your automatic transcription output

Influencer sorpresa

For those who publish audio and video content such as podcasts and vlogs, transcription and subtitles are very powerful tools, as long as they are put to good use.

In fact, having access to a podcast or video transcription can be extremely useful for listeners to go back and read later.

At the same time, many people watch muted videos on social media, so adding subtitles to posted videos is critical to attract and keep their attention. According to widespread statistics, the chances of a video being watched from start to finish rise as high as 80% if it has closed captions.

No wonder, then, that multimedia editing tools such as Veed.io, Kapwing.com for videos and Podcastle.ai for podcasts, offer automatic speech recognition to meet the needs of content creators who want to climb to the top of the SERP and increase engagement.

While it is true that many of these tools promise very high transcription accuracy rates, the reality is something else: texts resulting from automatic transcription are actually often far from perfect and require some manual post editing, sometimes a heavy one, before they can be published.

Publishing a text that has not been post edited in fact may be worse than not publishing it at all, as it may be difficult to read and can divert the audience’s attention instead of attracting it.

Most probably, the poor quality of automatic transcription depends on the fact that speech recognition in languages other than English is still inaccurate. However, while we wait patiently for the technology to improve, we can help artificial intelligence to make fewer mistakes, taking good care of the quality of the recording.

Indeed, with a few technical tricks, although it won’t produce a text “as if it had been written by a human”, automatic speech recognition will succeed far better. As a result, our intervention afterwards may be limited to a light editing, instead of having to heavily review a text full of errors. The difference in timing is remarkable: if it is a particularly long recording, it can be several hours.

The 5 most important requirements to get a good quality recording are:

  1. Using a good quality microphone
  2. Soundproofing your room
  3. Avoiding overlapping dialogues
  4. Speaking standard language
  5. Cleaning up the audio track

Using a good quality microphone

A smartphone or webcam microphone should be avoided because they cannot guarantee high quality audio. Therefore, it is essential to invest in a good microphone (we use Yeti by Blue), better if equipped with a pop filter.

The speaker also needs to keep the right distance from the microphone, neither too close, to avoid voice distortion, nor too far away. Ideally, the microphone should be mounted on an appropriate swivel stand.

If you are recording a person sitting at a table, perhaps in front of a computer, the microphone should be isolated rather than placed on the table, so it does not pick up all the vibrations produced, for example, by their hands tapping on the keyboard or mouse.

Soundproofing your room

Secondly, the environment in which the recording takes place - if it is not a studio - should be as soundproof as possible, or at the very least it must be set up to avoid background noise, resonance, and annoying echoes. Smooth and bare walls should be coated. If the room has curtains, they should be closed to muffle the sound.

If the recording takes place outdoors, on the other hand, stay away, as much as possible, from busy streets and noisy places in general. However, bear in mind that after recording outside, cleaning up the audio (see below) may be more time consuming.

Avoiding overlapping dialogues

Automatic speech recognition tends to work worse when there are overlapping voices. Participants should take turns to speak and not interrupt one another while speaking, in order to avoid overlapping voices. If each person has a microphone, we recommend muting those of people who are not talking.

Speaking standard language

Another typical problem with speech recognition is non-standard pronunciation. Some systems can be set up to recognize local variants (e.g., the English of India or the Spanish of Mexico), but only for the most widely used languages, such as English and Spanish.

Anyway, it is recommended to use standard pronunciation as much as possible, avoiding foreign words, dialects, regional accents, local idioms. Moreover, you should speak at a regular speed, without mumbling or whispering.

Cleaning up the audio track

From a technical point of view, the audio should be recorded professionally, without interference, echo, feedback or other similar issues. As a general rule, the clearer the pronounciation and quieter the surroundings, the better the final result.

Warning! Sanitary masks or other face and mouth coverings will muffle the voice and spoil the recording.

When the quality of the recording is not good enough, an audio editing program can help you improve it. By learning the main features and tricks of audio editing programs such as Audacity you can clean up files considerably.

Limits of automatic transcription

As already anticipated, even under ideal conditions and near-perfect recognition, automatic transcription has a major flaw: the resulting text will lack almost all of those formal aspects commonly used to organize thought and ease the reading.

In texts produced by artificial intelligence, punctuation is usually limited to commas for shorter pauses and periods for longer ones: computers in fact cannot always tell a question from a statement or exclamation by the tone of voice.

The same is true for paragraph division with its associated headings, italics, bold, bullet and numbered lists: you need to add them manually.

Indirect speech, all that is said “in quotation marks”, and again the distinction between different speakers should also be added manually.

Then let’s not talk about capturing non-verbal expressions, such as laughter or sighing, which are an integral part of communication and should be included in the subtitles and transcription for better understanding.

Here is a (fictitious) example, in which we see the automatic transcription of a video recipe compared to the same recipe manually edited by a person, with the objective of publishing it as a blog post.

Automatic transcription Text manually edited and formatted

Good morning everyone and welcome back to our regular appointment in the kitchen. Today we will prepare the most classic of Italian desserts the tiramisù. This is a simple recipe that has been tried over and over again and has always been a huge success so if you follow it step by step you will surely impress your guests. So let’s start with the ingredients and quantities. First of all the coffee I use the biggest coffee maker I have six eight cups then eight hundred grams of savoiardi biscuits here are two mascarpone cheese packs of two hundred and fifty grams four whole fresh eggs and two hundred and fifty grams of sugar and finally cocoa powder for the topping. This is the recipe to make six servings and is the non-alcoholic one but if you want you can add half a glass of Marsala to the coffee. Then we start the preparation by putting the coffee maker on the stove and then letting the coffee cool down, this is extremely important otherwise then the savoiardi biscuits will get too soggy. While it is cooling we separate the yolk from the white of the eggs, we put the yolks in a large bowl and mix them with the sugar and mascarpone cheese until they form a fluffy cream while we whisk the egg whites in another bowl. Little by little, we then add the egg whites to the cream, mixing from the bottom up, and begin to spread a layer of cream over the bottom of a rectangular baking dish. We soak the savoiardi in cold coffee for a few seconds and place them in the baking dish then pour a layer of cream making it as even as possible and repeat adding more savoiardi and another layer of cream. At this point we cover the whole thing with foil and store it in the fridge for at least three hours and before serving we sprinkle cocoa powder on top. Enjoy!

Classic Tiramisù

Difficulty: Low

Timing: 45 min + 3 h

Tools:

  • Coffee maker
  • 2 large bowls
  • Whisk blender
  • 25 cm rectangular baking dish

Ingredients for 6 persons:

  • 6-8 coffee cups
  • 800 g savoiardi biscuits
  • 500 g mascarpone cheese
  • 4 whole fresh eggs
  • 250 g sugar
  • Cocoa powder to taste

Preparation

  1. Prepare coffee and let it cool.
  2. Separate the yolk from the white of the eggs.
  3. Mix the egg yolks with the sugar and mascarpone cheese until fluffy.
  4. Whip the egg whites.
  5. Gradually add the egg whites to the cream, mixing from top to bottom.Spread a layer of cream in the baking dish.
  6. Soak the savoiardi in coffee and arrange them in the baking dish.
  7. Pour in a layer of cream and spread evenly.
  8. Arrange another layer of savoiardi and one of cream.
  9. Cover with foil and refrigerate for at least 3 hours
  10. Before serving, dust with cocoa powder.

Subtitles

So far we have only talked about transcription, that is, a written or printed version of an audio recording. The above said applies to video subtitles as well, but with a few nuances that are worth pointing out.

Automatic systems are perfectly capable of breaking up a transcript into subtitles, but they do so without any criteria, except for two purely technical parameters:

  • maximum number of characters per line (usually 42) and
  • maximum number of rows that can appear at the same time, (usually 2).

In some cases you can change the appearance of the subtitles, such as the colour of the text, the background, the position, and little more.

On the other hand, professional subtitlers follow specific ground rules to improve readability. These are simple and logical rules, but ones that artificial intelligence still cannot apply on its own.

For example, the cases in which text can be broken into two lines depend on specific guidelines: after punctuation marks, before conjunctions and before prepositions.

The line break, on the other hand, should never separate a noun from an article or an adjective, a proper name from a surname, nor a verb from an auxiliary, a subject or reflexive pronoun, or a negation.

Here is a simulation with the beginning of a famous novel:

Automatic subtitles Manual subtitles

That branch of Lake Como, which turns to
noon, between two unbroken chains

of mountains, all to breasts and gulfs, to
depending on the protrusion and indentation of

those, it comes, almost all of a sudden, to
shrink, and to take course and figure

of river, between a headland on the right, and
a wide coastline on the other side; and the

bridge, which there connects the two banks, par
that makes the eye even more sensitive to the

this transformation, and mark the point in
where the lake ceases, and the Adda begins again,

to regain then the name of a lake where the
shores, moving away again, leaving the

the water relax and slow down into new
gulfs and into new breasts.

That branch of Lake Como,
turning to noon,

between two unbroken chains of mountains,
all to breasts and gulfs, according to

of the protrusion and retraction of those,
comes, almost suddenly, to shrink,

and to take course and figure of river,
between a promontory to the right,

and a wide coastline on the other side;
and the bridge, which connects the two banks there,

it seems to make it even more sensitive to the eye
this transformation, and mark the point

in which the lake ceases,
and the Adda begins again,

to take back the name of a lake
where the shores, moving away again,

let the water stretch out and slow down
into new gulfs and into new breasts.

These are not substantial differences and may be missed by non-experts, but they dramatically improve the reading and enjoyment of a captioned video.

Conclusion

Relying on automatic systems to transcribe an audio recording or subtitle a video is a first step, but it is not enough to provide the audience with a top-notch experience and thus increase the number of views.

Even with maximum care for recording quality, many technical limitations will prevent artificial intelligence from achieving results that can be compared to manual work.

For best results, it is worth investing time in fine-tuning the speech recognition output before publishing it, following the directions in this article.

For those who prefer to devote themselves to creating audio and video content instead of worrying about text quality, we have launched two new services: transcription and subtitling. They are designed especially for those who publish podcasts and videos on social media and want to offer their followers editorial-quality texts ready to be posted or added to videos as subtitles.

We can offer you the first 5 minutes of your audio transcription or video subtitles for free within 48 hours!

Technical translator, project manager, mentor, and admirer of ingenuity. Founding member of Qabiria.

Further Reading

Chat to one of us

Let us know what you need by sending an email to hola@qabiria.com or by filling in the contact form. We guarantee a response within 24 hours, but usually we’re much faster.

Contact us