Efficient SRT Translation Using ChatGPT: Tips and Tricks
Translating SubRip Text (SRT) files is a common yet intricate task in the world of video production and subtitling. Managing SRT files and performing translations was a cumbersome task, but with the help of AI such as ChatGPT, this can be done with ease. Therefore, I’ll go through how ChatGPT can be used to translate SRT files into different languages and what challenges occurred.
Understanding the SRT format and ChatGPT's role
First, we need to understand what SRT files are and how they are used.
An SRT file contains a series of time codes and the corresponding subtitles that appear on the video during those times. Each subtitle entry, let's call it the SRT part, consists of a sequential number, a time range, and the text of the subtitle. For example, this is how a part of an SRT file looks:
...
208
19:32:00 -> 19:46:00
Lukas and I are going fishing to the lake. The lake
209
19:46:00 -> 19:56:00
is located near where Wilhelm lives, around the corner of the school
...
Having a part split in the middle of a sentence is okay, and as you will notice further in this blog, this can cause problems.
These SRT files are easy to edit with any text editor and are supported by the majority of video playback software and platforms, such as VLC and YouTube. When playing a video, the SRT file is loaded alongside it, allowing subtitles to be displayed at the specified times. The role of ChatGPT is to translate these files into different languages; this means that the format needs to be preserved and the text content of every SRT part needs to be accurate, or at least as accurate as it can be. When using ChatGPT for translation, it's crucial to maintain the integrity of the SRT parts and the format of the file, as any discrepancies can lead to synchronization issues in the video or even be unusable.
Token limit
OpenAI's APIs have a token limit per API. That means there is a maximum number of tokens that can be sent with a single API request. Overstepping this limit can result in request truncation, so it's essential to consider the length of the request content. Take into account that both input, in terms of prompt, and result share the same token limit. Meaning, the longer the prompt, the shorter the result we can get.
What’s considered a token is unfortunately not standardized because the tokenizing process varies between models. There’s a rule of thumb for OpenAI models: “One token generally corresponds to ~4 characters of text for common English text." But a rule of thumb is not accurate enough if we want to do this programmatically.
To tackle the potential problem of token limiting when we translate a large SRT file, we need to chunk the content and send multiple requests that fit the token limit. In order to make an educated guess of how many tokens a file contains, we can use a greedy approach or a library that does the calculation for us. The greedy approach would be to say that one word and one whitespace are both one token.
The other solution is to use a library like tiktoken which is a tokenizer that we can run and then get the number of tokens. Both solutions are valid, but the tokenizer is, in my opinion, a safer choice. However, add a margin to the number of tokens. Meaning, if the token limit is 1000 and we get 998 from the tokenizer, in a perfect world, we would not need to chunk the content, but instead of taking the risk, chunk the content. Chunking is not always easy and can lead to problems if not done correctly.
Where to chunk the content
Let's say that the SRT file consists of 20000 tokens, given by the tokenizer, and that the token limit is 4,096 tokens. Because an SRT file consists of SRT parts that need a specific format, we can’t risk chunking in the middle of an SRT part and breaking the format.
index
start_time -> end_time
content
This means we need to find a place we can chunk the file where it’s not an SRT part. Which means we need to do some parsing of the document to see which line we can chunk into the SRT file. After the chunking is done, we can send each chunk, which is less than the token limit, and send multiple requests to the OpenAI API. But wait! As it’s shown in the first example, the content of an SRT part is not always a full sentence; it can be two halfs of two different sentences, or it can be one sentence and the first two letters of the next sentence. This would not be a problem if all the languages behaved the same in terms of grammar and sentence structure. However, not all the languages work the same way. If not corrected, the sentence can be translated wrong, and the overall quality of the translation would take a hit. Another problem, great...
But let's not be frowny about that. As software developers, that’s what we do; we solve the problem, even the one we create! So, how do we solve this? Well, the solution is to use a heuristic.
The heuristic is that if a sentence is started in an SRT part but not completed, move those words to the next SRT part. Would this affect the timestamps? Well, unfortunately, yes. But that’s the result of using a heuristic. Instead of changing the timestamps, which would be a whole different problem by itself, we trade accuracy for speed.
Postprocessing
We now have multiple chunks that contain multiple SRT parts, but in order to get one file as a result, we need to concatenate the results from the requests. The main problem that occurred was that between each SRT part, there needed to be a newline. An easy fix in this case is that, when the concatenation occurs, add a newline character between each chunk.
Final thoughts
And there we have it! Those are the problems that occurred when trying to translate an English SRT file to other languages using ChatGPT. To summarize, the root problem is the token limit, which in turn generates more problems. So what can we learn from this experience? Well, firstly, chunking a file that has a strict format is cumbersome, which can later lead to more problems. Secondly, it's sometimes worth compromising quality over speed. Lastly, none of these problems would have occurred if the token limit were unlimited, but of course that would only exist in a perfect world.
This blog post talks about how the translation of SRT files is handled in our open-source tool, Bragir. If you want to learn more about Bragir, check out the following blog post.