CUSTOMIZING NMT ENGINES FOR JAPANESE SONG LYRICS

What is NMT? Can it transform lyric translations?

An NMT engine, or Neural Machine Translation engine can bring several benefits to the localization process. By automating translation tasks, NMT models can streamline workflows reducing turnaround times and operational costs that benefit both sides of the vendor and client side of language services. The capabilities of an NMT to learn from large datasets and capture contextual nuances can also result in higher-quality translations that better align with cultural expectations and maintain consistency across translated content. Knowing that NMT models can be customized and fine-tuned for specific localization projects, my localization partner, Brian Hsieh, and I took on a pilot project to train an NMT for Japanese songs lyrics into English to test the feasibility of using an NMT as a long-term solution in the Japanese music industry.

The goal of this pilot project is broken down into the following 3 main criteria.
PEMT = Post-edited machine translation, HT = Human Translation

Quality Goals: PEMT are written naturally and without grammatical errors.
Timing Goals: PEMT is 80% faster than HT (1125 Characters/Hour PEMT vs 625 Characters/Hour HT)
Pricing Goals: PEMT 30% savings over human translation HT ($0.21/Word for PEMT vs $0.30/Characters for HT )

For the scope of this project, 2 NMT engines (Microsoft Custom Translator and SYSTRAN) were chosen to conduct the training process and then evaluated on each goals identified above. Here are our lessons learned and findings from this pilot project, including which NMT engine is preferred and how well each goal was reached.

Step 1: Collecting Data

The amount of time to collect data turned out to be the most time-consuming part of this process. Since we had to collect translated song lyrics that were done by human translators, this required sifting through popular song translation websites and copying segments into an excel file. In addition to manually collecting data, we also resorted to using corpus data in order to provide segments that we thought were useful in diversifying the NMT model during the iterative rounds of training. We initially underestimated the number of segments we could collect on song data alone, therefore additional time was incurred during the actual process to collect more song segments and researching corpus data and aligning them. Here is a summary of how our data was used in the NMT engines. Microsoft used all 3 – training, tuning and testing data sets, however SYSTRAN only required the training and testing due to the setup environment.

Training Set

Top 75 pop sings in 2020 – Billboard Japan Charts

Top 75 songs from various genres

10,000 segments JESC Corpus (various fictional TV, Movies, books)

5,000 segments OpenSubtitles Corpus (various Japanese dramas)

Tuning Set

Top 50-100 pop songs in 2023 Billboard Japan Charts

Testing Set

Top 1-50 songs of 2023 in Billboard Japan Charts

Example of collecting translated song lyrics in excel

Step 2: Training the NMT Engines

In total, 5 rounds of training were conducted to provide a quality metric in terms of BLEU scores. Here is a comparison of the BLEU scores for both NMT engines, Microsoft Custom Translator and SYSTRAN. While SYSTRAN provided a higher range of improvement, Microsoft’s BLEU score output was generally higher on average. From this analysis, it was difficult to contribute a clear finding on how to change the training and tuning sets to boost the BLEU scores. However, the little improvements between scores did show that perhaps increasing the segment counts and using more song-related segment increased the scores.

Training Round	Microsoft Custom Translator	SYSTRAN	Model Details (Segments)
Round 1	20.12	16.51	Songs (10K)
Round 2	19	7.25	JESC (10K)
Round 3	20	11.29	Songs + JESC (10K)
Round 4	20.61	14.42	Songs + JESC (20K)
Round 5	20.76	13.93	Songs + JESC + OpenSubtitles (25K)
Range of Scores (Max – Min)	1.76 Improvement	9.26 Improvement
Average of Sores	20.10	12.68

The decision to go with Microsoft Custom Translator ultimately came down to the following reasons. While Microsoft’s processing time was on average 12 times longer than SYSTRAN, Brian and I both agreed that this was not a severe downside considering the program could run in the background during non-operational working hours. The other 4 factors including the bleu score evaluation and the usability in accessing the test results were much better to work with as well. Ultimately, Microsoft Custom Translator was the chosen NMT.

Factor	Microsoft Custom Translator	SYSTRAN
Processing Time	Longer processing time	Shorter processing time
BLEU Scores	Higher BLEU scores	Lower Bleu Score
Test Results	Includes tuning set	Does not include tuning set
Test Results	No time expiration	3-day time expiration
Test Results	Separate files for source and machine translated segments (easier to align in CAT tool)	Merged source and custom machine translated segments (need to manually unmerge for CAT Tool)

Step 3: Measuring the Quality of the Post-Edited Machine Translations

Once the custom translations were completed in each engine, we then estimated how much time could be saved in post-editing compared to a fully human translated project. The post-editing process turned out to be faster as we anticipated but also surpassed the initial goal. Afterwards, we asked 2 human evaluators to assess the quality of the PEMT based on naturalness and grammatical correctness on a scale of 1 – 4; 1 being the poorest quality rating and 4 being the highest quality rating. Both human evaluators scored high marks for each category in the sample of PEMT segments, resulting in passing results. Combing the results of our human evaluated surveys on the post-edited machine translations and the amount of time it took to complete the post-editing, we determined that this process produced favorable results in terms of quality and timing goals. You can take a look at the human evaluation score below.

Sample	Grammar	Naturalness	Grammar	Naturalness
Evaluator 1	PEMT Round 1	PEMT Round 1	PEMT Round 5	PEMT Round 5
Song 1	4	3	4	3
Song 2	4	4	4	4
Song 3	4	4	4	4
Song 4	4	3	4	4
Song 5	4	4	4	4
Evaluator 2	PEMT Round 1	PEMT Round 1	PEMT Round 5	PEMT Round 5
Song 1	3	4	4	4
Song 2	4	4	4	4
Song 3	4	4	4	4
Song 4	4	4	4	4
Song 5	4	4	4	4

Step 4: Evaluating the Quality Goals

	Initial Goal	Findings	Improvements
Quality Goal	Post-edited Machine Translations (PEMT) are written naturally and without grammatical errors.	Human evaluation determined this goal to be achievable.	Although the goal was met, quality assessment could be more thorough. Need larger human evaluation sample for clearer quality assessment.
Timing Goals	PEMT 80% faster than HT (1125 Characters/Hour PEMT vs 625 Characters/Hour HT)	Post-editing rate was calculated based on a sample size of 1,470 words post-edited in 1 hour. This is 3.74 times faster than the standard human translation rate of 312.5 words/hour.	Very light post-editing was conducted and could be further improved by having a professional bilingual post-editing to have a better assessment.
Pricing Goals	PEMT 30% savings over human translation HT ($0.21/Word for PEMT vs $0.30/Characters for HT)	Project spending on the machine training process was more than originally estimated. Estimated: 30 hours ($1200) Actual: 37 hours ($1480)	In order to negotiate better cost savings for post-editing, we would need to find ways to be more efficient with the machine training process, particularly in the data collection and alignment phrase.

You can watch a full-detailed analysis our findings and lessons learned on this NMT training process in the video below. You can additionally download a copy of our initial and updated proposal.

INITIAL_MT_Pilot_Project_Proposal-1

FINAL_UPDATED_MT_Pilot_Project_Proposal