TransCoder: Unsupervised Translation of Programming Languages (Paper Explained)

preview_player
Показать описание
Code migration between languages is an expensive and laborious task. To translate from one language to the other, one needs to be an expert at both. Current automatic tools often produce illegible and complicated code. This paper applies unsupervised neural machine translation to source code of Python, C++, and Java and is able to translate between them, without ever being trained in a supervised fashion.

OUTLINE:
0:00 - Intro & Overview
1:15 - The Transcompiling Problem
5:55 - Neural Machine Translation
8:45 - Unsupervised NMT
12:55 - Shared Embeddings via Token Overlap
20:45 - MLM Objective
25:30 - Denoising Objective
30:10 - Back-Translation Objective
33:00 - Evaluation Dataset
37:25 - Results
41:45 - Tokenization
42:40 - Shared Embeddings
43:30 - Human-Aware Translation
47:25 - Failure Cases
48:05 - Conclusion

Abstract:
A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is timeconsuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy. Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages. We also build and release a test set composed of 852 parallel functions, along with unit tests to check the correctness of translations. We show that our model outperforms rule-based commercial baselines by a significant margin.

Authors: Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, Guillaume Lample

Links:
Рекомендации по теме
Комментарии
Автор

Translation from machine code will be even beter.

XOPOIIIO
Автор

Thanks for such a detailed walkthrough. You are the disseminator of research in DL/ NLP to others. Hope you gain a lot of followers so that you can reach out to many others looking for a guidance in understanding a lot of research papers. Also, it is a very productive thing to do to keep yourself engaged in a PhD and also to reinforce your own understanding of things. May I know what tool /gadget are you using to make this interactive and expressible?

priyamdey
Автор

Talking about a paper. Very good idea!

It immediately prioritizes accessibility.

IslandRai
Автор

It's amazing that you can take such difficult papers and explain them so clearly. Thank you very much!

TheAmishWarlord
Автор

Beside I do not know FB and have never account there but I really appreciate the work which is perform by people who work there. Anyway, amazing channel Yannic! I discovered that for sometimes ago. You work is very impressive. Awesome content and valuable knowledge. Thanks and good luck.

markusbuchholz
Автор

this is great! You do a good job explaining it clearly and quickly. thanks!

evanalmloff
Автор

Wow! the back translation does the trick. We can use this in various problems where we don't have supervised data. In the future the intellectual work will be human + machine collaboration with machines doing all the heavy lifting and humans fine tuning...

shivaramkr
Автор

What would be more cool is to train a model to go from leetcode prompt to solution. I think openai or google is working on this but yeah...

DistortedV
Автор

Reminds me of the time when I thought some ruby code was python written by a kid. I was literally treating ruby as corrupted python.
.
37:00 Have you heard about Test Driven Development? In this method, the unit tests are written adversarially.
.
41:00 This perplexes me. All the neural code generation paper never attempt to make sure that the generated code always compiles. While its quite easy if you generate the Abstract Syntax Tree instead of raw characters or tokens. I remember some Microsoft research doing something similar on the SMILES dataset, it worked great.

herp_derpingson
Автор

Nicely done.

A couple of comments on unit tests. First, they should be written before the code they test. They should test all the edge cases first (what should happen if 0, -1, null, just inside of bounds, just out of bounds, invalid types, etc. are entered?) Then they should test expected behavior, the purpose of the function under test. Functions should be short, and only do one thing, and ideally only take a parameter or two. Code which does this is much more testable, predictable, readable (in part because functions/variables are easier to name), and more amenable to change. They are also easier to migrate between compiler versions and across languages (I've done this a lot before and after the advent of unit tests, and much prefer the after). I suspect that will improve the results of automatic translation, too.

Example functions for standard algorithms tend to run longer than good quality production code. Most derive from ancestors which were first written back before OO, unit tests, and compilers that didn't penalize breaking up code into readable units. They also had to fit on a textbook page or two, and were meant to be read as a unit. I've keyed in many of them that don't even run as written.

fermigas
Автор

I had this idea years ago...was laughed at...was nice this come up in my feed had no idea it was being worked on

davidprivate
Автор

Translating from one scripting language to another is next to useless if the translation doesn't take advantage of the new languages features. I suppose they have to start from somewhere, but I anticipate this evolving into a multi-step process where they first translate to a new language, then optimize for that language. The end result will probably be a mess of spaghetti code that's totally unreadable for a human, but it might not matter if all the developer cares about is the high level scripting.

dkwroot
Автор

You should actually look into TDD. If your unit test simply reimplements your original code in a different way, then you're doing unit test wrong. There are unittest best practices that prevents that from being the case.

yvrelna
Автор

5 years and we are writing just design doc in English and whole things comes out.

dippatel
Автор

The pirate joke @ 46:31 was great. lol

rbain
Автор

I think the main point of unit tests is that when you are writing the code for the first time, you are in a great position to be able to define the behavior of the unit rather than later on when you come back to it. With time you will understand that unit of code less, and the understanding will be even worse when its another person. Of course, if you don't understand what the codes properties are at the point of writing, nothing is going to save you from replicating those mistakes in the unit tests.

Batsup
Автор

The unit test idea is really cool, would like to explore that with language models

DistortedV
Автор

I really loved the way you explained this paper, thanks for making this video! Arr! :)

sanderbos
Автор

Love it. The only thing you could improve are your thumbnails, haha. Thanks for these nice vids!

tech
Автор

[17:28] «Naturally, the things that mean the same things are going to be in the same place in embedding space, either because they are the same, or because their statistical relation to things which are the same is the same.»

greencoder