TransCoder: Unsupervised Translation of Programming Languages (Paper Explained)

Показать описание

Code migration between languages is an expensive and laborious task. To translate from one language to the other, one needs to be an expert at both. Current automatic tools often produce illegible and complicated code. This paper applies unsupervised neural machine translation to source code of Python, C++, and Java and is able to translate between them, without ever being trained in a supervised fashion.

OUTLINE:
0:00 - Intro & Overview
1:15 - The Transcompiling Problem
5:55 - Neural Machine Translation
8:45 - Unsupervised NMT
12:55 - Shared Embeddings via Token Overlap
20:45 - MLM Objective
25:30 - Denoising Objective
30:10 - Back-Translation Objective
33:00 - Evaluation Dataset
37:25 - Results
41:45 - Tokenization
42:40 - Shared Embeddings
43:30 - Human-Aware Translation
47:25 - Failure Cases
48:05 - Conclusion

Abstract:
A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is timeconsuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy. Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages. We also build and release a test set composed of 852 parallel functions, along with unit tests to check the correctness of translations. We show that our model outperforms rule-based commercial baselines by a significant margin.

Authors: Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, Guillaume Lample

Links:

Рекомендации по теме

Комментарии

Translation from machine code will be even beter.

XOPOIIIO

Thanks for such a detailed walkthrough. You are the disseminator of research in DL/ NLP to others. Hope you gain a lot of followers so that you can reach out to many others looking for a guidance in understanding a lot of research papers. Also, it is a very productive thing to do to keep yourself engaged in a PhD and also to reinforce your own understanding of things. May I know what tool /gadget are you using to make this interactive and expressible?

priyamdey

Talking about a paper. Very good idea!

It immediately prioritizes accessibility.

IslandRai

It's amazing that you can take such difficult papers and explain them so clearly. Thank you very much!

TheAmishWarlord

Beside I do not know FB and have never account there but I really appreciate the work which is perform by people who work there. Anyway, amazing channel Yannic! I discovered that for sometimes ago. You work is very impressive. Awesome content and valuable knowledge. Thanks and good luck.

markusbuchholz

this is great! You do a good job explaining it clearly and quickly. thanks!

evanalmloff

Wow! the back translation does the trick. We can use this in various problems where we don't have supervised data. In the future the intellectual work will be human + machine collaboration with machines doing all the heavy lifting and humans fine tuning...

shivaramkr

What would be more cool is to train a model to go from leetcode prompt to solution. I think openai or google is working on this but yeah...

DistortedV

Reminds me of the time when I thought some ruby code was python written by a kid. I was literally treating ruby as corrupted python.
.
37:00 Have you heard about Test Driven Development? In this method, the unit tests are written adversarially.
.
41:00 This perplexes me. All the neural code generation paper never attempt to make sure that the generated code always compiles. While its quite easy if you generate the Abstract Syntax Tree instead of raw characters or tokens. I remember some Microsoft research doing something similar on the SMILES dataset, it worked great.

herp_derpingson

Nicely done.

A couple of comments on unit tests. First, they should be written before the code they test. They should test all the edge cases first (what should happen if 0, -1, null, just inside of bounds, just out of bounds, invalid types, etc. are entered?) Then they should test expected behavior, the purpose of the function under test. Functions should be short, and only do one thing, and ideally only take a parameter or two. Code which does this is much more testable, predictable, readable (in part because functions/variables are easier to name), and more amenable to change. They are also easier to migrate between compiler versions and across languages (I've done this a lot before and after the advent of unit tests, and much prefer the after). I suspect that will improve the results of automatic translation, too.

Example functions for standard algorithms tend to run longer than good quality production code. Most derive from ancestors which were first written back before OO, unit tests, and compilers that didn't penalize breaking up code into readable units. They also had to fit on a textbook page or two, and were meant to be read as a unit. I've keyed in many of them that don't even run as written.

fermigas

I had this idea years ago...was laughed at...was nice this come up in my feed had no idea it was being worked on

davidprivate

Translating from one scripting language to another is next to useless if the translation doesn't take advantage of the new languages features. I suppose they have to start from somewhere, but I anticipate this evolving into a multi-step process where they first translate to a new language, then optimize for that language. The end result will probably be a mess of spaghetti code that's totally unreadable for a human, but it might not matter if all the developer cares about is the high level scripting.

dkwroot

You should actually look into TDD. If your unit test simply reimplements your original code in a different way, then you're doing unit test wrong. There are unittest best practices that prevents that from being the case.

yvrelna

5 years and we are writing just design doc in English and whole things comes out.

dippatel

The pirate joke @ 46:31 was great. lol

rbain

I think the main point of unit tests is that when you are writing the code for the first time, you are in a great position to be able to define the behavior of the unit rather than later on when you come back to it. With time you will understand that unit of code less, and the understanding will be even worse when its another person. Of course, if you don't understand what the codes properties are at the point of writing, nothing is going to save you from replicating those mistakes in the unit tests.

Batsup

The unit test idea is really cool, would like to explore that with language models

DistortedV

I really loved the way you explained this paper, thanks for making this video! Arr! :)

sanderbos

Love it. The only thing you could improve are your thumbnails, haha. Thanks for these nice vids!

tech

[17:28] «Naturally, the things that mean the same things are going to be in the same place in embedding space, either because they are the same, or because their statistical relation to things which are the same is the same.»

greencoder

TransCoder: Unsupervised Translation of Programming Languages (Paper Explained)

TransCoder: Unsupervised Translation of Programming Languages (paper illustrated)

TransCoder: Unsupervised Translation of Programming Languages (Paper Explained)

TransCoder: Unsupervised Translation of Programming Languages

Facebook Transcoder : Unsupervised Translation of Programming Languages | Paper Explained

Unsupervised Translation of Programming Languages | NLP Journal Club

Transcoder from Facebook AI Research - Translation of Programming Languages

Facebook TransCoder AI's : Unsupervised Translation of Programming Languages (Slide Explained)

Facebook Research - Unsupervised Translation of Programming Languages

Facebook's TransCoder: Converting Programming Languages with AI

Facebook's TransCoder : Translate Code from Python to C++, Java

Marie-Anne Lachaux | Unsupervised Translation of Programming Languages | FAIR | CTDS.Show #

CS480/680 Lecture 6: Unsupervised word translation (Kira Selby)

Unsupervised Statistical Machine Translation

The Most Powerful And Fastest Smart Converter Of Programming Languages To Each Other

#AI Code Translator by #Facebook | Convert Python code to Java & C++

CMU Multilingual NLP 2020 (11): Unsupervised Translation

Online programming languages converter | Website converter | 2022

Quick programming language translation using OpenAI Codex

Programming language Translator

Mach1™ TUTORIAL: Transcoder Overview

A Theory of Unsupervised Machine Translation with Application to Understanding Whale Communication

The Most Powerful And Fastest Smart Converter Of Programming Languages To Each Other.

Convert one programming language to another in seconds #openai #easytul #trending #programming

Obtaining new license for Transcoder