Analyzing Source Code Using Neural Networks: A Case Study
https://medium.com/@tusharma/analyzing-source-code-using-neural-networks-a-case-study-f564dd6f1f69
Code smells indicate the presence of quality issues in source code. An excessive number of smells make a software system hard to evolve and maintain. In this article, we apply deep learning models based on CNN and RNN to detect code smells without extensive feature engineering, just by feeding the source code in tokenized form.
Following figure provides an overview of the setup. We download 1,072 C# repositories from GitHub. We use Designite to analyze C# code. We use CodeSplit to extract each method and class definition into separate files from C# programs. Then the learning data generator uses the detected smells to bifurcate code fragments into positive or negative samples for a smell — positive samples contain the smell while the negative samples are free from that smell. Tokenizer takes a method or class definition and generates integer tokens for each token in the source code. We apply preprocessing operation, specifically duplicates removal, on the output of Tokenizer. The processed output of Tokenizer is ready to feed to the neural networks.