150k Python Dataset

This dataset is released as a part of Machine Learning for Programming project that aims to create new kinds of programming tools and techniques based on machine learning and statistical models learned over massive codebases. For more information about the project, tools and other resources please visit the main project page.

Overview

We provide a dataset consisting of parsed Parsed ASTs that were used to train and evaluate the DeepSyn tool. The Python programs are collected from GitHub repositories by removing duplicate files, removing project forks (copy of another existing repository), keeping only programs that parse and have at most 30'000 nodes in the AST and we aim to remove obfuscated files. Furthermore, we only used repositories with permissive and non-viral licenses such as MIT, BSD and Apache. For parsing, we used the Python AST parser included in Python 2.7. We also include the parser as part of our dataset. The dataset is split into two parts -- 100'000 files used for training and 50'000 files used for evaluation.

Download

Version 1.0 [526.6MB]

An archive of the dataset

Download

The archive contains the following files:

parse_python.py -- The parser that we used to obtain JSON from each Python source code that we used to obtain this dataset.
python100k_train.json -- Parsed ASTs in JSON format. This is a dataset for training.
python50k_eval.json -- Parsed ASTs in JSON format. This is a dataset for evaluation.

Redistributable Version [June 2020]

Redistributable version containing subset of the original repositories and files.

Download - Source Files

Version 1.0 Files [190MB]

An archive of source files used to generate the py150 Dataset

Download

The archive contains the following files:

data.tar.gz -- Archive containing all the source files
python100k_train.txt -- List of files used in the training dataset.
python50k_eval.txt -- List of files used in the evaluation dataset.
github_repos.txt -- List of GitHub repositories and their revisions used to obtain the dataset.

Note that the order of python100k_train.txt and python100k_train.json (containing the ASTs of the parsed files) are the same. That is, parsing the n-th file from python100k_train.txt produces n-th ASTs in python100k_train.json.

Published research using this dataset may cite the following paper:

Raychev, V., Bielik, P., and Vechev, M. Probabilistic Model for Code with Decision Trees. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (2016), OOPSLA ’16, ACM

Format

Now we briefly explain the JSON format into which each AST is stored. The python100k_train.json and python50k_eval.json files include one such JSON per line. As an example, given a simple program:

x = 7
print x+1

The serialized AST is as follows (here we show it pretty-printed, but the entire JSON is on a single line in the data):

[ {"type":"Module","children":[1,4]},
    {"type":"Assign","children":[2,3]},
      {"type":"NameStore","value":"x"},
      {"type":"Num","value":"7"},
    {"type":"Print","children":[5]},
      {"type":"BinOpAdd","children":[6,7]},
        {"type":"NameLoad","value":"x"},
        {"type":"Num","value":"1"} ]

As can be seen, the json contains array of objects. Each object contains several name/value pairs:

(Required) type: string containing type of current AST node.
(Optional) value: string containing value (if any) of the current AST node.
(Optional) children: array of integers denoting indices of children (if any) of the current AST node. Indices are 0-based starting from the first node in the JSON.