The first step in building a thriving AutoML research community is making sure that there are enough high quality datasets available to the community. This corpus contains a large number of datasets collected and developed under the umbrella of DARPA's D3M program. Each dataset in this corpus was painstakingly curated and annotated with extensive metadata to ensure that the AutoML community is presented with challenging datasets that go beyond the simple tabular datasets and cover a rich set of problem types and data types. Some of the problem and data types covered by this corpus are classification (binary, multi-class, and multi-label) and regression (univariate and multivariate) over tabular, text, image, video and audio data; time series forecasting; object detection; graph problems such as link prediction, vertex nomination, community detection, collaborative filtering; multi-table relational data; multiple-instance learning problem, etc. This corpus hopes to unite researchers in discovering the new frontiers of AutoML research.
## Organization
This corpus is organized into seed datasets and training datasets.
```
.
└── seed_datasets
└── training_datasets
├── LL0
└── LL1
```
`seed_datasets` contain sample datasets that provide a flavor of all the major data types and problem types. `training_datasets` contain a lot more datasets and are used for developing deeper AutoML capabilities. Within `training_datasets`, `LL0` contain simpler level 0 datasets (tabular datasets) and `LL1` contains harder level 1 datasets (raw data, graph data, relational data, etc).