README.md 3 KB
Newer Older
Swaroop Vattam's avatar
Swaroop Vattam committed
1
# Public D3M datasets
Mitar's avatar
Mitar committed
2

Swaroop Vattam's avatar
Swaroop Vattam committed
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
![D3M](DARPA_D3M_Logo.png)

The first step in building a thriving AutoML research community is making sure that there are enough high quality datasets available to the community. This corpus contains a large number of datasets collected and developed under the umbrella of DARPA's D3M program. Each dataset in this corpus was painstakingly curated and annotated with extensive metadata to ensure that the AutoML community is presented with challenging datasets that go beyond the simple tabular datasets and cover a rich set of problem types and data types. Some of the problem and data types covered by this corpus are classification (binary, multi-class, and multi-label) and regression (univariate and multivariate) over tabular, text, image, video and audio data; time series forecasting; object detection; graph problems such as link prediction, vertex nomination, community detection, collaborative filtering; multi-table relational data; multiple-instance learning problem, etc. This corpus hopes to unite researchers in discovering the new frontiers of AutoML research.

## Organization

This corpus is organized into seed datasets and training datasets.

```
.
└── seed_datasets
└── training_datasets
    ├── LL0
    └── LL1
```

`seed_datasets` contain sample datasets that provide a flavor of all the major data types and problem types. `training_datasets` contain a lot more datasets and are used for developing deeper AutoML capabilities. Within `training_datasets`, `LL0` contain simpler level 0 datasets (tabular datasets) and `LL1` contains harder level 1 datasets (raw data, graph data, relational data, etc).
Mitar's avatar
Mitar committed
20 21 22 23 24 25

## Downloading

Download datasets using [git LFS](https://git-lfs.github.com/):

```
Mitar's avatar
Mitar committed
26
$ git clone --recursive git@datasets.datadrivendiscovery.org:d3m/datasets.git
Mitar's avatar
Mitar committed
27 28 29 30 31 32
```

Note, use `git lfs clone` instead of `git clone` because it
is faster.

This will take time but especially disk space. Currently all
Swaroop Vattam's avatar
Swaroop Vattam committed
33 34
datasets are around 54 GB, but the whole directory with cloned
repository and git metadata is around 84 GB. Running
Mitar's avatar
Mitar committed
35 36 37 38 39 40 41 42 43 44 45 46 47
`git lfs prune` might help by removing old and unreferenced files.

Repository is organized so that all files larger than 100 KB are
stored in git LFS, while smaller files are managed through git
directly. This makes cloning faster because there is no need
to make many HTTP requests for small git LFS files which is slow.

## Partial downloading

It is possible to download only part of the repository. First clone
without downloading files managed by git LFS:

```
Mitar's avatar
Mitar committed
48
$ GIT_LFS_SKIP_SMUDGE=1 git clone --recursive git@datasets.datadrivendiscovery.org:d3m/datasets.git
Mitar's avatar
Mitar committed
49 50
```

Mitar's avatar
Mitar committed
51 52
This will download and checkout all files smaller than 100 KB,
including all the history.
Mitar's avatar
Mitar committed
53

Mitar's avatar
Mitar committed
54 55
After cloning, you can, e.g., download all files of just one dataset.
Run inside the cloned repository:
Mitar's avatar
Mitar committed
56 57 58 59

```
$ git lfs pull -I seed_datasets_current/185_baseball/
```
Mitar's avatar
Mitar committed
60 61 62 63

Another way to download only part of the repository is to not
clone also all git submodules with `--recursive`, but do only
those you are interested in.