README.md 3 KB
Newer Older
Swaroop Vattam's avatar
Swaroop Vattam committed
1
# Public D3M datasets
Mitar's avatar
Mitar committed
2

Swaroop Vattam's avatar
Swaroop Vattam committed
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
![D3M](DARPA_D3M_Logo.png)

The first step in building a thriving AutoML research community is making sure that there are enough high quality datasets available to the community. This corpus contains a large number of datasets collected and developed under the umbrella of DARPA's D3M program. Each dataset in this corpus was painstakingly curated and annotated with extensive metadata to ensure that the AutoML community is presented with challenging datasets that go beyond the simple tabular datasets and cover a rich set of problem types and data types. Some of the problem and data types covered by this corpus are classification (binary, multi-class, and multi-label) and regression (univariate and multivariate) over tabular, text, image, video and audio data; time series forecasting; object detection; graph problems such as link prediction, vertex nomination, community detection, collaborative filtering; multi-table relational data; multiple-instance learning problem, etc. This corpus hopes to unite researchers in discovering the new frontiers of AutoML research.

## Organization

This corpus is organized into seed datasets and training datasets.

```
.
└── seed_datasets
└── training_datasets
    ├── LL0
    └── LL1
```

`seed_datasets` contain sample datasets that provide a flavor of all the major data types and problem types. `training_datasets` contain a lot more datasets and are used for developing deeper AutoML capabilities. Within `training_datasets`, `LL0` contain simpler level 0 datasets (tabular datasets) and `LL1` contains harder level 1 datasets (raw data, graph data, relational data, etc).
Mitar's avatar
Mitar committed
20
21
22
23
24
25

## Downloading

Download datasets using [git LFS](https://git-lfs.github.com/):

```
Mitar's avatar
Mitar committed
26
$ git clone --recursive git@datasets.datadrivendiscovery.org:d3m/datasets.git
Mitar's avatar
Mitar committed
27
28
29
30
31
32
```

Note, use `git lfs clone` instead of `git clone` because it
is faster.

This will take time but especially disk space. Currently all
Swaroop Vattam's avatar
Swaroop Vattam committed
33
34
datasets are around 54 GB, but the whole directory with cloned
repository and git metadata is around 84 GB. Running
Mitar's avatar
Mitar committed
35
36
37
38
39
40
41
42
43
44
45
46
47
`git lfs prune` might help by removing old and unreferenced files.

Repository is organized so that all files larger than 100 KB are
stored in git LFS, while smaller files are managed through git
directly. This makes cloning faster because there is no need
to make many HTTP requests for small git LFS files which is slow.

## Partial downloading

It is possible to download only part of the repository. First clone
without downloading files managed by git LFS:

```
Mitar's avatar
Mitar committed
48
$ GIT_LFS_SKIP_SMUDGE=1 git clone --recursive git@datasets.datadrivendiscovery.org:d3m/datasets.git
Mitar's avatar
Mitar committed
49
50
```

Mitar's avatar
Mitar committed
51
52
This will download and checkout all files smaller than 100 KB,
including all the history.
Mitar's avatar
Mitar committed
53

Mitar's avatar
Mitar committed
54
55
After cloning, you can, e.g., download all files of just one dataset.
Run inside the cloned repository:
Mitar's avatar
Mitar committed
56
57
58
59

```
$ git lfs pull -I seed_datasets_current/185_baseball/
```
Mitar's avatar
Mitar committed
60
61
62
63

Another way to download only part of the repository is to not
clone also all git submodules with `--recursive`, but do only
those you are interested in.