README.md 1.19 KB
Newer Older
Swaroop Vattam's avatar
Swaroop Vattam committed
1
# Public D3M datasets
Mitar's avatar
Mitar committed
2

Swaroop Vattam's avatar
Swaroop Vattam committed
3
This repository contains public D3M datasets.
Mitar's avatar
Mitar committed
4 5 6 7 8 9

## Downloading

Download datasets using [git LFS](https://git-lfs.github.com/):

```
Swaroop Vattam's avatar
Swaroop Vattam committed
10
$ git lfs clone git@gitlab.datadrivendiscovery.org:d3m/datasets.git
Mitar's avatar
Mitar committed
11 12 13 14 15 16
```

Note, use `git lfs clone` instead of `git clone` because it
is faster.

This will take time but especially disk space. Currently all
Swaroop Vattam's avatar
Swaroop Vattam committed
17 18
datasets are around 54 GB, but the whole directory with cloned
repository and git metadata is around 84 GB. Running
Mitar's avatar
Mitar committed
19 20 21 22 23 24 25 26 27 28 29 30 31
`git lfs prune` might help by removing old and unreferenced files.

Repository is organized so that all files larger than 100 KB are
stored in git LFS, while smaller files are managed through git
directly. This makes cloning faster because there is no need
to make many HTTP requests for small git LFS files which is slow.

## Partial downloading

It is possible to download only part of the repository. First clone
without downloading files managed by git LFS:

```
Swaroop Vattam's avatar
Swaroop Vattam committed
32
$ git lfs clone git@gitlab.datadrivendiscovery.org:d3m/datasets.git -X "*"
Mitar's avatar
Mitar committed
33 34 35 36 37 38 39 40 41
```

This will download and checkout all files smaller than 100 KB.

Now to download all files of one dataset, run inside cloned repository:

```
$ git lfs pull -I seed_datasets_current/185_baseball/
```