README.md 1.33 KB
Newer Older
Mitar's avatar
Mitar committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
# Public D3M datasets

This repository contains public D3M datasets.

Datasets schemas and related documentation is available in [data-supply repository](https://gitlab.com/datadrivendiscovery/data-supply).

## Downloading

Download datasets using [git LFS](https://git-lfs.github.com/):

```
$ git lfs clone git@datasets.datadrivendiscovery.org:d3m/datasets.git
```

Note, use `git lfs clone` instead of `git clone` because it
is faster.

This will take time but especially disk space. Currently all
datasets are around 46 GB, but the whole directory with cloned
repository and git metadata is around 65 GB. Running
`git lfs prune` might help by removing old and unreferenced files.

Repository is organized so that all files larger than 100 KB are
stored in git LFS, while smaller files are managed through git
directly. This makes cloning faster because there is no need
to make many HTTP requests for small git LFS files which is slow.

## Partial downloading

It is possible to download only part of the repository. First clone
without downloading files managed by git LFS:

```
$ git lfs clone git@datasets.datadrivendiscovery.org:d3m/datasets.git -X "*"
```

This will download and checkout all files smaller than 100 KB.

Now to download all files of one dataset, run inside cloned repository:

```
$ git lfs pull -I seed_datasets_current/185_baseball/
```