README.md 1.59 KB
Newer Older
Swaroop Vattam's avatar
Swaroop Vattam committed
1
# Private D3M datasets
Mitar's avatar
Mitar committed
2

Swaroop Vattam's avatar
Swaroop Vattam committed
3 4 5 6 7
This repository contains private D3M datasets. **Do not distribute them.**

**Public D3M datasets are available [here](https://datasets.datadrivendiscovery.org/d3m/datasets).**

Please report any issues with private datasets in [data-supply repository](https://gitlab.com/datadrivendiscovery/data-supply/issues).
Mitar's avatar
Mitar committed
8 9 10 11 12 13 14 15

Datasets schemas and related documentation is available in [data-supply repository](https://gitlab.com/datadrivendiscovery/data-supply).

## Downloading

Download datasets using [git LFS](https://git-lfs.github.com/):

```
Swaroop Vattam's avatar
Swaroop Vattam committed
16
$ git lfs clone git@gitlab.datadrivendiscovery.org:d3m/datasets.git
Mitar's avatar
Mitar committed
17 18 19 20 21 22
```

Note, use `git lfs clone` instead of `git clone` because it
is faster.

This will take time but especially disk space. Currently all
Swaroop Vattam's avatar
Swaroop Vattam committed
23 24
datasets are around 54 GB, but the whole directory with cloned
repository and git metadata is around 84 GB. Running
Mitar's avatar
Mitar committed
25 26 27 28 29 30 31 32 33 34 35 36 37
`git lfs prune` might help by removing old and unreferenced files.

Repository is organized so that all files larger than 100 KB are
stored in git LFS, while smaller files are managed through git
directly. This makes cloning faster because there is no need
to make many HTTP requests for small git LFS files which is slow.

## Partial downloading

It is possible to download only part of the repository. First clone
without downloading files managed by git LFS:

```
Swaroop Vattam's avatar
Swaroop Vattam committed
38
$ git lfs clone git@gitlab.datadrivendiscovery.org:d3m/datasets.git -X "*"
Mitar's avatar
Mitar committed
39 40 41 42 43 44 45 46 47
```

This will download and checkout all files smaller than 100 KB.

Now to download all files of one dataset, run inside cloned repository:

```
$ git lfs pull -I seed_datasets_current/185_baseball/
```