Commit d8b38d91 authored by Mitar's avatar Mitar

Adding README.

parent bbbb4578
Pipeline #3 failed with stage
.idea
__pycache__
_IGNORE
_ignore
*.pyc
.ipynb_checkpoints
.DS_Store
validators
data_integrity_check.sh
validation.log
upgrade*.sh
validate*.sh
d3m_data_supply
test:
stage: build
image: registry.gitlab.com/datadrivendiscovery/images/core:ubuntu-bionic-python36-devel
variables:
GIT_STRATEGY: clone
GIT_SUBMODULE_STRATEGY: recursive
before_script:
- git lfs fetch --all
- pip3 install cerberus==1.3.1 deep_dircmp==0.1.0
- git clone --recursive https://gitlab.com/datadrivendiscovery/data-supply.git
- git -C d3m_data_supply checkout 51efe8f74ae2ec223a1540782945beee1f05bf00
script:
- |
set -o errexit
echo "Checking repository."
./git-check.sh
echo "Updating digests."
./update-digest.py
echo "Validating datasets."
./validate.py
if [ "${CI_COMMIT_REF_NAME}" = master ]; then
if [ -n "${GIT_ACCESS_USER}" -a -n "${GIT_ACCESS_TOKEN}" ]; then
echo "Pushing updated digests."
git remote set-url --push origin "https://${GIT_ACCESS_USER}:${GIT_ACCESS_TOKEN}@datasets.datadrivendiscovery.org/${CI_PROJECT_PATH}.git"
git config --local user.email noreply@datadrivendiscovery.org
git config --local user.name "D3M CI"
if ! git diff --quiet ; then
git commit -a -m "Generated by CI." -m "Source commmit: ${CI_COMMIT_SHA}" -m "[skip ci]"
if [ "${GIT_DEBUG}" = 1 ]; then
GIT_TRACE=1 GIT_TRANSFER_TRACE=1 GIT_CURL_VERBOSE=1 git push origin HEAD:refs/heads/master
else
git push origin HEAD:refs/heads/master &>/dev/null
fi
else
echo "Nothing changed."
fi
fi
fi
In commands below, change `.` to a directory under which you want to run it.
Change example version as well.
# Dependencies
```bash
$ apt-get install git git-lfs jq moreutils
$ pip3 install d3m cerberus deep_dircmp pandas
```
# Adding new datasets to the repository
This repository is structured so that all files larger than 100 KB are stored as
git LFS objects. All other files are stored as regular git objects. This has been
determined to be a good approach to take an advantage of both git LFS (handling
large files) and git (handling many small files).
Because of this structure, additional care has to be made when adding new datasets:
* First copy new files into the working directory of the repository, but do **not**
commit them yet.
* Run [`git-add.sh`](./git-add.sh) script which will mark all files larger than
100 KB to be handled by git LFS.
* `git add` and `git commit` new files. This will make marked files be stored
using git LFS.
It is important to note also that git retains all changes to files as well.
This is great when git is used for source code repository, but when handling large
files it means that all old versions of large files are retained as well.
Because of this make sure you add only the final version of files to the repository.
Do not add and commit initial files and then through multiple commits work on
converting those files to final version. Just add only the final version.
Of course, if changes are necessary months after the initial version of a dataset
was added, then just commit the new version. This is a perfect use of git so that
users can know what has changed between dataset versions.
# Validation
We have a [`validate.py`](./validate.py) script which can help you validate datasets
you are adding. It checks for some common issues. Feel free to suggest improvements
to the script.
It also runs as a CI to validate MRs and the repository itself.
# Listing all current dataset and problem schema versions
```bash
$ find . -name problemDoc.json | xargs -n 1 -I % bash -c "echo \$(jq '.about.problemSchemaVersion' %) %"
$ find . -name datasetDoc.json | xargs -n 1 -I % bash -c "echo \$(jq '.about.datasetSchemaVersion' %) %"
```
# Setting the schema version of datasets and problems
```bash
$ find . -name problemDoc.json | xargs -n 1 -I % bash -c "jq '.about.problemSchemaVersion=\"3.2.0\"' % | sponge %"
$ find . -name datasetDoc.json | xargs -n 1 -I % bash -c "jq '.about.datasetSchemaVersion=\"3.2.0\"' % | sponge %"
```
# Listing all current dataset and problem versions
```bash
$ find . -name problemDoc.json | xargs -n 1 -I % bash -c "echo \$(jq '.about.problemVersion' %) %"
$ find . -name datasetDoc.json | xargs -n 1 -I % bash -c "echo \$(jq '.about.datasetVersion' %) %"
```
# Setting the version of datasets and problems
```bash
$ find . -name problemDoc.json | xargs -n 1 -I % bash -c "jq '.about.problemVersion=\"2.0\"' % | sponge %"
$ find . -name datasetDoc.json | xargs -n 1 -I % bash -c "jq '.about.datasetVersion=\"2.0\"' % | sponge %"
```
# Public D3M datasets
This repository contains public D3M datasets.
Datasets schemas and related documentation is available in [data-supply repository](https://gitlab.com/datadrivendiscovery/data-supply).
## Downloading
Download datasets using [git LFS](https://git-lfs.github.com/):
```
$ git lfs clone git@datasets.datadrivendiscovery.org:d3m/datasets.git
```
Note, use `git lfs clone` instead of `git clone` because it
is faster.
This will take time but especially disk space. Currently all
datasets are around 46 GB, but the whole directory with cloned
repository and git metadata is around 65 GB. Running
`git lfs prune` might help by removing old and unreferenced files.
Repository is organized so that all files larger than 100 KB are
stored in git LFS, while smaller files are managed through git
directly. This makes cloning faster because there is no need
to make many HTTP requests for small git LFS files which is slow.
## Partial downloading
It is possible to download only part of the repository. First clone
without downloading files managed by git LFS:
```
$ git lfs clone git@datasets.datadrivendiscovery.org:d3m/datasets.git -X "*"
```
This will download and checkout all files smaller than 100 KB.
Now to download all files of one dataset, run inside cloned repository:
```
$ git lfs pull -I seed_datasets_current/185_baseball/
```
#!/bin/bash -e
# Note: This does not escape filenames. Which means this will not correctly track filenames
# with whitespace, [, ], or other similar characters which have special meaning as a git
# pattern, which is what "git lfs track" in fact expects. In this case, for those filenames,
# they filenames should be manually escaped.
find * -type f -size +100k -exec git lfs track '{}' +
#!/bin/bash -e
if git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| awk '$2 >= 100*(2^10)' \
| awk '{print $3}' \
| egrep -v '(^|/).gitattributes$' ; then
echo "Repository contains committed objects larger than 100 KB."
exit 1
fi
if git lfs ls-files --name-only | xargs stat -c '%s %n' | awk '$1 < 100*(2^10)' | awk '{print $2}' | grep . ; then
echo "Repository contains LFS objects smaller than 100 KB."
exit 1
fi
if git lfs ls-files --name-only | xargs stat -c '%s %n' | awk '$1 >= 2*(2^30)' | awk '{print $2}' | grep . ; then
echo "Repository contains LFS objects not smaller than 2 GB."
exit 1
fi
#!/usr/bin/env python3
import json
import os
import os.path
import sys
import time
from d3m.container import dataset as dataset_module
def search_directory(datasets_directory):
datasets_directory = os.path.abspath(datasets_directory)
for dirpath, dirnames, filenames in os.walk(datasets_directory, followlinks=True):
if 'datasetDoc.json' in filenames:
# Do not traverse further (to not parse "datasetDoc.json" if they
# exists in raw data filename).
dirnames[:] = []
dataset_doc_path = os.path.join(dirpath, 'datasetDoc.json')
print("Processing '{dataset_doc_path}'.".format(dataset_doc_path=dataset_doc_path))
try:
before = time.perf_counter()
dataset_digest = dataset_module.get_d3m_dataset_digest(dataset_doc_path)
after = time.perf_counter()
except Exception as error:
raise RuntimeError("Unable to compute digest for dataset '{dataset_doc_path}'.".format(dataset_doc_path=dataset_doc_path)) from error
try:
with open(dataset_doc_path, 'r') as dataset_doc_file:
dataset_doc = json.load(dataset_doc_file)
if dataset_doc['about'].get('digest', None) == dataset_digest:
print("Digest match (took {time:.3f}s).".format(time=after - before))
continue
dataset_doc['about']['digest'] = dataset_digest
with open(dataset_doc_path, 'w') as dataset_doc_file:
# In Python 3.6+ order of values in dicts is preserved and we use
# this here to try to minimize the line-by-line changes of JSON files.
json.dump(dataset_doc, dataset_doc_file, ensure_ascii=False, indent=2)
print("Digest updated (took {time:.3f}s).".format(time=after - before))
except Exception as error:
raise RuntimeError("Unable to update digest for dataset '{dataset_doc_path}'.".format(dataset_doc_path=dataset_doc_path))
def main():
datasets_directories = sys.argv[1:] or ['.']
for datasets_directory in datasets_directories:
search_directory(datasets_directory)
if __name__ == '__main__':
main()
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment