Commit 9250d3ec authored by Swaroop Vattam's avatar Swaroop Vattam

synced to internal D3M repo

parent f30ff9ad
Pipeline #10 passed with stage
in 64 minutes and 2 seconds
......@@ -11,7 +11,7 @@ test:
- git lfs fetch --all
- pip3 install cerberus==1.3.1 deep_dircmp==0.1.0
- git clone --recursive https://gitlab.com/datadrivendiscovery/data-supply.git
- git -C data-supply checkout 4d67a8acee3fe5236900137a528bc48cf05731a3
- git -C data-supply checkout df915cf20a44f948c8ee2aeb3a15e11d130286d9
script:
- |
......@@ -40,3 +40,4 @@ test:
fi
fi
fi
- echo "SUCCESS"
In commands below, change `.` to a directory under which you want to run it.
Change example version as well.
# Dependencies
```bash
$ apt-get install git git-lfs jq moreutils
$ pip3 install d3m cerberus deep_dircmp pandas
```
# Adding new datasets to the repository
This repository is structured so that all files larger than 100 KB are stored as
git LFS objects. All other files are stored as regular git objects. This has been
determined to be a good approach to take an advantage of both git LFS (handling
large files) and git (handling many small files).
Because of this structure, additional care has to be made when adding new datasets:
* First copy new files into the working directory of the repository, but do **not**
commit them yet.
* Run [`git-add.sh`](./git-add.sh) script which will mark all files larger than
100 KB to be handled by git LFS.
* `git add` and `git commit` new files. This will make marked files be stored
using git LFS.
It is important to note also that git retains all changes to files as well.
This is great when git is used for source code repository, but when handling large
files it means that all old versions of large files are retained as well.
Because of this make sure you add only the final version of files to the repository.
Do not add and commit initial files and then through multiple commits work on
converting those files to final version. Just add only the final version.
Of course, if changes are necessary months after the initial version of a dataset
was added, then just commit the new version. This is a perfect use of git so that
users can know what has changed between dataset versions.
# Validation
We have a [`validate.py`](./validate.py) script which can help you validate datasets
you are adding. It checks for some common issues. Feel free to suggest improvements
to the script.
It also runs as a CI to validate MRs and the repository itself.
# Listing all current dataset and problem schema versions
```bash
$ find . -name problemDoc.json | xargs -n 1 -I % bash -c "echo \$(jq '.about.problemSchemaVersion' %) %"
$ find . -name datasetDoc.json | xargs -n 1 -I % bash -c "echo \$(jq '.about.datasetSchemaVersion' %) %"
```
# Setting the schema version of datasets and problems
```bash
$ find . -name problemDoc.json | xargs -n 1 -I % bash -c "jq '.about.problemSchemaVersion=\"3.2.0\"' % | sponge %"
$ find . -name datasetDoc.json | xargs -n 1 -I % bash -c "jq '.about.datasetSchemaVersion=\"3.2.0\"' % | sponge %"
```
# Listing all current dataset and problem versions
```bash
$ find . -name problemDoc.json | xargs -n 1 -I % bash -c "echo \$(jq '.about.problemVersion' %) %"
$ find . -name datasetDoc.json | xargs -n 1 -I % bash -c "echo \$(jq '.about.datasetVersion' %) %"
```
# Setting the version of datasets and problems
```bash
$ find . -name problemDoc.json | xargs -n 1 -I % bash -c "jq '.about.problemVersion=\"2.0\"' % | sponge %"
$ find . -name datasetDoc.json | xargs -n 1 -I % bash -c "jq '.about.datasetVersion=\"2.0\"' % | sponge %"
```
#!/bin/bash -e
# Note: This does not escape filenames. Which means this will not correctly track filenames
# with whitespace, [, ], or other similar characters which have special meaning as a git
# pattern, which is what "git lfs track" in fact expects. In this case, for those filenames,
# they filenames should be manually escaped.
# This requires git LFS 2.9.0 or newer.
find * -type f -size +100k -exec git lfs track '{}' +
find * -type f -size +100k -exec git lfs track --filename '{}' +
......@@ -10,12 +10,17 @@ if git rev-list --objects --all \
exit 1
fi
if git lfs ls-files --name-only | xargs stat -c '%s %n' | awk '$1 < 100*(2^10)' | awk '{print $2}' | grep . ; then
if git lfs ls-files --name-only | xargs -r stat -c '%s %n' | awk '$1 < 100*(2^10)' | awk '{print $2}' | grep . ; then
echo "Repository contains LFS objects smaller than 100 KB."
exit 1
fi
if git lfs ls-files --name-only | xargs stat -c '%s %n' | awk '$1 >= 2*(2^30)' | awk '{print $2}' | grep . ; then
if git lfs ls-files --name-only | xargs -r stat -c '%s %n' | awk '$1 >= 2*(2^30)' | awk '{print $2}' | grep . ; then
echo "Repository contains LFS objects not smaller than 2 GB."
exit 1
fi
if find . -mindepth 1 -maxdepth 1 -not -name git-check.sh -not -name .git -exec grep -r 'version https://git-lfs.github.com/spec/v1' '{}' + ; then
echo "Repository contains LFS pointer files which are not correctly checked out."
exit 1
fi
......@@ -25,8 +25,8 @@
# - Test and train split of datasets used in clustering problems should be the same.
# - Require dataset digest.
# - Dataset entry points should have "learningData" as resource ID.
# - Problem descriptions using "f1", "precision", and "recall" metrics
# should have only two distinct values in target columns, have "posLabel" provided,
# - Problem descriptions using "f1", "precision", "recall", and "jaccardSimilarityScore"
# metrics should have only two distinct values in target columns, have "posLabel" provided,
# and that "posLabel" value should be among target values.
# - No other should have "posLabel" set.
# - "hammingLoss" metric can be used only with multi-label problems.
......@@ -57,6 +57,7 @@
# and simple or multi, but not mix).
# - When there is "multiIndex" column, all rows for same index value should have the same
# values in all columns except "suggestedTarget" columns.
# - Makes sure that "columnsCount" matches the number of columns, when it exists.
import argparse
import collections
......@@ -167,7 +168,7 @@ def validate_metrics(problem_description):
existing_metrics = set()
for metric in problem_description.get('inputs', {}).get('performanceMetrics', []):
if metric['metric'] in ['f1', 'precision', 'recall']:
if metric['metric'] in ['f1', 'precision', 'recall', 'jaccardSimilarityScore']:
if 'posLabel' not in metric:
print("ERROR: Problem uses '{metric}' metric, but 'posLabel' is not provided.".format(
metric=metric['metric'],
......@@ -179,7 +180,7 @@ def validate_metrics(problem_description):
))
error = True
elif 'posLabel' in metric:
print("ERROR: Problem does not use 'f1', 'precision', or 'recall' metric, but 'posLabel' is provided.".format(
print("ERROR: Problem does not use 'f1', 'precision', 'recall', or 'jaccardSimilarityScore' metric, but 'posLabel' is provided.".format(
metric=metric['metric'],
))
error = True
......@@ -925,7 +926,7 @@ def validate_target_values(problem_paths, dataset_path, problem_description, dat
error = True
for metric in problem_description.get('inputs', {}).get('performanceMetrics', []):
if metric['metric'] in ['f1', 'precision', 'recall']:
if metric['metric'] in ['f1', 'precision', 'recall', 'jaccardSimilarityScore']:
if number_distinct_values != 2:
print("ERROR: Problem {problem_paths} uses '{metric}' metric, but target column does not have 2 distinct values, but {number_distinct_values}.".format(
problem_paths=problem_paths,
......@@ -956,20 +957,66 @@ def validate_target_values(problem_paths, dataset_path, problem_description, dat
return error
def get_all_columns(dataset_path, resource_id, data_resource):
data_path = os.path.join(os.path.dirname(dataset_path), data_resource['resPath'])
data = read_csv(data_path)
data_columns = [{
'colIndex': column_index,
'colName': column_name,
'colType': 'unknown',
'role': []
} for column_index, column_name in enumerate(data.columns)]
columns = data_resource.get('columns', None)
if columns is None:
return data_columns
if 'columnsCount' in data_resource and data_resource['columnsCount'] != len(data_columns):
raise ValueError("Dataset '{dataset_path}' has resource '{resource_id}' with incorrect columns count {columns_count} (correct {correct_count}).".format(
dataset_path=dataset_path,
resource_id=resource_id,
columns_count=data_resource['columnsCount'],
correct_count=len(data_columns),
))
if len(columns) >= len(data_columns):
columns_names = [{'colIndex': c['colIndex'], 'colName': c['colName']} for c in columns]
data_columns_names = [{'colIndex': c['colIndex'], 'colName': c['colName']} for c in data_columns]
if columns_names != data_columns_names:
raise ValueError("Dataset '{dataset_path}' has resource '{resource_id}' where metadata columns do not match data columns.".format(
dataset_path=dataset_path,
resource_id=resource_id,
))
return columns
else:
for column in columns:
if column['colName'] != data_columns[column['colIndex']]['colName']:
raise ValueError("Dataset '{dataset_path}' has resource '{resource_id}' where column name '{metadata_name}' in metadata does not match column name '{data_name}' in data.".format(
dataset_path=dataset_path,
resource_id=resource_id,
metadata_name=column['colName'],
data_name=data_columns[column['colIndex']]['colName'],
))
data_columns[column['colIndex']] = column
return data_columns
def validate_target(problem_paths, dataset_path, problem_description, dataset_description, target, check_target_values):
error = False
try:
for data_resource in dataset_description['dataResources']:
if data_resource['resID'] == target['resID']:
for i, column in enumerate(data_resource['columns']):
if column['colIndex'] != i:
print("ERROR: Dataset '{dataset_path}' has column with invalid column index '{column_index}'.".format(
dataset_path=dataset_path,
column_index=column['colIndex'],
))
error = True
columns = get_all_columns(dataset_path, data_resource['resID'], data_resource)
for column in columns:
if target['colName'] == column['colName'] or target['colIndex'] == column['colIndex']:
if not (target['colName'] == column['colName'] and target['colIndex'] == column['colIndex']):
print("ERROR: Problem {problem_paths} has a target '{target_index}' which does not match a column '{column_index}' in dataset '{dataset_path}' fully.".format(
......@@ -1003,6 +1050,12 @@ def validate_target(problem_paths, dataset_path, problem_description, dataset_de
))
return True
except ValueError as error:
print("ERROR: {error}".format(
error=error,
))
return True
return error
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment