Commit 5b5136e5 authored by Swaroop Vattam's avatar Swaroop Vattam
Browse files

third major batch of dataset release

parent ce9086f6
Pipeline #46 passed with stage
in 185 minutes and 36 seconds
ID: uu3_world_development_indicators_raw_dataset
Name: World development indicators: Life expectancy prediction dataset
Description: The World Development Indicators from the World Bank contain over a thousand annual indicators of economic development from hundreds of countries around the world.
License: The World Banl terms of use: http://web.worldbank.org/WBSITE/EXTERNAL/0,,contentMDK:22547097~pagePK:50016803~piPK:50016805~theSitePK:13,00.html
License Link: https://www.worldbank.org/en/about/legal/terms-of-use-for-datasets
Source: Kaggle
Source Link: https://www.kaggle.com/worldbank/world-development-indicators
License: The World Banl terms of use: http://web.worldbank.org/WBSITE/EXTERNAL/0,,contentMDK:22547097~pagePK:50016803~piPK:50016805~theSitePK:13,00.html The World Bank Terms of Use
License Link: https://www.worldbank.org/en/about/legal/terms-of-use-for-datasets https://www.worldbank.org/en/about/legal/terms-of-use-for-datasets
Source: Kaggle Kaggle
Source Link: https://www.kaggle.com/worldbank/world-development-indicators https://www.kaggle.com/worldbank/world-development-indicators
Citation: The World Bank: World Development Indicators: Kaggle
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: SEMI_1040_sylva_prior
Name: SEMI_1040_sylva_prior_dataset
Description: SYLVA is the ecology database
ID: SEMI_1040_sylva_prior_dataset
Name: SEMI 1040 sylva prior dataset
Description: SYLVA is the ecology database
The task of SYLVA is to classify forest cover types. The forest cover type for 30 x 30 meter cells is obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. We brought it back to a two-class classification problem (classifying Ponderosa pine vs. everything else). The agnostic learning track data consists in 216 input variables. Each pattern is composed of 4 records: 2 true records matching the target and 2 records picked at random. Thus 1/2 of the features are distracters. The prior knowledge track data is identical to the agnostic learning track data, except that the distracters are removed and the identity of the features is revealed.
License: CC Public Domain Mark 1.0
License Link: https://creativecommons.org/publicdomain/mark/1.0/
Source: OpenML
Source Link: https://www.openml.org/d/1040
Citation: Datasets from the Agnostic Learning vs. Prior Knowledge Challenge (http://www.agnostic.inf.ethz.ch) Note: Derived from the covertype dataset Dataset from: http://www.agnostic.inf.ethz.ch/datasets.php Modified by TunedIT (converted to ARFF format)
-----------------END------------------
\ No newline at end of file
License: CC-BY License
License Link: http://creativecommons.org/licenses/by/4.0/
Source: OpenML
Source Link: http://www.openml.org/d/1040
Citation: @article{OpenML2013,
author = {Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis},
title = {OpenML: Networked Science in Machine Learning},
journal = {SIGKDD Explorations},
volume = {15},
number = {2},
year = {2013},
pages = {49--60},
url = {http://doi.acm.org/10.1145/2641190.2641198},
doi = {10.1145/2641190.2641198},
publisher = {ACM},
address = {New York, NY, USA},
}
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: SEMI_1044_eye_movements
Name: SEMI_1044_eye_movements_dataset
Description: Competition 1 (preprocessed data)
ID: SEMI_1044_eye_movements_dataset
Name: SEMI 1044 eye movements dataset
Description: Competition 1 (preprocessed data)
A straight-forward classification task. We provide pre-computed feature vectors for each word in the eye movement trajectory, with class labels.
The dataset consist of several assignments. Each assignment consists of a question followed by ten sentences (titles of news articles). One of the sentences is the correct answer to the question (C) and five of the sentences are irrelevant to the question (I). Four of the sentences are relevant to the question (R), but they do not answer it.
License: CC Public Domain Mark 1.0
License Link: https://creativecommons.org/publicdomain/mark/1.0/
Source: OpenML
Source Link: https://www.openml.org/d/1044
Citation: Jarkko Salojarvi, Kai Puolamaki, Jaana Simola, Lauri Kovanen, Ilpo Kojo, Samuel Kaski. Inferring Relevance from Eye Movements: Feature Extraction. Helsinki University of Technology, Publications in Computer and Information Science, Report A82. 3 March 2005. Data set at http://www.cis.hut.fi/eyechallenge2005/
-----------------END------------------
\ No newline at end of file
License: CC-BY License
License Link: http://creativecommons.org/licenses/by/4.0/
Source: OpenML
Source Link: http://www.openml.org/d/1044
Citation: @article{OpenML2013,
author = {Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis},
title = {OpenML: Networked Science in Machine Learning},
journal = {SIGKDD Explorations},
volume = {15},
number = {2},
year = {2013},
pages = {49--60},
url = {http://doi.acm.org/10.1145/2641190.2641198},
doi = {10.1145/2641190.2641198},
publisher = {ACM},
address = {New York, NY, USA},
}
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: SEMI_1053_jm1
Name: SEMI_1053_jm1_dataset
Description: JM1 software defect prediction
License: CC Public Domain Mark 1.0
License Link: https://creativecommons.org/publicdomain/mark/1.0/
Source: OpenML
Source Link: https://www.openml.org/d/1053
Citation: Author: [Mike Chapman, Galaxy Global Corporation](Robert.Chapman@ivv.nasa.gov) Source: [PROMISE Repository](http://promise.site.uottawa.ca/SERepository) Please cite: please follow the acknowledgment guidelines posted on [the PROMISE repository web page](http://promise.site.uottawa.ca/SERepository). This is a PROMISE data set made publicly available in order to encourage repeatable, verifiable, refutable, and/or improvable predictive models of software engineering. If you publish material based on PROMISE data sets then, please follow the acknowledgment guidelines posted on [the PROMISE repository web page](http://promise.site.uottawa.ca/SERepository). ## Title/Topic JM1/software defect prediction ## Sources * Creators: NASA, then the NASA Metrics Data Program, http://mdp.ivv.nasa.gov. * Contacts: * Mike Chapman, Galaxy Global Corporation (Robert.Chapman@ivv.nasa.gov) +1-304-367-8341 * Pat Callis, NASA, NASA project manager for MDP (Patrick.E.Callis@ivv.nasa.gov) +1-304-367-8309 * Donor: Tim Menzies (tim@barmag.net)
-----------------END------------------
\ No newline at end of file
ID: SEMI_1053_jm1_dataset
Name: SEMI 1053 jm1 dataset
Description: JM1 software defect prediction
License: CC-BY License
License Link: http://creativecommons.org/licenses/by/4.0/
Source: OpenML
Source Link: http://www.openml.org/d/1053
Citation: @article{OpenML2013,
author = {Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis},
title = {OpenML: Networked Science in Machine Learning},
journal = {SIGKDD Explorations},
volume = {15},
number = {2},
year = {2013},
pages = {49--60},
url = {http://doi.acm.org/10.1145/2641190.2641198},
doi = {10.1145/2641190.2641198},
publisher = {ACM},
address = {New York, NY, USA},
}
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: SEMI_1217_click_prediction_small
Name: SEMI_1217_click_prediction_small_dataset
Description: SEMI-SUPERVISED VERSION OF smaller sample of version 1
License: CC Public Domain Mark 1.0
License Link: https://creativecommons.org/publicdomain/mark/1.0/
Source: OpenML
Source Link: https://www.openml.org/d/1217
Citation: -NA-
-----------------END------------------
\ No newline at end of file
ID: SEMI_1217_click_prediction_small_dataset
Name: SEMI 1217 click prediction small dataset
Description: SEMI-SUPERVISED VERSION OF smaller sample of version 1
License: CC-BY License
License Link: http://creativecommons.org/licenses/by/4.0/
Source: OpenML
Source Link: http://www.openml.org/d/1217
Citation: @article{OpenML2013,
author = {Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis},
title = {OpenML: Networked Science in Machine Learning},
journal = {SIGKDD Explorations},
volume = {15},
number = {2},
year = {2013},
pages = {49--60},
url = {http://doi.acm.org/10.1145/2641190.2641198},
doi = {10.1145/2641190.2641198},
publisher = {ACM},
address = {New York, NY, USA},
}
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: SEMI_1459_artificial_characters
Name: SEMI_1459_artificial_characters_dataset
Description: This database has been artificially generated. It describes the structure of the capital letters A, C, D, E, F, G, H, L, P, R, indicated by a number 1-10, in that order (A=1,C=2,...). Each letter's structure is described by a set of segments (lines) which resemble the way an automatic program would segment an image. The dataset consists of 600 such descriptions per letter.
License: CC Public Domain Mark 1.0
License Link: https://creativecommons.org/publicdomain/mark/1.0/
Source: OpenML
Source Link: https://www.openml.org/d/1459
Citation: [UCI](https://archive.ics.uci.edu/ml/citation_policy.html)
-----------------END------------------
\ No newline at end of file
ID: SEMI_1459_artificial_characters_dataset
Name: SEMI 1459 artificial characters dataset
Description: This database has been artificially generated. It describes the structure of the capital letters A, C, D, E, F, G, H, L, P, R, indicated by a number 1-10, in that order (A=1,C=2,...). Each letter's structure is described by a set of segments (lines) which resemble the way an automatic program would segment an image. The dataset consists of 600 such descriptions per letter.
License: CC-BY License
License Link: http://creativecommons.org/licenses/by/4.0/
Source: OpenML
Source Link: http://www.openml.org/d/1459
Citation: @article{OpenML2013,
author = {Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis},
title = {OpenML: Networked Science in Machine Learning},
journal = {SIGKDD Explorations},
volume = {15},
number = {2},
year = {2013},
pages = {49--60},
url = {http://doi.acm.org/10.1145/2641190.2641198},
doi = {10.1145/2641190.2641198},
publisher = {ACM},
address = {New York, NY, USA},
}
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: SEMI_155_pokerhand
Name: SEMI_155_pokerhand_dataset
Description: Normalized version of the pokerhand data set.
License: CC Public Domain Mark 1.0
License Link: https://creativecommons.org/publicdomain/mark/1.0/
Source: OpenML
Source Link: https://www.openml.org/d/155
Citation: -NA-
-----------------END------------------
\ No newline at end of file
ID: SEMI_155_pokerhand_dataset
Name: SEMI 155 pokerhand dataset
Description: Normalized version of the pokerhand data set.
License: CC-BY License
License Link: http://creativecommons.org/licenses/by/4.0/
Source: OpenML
Source Link: http://www.openml.org/d/155
Citation: @article{OpenML2013,
author = {Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis},
title = {OpenML: Networked Science in Machine Learning},
journal = {SIGKDD Explorations},
volume = {15},
number = {2},
year = {2013},
pages = {49--60},
url = {http://doi.acm.org/10.1145/2641190.2641198},
doi = {10.1145/2641190.2641198},
publisher = {ACM},
address = {New York, NY, USA},
}
ID: political_instability_dataset
Name: Forecasting the presence of political instability
Description: Yearly occurence of political instability from 1974 to 2003
License: CC0 - Public Domain Dedication
License Link: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/29715
Source: Harvard Dataverse
Source Link: https://doi.org/10.7910/DVN/29715
Citation: @data{DVN/29715_2015, author = {Goldsmith, Benjamin E and Butcher, Charles R and Semenovich, Dimitri and Sowmya, Arcot}, publisher = {Harvard Dataverse}, title = {{Replication data for: Forecasting the onset of genocide and politicide: Annual out-of-sample forecasts on a global dataset, 1988�2003}}, UNF = {UNF:5:fqBFZdPDGOQweuIffdQ2pQ==}, year = {2015}, version = {V1}, doi = {10.7910/DVN/29715}, url = {https://doi.org/10.7910/DVN/29715}}
ID: uu3_world_development_indicators_dataset
Name: World development indicators: Life expectancy prediction dataset
Description: The World Development Indicators from the World Bank contain over a thousand annual indicators of economic development from hundreds of countries around the world.
License: The World Banl terms of use: http://web.worldbank.org/WBSITE/EXTERNAL/0,,contentMDK:22547097~pagePK:50016803~piPK:50016805~theSitePK:13,00.html
License Link: https://www.worldbank.org/en/about/legal/terms-of-use-for-datasets
Source: Kaggle
Source Link: https://www.kaggle.com/worldbank/world-development-indicators
License: The World Banl terms of use: http://web.worldbank.org/WBSITE/EXTERNAL/0,,contentMDK:22547097~pagePK:50016803~piPK:50016805~theSitePK:13,00.html The World Bank Terms of Use
License Link: https://www.worldbank.org/en/about/legal/terms-of-use-for-datasets https://www.worldbank.org/en/about/legal/terms-of-use-for-datasets
Source: Kaggle Kaggle
Source Link: https://www.kaggle.com/worldbank/world-development-indicators https://www.kaggle.com/worldbank/world-development-indicators
Citation: The World Bank: World Development Indicators: Kaggle
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: uu5_heartstatlog
Name: uu5_heartstatlog_dataset
Description: This is a a two-class classification problem to distinguish between absence and presence of heart disease.
ID: uu5_heartstatlog_dataset
Name: Statlog (Heart) Data Set
Description: This is a a two-class classification problem to distinguish between absence and presence of heart disease.
For LUPI processing, the features are split
- standard features (columns 1-6) are physically observable properties during a routine doctor visit
- privileged features (columns 7-13) are physicaly observable properties during an expensive and time consuming procedure of "stress test".
License: CC-BY license
License Link: -NA-
Source: OpenML
Source Link: https://archive.ics.uci.edu/ml/datasets/statlog+(heart)
Citation: @misc{Dua:2017, author = {Dheeru, Dua and Karra Taniskidou, Efi}, year = {2017}, title = {{UCI} Machine Learning Repository}, url = {http://archive.ics.uci.edu/ml}, institution = {University of California, Irvine, School of Information and Computer Sciences} }
-----------------END------------------
\ No newline at end of file
License: CC-BY License
License Link: http://creativecommons.org/licenses/by/4.0/
Source: OpenML
Source Link: https://www.openml.org/d/53
Citation: @article{OpenML2013,
author = {Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis},
title = {OpenML: Networked Science in Machine Learning},
journal = {SIGKDD Explorations},
volume = {15},
number = {2},
year = {2013},
pages = {49--60},
url = {http://doi.acm.org/10.1145/2641190.2641198},
doi = {10.1145/2641190.2641198},
publisher = {ACM},
address = {New York, NY, USA},
}
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: uu6_hepatitis
Name: uu6_hepatitis_dataset
Description: This is a a two-class classification problem to distinguish between DIE and LIVE classes in the context of hepatitis disease.
ID: uu6_hepatitis_dataset
Name: Hepatitis Data Set
Description: This is a a two-class classification problem to distinguish between DIE and LIVE classes in the context of hepatitis disease.
For LUPI processing, the features are split into two groups:
- standard features (columns 2-19) are physically observable medical/lab properties.
- privileged feature (column 20) is the results of histology, which requires its own time-consuming procedure.
License: CC-BY license
License Link: -NA-
Source: UCI
Source Link: https://archive.ics.uci.edu/ml/datasets/hepatitis
Citation: @misc{Dua:2017, author = {Dheeru, Dua and Karra Taniskidou, Efi}, year = {2017}, title = {{UCI} Machine Learning Repository}, url = {http://archive.ics.uci.edu/ml}, institution = {University of California, Irvine, School of Information and Computer Sciences} }
-----------------END------------------
\ No newline at end of file
License: open
License Link: https://archive.ics.uci.edu/ml/citation_policy.html
Source: UCI Machine Learning Repository
Source Link: https://archive.ics.uci.edu/ml/datasets/hepatitis
Citation: @inproceedings{cruz2015grouping, title={Grouping similar trajectories for carpooling purposes},author={Cruz, Michael O and Macedo, Hendrik and Guimaraes, Adolfo},booktitle={Intelligent Systems (BRACIS), 2015 Brazilian Conference on},pages={234--239}, year={2015},organization={IEEE}}
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: uu7_pima_diabetes
Name: uu7_pima_diabetes_dataset
Description: This is a a two-class classification problem to distinguish between absence and presence of diabetes.
ID: uu7_pima_diabetes_dataset
Name: Pima Diabetes Data Set
Description: This is a a two-class classification problem to distinguish between absence and presence of diabetes.
For LUPI processing, the features are split into two groups:
- standard features (columns 2-6, 8) are physically observable properties during a routine doctor visit
- privileged features (columns 1,7 ) are private information (number of pregnancies and diabetes pedigree function, which is the presence of diabetes among patient's relatives), which may not be available due to lack of recordkeeping.
License: CC-BY license
License Link: -NA-
Source: OpenML
Source Link: https://www.openml.org/d/37
Citation: @misc{Dua:2017, author = {Dheeru, Dua and Karra Taniskidou, Efi}, year = {2017}, title = {{UCI} Machine Learning Repository}, url = {http://archive.ics.uci.edu/ml}, institution = {University of California, Irvine, School of Information and Computer Sciences} }
-----------------END------------------
\ No newline at end of file
License: CC-BY License
License Link: http://creativecommons.org/licenses/by/4.0/
Source: OpenML
Source Link: https://www.openml.org/d/37
Citation: @article{OpenML2013,
author = {Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis},
title = {OpenML: Networked Science in Machine Learning},
journal = {SIGKDD Explorations},
volume = {15},
number = {2},
year = {2013},
pages = {49--60},
url = {http://doi.acm.org/10.1145/2641190.2641198},
doi = {10.1145/2641190.2641198},
publisher = {ACM},
address = {New York, NY, USA},
}
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: uu_101_object_categories
Name: uu_101_object_categories_dataset
Description: There are a total of 9145 pictures of object, each belonging to 101 categories.
ID: uu_101_object_categories_dataset
Name: 101 object categories
Description: There are a total of 9145 pictures of object, each belonging to 101 categories.
There are about 40 to 800 images per category.
License: open
License Link: -NA-
Source: California Institute of Technology
Source Link: http://www.vision.caltech.edu/Image_Datasets/Caltech101/
Citation: @ieee{fei_fei_2004_383, author = {L. Fei-Fei and R. Fergus and P. Perona}, title = {{Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories}}, month = june}
-----------------END------------------
\ No newline at end of file
License: open
License Link:
Source: California Institute of Technology
Source Link: http://www.vision.caltech.edu/Image_Datasets/Caltech101/
Citation: @ieee{fei_fei_2004_383, author = {L. Fei-Fei and R. Fergus and P. Perona}, title = {{Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories}}, month = june} @article{fei2006one, title={One-shot learning of object categories}, author={Fei-Fei, Li and Fergus, Rob and Perona, Pietro}, journal={IEEE transactions on pattern analysis and machine intelligence}, volume={28}, number={4}, pages={594--611}, year={2006}, publisher={IEEE}}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment