Commit 10188a0b authored by Swaroop Vattam's avatar Swaroop Vattam
Browse files

Merge branch 'add-notices' into 'master'

added notices

See merge request !5
parents ae8a60bd f4983d49
Pipeline #39 passed with stage
in 134 minutes and 31 seconds
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: 124_174_cifar10
Name: 124_174_cifar10_dataset
Description: Image recognition dataset consisting of 60000 32x32 colour images in 10 classes, with 6000 images per class.
License: open
License Link: None
Source: University of Toronto
Source Link: https://www.cs.toronto.edu/~kriz/cifar.html
Citation: @article{krizhevsky2009learning,title={Learning multiple layers of features from tiny images},author={Krizhevsky, Alex and Hinton, Geoffrey},year={2009},publisher={Technical report, University of Toronto}}
-----------------END------------------
\ No newline at end of file
ID: 124_174_cifar10_MIN_METADATA_dataset
Name: cifar10
Description: Image recognition dataset consisting of 60000 32x32 colour images in 10 classes, with 6000 images per class.
License: open
License Link: None
Source: University of Toronto
Source Link: https://www.cs.toronto.edu/~kriz/cifar.html
Citation:
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: 124_188_usps
Name: 124_188_usps_dataset
Description: Image recognition dataset consisting of 9298 16x16 images of 10 handwritten digits.
License: CC Public Domain Mark 1.0
License Link: https://creativecommons.org/publicdomain/mark/1.0/
Source: USPS
Source Link: http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html
Citation: @article{Cai11SRKDA, author = {Deng Cai and Xiaofei He and Jiawei Han}, title = {Speed Up Kernel Discriminant Analysis}, journal = {The VLDB Journal}, volume = {20}, number = {1},year = {2011}, pages = {21-33}} || @ARTICLE{Cai11SRKDA, AUTHOR = {Deng Cai and Xiaofei He and Jiawei Han}, TITLE = {Speed Up Kernel Discriminant Analysis}, JOURNAL = {The VLDB Journal}, YEAR = {2011}, volume = {20}, number = {1}, pages = {21-33}, }
-----------------END------------------
\ No newline at end of file
ID: 124_188_usps_MIN_METADATA_dataset
Name: usps
Description: Image recognition dataset consisting of 9298 16x16 images of 10 handwritten digits.
License: CC Public Domain Mark 1.0
License Link: https://creativecommons.org/publicdomain/mark/1.0/
Source: USPS
Source Link: http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html
Citation:
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: 124_214_coil20
Name: 124_214_coil20_dataset
Description: Image recognition dataset of 20 objects from 72 different views.
License: open
License Link: None
Source: Columbia University
Source Link: http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php
Citation: @article{nene1996columbia,title={Columbia object image library (coil-20)},author={Nene, Sameer A and Nayar, Shree K and Murase, Hiroshi and others},year={1996},publisher={Technical report CUCS-005-96}}
-----------------END------------------
\ No newline at end of file
ID: 124_214_coil20_MIN_METADATA_dataset
Name: coil-20
Description: Image recognition dataset of 20 objects from 72 different views.
License: open
License Link: None
Source: Columbia University
Source Link: http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php
Citation:
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: 1491_one_hundred_plants_margin
Name: 1491_one_hundred_plants_margin_dataset
Description: Plant Leaf Classification Using Probabilistic Integration of Shape, Texture and Margin Features. Signal Processing, Pattern Recognition and Applications, in press. 2013.
ID: 1491_one_hundred_plants_margin_MIN_METADATA_dataset
Name: one_hundred_plants_margin
Description: Plant Leaf Classification Using Probabilistic Integration of Shape, Texture and Margin Features. Signal Processing, Pattern Recognition and Applications, in press. 2013.
### Description
......@@ -23,11 +20,21 @@ There is a total of 1600 samples with 16 samples per leaf class (100 classes), a
Three 64 element feature vectors per sample.
License: CC Public Domain Mark 1.0
License Link: https://creativecommons.org/publicdomain/mark/1.0/
Source: OpenML
Source Link: https://www.openml.org/d/1491
Citation: @article{article,author = {Mallah, Charles and Cope, James and Orwell, James},year = {2013},month = {02},pages = {},title = {Plant Leaf Classification using Probabilistic Integration of Shape, Texture and Margin Features},volume = {3842},journal = {Pattern Recognit. Appl.},doi = {10.2316/P.2013.798-098}}
-----------------END------------------
\ No newline at end of file
License: CC Public Domain Mark 1.0 CC-BY License
License Link: https://creativecommons.org/publicdomain/mark/1.0/ http://creativecommons.org/licenses/by/4.0/
Source: OpenML OpenML
Source Link: https://www.openml.org/d/1491 http://www.openml.org/d/1491
Citation:
@article{OpenML2013,
author = {Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis},
title = {OpenML: Networked Science in Machine Learning},
journal = {SIGKDD Explorations},
volume = {15},
number = {2},
year = {2013},
pages = {49--60},
url = {http://doi.acm.org/10.1145/2641190.2641198},
doi = {10.1145/2641190.2641198},
publisher = {ACM},
address = {New York, NY, USA},
}
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: 1567_poker_hand
Name: 1567_poker_hand_dataset
Description: Purpose is to predict poker hands
ID: 1567_poker_hand_MIN_METADATA_dataset
Name: poker_hand
Description: Purpose is to predict poker hands
Each record is an example of a hand consisting of five playing cards drawn from a standard deck of 52. Each card is described using two attributes (suit and rank), for a total of 10 predictive attributes. There is one Class attribute that describes the "Poker Hand". The order of cards is important, which is why there are 480 possible Royal Flush hands as compared to 4 (one for each suit).
......@@ -59,11 +56,21 @@ Ordinal (0-9)
R. Cattral, F. Oppacher, D. Deugo. Evolutionary Data Mining with Automatic Rule Generalization. Recent Advances in Computers, Computing and Communications, pp.296-300, WSEAS Press, 2002.
Note: This was a slightly different dataset that had more classes, and was considerably more difficult
License: CC Public Domain Mark 1.0
License Link: https://creativecommons.org/publicdomain/mark/1.0/
Source: OpenML
Source Link: https://www.openml.org/d/1567
Citation: @misc{Dua:2019 ,author = {Dua, Dheeru and Graff, Casey},year = {2017},title = {{UCI} Machine Learning Repository},url = {http://archive.ics.uci.edu/ml},institution = {University of California, Irvine, School of Information and Computer Sciences} }
-----------------END------------------
\ No newline at end of file
License: CC Public Domain Mark 1.0 CC-BY License
License Link: https://creativecommons.org/publicdomain/mark/1.0/ http://creativecommons.org/licenses/by/4.0/
Source: OpenML OpenML
Source Link: https://www.openml.org/d/1567 http://www.openml.org/d/1567
Citation:
@article{OpenML2013,
author = {Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis},
title = {OpenML: Networked Science in Machine Learning},
journal = {SIGKDD Explorations},
volume = {15},
number = {2},
year = {2013},
pages = {49--60},
url = {http://doi.acm.org/10.1145/2641190.2641198},
doi = {10.1145/2641190.2641198},
publisher = {ACM},
address = {New York, NY, USA},
}
ID: 185_baseball_MIN_METADATA_dataset
Name: baseball
Description: **Author**: Jeffrey S. Simonoff
**Source**: [AnalCatData](http://www.stern.nyu.edu/~jsimonof/AnalCatData) - 2003
**Please cite**: Jeffrey S. Simonoff, Analyzing Categorical Data, Springer-Verlag, New York, 2003
Database of baseball players and play statistics, including 'Games_played', 'At_bats', 'Runs', 'Hits', 'Doubles', 'Triples', 'Home_runs', 'RBIs', 'Walks', 'Strikeouts', 'Batting_average', 'On_base_pct', 'Slugging_pct' and 'Fielding_ave'
Notes:
* Quotes, Single-Quotes and Backslashes were removed, Blanks replaced with Underscores
* Player is an identifier that should be ignored when modelling the data
License: CC Public Domain Mark 1.0 CC-BY License
License Link: https://creativecommons.org/publicdomain/mark/1.0/ http://creativecommons.org/licenses/by/4.0/
Source: OpenML OpenML
Source Link: https://www.openml.org/d/185 http://www.openml.org/d/185
Citation:
@article{OpenML2013,
author = {Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis},
title = {OpenML: Networked Science in Machine Learning},
journal = {SIGKDD Explorations},
volume = {15},
number = {2},
year = {2013},
pages = {49--60},
url = {http://doi.acm.org/10.1145/2641190.2641198},
doi = {10.1145/2641190.2641198},
publisher = {ACM},
address = {New York, NY, USA},
}
ID: 196_autoMpg_MIN_METADATA_dataset
Name: autoMpg
Description: **Author**:
**Source**: Unknown -
**Please cite**:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Identifier attribute deleted.
As used by Kilpatrick, D. & Cameron-Jones, M. (1998). Numeric prediction
using instance-based learning with encoding length selection. In Progress
in Connectionist-Based Information Systems. Singapore: Springer-Verlag.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
1. Title: Auto-Mpg Data
2. Sources:
(a) Origin: This dataset was taken from the StatLib library which is
maintained at Carnegie Mellon University. The dataset was
used in the 1983 American Statistical Association Exposition.
(c) Date: July 7, 1993
3. Past Usage:
- See 2b (above)
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning.
In Proceedings on the Tenth International Conference of Machine
Learning, 236-243, University of Massachusetts, Amherst. Morgan
Kaufmann.
4. Relevant Information:
This dataset is a slightly modified version of the dataset provided in
the StatLib library. In line with the use by Ross Quinlan (1993) in
predicting the attribute "mpg", 8 of the original instances were removed
because they had unknown values for the "mpg" attribute. The original
dataset is available in the file "auto-mpg.data-original".
"The data concerns city-cycle fuel consumption in miles per gallon,
to be predicted in terms of 3 multivalued discrete and 5 continuous
attributes." (Quinlan, 1993)
5. Number of Instances: 398
6. Number of Attributes: 9 including the class attribute
7. Attribute Information:
1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)
8. Missing Attribute Values: horsepower has 6 missing values
License: CC Public Domain Mark 1.0 CC-BY License
License Link: https://creativecommons.org/publicdomain/mark/1.0/ http://creativecommons.org/licenses/by/4.0/
Source: OpenML OpenML
Source Link: https://www.openml.org/d/196 http://www.openml.org/d/196
Citation:
@article{OpenML2013,
author = {Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis},
title = {OpenML: Networked Science in Machine Learning},
journal = {SIGKDD Explorations},
volume = {15},
number = {2},
year = {2013},
pages = {49--60},
url = {http://doi.acm.org/10.1145/2641190.2641198},
doi = {10.1145/2641190.2641198},
publisher = {ACM},
address = {New York, NY, USA},
}
ID: 22_handgeometry_MIN_METADATA_dataset
Name: Hand geometry: wrist breadth prediction dataset
Description: There are a total of 112 raw hand image files, each corresponding to the left hand of one of
the 112 different users. This has been split into 74 train images and 38 test images. The 74 training instances are
labeled with a real number (the target variable WRISTBREADTH) in unknown units.
License: Creative Commons Attribution Non Commercial 4.0 International Creative Commons Attribution-NonCommercial 4.0
License Link: https://creativecommons.org/licenses/by-nc/4.0/legalcode https://creativecommons.org/licenses/by-nc/4.0/legalcode
Source: Zenodo Zenodo
Source Link: https://zenodo.org/record/17487/export/hx#.WZHDMnonrrc https://zenodo.org/record/17487/export/hx#.WZHDMnonrrc
Citation: @misc{oscar_miguel_hurtado_2016_17487,
author = {Oscar Miguel-Hurtado and
Righard Guest and
Sarah V. Stevenage and
Greg J. Neil and
Sue Black},
title = {{Comparing Machine Learning Classifiers and
Linear/Logistic Regression to Explore the
Relationship between Hand Dimensions and
Demographic Characteristics}},
month = oct,
year = 2016,
doi = {10.5281/zenodo.17487},
url = {https://doi.org/10.5281/zenodo.17487}
}
ID: 26_radon_seed_MIN_METADATA_dataset
Name: Radon Level Prediction Dataset
Description: EPA dataset that correlates counties in MN with radon level emissions.
License: CC0 - Public Domain Dedication ALv2
License Link: https://creativecommons.org/publicdomain/zero/1.0/ https://github.com/pymc-devs/pymc3/blob/master/LICENSE
Source: Harvard Dataverse pymc example data
Source Link: https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/10287 https://github.com/pymc-devs/pymc3/tree/master/pymc3/examples/data
Citation: Salvatier J., Wiecki T.V., Fonnesbeck C. (2016) Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2:e55 DOI: 10.7717/peerj-cs.55.
ID: 27_wordLevels_MIN_METADATA_dataset
Name: Word Level Classification Dataset
Description: This is a tabular dataset comprising of about 7000 instances, split into 5600 training and 1400 test instances. Each instance has about 12 mixed categorical and float features.
License: Creative Commons Attribution 4.0 International Creative Commons Attribution 4.0
License Link: https://creativecommons.org/licenses/by/4.0/legalcode https://creativecommons.org/licenses/by/4.0/legalcode
Source: Zenodo Zenodo
Source Link: https://zenodo.org/record/12501#.WbPUhY6F2dw https://zenodo.org/record/12501#.WbPUhY6F2dw
Citation:
@misc{guzey_2014_12501,
author = {Guzey, Onur andSohsah, Gihad andUnal, Muhammed},
title = {{Classification of word levels with usage frequency, expert opinions and machine learning}},
month = oct,
year = 2014,
doi = {10.5281/zenodo.12501},
url = {https://doi.org/10.5281/zenodo.12501}
}
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: 299_libras_move
Name: 299_libras_move_dataset
Description: LIBRAS Movement Database
ID: 299_libras_move_MIN_METADATA_dataset
Name: libras_move
Description: LIBRAS Movement Database
LIBRAS, acronym of the Portuguese name "LIngua BRAsileira de Sinais", is the official brazilian sign language. The dataset (movement_libras) contains 15 classes of 24 instances each, where each class references to a hand movement type in LIBRAS. The hand movement is represented as a bidimensional curve performed by the hand in a period of time. The curves were obtained from videos of hand movements, with the Libras performance from 4 different people, during 2 sessions. Each video corresponds to only one hand movement and has about $7$ seconds. Each video corresponds to a function F in a functions space which is the continual version of the input dataset. In the video pre-processing, a time normalization is carried out selecting 45 frames from each video, in according to an uniform distribution. In each frame, the centroid pixels of the segmented objects (the hand) are found, which compose the discrete version of the curve F with 45 points. All curves are normalized in the unitary space.
In order to prepare these movements to be analysed by algorithms, we have carried out a mapping operation, that is, each curve F is mapped in a representation with 90 features, with representing the coordinates of movement.
Each instance represents 45 points on a bi-dimensional space, which can be plotted in an ordered way (from 1 through 45 as the X coordinate) in order to draw the path of the movement.
License: CC Public Domain Mark 1.0
License Link: https://creativecommons.org/publicdomain/mark/1.0/
Source: OpenML
Source Link: https://www.openml.org/d/299
Citation: @misc{Dua:2019 ,author = {Dua, Dheeru and Graff, Casey},year = {2017},title = {{UCI} Machine Learning Repository},url = {http://archive.ics.uci.edu/ml},institution = {University of California, Irvine, School of Information and Computer Sciences} }
-----------------END------------------
\ No newline at end of file
License: CC Public Domain Mark 1.0 CC-BY License
License Link: https://creativecommons.org/publicdomain/mark/1.0/ http://creativecommons.org/licenses/by/4.0/
Source: OpenML OpenML
Source Link: https://www.openml.org/d/299 http://www.openml.org/d/299
Citation:
@article{OpenML2013,
author = {Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis},
title = {OpenML: Networked Science in Machine Learning},
journal = {SIGKDD Explorations},
volume = {15},
number = {2},
year = {2013},
pages = {49--60},
url = {http://doi.acm.org/10.1145/2641190.2641198},
doi = {10.1145/2641190.2641198},
publisher = {ACM},
address = {New York, NY, USA},
}
ID: 30_personae_MIN_METADATA_dataset
Name: Personae -personality detection Dataset
Description: The Personae corpus was collected for experiments in Authorship Attribution and Personality Prediction. It consists of 145 Dutch-language essays, written by 145 different students.
License: CC Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
License Link: https://creativecommons.org/licenses/by-nc-sa/3.0/
Source: CLiPS Research Center, University of Antwerp
Source Link: https://www.clips.uantwerpen.be/datasets/personae-corpus
Citation:
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: 313_spectrometer
Name: 313_spectrometer_dataset
Description: Part of the IRAS Low Resolution Spectrometer Database.
ID: 313_spectrometer_MIN_METADATA_dataset
Name: spectrometer
Description: Part of the IRAS Low Resolution Spectrometer Database.
The Infra-Red Astronomy Satellite (IRAS) was the first attempt to map the full sky at infra-red wavelengths. This could not be done from ground observatories because large portions of the infra-red spectrum is absorbed by the atmosphere. The primary observing program was the full high resolution sky mapping performed by scanning at 4 frequencies. The Low Resolution Observation (IRAS-LRS) program observed high intensity sources over two continuous spectral bands. This database derives from a subset of the higher quality LRS observations taken between 12h and 24h right ascension.
This database contains 531 high quality spectra derived from the IRAS-LRS database. The original data contained 100 spectral measurements in each of two overlapping bands. Of these, 44 blue band and 49 red band channels contain usable flux measurements. Only these are included here. The original spectral intensities values are compressed to 4-digits, and each spectrum includes 5 rescaling parameters. We have used the LRS specified algorithm to rescale these to units of spectral intensity (Janskys). Total intensity differences have been eliminated by normalizing each spectrum to a mean value of 5000.
This database was originally obtained for use in development and testing of our AutoClass system for Bayesian classification. We have not retained any results from this development, having concentrated our efforts of a 5425 element version of the same data. Our classifications were based upon simultaneous modeling of all 93 spectral intensities. With the larger database we were able to find classes that correspond well with known spectral types associated with particular stellar types. We also found classes that match with the spectra expected of certain stellar processes under investigation by Ames astronomers. These classes have considerably enlarged the set of stars being investigated by those researchers.
License: CC Public Domain Mark 1.0
License Link: https://creativecommons.org/publicdomain/mark/1.0/
Source: OpenML
Source Link: https://www.openml.org/d/313
Citation: @misc{Dua:2019 ,author = {Dua, Dheeru and Graff, Casey},year = {2017},title = {{UCI} Machine Learning Repository},url = {http://archive.ics.uci.edu/ml},institution = {University of California, Irvine, School of Information and Computer Sciences} }
-----------------END------------------
\ No newline at end of file
License: CC Public Domain Mark 1.0 CC-BY License
License Link: https://creativecommons.org/publicdomain/mark/1.0/ http://creativecommons.org/licenses/by/4.0/
Source: OpenML OpenML
Source Link: https://www.openml.org/d/313 http://www.openml.org/d/313
Citation:
@article{OpenML2013,
author = {Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis},
title = {OpenML: Networked Science in Machine Learning},
journal = {SIGKDD Explorations},
volume = {15},
number = {2},
year = {2013},
pages = {49--60},
url = {http://doi.acm.org/10.1145/2641190.2641198},
doi = {10.1145/2641190.2641198},
publisher = {ACM},
address = {New York, NY, USA},
}
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: 31_urbansound_MIN_METADATA_dataset
Name: UrbanSound dataset
Description: The data consists of raw audio files. The audio files are in multiple formats.
License: CC Attribution Noncommercial License (by-nc), version 3.0 Creative Commons Attribution Noncommercial License (by-nc), version 3.0
License Link: https://creativecommons.org/licenses/by-nc/3.0/ https://creativecommons.org/licenses/by-nc/3.0/
Source: NYU and www.freesound.org NYU and www.freesound.org
Source Link: https://serv.cusp.nyu.edu/projects/urbansounddataset/ https://urbansounddataset.weebly.com/
Citation:
@inproceedings{Salamon:UrbanSound:ACMMM:14,
Address = {Orlando, FL, USA},
Author = {Salamon, J. and Jacoby, C. and Bello, J. P.},
Booktitle = {22st {ACM} International Conference on Multimedia ({ACM-MM'14})},
Month = {Nov.},
Title = {A Dataset and Taxonomy for Urban Sound Research},
Year = {2014}}
ID: 31_urbansound
Name: 31_urbansound_dataset
Description: The data consists of raw audio files. The audio files are in multiple formats.
License: CC Attribution Noncommercial License (by-nc), version 3.0
License Link: https://creativecommons.org/licenses/by-nc/3.0/
Source: NYU and www.freesound.org
Source Link: https://serv.cusp.nyu.edu/projects/urbansounddataset/
Citation: @inproceedings{Salamon:UrbanSound:ACMMM:14, Address = {Orlando, FL, USA}, Author = {Salamon, J. and Jacoby, C. and Bello, J. P.}, Booktitle = {22st {ACM} International Conference on Multimedia ({ACM-MM'14})}, Month = {Nov.}, Title = {A Dataset and Taxonomy for Urban Sound Research}, Year = {2014}}
-----------------END------------------
\ No newline at end of file
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: 32_fma
Name: 32_fma_dataset
Description: The data consists small subset of raw music files. The audio files have different bit rate and length. All are mp3 format
License: Creative Commons Attribution 4.0 International License (CC BY 4.0)
License Link: https://creativecommons.org/licenses/by/4.0
Source: MIT license
Source Link: https://github.com/mdeff/fma
Citation: @inproceedings{fma_dataset, title = {FMA: A Dataset for Music Analysis}, author = {Defferrard, Micha\"el and Benzi, Kirell and Vandergheynst, Pierre and Bresson, Xavier}, booktitle = {18th International Society for Music Information Retrieval Conference}, year = {2017}, url = {https://arxiv.org/abs/1612.01840},}
-----------------END------------------
\ No newline at end of file
ID: 32_fma_MIN_METADATA_dataset
Name: fma dataset
Description: The data consists small subset of raw music files. The audio files have different bit rate and length. All are mp3 format
License: Creative Commons Attribution 4.0 International License (CC BY 4.0)
License Link: https://creativecommons.org/licenses/by/4.0
Source: MIT license
Source Link: https://github.com/mdeff/fma
Citation:
ID: 32_wikiqa_MIN_METADATA_dataset
Name: WikiQA: A Challenge Dataset for Open-Domain Question Answering
Description: WikiQA dataset is a publicly available set of question and sentence (QS) pairs, collected and annotated for research on open-domain question answering
License: MICROSOFT RESEARCH DATA LICENSE AGREEMENT MICROSOFT RESEARCH DATA LICENSE AGREEMENT
License Link: None https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/spim-license.rtf
Source: Microsoft Microsoft
Source Link: https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/# https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/#
Citation:
@inproceedings{wikiqa-a-challenge-dataset-for-open-domain-question-answering,
author = {Yang, Yi and Yih, Scott Wen-tau and Meek, Chris},
title = {WikiQA: A Challenge Dataset for Open-Domain Question Answering},
year = {2015},
month = {September},
publisher = {ACL \u2013 Association for Computational Linguistics},
url = {https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/},
}
ID: 38_sick_MIN_METADATA_dataset
Name: sick
Description: **Author**:
**Source**: [UCI](http://archive.ics.uci.edu/ml/datasets/thyroid+disease)
**Please cite**: Thyroid disease records supplied by the Garavan Institute and J. Ross Quinlan, New South Wales Institute, Syndney, Australia. 1987.
Attribute information:
```
sick, negative. | classes
age: continuous.
sex: M, F.
on thyroxine: f, t.
query on thyroxine: f, t.
on antithyroid medication: f, t.
sick: f, t.
pregnant: f, t.
thyroid surgery: f, t.
I131 treatment: f, t.
query hypothyroid: f, t.
query hyperthyroid: f, t.
lithium: f, t.
goitre: f, t.
tumor: f, t.
hypopituitary: f, t.
psych: f, t.
TSH measured: f, t.
TSH: continuous.
T3 measured: f, t.
T3: continuous.
TT4 measured: f, t.
TT4: continuous.
T4U measured: f, t.
T4U: continuous.
FTI measured: f, t.
FTI: continuous.
TBG measured: f, t.
TBG: continuous.
referral source: WEST, STMW, SVHC, SVI, SVHD, other.
```
```
Num Instances: 3772
Num Attributes: 30
Num Continuous: 7 (Int 1 / Real 6)
Num Discrete: 23
Missing values: 6064 / 5.4%
```
```
name type enum ints real missing distinct (1)
1 'age' Int 0% 100% 0% 1 / 0% 93 / 2% 0%
2 'sex' Enum 96% 0% 0% 150 / 4% 2 / 0% 0%
3 'on thyroxine' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
4 'query on thyroxine' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
5 'on antithyroid medicati Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
6 'sick' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
7 'pregnant' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
8 'thyroid surgery' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
9 'I131 treatment' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
10 'query hypothyroid' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
11 'query hyperthyroid' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
12 'lithium' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
13 'goitre' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
14 'tumor' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
15 'hypopituitary' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
16 'psych' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
17 'TSH measured' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
18 'TSH' Real 0% 11% 79% 369 / 10% 287 / 8% 2%
19 'T3 measured' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
20 'T3' Real 0% 9% 71% 769 / 20% 69 / 2% 0%
21 'TT4 measured' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
22 'TT4' Real 0% 94% 0% 231 / 6% 241 / 6% 1%
23 'T4U measured' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
24 'T4U' Real 0% 2% 87% 387 / 10% 146 / 4% 1%
25 'FTI measured' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
26 'FTI' Real 0% 90% 0% 385 / 10% 234 / 6% 2%
27 'TBG measured' Enum 100% 0% 0% 0 / 0% 1 / 0% 0%
28 'TBG' Real 0% 0% 0% 3772 /100% 0 / 0% 0%
29 'referral source' Enum 100% 0% 0% 0 / 0% 5 / 0% 0%
30 'Class' Enum 100% 0% 0% 0 / 0% 2 / 0% 0%
```
License: CC Public Domain Mark 1.0 CC-BY License
License Link: https://creativecommons.org/publicdomain/mark/1.0/ http://creativecommons.org/licenses/by/4.0/
Source: OpenML OpenML
Source Link: https://www.openml.org/d/38 http://www.openml.org/d/38
Citation:
@article{OpenML2013,
author = {Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis},
title = {OpenML: Networked Science in Machine Learning},
journal = {SIGKDD Explorations},
volume = {15},
number = {2},
year = {2013},
pages = {49--60},
url = {http://doi.acm.org/10.1145/2641190.2641198},
doi = {10.1145/2641190.2641198},
publisher = {ACM},
address = {New York, NY, USA},
}
-----------------NOTICE------------------
This dataset was collected for use within the DARPA Data Driven Discovery of Models (D3M) program.
ID: 4550_MiceProtein
Name: 4550_MiceProtein_dataset
Description: The data set consists of the expression levels of 77 proteins/protein modifications that produced detectable signals in the nuclear fraction of cortex. There are 38 control mice and 34 trisomic mice (Down syndrome), for a total of 72 mice. In the experiments, 15 measurements were registered of each protein per sample/mouse. Therefore, for control mice, there are 38x15, or 570 measurements, and for trisomic mice, there are 34x15, or 510 measurements. The dataset contains a total of 1080 measurements per protein. Each measurement can be considered as an independent sample/mouse.
ID: 4550_MiceProtein_MIN_METADATA_dataset
Name: MiceProtein
Description: The data set consists of the expression levels of 77 proteins/protein modifications that produced detectable signals in the nuclear fraction of cortex. There are 38 control mice and 34 trisomic mice (Down syndrome), for a total of 72 mice. In the experiments, 15 measurements were registered of each protein per sample/mouse. Therefore, for control mice, there are 38x15, or 570 measurements, and for trisomic mice, there are 34x15, or 510 measurements. The dataset contains a total of 1080 measurements per protein. Each measurement can be considered as an independent sample/mouse.
The eight classes of mice are described based on features such as genotype, behavior and treatment. According to genotype, mice can be control or trisomic. According to behavior, some mice have been stimulated to learn (context-shock) and others have not (shock-context) and in order to assess the effect of the drug memantine in recovering the ability to learn in trisomic mice, some mice have been injected with the drug and others have not.
......@@ -20,11 +17,21 @@ Classes:
```
The aim is to identify subsets of proteins that are discriminant between the classes.
License: CC Public Domain Mark 1.0
License Link: https://creativecommons.org/publicdomain/mark/1.0/