Datasets is SherlockML’s environment for storing large files. It is designed to prevent accidental loss or modification of important data. Before proceeding, you may want to skim through the tutorial on Accessing Data.
To access the Datasets environment, click the relevant icon in the tab on the left-hand side of the workspace. Once inside the Datasets environment, the buttons on the top-right of the page offer three options, Upload file, Create folder, and Delete file. It is important to note that other actions, such as moving files from Datasets to the workspace, can be performed using the SherlockML File System (SFS) Python module.
Files uploaded to Datasets in CSV or TSV format are automatically analysed with Lens, SherlockML’s data-exploration service. As we will see, reports generated by Lens can be readily accessed from the Datasets page. If, on the other hand, you would like to use this feature as a Python module, have a look at this tutorial.
Lens reports evaluate the quality of datasets, and offer immediate insight through visualisations and tabular summaries.
In order to move files from Datasets to the workspace, where you can use files in your programs, we provide a python and R library that lets you manipulate files on Datasets.
Your Lens report will typically look like this:
Lens reports are organised in three parts, namely Columns, Correlation Matrix and Pairwise Density Plot. Each corresponds to a tab, so that you can navigate through a report as you would in your web browser.
This is the “landing page” of the report. It lists the quantities (Columns) found in the dataset, alongside their main characteristics,
- TYPE: the way the column is encoded. As in the data-analysis tool Pandas,
the type can be
float64(floating point number), or
- VALID: the number of non-null entries in the column.
- NULL: the number of null entries in the column. The sum of VALID and NULL is equal to the size of your dataset.
- DISTINCT: the number of distinct entries in the column. In other words, repeated identical values are counted only once.
- CATEGORICAL: If
No, the column is numeric (
float64). Else the column is non-numeric (
Clicking the name of a numeric quantity will direct you to the corresponding histogram. The plot will also include an estimate of the Probability Density Function (PDF) for the quantity, represented as a solid yellow line. More precisely, Lens calculates the PDF by means of a kernel density estimation (KDE) method.
Data scientists are often interested in correlations, as these indicate whether it is possible to make predictions. For example, let us assume the score of pupils in a test is highly correlated with the number of hours they studied. Then, given a pupil who has never taken the test, the number of hours he/she spent studying can be used to predict his/her score.
Lens calculates the correlation coefficient of each quantity in the dataset with all the others, and reports back a correlation matrix that summarises this information. For instance, each diagonal entry of this matrix specifies the correlation coefficient of a quantity with itself, which is equal to 1 by definition.
More technically, Lens returns the Spearman rank-order correlation coefficient matrix for your dataset.
To better characterise the correlation between two quantities, it is useful to create a scatter plot. The tab Pairwise Density Plot in your Lens report displays, in a sense, all scatter plots that can be generated by examining the quantities in the dataset pairwise. Whenever a quantity is compared with itself, the scatter plot conveys no information, and thus a histogram is displayed instead.
To be accurate, visualisations reported in this page are not scatter plots, but rather 2D Kernel Density Estimates (KDEs). These are approximations of joint Probability Density Functions (PDFs) for pairs of quantities in the dataset.