Datasets

Datasets is SherlockML’s environment for storing large files. It is designed to prevent accidental loss or modification of important data. Before proceeding, you may want to skim through the tutorial on Accessing Data.

To access the Datasets environment, click the relevant icon in the tab on the left-hand side of the workspace. Once inside the Datasets environment, the buttons on the top-right of the page offer three options, Upload file, Create folder, and Delete file. It is important to note that other actions, such as moving files from Datasets to the workspace, can be performed using the SherlockML File System (SFS) Python module.

Files uploaded to Datasets in CSV or TSV format are automatically analysed with Lens, SherlockML’s data-exploration service. As we will see, reports generated by Lens can be readily accessed from the Datasets page. If, on the other hand, you would like to use this feature as a Python module, have a look at this tutorial.

Lens reports evaluate the quality of datasets, and offer immediate insight through visualisations and tabular summaries.

Moving files to and from datasets

In order to move files from Datasets to the workspace, where you can use files in your programs, we provide a python and R library that lets you manipulate files on Datasets.

The python library is called sherlockml.filesystem. To save you some typing, you can import it as sfs:

import sherlockml.filesystem as sfs

You can then use the commands as follows:

sfs.put('/project/test-file.csv', '/input/test-file.csv')

The various functions for manipulating files are:

ls([prefix, project_id, show_hidden, s3_client]) List contents of project SFS directory.
get(project_path, local_path[, project_id]) Copy from the project directory to the local filesystem.
put(local_path, project_path[, project_id]) Copy from the local filesystem to the project directory.
open(project_path[, mode, temp_dir]) Open a file from SherlockML filesystem for reading.
mv(source_path, destination_path[, project_id]) Move a file within the project directory.
cp(source_path, destination_path[, …]) Copy a file within the project directory.
rm(project_path[, project_id, s3_client]) Remove a file from the project directory.
etag(project_path[, project_id]) Get a unique identifier for the current version of a file.

To find a full list, take a look at the SherlockML File System page.

The r library is called rsherlockml. As always, we load it with the library command:

library(rsherlockml)

You can then use the commands as follows:

datasets_put('/project/test-file.csv', '/input/test-file.csv')

The various functions for manipulating files are:

datasets_get(datasets_path, local_path, project_id = NULL)
Copy from the SherlockML datasets to the local filesystem.
datasets_move(source_path, destination_path, project_id = NULL)
Move a file from one location to another on SherlockML Datasets.
datasets_delete(path, project_id = NULL)
Delete a file from SherlockML Datasets.
datasets_list(prefix = “/”, project_id = NULL, show_hidden = FALSE)
List files on SherlockML datasets.
datasets_copy(source_path, destination_path, project_id = NULL)
Copy a file from one location to another on SherlockML Datasets.
datasets_etag(path, project_id = NULL)
Retrieve the etag for a file on SherlockML datasets.
datasets_put(local_path, datasets_path, project_id = NULL)
Copy from the local filesystem to SherlockML datasets.

Lens reports

Within the Datasets environment, select the CSV or TSV file you that would like to explore. The icon dots on the right-hand side of the page will direct you to the Lens report.

../_images/datasets_page1.png

Interpreting Lens reports

Your Lens report will typically look like this:

../_images/lens_overview.png

Lens reports are organised in three parts, namely Columns, Correlation Matrix and Pairwise Density Plot. Each corresponds to a tab, so that you can navigate through a report as you would in your web browser.

Columns

This is the “landing page” of the report. It lists the quantities (Columns) found in the dataset, alongside their main characteristics,

  • TYPE: the way the column is encoded. As in the data-analysis tool Pandas, the type can be int64 (integer number), float64 (floating point number), or object (non-numeric).
  • VALID: the number of non-null entries in the column.
  • NULL: the number of null entries in the column. The sum of VALID and NULL is equal to the size of your dataset.
  • DISTINCT: the number of distinct entries in the column. In other words, repeated identical values are counted only once.
  • CATEGORICAL: If No, the column is numeric (int64 or float64). Else the column is non-numeric (object).

Note

Clicking the name of a numeric quantity will direct you to the corresponding histogram. The plot will also include an estimate of the Probability Density Function (PDF) for the quantity, represented as a solid yellow line. More precisely, Lens calculates the PDF by means of a kernel density estimation (KDE) method.

../_images/histogram.png

Correlation matrix

Data scientists are often interested in correlations, as these indicate whether it is possible to make predictions. For example, let us assume the score of pupils in a test is highly correlated with the number of hours they studied. Then, given a pupil who has never taken the test, the number of hours he/she spent studying can be used to predict his/her score.

Lens calculates the correlation coefficient of each quantity in the dataset with all the others, and reports back a correlation matrix that summarises this information. For instance, each diagonal entry of this matrix specifies the correlation coefficient of a quantity with itself, which is equal to 1 by definition.

../_images/correlation_matrix.png

More technically, Lens returns the Spearman rank-order correlation coefficient matrix for your dataset.

Pairwise density plot

To better characterise the correlation between two quantities, it is useful to create a scatter plot. The tab Pairwise Density Plot in your Lens report displays, in a sense, all scatter plots that can be generated by examining the quantities in the dataset pairwise. Whenever a quantity is compared with itself, the scatter plot conveys no information, and thus a histogram is displayed instead.

../_images/pairwise_density.png

To be accurate, visualisations reported in this page are not scatter plots, but rather 2D Kernel Density Estimates (KDEs). These are approximations of joint Probability Density Functions (PDFs) for pairs of quantities in the dataset.