Getting Started

This module provides cached loading of open datasets from SherlockML. To view the available datasets:

In [1]:
from sherlockml import opendata
In [2]:
opendata.ls()
Out[2]:
['/census/',
 '/census/census_by_outputarea.csv',
 '/census/census_variable_info.csv',
 '/census/outputarea_localauthority_mapping.csv',
 '/census/outputarea_lsoa_msoa_mapping.csv',
 '/census/outputarea_parliamentaryconstituency_mapping.csv',
 '/census/postcode_outputarea_mapping.csv',
 '/geojson/',
 '/geojson/local_authorities.json',
 '/geojson/lower_super_output_areas.json',
 '/geojson/middle_super_output_areas.json',
 '/geojson/output_areas.json',
 '/geojson/parliamentary_constituencies.json',
 '/higgs_boson/',
 '/higgs_boson/README.md',
 '/higgs_boson/higgs.csv',
 '/higgs_boson/higgs_test.csv',
 '/higgs_boson/higgs_train.csv',
 '/higgs_boson/higgs_validate.csv',
 '/input/',
 '/input/census/census_by_outputarea.csv',
 '/input/census/census_variable_info.csv',
 '/input/census/outputarea_localauthority_mapping.csv',
 '/input/census/outputarea_lsoa_msoa_mapping.csv',
 '/input/census/outputarea_parliamentaryconstituency_mapping.csv',
 '/input/census/postcode_outputarea_mapping.csv',
 '/input/geojson/',
 '/input/geojson/local_authorities.json',
 '/input/geojson/lower_super_output_areas.json',
 '/input/geojson/middle_super_output_areas.json',
 '/input/geojson/output_areas.json',
 '/input/geojson/parliamentary_constituencies.json']

To load one of the datasets into a pandas DataFrame:

In [3]:
df = opendata.load('/census/census_by_outputarea.csv')
In [4]:
df.head()
Out[4]:
OA Total_Population Total_Households Total_Dwellings Total_Household_Spaces Total_Population_16_and_over Total_Population_16_to_74 Total_Pop_No_NI_Students_16_to_74 Total_Employment_16_to_74 Total_Pop_in_Housesholds_16_and_over ... u158 u159 u160 u161 u162 u163 u164 u165 u166 u167
0 E00000001 194 99 115 115 173 148 148 102 173 ... 6 18 57 14 9 2 2 0 0 0
1 E00000003 250 112 125 125 218 199 199 147 218 ... 10 24 74 32 6 2 1 2 1 5
2 E00000005 367 217 241 241 337 304 304 241 337 ... 16 37 117 52 12 7 9 3 0 4
3 E00000007 123 83 103 103 113 111 111 86 113 ... 4 18 36 20 9 0 2 0 0 1
4 E00000010 102 78 79 79 97 86 86 59 97 ... 12 11 16 16 6 0 5 0 0 5

5 rows × 178 columns

That’s it! The data will be cached on disk so as to not download it again. In addition, it will be cached in memory for performance. If the file gets updated on SherlockML, this module ensures you always have the latest version.