PostgreSQL

Connecting to PostgreSQL databases from the command line

The psql client is the standard tool for connecting to PostgreSQL databases from the command line. It is pre-installed on all SherlockML servers.

To connect to a PostgreSQL database, you need to know:

  • its hostname: this is usually a string like customers.mydomain.com;
  • its port: the default port for PostgreSQL databases is 5432;
  • the name of the database that you want to connect to. If this is a new server, the only database on it is likely to be named postgres;
  • your username and password for the database server (note that this is different to your SherlockML username). If you are unsure of these, you should ask your database administrator.

You can then connect to the database with:

$ psql --host HOSTNAME --port PORT --user USERNAME --dbname DATABASE_NAME

You should then be prompted for a password. If the port is the default 5432, you may omit the --port PORT argument.

For instance, I can connect to database github running on host github-scraper.cpllpj7texvd.eu-west-1.rds.amazonaws.com with username awsuser:

../../_images/postgresql_example.png

If you get stuck, run psql --help | less in a terminal to view a list of all available options.

Connecting to PostgreSQL databases from Python

To connect to PostgreSQL databases from Python, you can use the Python module psycopg2, which comes pre-installed on SherlockML Jupyter and JupyterLab servers:

import psycopg2

connection = psycopg2.connect(
    host='customers.mydomain.com',  # host on which the database is running
    database='database_name',  # name of the database to connect to
    user='username',  # username to connect with
    password='password'  # password associated with your username
)

cursor = connection.cursor()
cursor.execute('SELECT * FROM customers')
customers = list(cursor.fetchall())
print('We have {} customers'.format(len(customers)))  # This is data science!

connection.close()

Note

We close the connection to allow the database server to reclaim resources. This can be critical in a Jupyter notebook, since the kernel remains alive for a long time.

The psycopg2 object can also be used in a with statement to ensure it gets closed automatically:

import psycopg2

with psycopg2.connect(
    host='customers.mydomain.com', database='database_name',
    user='username', password='password'
) as connection:
    cursor = connection.cursor()
    cursor.execute('SELECT * FROM customers')
    customers = list(cursor.fetchall())
    print('We have {} customers'.format(len(customers)))

You can also use the read_sql() function in pandas to read the result of a query directly into a DataFrame:

import psycopg2
import pandas

with psycopg2.connect(
    host='customers.mydomain.com', database='database_name',
    user='username', password='password'
) as connection:
    df = pandas.read_sql('SELECT * FROM customers', connection)

print(df)

Note

We recommend avoiding pasting database passwords and other connection details in many notebooks in a project. Have a look at Factoring connection details into a package for a recommended strategy for managing database connection details.