PostgreSQL

Connecting to PostgreSQL databases from the command line

The psql client is the standard tool for connecting to PostgreSQL databases from the command line. It is pre-installed on all SherlockML servers.

To connect to a PostgreSQL database, you need to know:

  • its hostname: this is usually a string like customers.mydomain.com;
  • its port: the default port for PostgreSQL databases is 5432;
  • the name of the database that you want to connect to. If this is a new server, the only database on it is likely to be named postgres;
  • your username and password for the database server (note that this is different to your SherlockML username). If you are unsure of these, you should ask your database administrator.

You can then connect to the database with:

$ psql --host HOSTNAME --port PORT --user USERNAME --dbname DATABASE_NAME

You should then be prompted for a password. If the port is the default 5432, you may omit the --port PORT argument.

For instance, I can connect to database github running on host github-scraper.cpllpj7texvd.eu-west-1.rds.amazonaws.com with username awsuser:

../../_images/postgresql_example.png

If you get stuck, run psql --help | less in a terminal to view a list of all available options.

Connecting to PostgreSQL databases from Python

To connect to PostgreSQL databases from Python, you can use the Python module psycopg2, which comes pre-installed on SherlockML Jupyter and JupyterLab servers:

import psycopg2

connection = psycopg2.connect(
    host='customers.mydomain.com',  # host on which the database is running
    database='database_name',  # name of the database to connect to
    user='username',  # username to connect with
    password='password'  # password associated with your username
)

cursor = connection.cursor()
cursor.execute('SELECT * FROM customers')
customers = list(cursor.fetchall())
print('We have {} customers'.format(len(customers)))  # This is data science!

connection.close()

Note

We close the connection to allow the database server to reclaim resources. This can be critical in a Jupyter notebook, since the kernel remains alive for a long time.

The psycopg2 object can also be used in a with statement to ensure it gets closed automatically:

import psycopg2

with psycopg2.connect(
    host='customers.mydomain.com', database='database_name',
    user='username', password='password'
) as connection:
    cursor = connection.cursor()
    cursor.execute('SELECT * FROM customers')
    customers = list(cursor.fetchall())
    print('We have {} customers'.format(len(customers)))

You can also use the read_sql() function in pandas to read the result of a query directly into a DataFrame:

import psycopg2
import pandas

with psycopg2.connect(
    host='customers.mydomain.com', database='database_name',
    user='username', password='password'
) as connection:
    df = pandas.read_sql('SELECT * FROM customers', connection)

print(df)

Note

We recommend avoiding pasting database passwords and other connection details in many notebooks in a project. Have a look at Factoring connection details into a package for a recommended strategy for managing database connection details.

Connecting to PostgreSQL databases from R

To connect to PostgreSQL databases from an RStudio server, firstly install the dependencies for the database driver by running the following command:

$ sudo apt-get install unixodbc-dev odbc-postgresql

Alternately, specify unixodbc-dev and odbc-postgresql in the System section of a custom environment, and apply this environment to your RStudio server (see Custom server environments).

Once the ODBC drivers are installed, install the odbc library for R by running install.packages("odbc") in an R session. Run library(odbc) to import the library, and check which drivers are installed on the system by running odbc::odbcListDrivers().

To connect to a PostgreSQL database from R create a connection object:

connection <- DBI::dbConnect(
  odbc::odbc(),
  Driver   = "psqlodbca.so",
  Server   = "customer.mydomain.com",  # Host on which the database is running
  Database = "database_name",          # Name of the database
  UID      = "username",               # Database username
  PWD      = "password",               # Database password
  Port     = 5432                      # Database port (5432 is the PosgreSQL default)
)

Once you have a connection object, you can list the tables in the database with:

dbListTables(connection)

You can then query a table with dbSendQuery and dbFetch:

result <- dbSendQuery(connection, "SELECT * FROM customers")
dbFetch(result)

where dbSendQuery sends the query and dbFetch retrieves the results. For other kind of queries, just use dbSendQuery.