Part 3 - Building Data Services

Lesson 3: Tables

Defining Tables for SQLAlchemy

Juicebox data services don’t interact directly with the database using SQL. Instead, we write code in Python which is ‘translated’ to SQL with the help of a toolkit called SQLAlchemy.

In order for SQLAlchemy to perform as desired, we need to tell it what our data table looks like. There are two steps to accomplish this:

  • Create a Python class that inherits from SQLAlchemy’s base.
  • Define the __table__ attribute to match the columns in the database.

Let’s take a closer look at an example SQLAlchemy table, demo.census.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Import the specific SQLAlchemy tools we will use
from sqlalchemy import Table, Column, String, Float, ForeignKey, \
    select, join


class Census(Base):
    """
    The 'demo.census' table defined statically.  This is the preferred way
    to do things as it allows Juicebox to start up faster and be more flexible.

    A primary key is defined but will not be used.
    """

    __table__ = Table('census', Base.metadata,
                      Column('state', String(30), primary_key=True),
                      Column('sex', String(1)),
                      Column('age', Float()),
                      Column('pop2000', Float()),
                      Column('pop2008, Float()),
                      schema='demo', extend_existing=True)

For each table, you’ll create a class that inherits from Base (line 6) and define the __table__ object to look like the database table itself (lines 14 - 20). If you examine the database screenshot below, you’ll notice that each column in that database table matches one of the column objects we just created in the code above. For example, in the database, we have a pop2000 column; the SQLAlchemy code defines this column in line 18.

The demo.census table:

../../../_images/census-schema.png

Below is a sample of data from the census table. It lists population in the years 2000 and 2008 for every sex and age group in the state.

../../../_images/census-state-query-results.png

Your turn

1. In the stacks folder of your NFL app, create a file called nfl_tables.py.

Tip

You can explore the demo.nfl_contract table in PyCharm by using the Database tool window (View > Tool Windows > Database).

2. Map the demo.nfl_contract table to a class called NFLContract.

../../../_images/nfl-query-results.png

The table above shows a sample of the data from demo.nfl_contract table. Here is a column dictionary.

  • name: player’s name
  • start_date: contract starting season
  • end: contract ending season
  • pos: player’s position
  • team: contracted team
  • age: player’s age
  • dollars: contract amount
  • average: average salary per year over the life of the contract
  • gteed: amount of the contract that the player is guaranteed to make
  • per_gteed: percentage of the contract that the player is guaranteed to make
  • free_agent: season the player will become a free agent

Lesson 4: Base Service Basics

The Base Service and Recipes

We mentioned earlier that each slice in a stack gets its own data service. However, slices often share things like logic, tables, or special setup; we define all of these things in the base service which all slices on that stack will share. All slices on a single stack inherit from the base service.

The recipe abstraction for building queries is one of the most powerful features of Juicebox. We build sql queries that essentially function as a recipe for the data. Imagine a large pot: toss in all the ingredients from the shelf and out comes the data you need.

../../../_images/ingredient-cauldron.png

Let’s look at what goes into a base service.

Ingredients in Recipes

In order to create a recipe you must have ingredients. The ingredients of a Juicebox recipe are metrics, dimensions, and filters. Dimensions are ways you group the data. Metrics are ways to aggregate the values of the data. Filters are ways to explicitly include or exclude data. We combine some or all of these things to produce a SQL Query.

Any recipe is really just a way to ask a question about the data. That question gets translated to SQL and the rows returned are your answer.

Note

Let’s identify the recipe’s ingredients to answer the question “How many women lived in Ohio in the year 2000?”.

  • Metric - How many people?
  • Dimension- the year 2000
  • Filter - women only

Read more details about ingredients in the Juicebox docs.

Metrics for Aggregation

Metrics were designed to make aggregating data easy as pie. So, any time you want to know the total, average, maximum, minimum, etc you’ll just build a metric.

Let’s look back at our census example. Say you want to define a metric that would give you the total population in the year 2000. We’ll now walk through how you would accomplish that.

Note

All ingredients come from a shelf; this is just a python dictionary with the ingredients and their definitions.

Below is a metric shelf with only the metric to count the pop2000:

1
2
3
4
metric_shelf = {
    'pop2000': Metric(func.sum(Census.pop2000),
                      singular='Population 2000', format='.3s'),
}

Recall that metrics produce SQL. The first argument passed to a metric is the expression. Here, we call func.sum(), an SQLAlchemy function that translates to a SQL SUM. Most common SQL functions are available through func. The census.pop2000 argument inside of func.sum() is the pop2000 column that we created in our Census class.

If we created a recipe with only this metric the resulting SQL would be as follows, which gives you the sum of the pop2000 column for the whole table.

1
2
SELECT SUM(pop2000)
FROM demo.census;

Dimensions for Grouping

Here is a shelf with only one dimension on it called state. You’ll see that it references the state attribute of the Census object we defined earlier in nfl_tables.py.

1
2
3
4
5
# Dimensions are ways to split the data.
dimension_shelf = {
    # Simplest possible dimension, a SQLAlchemy expression and a label.
    'state': Dimension(Census.state, singular='State', plural='States'),
}

As we saw with metrics, dimensions will result in a SELECT clause in the output SQL. However, dimensions also produce an accompanying GROUP BY clause. The dimension above will result in the following SQL:

1
2
3
SELECT state
FROM demo.census
GROUP BY state

A recipe with only this dimension would produce one column with the list of all states represented in the census, like below.

../../../_images/census-state-list.png

Recipe

The recipe combines different ingredients we create to produce something useful.

For example, if we wanted to know the total population by state we could create

1
recipe = self.recipe().metrics('pop2000').dimensions('state')

We call the recipe constructor, .recipe(), and chain it with a call to .metrics(), passing the name of the metric, and calling dimension in the same manner.

.recipe() goes to the metric and dimension shelves we defined earlier and looks for the ingredients we passed in by name.

This simple recipe will produce the following SQL:

1
2
3
SELECT state, Sum(pop2000)
FROM demo.census
GROUP BY state;

Note

Notice a few advantages of recipes: they allow you to combine and reuse metrics and dimensions in different ways without having to re-write the full SQL every time. Other than simple configurations, most details about managing your data are hidden.

Base Service

As we discussed before, the base service is the data services from which all other slice-specific data services inherit. So, any initial setup that all slices will need is done here.

Hint

Most every base service contains a metric shelf and a dimension shelf.

We also usually have automatic filter keys in the base service. Dimensions defined as automatic_filter_keys will have a global filter automatically created for them.

See below the base service for our example census app.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
class CensusService(RecipeServiceBaseV3):
    # Metrics are defined as an SQLAlchemy expression, and a label.
    metric_shelf = {
        'pop2000': Metric(func.sum(Census.pop2000),
                          singular='Population 2000', format=".3s"),
        'pop2008': Metric(func.sum(Census.pop2008),
                          singular='Population 2008', format=".3s"),
        'popdiff': Metric(func.sum(Census.pop2008 - Census.pop2000),
                          singular='Population Growth',
                          format=".3s"),
        'avgage': Metric(
            func.sum(Census.pop2008 * Census.age) / func.sum(Census.pop2008),
            singular='Average Age', format=".1f"),
        # A metric using a complex expression
        'pctfemale': Metric(func.sum(case([(Census.sex == 'F',
                                            Census.pop2008)], else_=0)) /
                            func.sum(Census.pop2008), singular='% Female',
                            format=".1%"),
        # a metric using a formatter
        'pctdiff': Metric(func.sum(Census.pop2008 - Census.pop2000) / func.sum(
            Census.pop2000),
                          singular='Population Pct Change',
                          formatters=[
                              lambda x: "Change is {0:0.1%} percent".format(
                                  x)]),
    }

    # Dimensions are ways to split the data.
    dimension_shelf = {
        # Simplest possible dimension, a SQLAlchemy expression and a label.
        'state': Dimension(Census.state, singular='State', plural="States"),
        'age': Dimension(Census.age, singular='Age', plural="Ages"),
        # This will use the lookup to get display values of "M" and "F"
        'sex': LookupDimension(Census.sex, label='Sex',
                               lookup={'M': 'Menfolk', "F": "Womenfolk"},
                               singular="Gender", plural="Genders"),
        'age_bands': Dimension(case([(Census.age < 21, 'Under 21'),
                                     (Census.age < 49, '21-49')
                                     ], else_='Other'), label='Age Bands'),
    }

    # Dimension order will be used for the global filters
    automatic_filter_keys = dimension_shelf.keys()

    def __init__(self, *args, **kwargs):
        super(CensusService, self).__init__(*args, **kwargs)
        self.Session = Session

The first metric on the metric shelf is pop2000 (recall from our earlier example that this metric queried the sum of the pop2000 census); pop2008 is similar. The next four metrics are examples of different ways to look at the change in population between 2000 and 2008.

The dimension shelf defines several types of dimensions. You can see a list of available dimensions here. On the dimension shelf, the state dimension (line 31) we used earlier is defined. We also have dimensions for age and sex.

Your turn

  1. Create a file named nfl_baseservice.py.

    This is where you will construct your metric and dimension shelves.

  2. Define metrics based on what you know you need at this point.

    You can always come back to the base service later and add ingredients; this often happens when writing data services for specific slices.

    ../../../_images/nfl-key-metrics-slice.png

    Based on what we see in the option chooser above, you’ll need to define metrics for each data point there (total players, average age, etc).

    Recall that the distribution slice focuses on average salary, so you’ll need a metric for that as well.

  3. Now, create dimensions; look at the app and see what groups we need to aggregate by.

    Let’s start with name, age, team, and position.

  4. Configure your app to have global filters for team and position.

    Remember, you can use automatic_filter_keys to construct global filters from dimensions.

Now you’re ready for Part 4 - Building the Slices!