Intermediate SQL
Mona Khalil
Course Description
So you’ve learned how to aggregate and join data from tables in your database—now what? How do you manipulate, transform, and make the most sense of your data? This intermediate-level course will teach you several key functions necessary to wrangle, filter, and categorize information in a relational database, expand your SQL toolkit, and answer complex questions. You will learn the robust use of CASE statements, subqueries, and window functions—all while discovering some interesting facts about soccer using the European Soccer Database. Should there be further discussion, please contact us via email: dattran.hcmiu@gmail.com.
1 We’ll take the CASE
In this chapter, you will learn how to use the CASE WHEN statement to create categorical variables, aggregate data into a single column with multiple filtering conditions, and calculate counts and percentages.
1.1 We’ll take the CASE
1.1.1 Basic CASE statements
What is your favorite team?
The European Soccer Database contains data about 12,800 matches from 11 countries played between 2011-2015! Throughout this course, you will be shown filtered versions of the tables in this database in order to better explore their contents.
In this exercise, you will identify matches played between FC Schalke 04 and FC Bayern Munich. There are 2 teams identified in each match in the hometeam_id
and awayteam_id
columns, available to you in the filtered matches
table. ID can join to the team_api_id
column in the teams
table, but you cannot perform a join on both at the same time.
However, you can perform this operation using a CASE
statement once you’ve identified the team_api_id
associated with each team!
teams
table.
IN
, giving you the team_api_IDs
needed for the next step.
SELECT
-- Select the team long name and team API id
team_long_name,
team_api_idFROM teams
-- Only include FC Schalke 04 and FC Bayern Munich
WHERE team_long_name IN ('FC Schalke 04', 'FC Bayern Munich');
team_long_name | team_api_id |
---|---|
FC Bayern Munich | 9823 |
FC Schalke 04 | 10189 |
CASE
statement that identifies whether a match in Germany included FC Bayern Munich
, FC Schalke 04
, or neither as the home team.
CASE
statement alias, home_team
.
-- Identify the home team as Bayern Munich, Schalke 04, or neither
SELECT
CASE WHEN hometeam_id = 10189 THEN 'FC Schalke 04'
WHEN hometeam_id = 9823 THEN 'FC Bayern Munich'
ELSE 'Other' END AS home_team,
COUNT(id) AS total_matches
FROM matches
-- Group by the CASE statement alias
GROUP BY home_team;
home_team | total_matches |
---|---|
FC Bayern Munich | 136 |
FC Schalke 04 | 136 |
Other | 25707 |
Great job! You were able to identify two teams using CASE
statement instead of a join, which could only identify one at a time. Let’s learn some more about key teams in Europe!
1.1.2 CASE statements comparing column values
Barcelona is considered one of the strongest teams in Spain’s soccer league.
In this exercise, you will be creating a list of matches in the 2011/2012 season where Barcelona was the home team. You will do this using a CASE
statement that compares the values of two columns to create a new group – wins, losses, and ties.
In 3 steps, you will build a query that identifies a match’s winner, identifies the identity of the opponent, and finally filters for Barcelona as the home team. Completing a query in this order will allow you to watch your results take shape with each new piece of information.
The matches
table currently contains Barcelona’s matches from the 2011/2012 season, and has two key columns, hometeam_id
and awayteam_id
, that can be joined with the teams
table. However, you can only join teams
to one column at a time.
date
of the match and create a CASE
statement to identify matches as home wins, home losses, or ties.
SELECT
-- Select the date of the match
date,
-- Identify home wins, losses, or ties
CASE WHEN home_goal > away_goal THEN 'Home win!'
WHEN home_goal < away_goal THEN 'Home loss :('
ELSE 'Tie' END AS outcome
FROM matches;
date | outcome |
---|---|
2022-08-17 | Tie |
2022-08-16 | Tie |
2022-08-16 | Home loss :( |
2022-08-17 | Home win! |
2022-08-16 | Home loss :( |
2022-09-24 | Tie |
2022-08-16 | Tie |
2022-08-16 | Home loss :( |
2022-08-16 | Home win! |
2022-11-01 | Home win! |
teams
table team_api_id
column to the matches
table awayteam_id
. This allows us to retrieve the away team’s identity.
team_long_name
from teams
as opponent
and complete the CASE
statement from Step 1.
SELECT
date,
m.--Select the team long name column and call it 'opponent'
AS opponent,
t.team_long_name -- Complete the CASE statement with an alias
CASE WHEN m.home_goal > m.away_goal THEN 'Home win!'
WHEN m.home_goal < m.away_goal THEN 'Home loss :('
ELSE 'Tie' END AS outcome
FROM matches AS m
-- Left join teams onto matches
LEFT JOIN teams AS t
ON m.awayteam_id = t.team_api_id;
date | opponent | outcome |
---|---|---|
2022-08-17 | Beerschot AC | Tie |
2022-08-16 | Sporting Lokeren | Tie |
2022-08-16 | RSC Anderlecht | Home loss :( |
2022-08-17 | RAEC Mons | Home win! |
2022-08-16 | Standard de Liège | Home loss :( |
2022-09-24 | Club Brugge KV | Tie |
2022-08-16 | KV Kortrijk | Tie |
2022-08-16 | Royal Excel Mouscron | Home loss :( |
2022-08-16 | Sporting Charleroi | Home win! |
2022-11-01 | KV Kortrijk | Home win! |
CASE
statement as the previous steps.
8634
).
SELECT
date,
m.AS opponent,
t.team_long_name -- Complete the CASE statement with an alias
CASE WHEN m.home_goal > m.away_goal THEN 'Barcelona win!'
WHEN m.home_goal < m.away_goal THEN 'Barcelona loss :('
ELSE 'Tie' END AS outcome
FROM matches AS m
LEFT JOIN teams AS t
ON m.awayteam_id = t.team_api_id
-- Filter for Barcelona as the home team
WHERE m.hometeam_id = 8634;
date | opponent | outcome |
---|---|---|
2022-11-08 | Real Valladolid | Barcelona win! |
2022-11-23 | Getafe CF | Tie |
2022-12-06 | Valencia CF | Barcelona win! |
2022-12-13 | Real Madrid CF | Barcelona win! |
2022-01-03 | RCD Mallorca | Barcelona win! |
2022-01-17 | RC Deportivo de La Coruña | Barcelona win! |
2022-09-13 | Racing Santander | Tie |
2022-01-24 | CD Numancia | Barcelona win! |
2022-02-08 | Real Sporting de Gijón | Barcelona win! |
2022-02-21 | RCD Espanyol | Barcelona loss :( |
Great job! Did you notice how often Barcelona tends to win matches?
1.1.3 CASE statements comparing two column values part 2
Similar to the previous exercise, you will construct a query to determine the outcome of Barcelona’s matches where they played as the away team. You will learn how to combine these two queries in chapters 2 and 3.
Did their performance differ from the matches where they were the home team?
CASE
statement to identify Barcelona’s away team games (id = 8634
) as wins, losses, or ties.
teams
table team_api_id
column on the matches
table hometeam_id
column. This retrieves the identity of the home team opponent.
-- Select matches where Barcelona was the away team
SELECT
date,
m.AS opponent,
t.team_long_name CASE WHEN m.home_goal < m.away_goal THEN 'Barcelona win!'
WHEN m.home_goal > m.away_goal THEN 'Barcelona loss :('
ELSE 'Tie' END AS outcome
FROM matches AS m
-- Join teams to matches
LEFT JOIN teams AS t
ON m.hometeam_id = t.team_api_id
WHERE m.awayteam_id = 8634;
date | opponent | outcome |
---|---|---|
2022-08-31 | CD Numancia | Barcelona loss :( |
2022-11-16 | RC Recreativo | Barcelona win! |
2022-11-29 | Sevilla FC | Barcelona win! |
2022-12-21 | Villarreal CF | Barcelona win! |
2022-01-11 | CA Osasuna | Barcelona win! |
2022-02-01 | Racing Santander | Barcelona win! |
2022-02-14 | Real Betis Balompié | Tie |
2022-03-01 | Atlético Madrid | Barcelona loss :( |
2022-03-15 | UD Almería | Barcelona win! |
2022-04-04 | Real Valladolid | Barcelona win! |
Fantastic! Barcelona’s performance seems to be worse when they are the away team. We’ll explore this further in the next lesson.
1.2 More complex CASE WHEN
1.2.1 In CASE of rivalry
Barcelona and Real Madrid have been rival teams for more than 80 years. Matches between these two teams are given the name El Clásico (The Classic). In this exercise, you will query a list of matches played between these two rivals.
You will notice in Step 2 that when you have multiple logical conditions in a CASE
statement, you may quickly end up with a large number of WHEN
clauses to logically test every outcome you are interested in. It’s
important to make sure you don’t accidentally exclude key information in
your ELSE
clause.
In this exercise, you will retrieve information about matches played between Barcelona (id = 8634
) and Real Madrid (id = 8633
). Note that the query you are provided with already identifies the Clásico matches using a filter in the WHERE
clause.
CASE
statement, identifying Barcelona or Real Madrid as the home team using the hometeam_id
column.
CASE
statement in the same way, using awayteam_id
.
SELECT
date,
-- Identify the home team as Barcelona or Real Madrid
CASE WHEN hometeam_id = 8634 THEN 'FC Barcelona'
ELSE 'Real Madrid CF' END AS home,
-- Identify the away team as Barcelona or Real Madrid
CASE WHEN awayteam_id = 8634 THEN 'FC Barcelona'
ELSE 'Real Madrid CF' END AS away
FROM matches
WHERE (awayteam_id = 8634 OR hometeam_id = 8634)
AND (awayteam_id = 8633 OR hometeam_id = 8633);
date | home | away |
---|---|---|
2022-12-13 | FC Barcelona | Real Madrid CF |
2022-05-02 | Real Madrid CF | FC Barcelona |
2022-11-29 | FC Barcelona | Real Madrid CF |
2022-04-10 | Real Madrid CF | FC Barcelona |
2022-11-29 | FC Barcelona | Real Madrid CF |
2022-04-16 | Real Madrid CF | FC Barcelona |
2022-12-10 | Real Madrid CF | FC Barcelona |
2022-04-21 | FC Barcelona | Real Madrid CF |
2022-03-02 | Real Madrid CF | FC Barcelona |
2022-10-07 | FC Barcelona | Real Madrid CF |
CASE
statement identifying who won each match. Note there are 3 possible outcomes, but 5 conditions that you need to identify.
SELECT
date,
CASE WHEN hometeam_id = 8634 THEN 'FC Barcelona'
ELSE 'Real Madrid CF' END as home,
CASE WHEN awayteam_id = 8634 THEN 'FC Barcelona'
ELSE 'Real Madrid CF' END as away,
-- Identify all possible match outcomes
CASE WHEN home_goal > away_goal AND hometeam_id = 8634 THEN 'Barcelona win!'
WHEN home_goal > away_goal AND hometeam_id = 8633 THEN 'Real Madrid win!'
WHEN home_goal < away_goal AND awayteam_id = 8634 THEN 'Barcelona win!'
WHEN home_goal < away_goal AND awayteam_id = 8633 THEN 'Real Madrid win!'
ELSE 'Tie!' END as outcome
FROM matches
WHERE (awayteam_id = 8634 OR hometeam_id = 8634)
AND (awayteam_id = 8633 OR hometeam_id = 8633);
date | home | away | outcome |
---|---|---|---|
2022-12-13 | FC Barcelona | Real Madrid CF | Barcelona win! |
2022-05-02 | Real Madrid CF | FC Barcelona | Barcelona win! |
2022-11-29 | FC Barcelona | Real Madrid CF | Barcelona win! |
2022-04-10 | Real Madrid CF | FC Barcelona | Barcelona win! |
2022-11-29 | FC Barcelona | Real Madrid CF | Barcelona win! |
2022-04-16 | Real Madrid CF | FC Barcelona | Tie! |
2022-12-10 | Real Madrid CF | FC Barcelona | Barcelona win! |
2022-04-21 | FC Barcelona | Real Madrid CF | Real Madrid win! |
2022-03-02 | Real Madrid CF | FC Barcelona | Real Madrid win! |
2022-10-07 | FC Barcelona | Real Madrid CF | Tie! |
Fantastic! These two teams are fairly evenly matched!
1.2.2 Filtering your CASE statement
Let’s generate a list of matches won by Italy’s Bologna team! There are quite a few additional teams in the two tables, so a key part of generating a usable query will be using your CASE
statement as a filter in the WHERE
clause.
CASE
statements allow you to categorize data that you’re
interested in – and exclude data you’re not interested in. In order to
do this, you can use a CASE
statement as a filter in the WHERE
statement to remove output you don’t want to see.
Here is how you might set that up:
SELECT *
FROM table
WHERE
CASE WHEN a > 5 THEN 'Keep'
WHEN a <= 5 THEN 'Exclude' END = 'Keep';
In essence, you can use the CASE
statement as a filtering column like any other column in your database. The only difference is that you don’t alias the statement in WHERE
.
Bologna
’s team ID listed in the teams
table by selecting the team_long_name
and team_api_id
.
-- Select team_long_name and team_api_id from team
SELECT
team_long_name,
team_api_idFROM teams
-- Filter for team long name
WHERE team_long_name = 'Bologna';
team_long_name | team_api_id |
---|---|
Bologna | 9857 |
season
and date
that a match was played.
CASE
statement so that only Bologna’s home and away wins are identified.
-- Select the season and date columns
SELECT
season,date,
-- Identify when Bologna won a match
CASE WHEN hometeam_id = 9857
AND home_goal > away_goal
THEN 'Bologna Win'
WHEN awayteam_id = 9857
AND away_goal > home_goal
THEN 'Bologna Win'
END AS outcome
FROM matches;
season | date | outcome |
---|---|---|
2008/2009 | 2022-08-17 | NA |
2008/2009 | 2022-08-16 | NA |
2008/2009 | 2022-08-16 | NA |
2008/2009 | 2022-08-17 | NA |
2008/2009 | 2022-08-16 | NA |
2008/2009 | 2022-09-24 | NA |
2008/2009 | 2022-08-16 | NA |
2008/2009 | 2022-08-16 | NA |
2008/2009 | 2022-08-16 | NA |
2008/2009 | 2022-11-01 | NA |
home_goal
and away_goal
for each match.
CASE
statement in the WHERE
clause to filter all NULL
values generated by the statement in the previous step.
-- Select the season, date, home_goal, and away_goal columns
SELECT
season,date,
home_goal,
away_goalFROM matches
WHERE
-- Exclude games not won by Bologna
CASE WHEN hometeam_id = 9857 AND home_goal > away_goal THEN 'Bologna Win'
WHEN awayteam_id = 9857 AND away_goal > home_goal THEN 'Bologna Win'
END IS NOT NULL;
season | date | home_goal | away_goal |
---|---|---|---|
2008/2009 | 2022-08-31 | 1 | 2 |
2008/2009 | 2022-12-13 | 5 | 2 |
2008/2009 | 2022-01-18 | 1 | 2 |
2008/2009 | 2022-01-28 | 0 | 1 |
2008/2009 | 2022-03-08 | 3 | 0 |
2008/2009 | 2022-04-26 | 2 | 0 |
2008/2009 | 2022-05-17 | 2 | 1 |
2008/2009 | 2022-05-31 | 3 | 1 |
2008/2009 | 2022-10-19 | 3 | 1 |
2009/2010 | 2022-10-28 | 2 | 1 |
Good work! You now have details on every match in this database where Bologna won. We will work on adding more useful information to your queries later in this course.
1.3 Aggregate CASE WHEN
1.3.1 COUNT using CASE WHEN
Do the number of soccer matches played in a given European country differ across seasons? We will use the European Soccer Database to answer this question.
You will examine the number of matches played in 3 seasons within each
country listed in the database. This is much easier to explore with each
season’s matches in separate columns. Using the country
and unfiltered match
table, you will count the number of matches played in each country
during the 2012/2013, 2013/2014, and 2014/2015 match seasons.
CASE
statement that identifies the id
of matches played in the 2012/2013 season. Specify that you want ELSE
values to be NULL
.
CASE
statement in a COUNT
function and group the query by the country
alias.
SELECT
AS country,
c.name -- Count games from the 2012/2013 season
COUNT(CASE WHEN m.season = '2012/2013'
THEN m.id ELSE NULL END) AS matches
FROM country AS c
LEFT JOIN match AS m
ON c.id = m.country_id
-- Group by country name alias
GROUP BY country;
country | matches |
---|---|
Belgium | 240 |
England | 380 |
France | 380 |
Germany | 306 |
Italy | 380 |
Netherlands | 306 |
Poland | 240 |
Portugal | 240 |
Scotland | 228 |
Spain | 380 |
CASE WHEN
statements counting the matches played in each country across the 3 seasons.
END
your CASE
statement without an ELSE
clause.
SELECT
AS country,
c.name -- Count matches in each of the 3 seasons
COUNT(CASE WHEN m.season = '2012/2013' THEN m.id END) AS matches,
COUNT(CASE WHEN m.season = '2013/2014' THEN m.id END) AS matches,
COUNT(CASE WHEN m.season = '2014/2015' THEN m.id END) AS matches
FROM country AS c
LEFT JOIN match AS m
ON c.id = m.country_id
-- Group by country name alias
GROUP BY country;
country | matches | matches | matches |
---|---|---|---|
Belgium | 240 | 12 | 240 |
England | 380 | 380 | 380 |
France | 380 | 380 | 380 |
Germany | 306 | 306 | 306 |
Italy | 380 | 380 | 379 |
Netherlands | 306 | 306 | 306 |
Poland | 240 | 240 | 240 |
Portugal | 240 | 240 | 306 |
Scotland | 228 | 228 | 228 |
Spain | 380 | 380 | 380 |
Fantastic! The number of matches played in each season seems relatively consistent across countries. Where do you see the largest difference?
1.3.2 COUNT and CASE WHEN with multiple conditions
In R or Python, you have the ability to calculate a SUM
of logical values (i.e., TRUE
/FALSE
) directly. In SQL, you have to convert these values into 1 and 0 before calculating a sum. This can be done using a CASE
statement.
There’s one key difference when using SUM
to aggregate logical values compared to using COUNT
in the previous exercise –
Your goal here is to use the country
and match
table to determine the total number of matches won by the home team in each country during the 2012/2013, 2013/2014, and 2014/2015 seasons.
CASE
statements to “count” matches in the ‘2012/2013’
, ‘2013/2014’
, and ‘2014/2015’
seasons, respectively.
CASE
statement return a 1
for every match you want to include, and a 0
for every match to exclude.
CASE
statement in a SUM
to return the total matches played in each season.
SELECT
AS country,
c.name -- Sum the total records in each season where the home team won
SUM(CASE WHEN m.season = '2012/2013' AND m.home_goal > m.away_goal
THEN 1 ELSE 0 END) AS matches,
SUM(CASE WHEN m.season = '2013/2014' AND m.home_goal > m.away_goal
THEN 1 ELSE 0 END) AS matches,
SUM(CASE WHEN m.season = '2014/2015' AND m.home_goal > m.away_goal
THEN 1 ELSE 0 END) AS matches
FROM country AS c
LEFT JOIN match AS m
ON c.id = m.country_id
-- Group by country name alias
GROUP BY country;
country | matches | matches | matches |
---|---|---|---|
Belgium | 102 | 6 | 106 |
England | 166 | 179 | 172 |
France | 170 | 168 | 181 |
Germany | 130 | 145 | 145 |
Italy | 177 | 181 | 152 |
Netherlands | 137 | 144 | 138 |
Poland | 97 | 110 | 114 |
Portugal | 103 | 108 | 137 |
Scotland | 89 | 102 | 102 |
Spain | 189 | 179 | 171 |
Good job! You now have a second way of calculating the results from the previous exercise.
1.3.3 Calculating percent with CASE and AVG
CASE
statements will return any value you specify in your THEN
clause. This is an incredibly powerful tool for robust calculations and
data manipulation when used in conjunction with an aggregate statement.
One key task you can perform is using CASE
inside an AVG
function to calculate a percentage of information in your database.
Here’s an example of how you set that up:
AVG(CASE WHEN condition_is_met THEN 1
WHEN condition_is_not_met THEN 0 END)
With this approach, it’s important to accurately specify which records count as 0
, otherwise your calculations may not be correct!
Your task is to examine the number of wins, losses, and ties in each country. The matches
table is filtered to include all matches from the 2013/2014 and 2014/2015 seasons.
CASE
statements to COUNT
the total number of home team wins, away team wins, and ties, which will allow you to examine the total number of records.
SELECT
AS country,
c.name -- Sum the home wins, away wins, and ties in each country
COUNT(CASE WHEN m.home_goal > m.away_goal THEN m.id
END) AS home_wins,
COUNT(CASE WHEN m.home_goal < m.away_goal THEN m.id
END) AS away_wins,
COUNT(CASE WHEN m.home_goal = m.away_goal THEN m.id
END) AS ties
FROM country AS c
LEFT JOIN matches AS m
ON c.id = m.country_id
GROUP BY country;
country | home_wins | away_wins | ties |
---|---|---|---|
Belgium | 810 | 493 | 425 |
England | 1390 | 867 | 783 |
France | 1359 | 822 | 859 |
Germany | 1107 | 744 | 597 |
Italy | 1407 | 814 | 796 |
Netherlands | 1171 | 696 | 581 |
Poland | 870 | 525 | 525 |
Portugal | 908 | 611 | 533 |
Scotland | 760 | 617 | 447 |
Spain | 1485 | 851 | 704 |
CASE
statement inside AVG
.
ties_2013_2014
and ties_2014_2015
, respectively.
SELECT
AS country,
c.name -- Calculate the percentage of tied games in each season
AVG(CASE WHEN m.season= '2013/2014' AND m.home_goal = m.away_goal THEN 1
WHEN m.season= '2013/2014' AND m.home_goal != m.away_goal THEN 0
END) AS ties_2013_2014,
AVG(CASE WHEN m.season= '2014/2015' AND m.home_goal = m.away_goal THEN 1
WHEN m.season= '2014/2015' AND m.home_goal != m.away_goal THEN 0
END) AS ties_2014_2015
FROM country AS c
LEFT JOIN matches AS m
ON c.id = m.country_id
GROUP BY country;
country | ties_2013_2014 | ties_2014_2015 |
---|---|---|
Belgium | 0.1666667 | 0.2500000 |
England | 0.2052632 | 0.2447368 |
France | 0.2842105 | 0.2315789 |
Germany | 0.2091503 | 0.2679739 |
Italy | 0.2368421 | 0.3166227 |
Netherlands | 0.2745098 | 0.2385621 |
Poland | 0.3041667 | 0.2750000 |
Portugal | 0.2500000 | 0.2777778 |
Scotland | 0.2192982 | 0.1929825 |
Spain | 0.2263158 | 0.2394737 |
ROUND
function to round to 2 decimal points.
SELECT
AS country,
c.name -- Round the percentage of tied games to 2 decimal points
ROUND(AVG(CASE WHEN m.season='2013/2014' AND m.home_goal = m.away_goal THEN 1
WHEN m.season='2013/2014' AND m.home_goal != m.away_goal THEN 0
END),2) AS pct_ties_2013_2014,
ROUND(AVG(CASE WHEN m.season='2014/2015' AND m.home_goal = m.away_goal THEN 1
WHEN m.season='2014/2015' AND m.home_goal != m.away_goal THEN 0
END),2) AS pct_ties_2014_2015
FROM country AS c
LEFT JOIN matches AS m
ON c.id = m.country_id
GROUP BY country;
country | pct_ties_2013_2014 | pct_ties_2014_2015 |
---|---|---|
Belgium | 0.17 | 0.25 |
England | 0.21 | 0.24 |
France | 0.28 | 0.23 |
Germany | 0.21 | 0.27 |
Italy | 0.24 | 0.32 |
Netherlands | 0.27 | 0.24 |
Poland | 0.30 | 0.28 |
Portugal | 0.25 | 0.28 |
Scotland | 0.22 | 0.19 |
Spain | 0.23 | 0.24 |
Fantastic! That’s much easier to read. What patterns/differences are you seeing in the results? Who tends to score ties more often?
2 Short and Simple Subqueries
In this chapter, you will learn about subqueries in the SELECT, FROM, and WHERE clauses. You will gain an understanding of when subqueries are necessary to construct your dataset and where to best include them in your queries.
2.1 WHERE are the Subqueries?
2.1.1 Filtering using scalar subqueries
Subqueries are incredibly powerful for performing complex filters and transformations. You can filter data based on single, scalar values using a subquery in ways you cannot by using WHERE
statements or joins. Subqueries can also be used for more advanced
manipulation of your data set. You will likely encounter subqueries in
any real-world setting that uses relational databases.
In this exercise, you will generate a list of matches where the total goals scored (for both teams in total) is more than 3 times the average for games in the matches
table, which includes all games played in the 2013/2014 season.
?column?
in your results.
-- Select the average of home + away goals, multiplied by 3
SELECT
3 * avg(home_goal + away_goal)
FROM matches;
3 * avg(home_goal + away_goal) |
---|
8.116594 |
SELECT
-- Select the date, home goals, and away goals scored
date,
home_goal,
away_goalFROM matches
-- Filter for matches where total goals exceeds 3x the average
WHERE (home_goal + away_goal) >
SELECT 3 * AVG(home_goal + away_goal)
(FROM matches);
date | home_goal | away_goal |
---|---|---|
2022-10-29 | 4 | 5 |
2022-11-22 | 9 | 1 |
2022-01-16 | 7 | 2 |
2022-08-28 | 8 | 2 |
2022-12-29 | 7 | 3 |
2022-05-19 | 5 | 5 |
2022-12-14 | 6 | 3 |
2022-03-22 | 3 | 6 |
2022-08-30 | 3 | 6 |
2022-01-23 | 4 | 5 |
Look at those scores! This calculation would have been much more difficult without a subquery.
2.1.2 Filtering using a subquery with a list
Your goal in this exercise is to generate a list of teams that never
played a game in their home city. Using a subquery, you will generate a
list of unique hometeam_ID
values from the unfiltered match
table to exclude in the team
table’s team_api_ID
column.
In addition to filtering using a single-value (scalar) subquery, you can create a list of values in a subquery to filter data based on a complex set of conditions. This type of subquery generates a one column reference list for the main query. As long as the values in your list match a column in your main query’s table, you don’t need to use a join – even if the list is from a separate table.
WHERE
clause that retrieves all unique hometeam_ID
values from the match
table.
team_long_name
and team_short_name
from the team
table. Exclude all values from the subquery in the main query.
SELECT
-- Select the team long and short names
team_long_name,
team_short_nameFROM team
-- Exclude all values from the subquery
WHERE team_api_id NOT IN
SELECT DISTINCT hometeam_id FROM match); (
team_long_name | team_short_name |
---|
Great job! There are 52 teams without any home games recorded in this database. Why do you think that is?
2.1.3 Filtering with more complex subquery conditions
In the previous exercise, you generated a list of teams that have no
home matches listed in the soccer database using a subquery in WHERE
. Let’s do some further exploration in this database by creating a list of teams that scored 8 or more goals in a home match.
In order to do this, you will construct a subquery in the WHERE
statement with its own filtering condition.
WHERE
clause that retrieves all hometeam_ID
values from match
with a home_goal
score greater than or equal to 8.
team_long_name
and team_short_name
from the team
table. Include all values from the subquery in the main query.
SELECT
-- Select the team long and short names
team_long_name,
team_short_nameFROM team
-- Filter for teams with 8 or more home goals
WHERE team_api_id IN
SELECT hometeam_id
(FROM match
WHERE home_goal >= 8);
team_long_name | team_short_name |
---|---|
Manchester United | MUN |
Tottenham Hotspur | TOT |
Chelsea | CHE |
Southampton | SOU |
FC Bayern Munich | BMU |
PSV | PSV |
SL Benfica | BEN |
Celtic | CEL |
Real Madrid CF | REA |
FC Barcelona | BAR |
Excellent job! This is an interesting list of teams!
2.2 Subqueries in FROM
2.2.1 Joining Subqueries in FROM
The match
table in the European Soccer Database does not
contain country or team names. You can get this information by joining
it to the country
table, and use this to aggregate information, such as the number of matches played in each country.
If you’re interested in filtering data from one of these tables, you can
also create a subquery from one of the tables, and then join it to an
existing table in the database. A subquery in FROM
is an
effective way of answering detailed questions that requires filtering or
transforming data before including it in your final results.
Your goal in this exercise is to generate a subquery using the match
table, and then join that subquery to the country
table to calculate information about matches with 10 or more goals in total!
id
) from the match
table.
SELECT
-- Select the country ID and match ID
country_id, id
FROM match
-- Filter for matches with 10 or more goals in total
WHERE (home_goal + away_goal) >= 10;
country_id | id |
---|---|
1729 | 2157 |
1729 | 3093 |
1729 | 3369 |
1729 | 3566 |
4769 | 5192 |
7809 | 9211 |
13274 | 13899 |
13274 | 14224 |
19694 | 20107 |
21518 | 23444 |
country
in the main query.
name
from country
and count the id
column from match
.
SELECT
-- Select country name and the count match IDs
AS country_name,
c.name COUNT(sub.id) AS matches
FROM country AS c
-- Inner join the subquery onto country
-- Select the country id and match id columns
INNER JOIN (SELECT country_id, id
FROM match
-- Filter the subquery by matches with 10+ goals
WHERE (home_goal + away_goal) >= 10) AS sub
ON c.id = sub.country_id
GROUP BY country_name;
country_name | matches |
---|---|
England | 4 |
France | 1 |
Germany | 1 |
Netherlands | 2 |
Scotland | 1 |
Spain | 5 |
Good job! You generated a count of matches in each country where the total goals was higher than 10. That’s a lot of goals!
2.2.2 Building on Subqueries in FROM
In the previous exercise, you found that England, Netherlands, Germany and Spain were the only countries that had matches in the database where 10 or more goals were scored overall. Let’s find out some more details about those matches – when they were played, during which seasons, and how many of the goals were home versus away goals.
You’ll notice that in this exercise, the table alias is excluded for every column selected in the main query. This is because the main query is extracting data from the subquery, which is treated as a single table.
FROM
clause. Select the
country name from the country table, along with the date, the home goal,
the away goal, and the total goals columns from the match table.
total_goals
. This will be used to filter the main query.
SELECT
-- Select country, date, home, and away goals from the subquery
country,date,
home_goal,
away_goalFROM
-- Select country name, date, home_goal, away_goal, and total goals in the subquery
SELECT c.name AS country,
(date,
m.
m.home_goal,
m.away_goal,+ m.away_goal) AS total_goals
(m.home_goal FROM match AS m
LEFT JOIN country AS c
ON m.country_id = c.id) AS subquery
-- Filter by total goals scored in the main query
WHERE total_goals >= 10;
country | date | home_goal | away_goal |
---|---|---|---|
England | 2022-11-22 | 9 | 1 |
England | 2022-08-28 | 8 | 2 |
England | 2022-12-29 | 7 | 3 |
England | 2022-05-19 | 5 | 5 |
France | 2022-11-08 | 5 | 5 |
Germany | 2022-03-30 | 9 | 2 |
Netherlands | 2022-10-24 | 10 | 0 |
Netherlands | 2022-11-06 | 6 | 4 |
Scotland | 2022-05-05 | 6 | 6 |
Spain | 2022-10-30 | 7 | 3 |
Great job! You can manipulate the information you extract by placing joins in or using a subquery. These must have been exciting matches!
2.3 Subqueries in SELECT
2.3.1 Add a subquery to the SELECT clause
Subqueries in SELECT
statements generate a single value
that allow you to pass an aggregate value down a data frame. This is
useful for performing calculations on data within your database.
In the following exercise, you will construct a query that calculates the average number of goals per match in each country’s league.
home_goal
and away_goal
.
home_goal
and away_goal
. This calculates the average goals for each league.
SELECT
AS league,
l.name -- Select and round the league's total goals
ROUND(AVG(m.home_goal + m.away_goal),2) AS avg_goals,
-- Select and round the average total goals
SELECT ROUND(AVG(home_goal + away_goal),2)
(FROM match
WHERE season = '2013/2014') AS overall_avg
FROM league AS l
LEFT JOIN match AS m
ON l.country_id = m.country_id
-- Filter for the 2013/2014 season
WHERE m.season = '2013/2014'
GROUP BY l.name;
league | avg_goals | overall_avg |
---|---|---|
Belgium Jupiler League | 2.50 | 2.77 |
England Premier League | 2.77 | 2.77 |
France Ligue 1 | 2.46 | 2.77 |
Germany 1. Bundesliga | 3.16 | 2.77 |
Italy Serie A | 2.72 | 2.77 |
Netherlands Eredivisie | 3.20 | 2.77 |
Poland Ekstraklasa | 2.64 | 2.77 |
Portugal Liga ZON Sagres | 2.37 | 2.77 |
Scotland Premier League | 2.75 | 2.77 |
Spain LIGA BBVA | 2.75 | 2.77 |
Good job! With a subquery in select, you can compare each league’s average to the overall average. Which leagues tend to score higher than average?
2.3.2 Subqueries in Select for Calculations
Subqueries in SELECT
are a useful way to create calculated columns in a query. A subquery in SELECT
can be treated as a single numeric value to use in your calculations. When writing queries in SELECT
, it’s important to remember that filtering the main query does not filter the subquery – and vice versa.
In the previous exercise, you created a column to compare each league’s average total goals to the overall average goals in the 2013/2014 season. In this exercise, you will add a column that directly compares these values by subtracting the overall average from the subquery.
SELECT
-- Select the league name and average goals scored
AS league,
l.name ROUND(AVG(m.home_goal + m.away_goal),2) AS avg_goals,
-- Subtract the overall average from the league average
ROUND(AVG(m.home_goal + m.away_goal) -
SELECT AVG(home_goal + away_goal)
(FROM match
WHERE season = '2013/2014'),2) AS diff
FROM league AS l
LEFT JOIN match AS m
ON l.country_id = m.country_id
-- Only include 2013/2014 results
WHERE m.season = '2013/2014'
GROUP BY l.name;
league | avg_goals | diff |
---|---|---|
Belgium Jupiler League | 2.50 | -0.27 |
England Premier League | 2.77 | 0.00 |
France Ligue 1 | 2.46 | -0.31 |
Germany 1. Bundesliga | 3.16 | 0.39 |
Italy Serie A | 2.72 | -0.04 |
Netherlands Eredivisie | 3.20 | 0.43 |
Poland Ekstraklasa | 2.64 | -0.13 |
Portugal Liga ZON Sagres | 2.37 | -0.40 |
Scotland Premier League | 2.75 | -0.02 |
Spain LIGA BBVA | 2.75 | -0.02 |
Good job! Games in the Netherlands tend to score the highest number of goals on average in this season. Why do you think that is?
2.4 Subqueries best practices
2.4.1 ALL the subqueries EVERYWHERE
In soccer leagues, games are played at different stages.
Winning teams progress from one stage to the next, until they reach the
final stage. In each stage, the stakes become higher than the previous
one. The match
table includes data about the different stages that each match took place in.
In this lesson, you will build a final query across 3 exercises that will contain three subqueries – one in the SELECT
clause, one in the FROM
clause, and one in the WHERE
clause. In the final exercise, your query will extract data examining
the average goals scored in each stage of a match. Does the average
number of goals scored change as the stakes get higher from one stage to
the next?
SELECT
subqueries.
m.stage
column.
SELECT
-- Select the stage and average goals for each stage
m.stage,ROUND(AVG(m.home_goal + m.away_goal),2) AS avg_goals,
-- Select the average overall goals for the 2012/2013 season
ROUND((SELECT AVG(home_goal + away_goal)
FROM match
WHERE season = '2012/2013'),2) AS overall
FROM match AS m
-- Filter for the 2012/2013 season
WHERE m.season = '2012/2013'
-- Group by stage
GROUP BY m.stage;
stage | avg_goals | overall |
---|---|---|
1 | 2.68 | 2.77 |
2 | 2.65 | 2.77 |
3 | 2.83 | 2.77 |
4 | 2.80 | 2.77 |
5 | 2.61 | 2.77 |
6 | 2.78 | 2.77 |
7 | 2.69 | 2.77 |
8 | 3.09 | 2.77 |
9 | 2.70 | 2.77 |
10 | 2.96 | 2.77 |
Excellent job! You will continue building on this data set in the next exercises.
2.4.2 Add a subquery in FROM
In the previous exercise, you created a data set listing the average home and away goals in each match stage of the 2012/2013 match season.
In this next step, you will turn the main query into a subquery to extract a list of stages where the average home goals in a stage is higher than the overall average for home goals in a match.
FROM
clause subquery.
WHERE
clause that calculates the overall average home goals.
stage
and avg_goals
columns from the s
subquery into the main query.
SELECT
-- Select the stage and average goals from the subquery
s.stage,ROUND(s.avg_goals,2) AS avg_goals
FROM
-- Select the stage and average goals in 2012/2013
SELECT
(
stage,AVG(home_goal + away_goal) AS avg_goals
FROM match
WHERE season = '2012/2013'
GROUP BY stage) AS s
WHERE
-- Filter the main query using the subquery
> (SELECT AVG(home_goal + away_goal)
s.avg_goals FROM match WHERE season = '2012/2013');
stage | avg_goals |
---|---|
3 | 2.83 |
4 | 2.80 |
6 | 2.78 |
8 | 3.09 |
10 | 2.96 |
11 | 2.92 |
12 | 3.23 |
17 | 2.85 |
20 | 2.96 |
21 | 2.90 |
Excellent job! Quite a few stages were eliminated with this filter. Do you notice anything interesting?
2.4.3 Add a subquery in SELECT
In the previous exercise, you added a subquery to the FROM
statement and selected the stages where the number of average goals in a
stage exceeded the overall average number of goals in the 2012/2013
match season. In this final step, you will add a subquery in SELECT
to compare the average number of goals scored in each stage to the total.
SELECT
that yields the average goals scored in the 2012/2013 season. Name the new column overall_avg
.
FROM
that calculates the average goals scored in each stage during the 2012/2013 season.
SELECT
-- Select the stage and average goals from s
s.stage,ROUND(s.avg_goals,2) AS avg_goal,
-- Select the overall average for 2012/2013
SELECT AVG(home_goal + away_goal) FROM match WHERE season = '2012/2013') AS overall_avg
(FROM
-- Select the stage and average goals in 2012/2013 from match
SELECT
(
stage,AVG(home_goal + away_goal) AS avg_goals
FROM match
WHERE season = '2012/2013'
GROUP BY stage) AS s
WHERE
-- Filter the main query using the subquery
> (SELECT AVG(home_goal + away_goal)
s.avg_goals FROM match WHERE season = '2012/2013');
stage | avg_goal | overall_avg |
---|---|---|
3 | 2.83 | 2.772699 |
4 | 2.80 | 2.772699 |
6 | 2.78 | 2.772699 |
8 | 3.09 | 2.772699 |
10 | 2.96 | 2.772699 |
11 | 2.92 | 2.772699 |
12 | 3.23 | 2.772699 |
17 | 2.85 | 2.772699 |
20 | 2.96 | 2.772699 |
21 | 2.90 | 2.772699 |
Amazing job! You used 3 subqueries to generate a list of stages that have higher than average goals scored in matches.
4 Window Functions
You will learn about window functions and how to pass aggregate functions along a dataset. You will also learn how to calculate running totals and partitioned averages.
4.1 It’s OVER
4.1.1 The match is OVER
The OVER()
clause allows you to pass an aggregate function down a data set, similar to subqueries in SELECT
. The OVER()
clause offers significant benefits over subqueries in select – namely, your queries will run faster, and the OVER()
clause has a wide range of additional functions and clauses you can
include with it that we will cover later on in this chapter.
In this exercise, you will revise some queries from previous chapters using the OVER()
clause.
match
and country
tables.
SELECT
-- Select the id, country name, season, home, and away goals
id,
m.AS country,
c.name
m.season,
m.home_goal,
m.away_goal,-- Use a window to include the aggregate average in each row
AVG(m.home_goal + m.away_goal) OVER() AS overall_avg
FROM match AS m
LEFT JOIN country AS c ON m.country_id = c.id;
id | country | season | home_goal | away_goal | overall_avg |
---|---|---|---|---|---|
1 | Belgium | 2008/2009 | 1 | 1 | 2.705531 |
2 | Belgium | 2008/2009 | 0 | 0 | 2.705531 |
3 | Belgium | 2008/2009 | 0 | 3 | 2.705531 |
4 | Belgium | 2008/2009 | 5 | 0 | 2.705531 |
5 | Belgium | 2008/2009 | 1 | 3 | 2.705531 |
6 | Belgium | 2008/2009 | 1 | 1 | 2.705531 |
7 | Belgium | 2008/2009 | 2 | 2 | 2.705531 |
8 | Belgium | 2008/2009 | 1 | 2 | 2.705531 |
9 | Belgium | 2008/2009 | 1 | 0 | 2.705531 |
10 | Belgium | 2008/2009 | 4 | 1 | 2.705531 |
Good job! The basic window function with an OVER
clause functions just like a subquery in SELECT
.
4.1.2 What’s OVER here?
Window functions allow you to create a RANK
of information
according to any variable you want to use to sort your data. When
setting this up, you will need to specify what column/calculation you
want to use to calculate your rank. This is done by including an ORDER BY
clause inside the OVER()
clause. Below is an example:
SELECT
id,
RANK() OVER(ORDER BY home_goal) AS rank
FROM match;
In this exercise, you will create a data set of ranked matches according to which leagues, on average, score the most goals in a match.
league
and match
.
SELECT
-- Select the league name and average goals scored
AS league,
l.name AVG(m.home_goal + m.away_goal) AS avg_goals,
-- Rank each league according to the average goals
RANK() OVER(ORDER BY AVG(m.home_goal + m.away_goal)) AS league_rank
FROM league AS l
LEFT JOIN match AS m
ON l.id = m.country_id
WHERE m.season = '2011/2012'
GROUP BY l.name
-- Order the query by the rank you created
ORDER BY league_rank;
league | avg_goals | league_rank |
---|---|---|
Poland Ekstraklasa | 2.195833 | 1 |
France Ligue 1 | 2.515789 | 2 |
Italy Serie A | 2.583799 | 3 |
Switzerland Super League | 2.623457 | 4 |
Scotland Premier League | 2.635965 | 5 |
Portugal Liga ZON Sagres | 2.641667 | 6 |
Spain LIGA BBVA | 2.763158 | 7 |
England Premier League | 2.805263 | 8 |
Germany 1. Bundesliga | 2.859477 | 9 |
Belgium Jupiler League | 2.879167 | 10 |
Good job! You’ll notice that the default order is from smallest to largest value – in this case, the Polish league tends to have the fewest goals scored on average.
4.1.3 Flip OVER your results
In the last exercise, the rank generated in your query was organized from smallest to largest. By adding DESC
to your window function, you can create a rank sorted from largest to smallest.
SELECT
id,
RANK() OVER(ORDER BY home_goal DESC) AS rank
FROM match;
SELECT
-- Select the league name and average goals scored
AS league,
l.name AVG(m.home_goal + m.away_goal) AS avg_goals,
-- Rank leagues in descending order by average goals
RANK() OVER(ORDER BY AVG(m.home_goal + m.away_goal) DESC) AS league_rank
FROM league AS l
LEFT JOIN match AS m
ON l.id = m.country_id
WHERE m.season = '2011/2012'
GROUP BY l.name
-- Order the query by the rank you created
ORDER BY league_rank;
league | avg_goals | league_rank |
---|---|---|
Netherlands Eredivisie | 3.258170 | 1 |
Belgium Jupiler League | 2.879167 | 2 |
Germany 1. Bundesliga | 2.859477 | 3 |
England Premier League | 2.805263 | 4 |
Spain LIGA BBVA | 2.763158 | 5 |
Portugal Liga ZON Sagres | 2.641667 | 6 |
Scotland Premier League | 2.635965 | 7 |
Switzerland Super League | 2.623457 | 8 |
Italy Serie A | 2.583799 | 9 |
France Ligue 1 | 2.515789 | 10 |
Great job! Unlike a subquery in SELECT, your window function will apply the filter that you include in your WHERE
clause.
4.2 OVER with a PARTITION
4.2.1 PARTITION BY a column
The PARTITION BY
clause allows you to calculate separate
“windows” based on columns you want to divide your results. For example,
you can create a single column that calculates an overall average of
goals scored for each season.
In this exercise, you will be creating a data set of games played by Legia Warszawa (Warsaw League), the top ranked team in Poland, and comparing their individual game performance to the overall average for that season.
Where do you see more outliers? Are they Legia Warszawa’s home or away games?
id = 8673
.
SELECT
date,
season,
home_goal,
away_goal,CASE WHEN hometeam_id = 8673 THEN 'home'
ELSE 'away' END AS warsaw_location,
-- Calculate the average goals scored partitioned by season
AVG(home_goal) OVER(PARTITION BY season) AS season_homeavg,
AVG(away_goal) OVER(PARTITION BY season) AS season_awayavg
FROM match
-- Filter the data set for Legia Warszawa matches only
WHERE
= 8673
hometeam_id OR awayteam_id = 8673
ORDER BY (home_goal + away_goal) DESC;
date | season | home_goal | away_goal | warsaw_location | season_homeavg | season_awayavg |
---|---|---|---|---|---|---|
2022-09-14 | 2013/2014 | 3 | 5 | away | 1.766667 | 1.233333 |
2022-10-24 | 2009/2010 | 5 | 2 | home | 1.233333 | 0.700000 |
2022-05-25 | 2010/2011 | 2 | 5 | away | 1.633333 | 1.133333 |
2022-09-13 | 2014/2015 | 4 | 3 | home | 1.566667 | 1.333333 |
2022-02-25 | 2010/2011 | 3 | 3 | away | 1.633333 | 1.133333 |
2022-07-20 | 2013/2014 | 5 | 1 | home | 1.766667 | 1.233333 |
2022-11-28 | 2008/2009 | 2 | 3 | away | 1.600000 | 0.700000 |
2022-05-30 | 2008/2009 | 4 | 1 | home | 1.600000 | 0.700000 |
2022-03-26 | 2009/2010 | 2 | 3 | away | 1.233333 | 0.700000 |
2022-10-22 | 2010/2011 | 1 | 4 | away | 1.633333 | 1.133333 |
Good job! You’ve generated an ordered data set with window functions to compare each game’s score to averages for that season.
4.2.2 PARTITION BY multiple columns
The PARTITION BY
clause can be used to break out window
averages by multiple data points (columns). You can even calculate the
information you want to use to partition your data! For example, you can
calculate average goals scored by season and by country, or by the
calendar year (taken from the date column).
In this exercise, you will calculate the average number home and away goals scored Legia Warszawa, and their opponents, partitioned by the month in each season.
8673
) so that the window calculation only includes matches involving them.
SELECT
date,
season,
home_goal,
away_goal,CASE WHEN hometeam_id = 8673 THEN 'home'
ELSE 'away' END AS warsaw_location,
-- Calculate average goals partitioned by season and month
AVG(home_goal) OVER(PARTITION BY season,
cast(strftime('%m', date) as int)) AS season_mo_home,
AVG(away_goal) OVER(PARTITION BY season,
cast(strftime('%m', date) as int)) AS season_mo_away
FROM match
WHERE
= 8673
hometeam_id OR awayteam_id = 8673
ORDER BY (home_goal + away_goal) DESC;
date | season | home_goal | away_goal | warsaw_location | season_mo_home | season_mo_away |
---|---|---|---|---|---|---|
2022-09-14 | 2013/2014 | 3 | 5 | away | 2.250000 | 2.5000000 |
2022-10-24 | 2009/2010 | 5 | 2 | home | 2.500000 | 0.7500000 |
2022-05-25 | 2010/2011 | 2 | 5 | away | 2.000000 | 1.1666667 |
2022-09-13 | 2014/2015 | 4 | 3 | home | 2.000000 | 2.6666667 |
2022-02-25 | 2010/2011 | 3 | 3 | away | 3.000000 | 3.0000000 |
2022-07-20 | 2013/2014 | 5 | 1 | home | 2.500000 | 2.0000000 |
2022-05-30 | 2008/2009 | 4 | 1 | home | 2.000000 | 0.6000000 |
2022-11-28 | 2008/2009 | 2 | 3 | away | 1.833333 | 0.6666667 |
2022-03-26 | 2009/2010 | 2 | 3 | away | 1.000000 | 1.2500000 |
2022-04-02 | 2010/2011 | 2 | 3 | home | 1.600000 | 1.2000000 |
Excellent! You successfully partitioned your averages by each month in the season. In the next lesson, we’ll expand on the usefulness of information generated with window functions.
4.3 Sliding windows
4.3.1 Slide to the left
Sliding windows allow you to create running calculations between any two points in a window using functions such as PRECEDING
, FOLLOWING
, and CURRENT ROW
.
You can calculate running counts, sums, averages, and other aggregate
functions between any two points you specify in the data set.
In this exercise, you will expand on the examples discussed in the video, calculating the running total of goals scored by the FC Utrecht when they were the home team during the 2011/2012 season. Do they score more goals at the end of the season as the home or away team?
- Assessing the running total of home goals scored by FC Utrecht.
- Assessing the running average of home goals scored.
-
Ordering both the running average and running total by
date
.
SELECT
date,
home_goal,
away_goal,-- Create a running total and running average of home goals
SUM(home_goal) OVER(ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
AVG(home_goal) OVER(ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_avg
FROM match
WHERE
= 9908
hometeam_id AND season = '2011/2012';
date | home_goal | away_goal | running_total | running_avg |
---|---|---|---|---|
2022-01-22 | 1 | 1 | 1 | 1.000000 |
2022-02-12 | 1 | 1 | 2 | 1.000000 |
2022-02-19 | 3 | 0 | 5 | 1.666667 |
2022-03-04 | 0 | 0 | 5 | 1.250000 |
2022-03-18 | 3 | 1 | 8 | 1.600000 |
2022-03-30 | 3 | 2 | 11 | 1.833333 |
2022-04-15 | 4 | 2 | 15 | 2.142857 |
2022-04-28 | 1 | 3 | 16 | 2.000000 |
2022-05-02 | 2 | 2 | 18 | 2.000000 |
2022-08-14 | 2 | 2 | 20 | 2.000000 |
Good job! Sliding windows allow for a wide variety of calculations with your database.
4.3.2 Slide to the right
Now let’s see how FC Utrecht performs when they’re the away team. You’ll notice that the total for the season is at the bottom of the data set you queried. Depending on your results, this could be pretty long, and scrolling down is not very helpful.
In this exercise, you will slightly modify the query from the previous
exercise by sorting the data set in reverse order and calculating a
backward running total from the CURRENT ROW
to the end of the data set (earliest record).
- Assessing the running total of home goals scored by FC Utrecht.
- Assessing the running average of home goals scored.
-
Ordering both the running average and running total by
date
, descending.
SELECT
-- Select the date, home goal, and away goals
date,
home_goal,
away_goal,-- Create a running total and running average of home goals
SUM(home_goal) OVER(ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_total,
AVG(home_goal) OVER(ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_avg
FROM match
WHERE
= 9908
awayteam_id AND season = '2011/2012';
date | home_goal | away_goal | running_total | running_avg |
---|---|---|---|---|
2022-12-17 | 1 | 0 | 25 | 1.470588 |
2022-11-25 | 2 | 0 | 24 | 1.500000 |
2022-11-20 | 2 | 2 | 22 | 1.466667 |
2022-10-30 | 3 | 1 | 20 | 1.428571 |
2022-10-15 | 1 | 0 | 17 | 1.307692 |
2022-09-24 | 1 | 0 | 16 | 1.333333 |
2022-09-11 | 2 | 3 | 15 | 1.363636 |
2022-08-20 | 2 | 1 | 13 | 1.300000 |
2022-08-06 | 0 | 0 | 11 | 1.222222 |
2022-05-06 | 1 | 3 | 11 | 1.375000 |
Good job! You can reverse your calculations with your window functions to best suit your needs in an analysis.
4.4 Bringing it all together
4.4.1 Setting up the home team CTE
In this course, we’ve covered ways in which you can use CASE
statements, subqueries, common table expressions, and window functions
in your queries to structure a data set that best meets your needs. For
this exercise, you will be using all of these concepts to generate a
list of matches in which Manchester United was defeated during the
2014/2015 English Premier League season.
Your first task is to create the first query that filters for matches where Manchester United played as the home team. This will become a common table expression in a later exercise.
CASE
statement that identifies each match as a win, lose, or tie for Manchester United.
WHEN
clause in the CASE
statement (equals, greater than, less than).
match
, and team_api_id
from team
.
SELECT
id,
m.
t.team_long_name,-- Identify matches as home/away wins or ties
CASE WHEN m.home_goal > m.away_goal THEN 'MU Win'
WHEN m.home_goal < m.away_goal THEN 'MU Loss'
ELSE 'Tie' END AS outcome
FROM match AS m
-- Left join team on the home team ID and team API id
LEFT JOIN team AS t
ON m.hometeam_id = t.team_api_id
WHERE
-- Filter for 2014/2015 and Manchester United as the home team
= '2014/2015'
m.season AND t.team_long_name = 'Manchester United';
id | team_long_name | outcome |
---|---|---|
4013 | Manchester United | MU Loss |
4031 | Manchester United | MU Win |
4051 | Manchester United | MU Win |
4062 | Manchester United | MU Win |
4085 | Manchester United | MU Win |
4105 | Manchester United | MU Win |
4145 | Manchester United | MU Loss |
4164 | Manchester United | MU Win |
4181 | Manchester United | MU Win |
4203 | Manchester United | MU Win |
Good job! This query you wrote will become the first common table expression.
4.4.2 Setting up the away team CTE
Great job! Now that you have a query identifying the home team in a
match, you will perform a similar set of steps to identify the away team. Just like the previous step, you will join the match
and team
tables. Each of these two queries will be declared as a Common Table Expression in the following step.
The primary difference in this query is that you will be joining the tables on awayteam_id
, and reversing the match outcomes in the CASE
statement.
When altering CASE
statement logic in your own work, you can reverse either the logical condition (i.e., home_goal > away_goal
) or the outcome in THEN
– just make sure you only reverse one of the two!
CASE
statement syntax.
awayteam_id
, and team_api_id
.
SELECT
id,
m.
t.team_long_name,-- Identify matches as home/away wins or ties
CASE WHEN m.home_goal > m.away_goal THEN 'MU Loss'
WHEN m.home_goal < m.away_goal THEN 'MU Win'
ELSE 'Tie' END AS outcome
-- Join team table to the match table
FROM match AS m
LEFT JOIN team AS t
ON m.awayteam_id = t.team_api_id
WHERE
-- Filter for 2014/2015 and Manchester United as the away team
= '2014/2015'
m.season AND t.team_long_name = 'Manchester United';
id | team_long_name | outcome |
---|---|---|
4026 | Manchester United | MU Loss |
4039 | Manchester United | MU Win |
4075 | Manchester United | MU Win |
4089 | Manchester United | Tie |
4117 | Manchester United | Tie |
4126 | Manchester United | Tie |
4136 | Manchester United | Tie |
4155 | Manchester United | MU Win |
4178 | Manchester United | Tie |
4197 | Manchester United | MU Loss |
Great job! You correctly constructed a CASE
statement to
accurately identify matches as a win or loss for Manchester United. This
will become your second common table expression.
4.4.3 Putting the CTEs together
Now that you’ve created the two subqueries identifying the home and away
team opponents, it’s time to rearrange your query with the home
and away
subqueries as Common Table Expressions (CTEs). You’ll notice that the main query includes the phrase, SELECT DISTINCT
. Without identifying only DISTINCT
matches, you will return a duplicate record for each game played.
Continue building the query to extract all matches played by Manchester United in the 2014/2015 season.
home
and away
CTEs before your main query.
LEFT JOIN
.
match
, team names from the CTEs, and home/ away goals from match
in the main query.
-- Set up the home team CTE
WITH home AS (
SELECT m.id, t.team_long_name,
CASE WHEN m.home_goal > m.away_goal THEN 'MU Win'
WHEN m.home_goal < m.away_goal THEN 'MU Loss'
ELSE 'Tie' END AS outcome
FROM match AS m
LEFT JOIN team AS t ON m.hometeam_id = t.team_api_id),
-- Set up the away team CTE
AS (
away SELECT m.id, t.team_long_name,
CASE WHEN m.home_goal > m.away_goal THEN 'MU Loss'
WHEN m.home_goal < m.away_goal THEN 'MU Win'
ELSE 'Tie' END AS outcome
FROM match AS m
LEFT JOIN team AS t ON m.awayteam_id = t.team_api_id)
-- Select team names, the date and goals
SELECT DISTINCT
date,
m.AS home_team,
home.team_long_name AS away_team,
away.team_long_name
m.home_goal, m.away_goal-- Join the CTEs onto the match table
FROM match AS m
LEFT JOIN home ON m.id = home.id
LEFT JOIN away ON m.id = away.id
WHERE m.season = '2014/2015'
AND (home.team_long_name = 'Manchester United'
OR away.team_long_name = 'Manchester United');
date | home_team | away_team | home_goal | away_goal |
---|---|---|---|---|
2022-08-16 | Manchester United | Swansea City | 1 | 2 |
2022-11-02 | Manchester City | Manchester United | 1 | 0 |
2022-11-08 | Manchester United | Crystal Palace | 1 | 0 |
2022-11-22 | Arsenal | Manchester United | 1 | 2 |
2022-11-29 | Manchester United | Hull City | 3 | 0 |
2022-12-02 | Manchester United | Stoke City | 2 | 1 |
2022-12-08 | Southampton | Manchester United | 1 | 2 |
2022-12-14 | Manchester United | Liverpool | 3 | 0 |
2022-12-20 | Aston Villa | Manchester United | 1 | 1 |
2022-12-26 | Manchester United | Newcastle United | 3 | 1 |
Fantastic! You wrote a query that extracted data from two common table expressions. You can scale up the number of CTEs as necessary, depending on your data.
4.4.4 Add a window function
Fantastic! You now have a result set that retrieves the match date, home team, away team, and the goals scored by each team. You have one final component of the question left – how badly did Manchester United lose in each match?
In order to determine this, let’s add a window function to the main query that ranks matches by the absolute value of the difference between home_goal
and away_goal
.
This allows us to directly compare the difference in scores without
having to consider whether Manchester United played as the home or away
team!
The equation is complete for you – all you need to do is properly complete the window function!
-- Set up the home team CTE
WITH home AS (
SELECT m.id, t.team_long_name,
CASE WHEN m.home_goal > m.away_goal THEN 'MU Win'
WHEN m.home_goal < m.away_goal THEN 'MU Loss'
ELSE 'Tie' END AS outcome
FROM match AS m
LEFT JOIN team AS t ON m.hometeam_id = t.team_api_id),
-- Set up the away team CTE
AS (
away SELECT m.id, t.team_long_name,
CASE WHEN m.home_goal > m.away_goal THEN 'MU Loss'
WHEN m.home_goal < m.away_goal THEN 'MU Win'
ELSE 'Tie' END AS outcome
FROM match AS m
LEFT JOIN team AS t ON m.awayteam_id = t.team_api_id)
-- Select columns and and rank the matches by goal difference
SELECT DISTINCT
date,
m.AS home_team,
home.team_long_name AS away_team,
away.team_long_name
m.home_goal, m.away_goal,RANK() OVER(ORDER BY ABS(home_goal - away_goal) DESC) as match_rank
-- Join the CTEs onto the match table
FROM match AS m
LEFT JOIN home ON m.id = home.id
LEFT JOIN AWAY ON m.id = away.id
WHERE m.season = '2014/2015'
AND ((home.team_long_name = 'Manchester United' AND home.outcome = 'MU Loss')
OR (away.team_long_name = 'Manchester United' AND away.outcome = 'MU Loss'));
date | home_team | away_team | home_goal | away_goal | match_rank |
---|---|---|---|---|---|
2022-04-26 | Everton | Manchester United | 3 | 0 | 1 |
2022-09-21 | Leicester City | Manchester United | 5 | 3 | 2 |
2022-08-16 | Manchester United | Swansea City | 1 | 2 | 3 |
2022-11-02 | Manchester City | Manchester United | 1 | 0 | 3 |
2022-01-11 | Manchester United | Southampton | 0 | 1 | 3 |
2022-02-21 | Swansea City | Manchester United | 2 | 1 | 3 |
2022-04-18 | Chelsea | Manchester United | 1 | 0 | 3 |
2022-05-02 | Manchester United | West Bromwich Albion | 0 | 1 | 3 |
Awesome! This is an interesting list of matches. You can easily modify this query for any team of your choice. Finally, if you liked this course, let your instructor know by tweeting them at @mona kay !