SQL 1: Intro to SQL and the SELECT Statement

SQL stands for Structured Query Language – see, you're halfway through the first sentence of this course, and you're already half asleep! Simply put, SQL is a way for us humans to communicate with databases. And "database" is a pretty generic term for "way for a computer to store data". There are lots of ways for a computer to store data, including (but certainly not limited to):

  • CSV, JSON, or other structured text file
  • Spreadsheet
  • Document database
  • Relational database management system (RDBMS)

That last one – relational databases or RDBMS – is the one we're going to concentrate on in this course. While there are many software applications that store data in other ways, including those listed, a vast majority of business-class software applications use some flavor of relational database to store data. These "flavors" are vast – some are open-source solutions, and some are proprietary and commercial. Examples of these are:

  • MySQL (open-source)
  • PostgreSQL (open-source)
  • SQLite (open-source)
  • IBM DB2 (closed-source, commercial)
  • Microsoft SQL Server (closed-source, commercial)
  • Oracle Database (closed-source, commercial)

And, although there are a large number of these flavors, developed by many different companies and organizations, the core of the SQL language is a standard that may be used to communicate with any of them.

Let's dive right in.

The Basics

Relational databases store data in tabular format. If you've ever used Microsoft Excel, Google Sheets, or any other spreadsheet application to store a list of anything, you've probably invented a way to organize data that is very similar to how databases do it – perhaps:

  • Your data is stored in a grid format
  • The columns of the grid represent the attributes of the thing you are storing
  • There is a row in the grid for each one of the things, and maybe that's called a record

For example, if I were to build a spreadsheet of my favorite movies, it might look something like this:

# Title Year Rating
1 Star Wars 1977 PG
2 Apollo 13 2004 PG
3 The Matrix 2007 R

Pretty straightforward. And, if I want to add more attributes to my data – say, I want to add the studio that produced the movie – I'd simply add to the columns of data.

# Title Year Rating Studio
1 Star Wars 1977 PG Lucasfilm
2 Apollo 13 2004 PG Universal
3 The Matrix 2007 R Warner Bros.

And, if I want to add more movies, I'd add to the rows of data.

# Title Year Rating Studio
1 Star Wars 1977 PG Lucasfilm
2 Apollo 13 2004 PG Universal
3 The Matrix 2007 R Warner Bros.
4 Forrest Gump 1994 PG-13 Paramount

As simple as this example is, this is precisely how data is structured in a relational database system. To offer up some definitions – the columns are known as the fields or attributes, the rows are known as records, and the rows+columns together is a table. When we have a collection of many tables, that's a database.

In practice, the main difference between a spreadsheet (e.g. Excel) and a database system is that Excel already includes a standard front-end by which an end-user (like you) is able to read and write to the data. Since the use-cases for RDBMSes are much more varied and programmable than a spreadsheet, there's no such standard for relational databases. Instead, we use SQL to talk with the database, and to pass the results on to our front-end of choice, whether that's a custom web application, a PDF report, a commercial front-end like Tableau, or whatever else might be needed.

Using SQL to read data

In this lesson, we'll be looking at ways we can use SQL to read data from a relational database. In a later lesson, we will be looking at using SQL to write to and modify a database.

The SELECT statement

We read data from a database by using the SELECT statement in SQL. Let's look at an example. Suppose we own an e-commerce store – we'd probably have a a database table used to store the products we sell. This table would likely be named Products and look something like this:

Name Price Department
Camera $299 Electronics
Sofa $699 Furniture
Dining Table $1299 Furniture
Toaster $79 Housewares

The SELECT statement allows us to ask the database questions. What products do we sell? How much does the Toaster cost? How many products cost more than $500?

It would be nice if the SQL language knew how to answer your question, in exactly the way you asked it. But, no, it's a computer, so we have to follow a simple formula so that you can ask the question in a way that SQL can understand:

  • Identify the table that holds the information you need
  • Identify the columns that hold the information you need
  • Add any other conditions
  • Then, ask the question in SQL

For instance, if my question is: what are the names of the products we sell?

  • Identify the table that holds the information you need – Products
  • Identify the columns that hold the information you need – Name
  • Add any other conditions – none
  • Then, ask the question:

SELECT Name FROM Products;

There are 3 things about this SELECT statement that I'll point out:

  • It ends with a semicolon ; – this depends on the flavor of RDBMS and the front-end being used, but many databases will require the semicolon at the end of the SQL statement, so all the examples we'll use will include it
  • The column and table names are case-sensitive; again, this will depend on the specific database system, but for illustration's sake, all examples will match the table/column name casing in the database
  • The SELECT and FROM clauses are in ALL CAPS; this is NOT a requirement in SQL, but clauses in examples will be written this way, as it is often more readable

The result that we'll receive is actually another, temporary table – i.e. one that is not part of the database structure or schema as-designed, but is simply a temporary subset of the larger table we've targeted. I like to think of this as a "👶baby table" – a much smaller table that looks a lot like the larger, original table, but only holds the data we asked for:

Name
Camera
Sofa
Dining Table
Toaster

To be clear, "👶 baby table" is not an official SQL term... it's just what I like to call it.

Now, it's up to you to decide if the results provided in this "👶 baby table" are everything you need, or if you need to modify your question in order to get more information.

Hmmm, it would nice if I could see the price too. Can I get a list of all the products we sell, along with the price?

Back to the formula:

  • Identify the table that holds the information you need – Products
  • Identify the columns that hold the information you need – Name and Price
  • Add any other conditions – none
  • Then, ask the question:

SELECT Name, Price FROM Products;

Name Price
Camera $299
Sofa $699
Dining Table $1299
Toaster $79

Sorting

Now that we've seen how to use the SELECT statement in SQL to retrieve rows of data from a database table, let's take one step toward doing something more meaningful. It's pretty rare that we'll want to get all records in a database table, in exactly the order that the records were inserted – this is what we've done thus far.

We'll often want our records in our 👶 baby table to be in a specific order – perhaps alphabetically or sorted by a numeric value. This is going to be based on how the data is ultimately consumed – e.g. in a PDF report for a stakeholder in your company, displayed on-screen for a user of a web application, and so on. Let's continue with our question and answer format for this one:

Can we order the results by price?

  • Identify the table that holds the information you need – Products
  • Identify the columns that hold the information you need – Name and Price
  • Add any other conditions – sort by Price
  • Then, ask the question:

SELECT Name, Price FROM Products ORDER BY Price;

Name Price
Toaster $79
Camera $299
Sofa $699
Dining Table $1299

As we can see, ascending order (from the least to the greatest) is the default when sorting a numeric value. (It's alphabetically from A-Z with a non-numeric value.)

Can we order by the more expensive products first?

  • Identify the table that holds the information you need – Products
  • Identify the columns that hold the information you need – Name and Price
  • Add any other conditions – sort by Price in descending order
  • Then, ask the question:

SELECT Name, Price FROM Products ORDER BY Price DESC;

Name Price
Dining Table $1299
Sofa $699
Camera $299
Toaster $79

Limiting

Another common use-case is that we want to limit the number of rows in our 👶baby table – what we've done so far is just fine for our very small table with 4 rows of data. But what if our Products table contained 10 million rows? The SELECT statements we've written thus far would be very slow, and not likely to be very helpful or meaningful to the person receiving the results. So we must often reduce our results, and that usually means refining the question we're asking.

What is the single most expensive product we sell?

  • Identify the table that holds the information you need – Products
  • Identify the columns that hold the information you need – Name and Price
  • Add any other conditions – sort by Price in descending order and only return 1 row
  • Then, ask the question:
SELECT Name, Price
FROM Products
ORDER BY Price DESC
LIMIT 1;

Same results as before – but only returns a single record, answering our question. Also note that we're separating each fragment of the SQL statement here – AKA clauses – by putting each on its own line. As our SQL becomes more complex, I've found this easier to read and comprehend – especially when first learning – so we'll do this from now on.

Lab 1: The Basics

Let's play ball! To begin, proceed to https://github.com/entr451-spring2024/sql-intro and open the project in Gitpod by clicking the Gitpod button. Then, in Gitpod, open a new Terminal session by going to the hamburger menu -> View -> Terminal. You can also try the Toggle Terminal keyboard shortcut. Slide the window containing the Terminal window so that it's larger, if you like, and begin by typing:

sqlite3 baseball.sqlite3

The database software that we'll be using in this lesson is SQLite, version 3. The sqlite3 command opens an interactive SQLite application where we can type SQL commands into the Terminal and, in this case, opens the baseball.sqlite3 database, which contains data about, well... baseball! Try it!

SELECT first_name, last_name FROM players;

Don't forget that semi-colon!  You should see a list of all the first and last names of all the baseball players in our database. You can also run a few SQLite-specific commands (i.e. not SQL, but commands that are utilities for the SQLite3 software itself). These commands start with the dot character (. ), like this:

.schema

The output you see shows you the three tables contained within our database, and the structure of the tables (we'll get more into the structure of tables later). Again, these SQLite commands are not SQL – so they do not end with the semi-colon.

You can also run SQL queries from files. Open demo.sql and you'll see the same query you just tried above. Run this query by issuing the following command from within SQLite:

.read demo.sql

This simply reads the SQL in the specified file, executes it, and displays the results on-screen.

You can also turn "column mode" and headers on, so that the results of your query will appear in a table-like format:

.mode column
.headers on

Browse the files named lab-1-*.sql – each file contains a question to be answered. See if you can write the SELECT statement necessary to answer each question, and execute each file using the .read command, as outlined above. For example, to run the first lab, write your SELECT statement in the lab-1-1.sql file, then execute it from within SQLite:

.read lab-1-1.sql

The output produced by the SELECT statement will be displayed on-screen. Note: by default, Gitpod will only display the last 1000 lines of output.

See the answers when you're done.

Filtering Data

The ability to filter data is probably the most powerful and often-used feature of SQL. For many people, gaining insights from arbitrary sets of data is often the objective of learning SQL in the first place.

We've already learned about the LIMIT clause, which simply whittles down the size of our result set – the 👶 baby table – by some number that we provide. However, we often want to specify criteria for filtering the results, rather than a set number like 10. What are our best-selling products? Which sales tactics are producing the best results this month? How many orders did we receive for the new widget this week?

For questions like these, we must use the WHERE clause. The WHERE clause allows us to add more criteria to the questions we ask. Let's jump into an example using our Products table:

Which products cost more than $500?

  • Identify the table that holds the information you need – Products
  • Identify the columns that hold the information you need – Name and Price
  • Add any other conditions –  Price is more than $500
  • Then, ask the question:

SELECT Name, Price FROM Products WHERE Price > 500;

Name Price
Dining Table $1299
Sofa $699

A WHERE clause is going to always have a column name (in this case, Price), a comparison operator (> or "greater than") and a value to compare (500). Because the column in question contains numbers, it's going to compare the Price of every row in the table against the numeric value 500 and filter the results to only include those with a price greater than that number. Let's look at what happens with a non-numeric comparison:

Which products are sold in the Furniture department?

  • Identify the table that holds the information you need – Products
  • Identify the columns that hold the information you need – Name and Price again, but we probably want Department now too
  • Add any other conditions – Department is Furniture
  • Then, ask the question:
SELECT Name, Price, Department 
FROM Products 
WHERE Department = "Furniture";
Name Price Department
Sofa $699 Furniture
Dining Table $1299 Furniture

Again, we have a column name (in this case, Department), a comparison operator (= or "exactly equals") and a value to compare ("Furniture"). Because Department is non-numeric, the value to compare must be in quotes. This is necessary to distinguish the comparison value from columns name, other SQL clauses, variables, and the like.

Separate multiple filters/conditions with the AND clause. So if we wanted the answer to:

Which products are sold in the furniture department and cost more than $500?

SELECT Name, Price, Department 
FROM Products 
WHERE Department = "Furniture"
  AND Price > 500;

Lab 2: Filtering

See the files named lab-2-*.sql for the questions, and modify each file until you achieve the desired result. See the answers when done.

A Star is Born

We've already seen how SQL queries require us to provide the columns that we want to be returned in the 👶 baby table – and how we often need to think about what columns those are when providing clean, concise output. Occasionally though, the columns don't really matter. In these cases, we can use the asterisk or star (*) to denote that we want all columns in a table. Consider a Reviews table, which contains reviews about our products. If we simply want all columns, all rows from this table, we could write a query that looks like:

SELECT * from reviews;

And the result would be everything – all columns and all rows – from that table.

Product Rating Body
Camera 4 Takes nice pictures!
Sofa 5 Comfy!
Camera 5 Best camera ever!!!
Toaster 3 Makes toast.
Product Rating Body
Camera 4 Takes nice pictures!
Sofa 5 Comfy!
Camera 5 Best camera ever!!!
Toaster 3 Makes toast.
Product Rating Body
Camera 4 Takes nice pictures!
Sofa 5 Comfy!
Camera 5 Best camera ever!!!
Toaster 3 Makes toast.

In practice, we're rarely going to use it like this, unless a table contains just a few columns, or when we don't care about concise output, like when we're testing our queries. Or, within an aggregate function.

Aggregate Functions

Often, the questions you want answered will involve math. That is, instead of simply seeing a subset of the rows in a table, we'll want to aggregate the results together in some way, e.g. adding the number of rows, summing a single column, seeing an average, etc.

How many reviews have been written?

  • Identify the table that holds the information you need – Reviews
  • Identify the columns that hold the information you need – not applicable in this case, since we're looking for an aggregate number
  • Add any other conditions – none
  • Then, ask the question:
SELECT COUNT(*) FROM Reviews;
COUNT(*)
4

There are a couple of crazy things going on here! Let's break it down.

First, instead of a column name like Name, Price, Body, etc. in our SELECT clause, we're calling a SQL function. There are many SQL functions built into SQL, but they will always contain a set of parentheses, which denote arguments to the function. In this case, we're using the COUNT function, which accepts an argument of "what columns are we counting?" Since we're simply looking for the aggregate number of rows in the table, the columns aren't relevant and we just use *.

And even though we're asking for an aggregate, that is, a single number representing the count of all the reviews, our result is still a 👶 baby table. It simply contains a single column – called COUNT(*) – and a single row. We can call several SQL functions, or a combination of SQL functions and column names, which will result in a 👶 baby table containing several columns.

Let's look at another aggregate function, where the argument does matter:

What is the average rating across all reviews?

  • Identify the table that holds the information you need – Reviews
  • Identify the columns that hold the information you need – an average Rating
  • Add any other conditions – none
  • Then, ask the question:
SELECT AVG(Rating) FROM Reviews;
AVG(Rating)
4.25

Similar as before, with our COUNT function, but this time, the argument matters. We're using the AVG function, which takes a column name as an argument – which column are we averaging?

As mentioned, there are many functions built-in to SQL. The ones that are relevant to this lesson are aggregate functions – ones that do math to aggregate information. Examples of aggregate functions include:

  • COUNT - counts the number of rows
  • AVG - average of a column value or expression
  • SUM - sum of a column value or expression
  • MIN - minimum of a column value or expression
  • MAX - maximum of a column value or expression

Grouping

Occasionally, a question like How many reviews have been written? will be enough to satisfy business requirements. However, it's more likely that your boss, client, or other stakeholder will want more specific information than simply aggregating the entire table's worth of data. We'll often want to categorize or aggregate (group) results by the value of another column. For example, a more common question might be:

How many reviews have been written per product?

  • Identify the table that holds the information you need – Reviews
  • Identify the columns that hold the information you need – the Product and the aggregate number of reviews
  • Add any other conditions – none
  • Then, ask the question:
SELECT Product, COUNT(*) 
FROM Reviews
GROUP BY Product;
Product COUNT(*)
Camera 2
Sofa 1
Toaster 1

We see that the values separated by commas in the SELECT clause, whether they are column names or SQL functions, end up being the columns in the 👶 baby table. And, because we used a GROUP BY clause, our results are aggregated (grouped) by the distinct values in the Product column.

This brings us to Rule #1A, the First Law of all SQL... when using a GROUP BY clause, the columns listed in the GROUP BY must also be in the SELECT clause. Makes sense, right? In this example, if the Product column wasn't also in the SELECT, the results would just be a list of numbers with no context:

SELECT COUNT(*) 
FROM Reviews
GROUP BY Product;
COUNT(*)
2
1
1

Not very useful, is it? And, just to be clear, it's perfectly fine to have columns in the SELECT that are not in the GROUP BY – just not the other way around.

Order of Clauses

So far, we've seen several different types of SQL clauses. In fact, we've seen enough to warrant talking about the order of these clauses. Because of the way SQL is executed, the order in which the clauses are written does matter. This may differ slightly based on the flavor/implementation of SQL being worked with (i.e. some are more forgiving than others), but generally speaking, the order should be as follows:

  • SELECT - the columns or executed functions you want in the result
  • FROM - the table(s) involved
  • WHERE - the conditions or constraints
  • GROUP BY - columns in the SELECT used to aggregate the results
  • ORDER BY - columns by which the results shall be ordered, either numerically or alphabetically

To put it all together:

What is the average rating per product, taking into account only those reviews with a rating of 3 or above, from lowest to highest average rating?

  • Identify the table that holds the information you need – Reviews
  • Identify the columns that hold the information you need – the Product and the aggregate average rating of reviews
  • Add any other conditions – rating of 3 or above
  • Then, ask the question:
SELECT Product, AVG(Rating)
FROM Reviews
WHERE Rating >= 3
GROUP BY Product
ORDER BY AVG(Rating)
Product AVG(Rating)
Toaster 3
Camera 4.5
Sofa 5

Lab 3: Aggregation

See the files named lab-3-*.sql for the questions, and modify each file until you achieve the desired result. See the answers when done.

#BabyTable