Benchmarking NaturalSQL against State of the Art LLMs
Table of Contents
- Table of Contents
- TLDR
- 1. SQL-Eval Results
- 2. NaturalSQL on Complex Questions
- Testing Defog & NaturalSQL on Compound Questions
- What is a Compound Question?
- Creating a Base Line Test Set
- Compound Questions
- SQL Tables
- Testing both SQLCoder-7B and NaturalSQL-6.7B
- **SQLCoder** (**Failed** ❌)
- **NaturalSQL** (**Passed**✅)
- **SQLCoder** (**Failed** ❌)
- **NaturalSQL** (**Failed** ❌)
- **SQLCoder** (**Failed** ❌)
- **NaturalSQL** (**Failed** ❌)
- **SQLCoder** (**Failed** ❌)
- **NaturalSQL** (**Passed** ✅)
- **SQLCoder** (**Failed** ❌)
- **NaturalSQL** (**Passed** ✅)
- Results
- Conclusion
In this benchmarking study, NaturalSQL-6.7B-v0 is evaluated against leading industry models, including GPT-4
and Defog's SQLCoder-7B
for SQL generation in complex scenarios. The study focuses on complex questions and employs the SQL-Eval
framework for a comprehensive assessment.
NaturalSQL shows remarkable proficiency in handling compound questions and outperforms larger models in several categories. This study demonstrates that with targeted fine-tuning, even smaller models can achieve state-of-the-art results in specific domains like SQL generation for complex business queries.
This study will test NaturalSQL-6.7B against SQLCoder-7B and other models on the two tests below:
- Small Sample of Complex Questions
- SQL-Eval - Programmatic SQL Model Evaluation Framework
Table of Contents #
TLDR #
NaturalSQL excels at generating sql for complex questions and beats industry leading models that are twice it's size in multiple categories. NaturalSQL is also much stronger at generating working SQL for compound / complex questions compared to other models of it's size.
1. SQL-Eval Results #
I used sql-eval
by Defog, to benchmark NaturalSQL on 200 novel instruction / sql tests that were not trained on. We can see how it compares to other models and see a breakdown of performance by category.
What is SQL-Eval? #
Our testing procedure comprises the following steps. For each question/query pair:
- We generate a SQL query (possibly from an LLM).
- We run both the "gold" query and the generated query on their respective database to obtain 2 dataframes with the results.
- We compare the 2 dataframes using an "exact" and a "subset" match.
SQL Eval evaluates the LLM on 200 questions across different categories and assesses correctness of the model.
SQL-Eval Percentage Correct #
Huge thanks to the Defog team for the open source sql-eval framework. Based on the results, we can see the very first iteration of the NaturalSQL-6.7B models, is on par with industry leading models of its size and beats gpt-3.5
and claude
models.
SQL-Eval Performance by Query Category #
Here we can see how NaturalSQL compares to other models on different SQL categories. Interestingly, NaturalSQL, matches or beats both SQLCoder-7B and SQLCoder-2-15B on 50% of the categories:
where
join
ratio
NaturalSQL-6.7B beats SQLCoder-7B in 4 out of 6 categories. However, we can see the category that really brings down the average is: date
. NaturalQuery scores the lowest at 36.0
. This will be a strong area of focus in the next iteration.
2. NaturalSQL on Complex Questions #
NaturalSQL excels at complex questions. Here is a comparison of Defog's SQLCoder-7B and ChatDB's NaturalSQL-6.7B on compound SQL questions:
Question | NaturalSQL | SQLCoder |
---|---|---|
Which employee had the highest salary in the 'Marketing' department, and what is the total budget of that department? | ✅ | ❌ |
Identify the customer who placed the highest number of orders last year, and list the names and prices of the products they ordered most frequently. | ❌ | ❌ |
Find the supplier that provided the most products currently out of stock, and show the names of their out-of-stock products. | ❌ | ❌ |
Determine the customer with the largest total order value last month, and list all the products and their quantities they ordered. | ✅ | ❌ |
Who is the employee that handled the most orders last quarter, and what is the status of their most recent order? | ✅ | ❌ |
Testing Defog & NaturalSQL on Compound Questions #
What is a Compound Question? #
Enterprise companies typically aren't wanting to only ask simple and easy questions such as:
How many products did I sell this month
?
The real power of text to SQL models like NaturalSQL is the ability to answer compound, multi-part questions like these below:
Compound Questions:
Which supplier supplied us the most products last year, show me our best selling product from them.
Which point guard had the highest points per game in March, how many steals did they average?
For the customers who ordered over 100$ worth of products, I want to know what the best selling product was
Creating a Base Line Test Set #
For the first test, I wanted to get a baseline of both NaturalSQL-6.7B and SQLCoder-7B on 5 complex questions for a random dataset.
I used ChatGPT to generate a SQL DDL with 5 tables with 10-15 columns. You can checkout my ChatGPT shared link here so you can verify these results weren't cherry picked.
Here is the prompt I used: create a 5 table sql create table sql where each table has 10-15 columns of different data types
. Pretty simple.
Compound Questions #
For the SQL tables, I asked ChatGPT to generate 5 compound questions, in which it replied:
Which employee had the highest salary in the 'Marketing' department, and what is the total budget of that department?
Identify the customer who placed the highest number of orders last year, and list the names and prices of the products they ordered most frequently.
Find the supplier that provided the most products currently out of stock, and show the names of these out-of-stock products.
Determine the customer with the largest total order value last month, and list all the products and their quantities they ordered.
Who is the employee that handled the most orders last quarter, and what is the status of their most recent order?
SQL Tables #
CREATE TABLE employees (
employeeid SERIAL PRIMARY KEY,
firstname VARCHAR(50),
lastname VARCHAR(50),
birthdate DATE,
email VARCHAR(100),
phone VARCHAR(20),
salary NUMERIC(10, 2),
hiredate TIMESTAMP,
departmentid INT,
address VARCHAR(200),
city VARCHAR(50),
state VARCHAR(50),
zipcode VARCHAR(10),
country VARCHAR(50),
isactive BOOLEAN
);
CREATE TABLE departments (
departmentid SERIAL PRIMARY KEY,
departmentname VARCHAR(100),
managerid INT,
location VARCHAR(100),
phonenumber VARCHAR(20),
budget NUMERIC(10, 2),
startdate DATE,
enddate DATE,
description TEXT,
email VARCHAR(100),
website VARCHAR(100)
);
CREATE TABLE products (
productid SERIAL PRIMARY KEY,
productname VARCHAR(100),
supplierid INT,
categoryid INT,
quantityperunit VARCHAR(50),
unitprice NUMERIC(10, 2),
unitsinstock SMALLINT,
unitsonorder SMALLINT,
reorderlevel SMALLINT,
discontinued BOOLEAN,
registered TIMESTAMP,
description TEXT
);
CREATE TABLE orders (
orderid SERIAL PRIMARY KEY,
customerid INT,
employeeid INT,
orderdate TIMESTAMP,
requireddate TIMESTAMP,
shippeddate TIMESTAMP,
shipvia INT,
freight NUMERIC(10, 2),
shipname VARCHAR(100),
shipaddress VARCHAR(200),
shipcity VARCHAR(50),
shipregion VARCHAR(50),
shippostalcode VARCHAR(10),
shipcountry VARCHAR(50),
status VARCHAR(20)
);
CREATE TABLE customers (
customerid SERIAL PRIMARY KEY,
companyname VARCHAR(100),
contactname VARCHAR(100),
contacttitle VARCHAR(50),
address VARCHAR(200),
city VARCHAR(50),
region VARCHAR(50),
postalcode VARCHAR(10),
country VARCHAR(50),
phone VARCHAR(20),
fax VARCHAR(20),
email VARCHAR(100),
website VARCHAR(100),
preferredcontactmethod VARCHAR(50),
registrationdate DATE
);
Testing both SQLCoder-7B and NaturalSQL-6.7B #
Question: Which employee had the highest salary in the 'Marketing' department, and what is the total budget of that department?
SQLCoder (Failed ❌) #
- The join condition in the
total_budget
CTE is incorrect. It joinsdepartments.departmentid
withhighest_salary.employeeid
, which are unrelated fields. The correct join should be ondepartments.departmentid
withemployees.departmentid
. - Also,
MAX(employees.salary)
should be outside theGROUP BY
clause.
WITH highest_salary AS
(SELECT employees.employeeid,
employees.firstname,
employees.lastname,
MAX(employees.salary) AS max_salary
FROM employees
JOIN departments ON employees.departmentid = departments.departmentid
WHERE departments.departmentname = 'Marketing'
GROUP BY employees.employeeid,
employees.firstname,
employees.lastname),
total_budget AS
(SELECT departments.budget
FROM departments
JOIN highest_salary ON departments.departmentid = highest_salary.employeeid)
SELECT highest_salary.firstname,
highest_salary.lastname,
total_budget.budget
FROM highest_salary,
total_budget;
NaturalSQL (Passed✅) #
_NaturalSQL is finetuned to use ILIKE
and _
often since user rarely mentions a text column name in full form.*
This query correctly retrieves the first name, last name, and salary of the employee with the highest salary in the 'Marketing' department, along with the total budget of the department. It uses ILIKE
for a flexible search, and ORDER BY
with LIMIT 1
to get the highest salary.
SELECT e.firstname,
e.lastname,
e.salary,
d.budget
FROM employees e
JOIN departments d ON e.departmentid = d.departmentid
WHERE d.departmentname ILIKE '%Marketing%'
ORDER BY e.salary DESC
LIMIT 1;
Question: Identify the customer who placed the highest number of orders last year, and list the names and prices of the products they ordered most frequently.
SQLCoder (Failed ❌) #
- Incorrect Join in
product_orders
CTE: The query attempts to join the products table with the orders table usingp.productid = o.productid
. However, theorders
table does not have aproductid
column, making this join invalid as per the provided schema. - Logical Error in
customer_product_orders
CTE: The join conditionco.customerid = pp.productid
is incorrect because it equates a customer ID with a product ID, which are unrelated entities in the schema.
WITH customer_orders AS
(SELECT o.customerid,
COUNT(*) AS num_orders
FROM orders o
WHERE o.orderdate BETWEEN (CURRENT_DATE - interval '1 year') AND CURRENT_DATE
GROUP BY o.customerid),
product_orders AS
(SELECT p.productid,
COUNT(*) AS num_orders
FROM products p
JOIN orders o ON p.productid = o.productid
WHERE o.orderdate BETWEEN (CURRENT_DATE - interval '1 year') AND CURRENT_DATE
GROUP BY p.productid),
customer_product_orders AS
(SELECT co.customerid,
pp.productid,
co.num_orders,
pp.productname,
pp.unitprice
FROM customer_orders co
JOIN product_orders pp ON co.customerid = pp.productid)
SELECT cpo.customerid,
cpo.productname,
cpo.unitprice
FROM customer_product_orders cpo
WHERE cpo.num_orders =
(SELECT MAX(num_orders)
FROM customer_product_orders)
ORDER BY cpo.num_orders DESC NULLS LAST;
NaturalSQL (Failed ❌) #
- Non-existent Table Reference: The query references a table order_details in the
FrequentlyOrderedProducts
CTE, which does not exist in the provided schema. - Incorrect Join and Column Reference: In
FrequentlyOrderedProducts
, it attempts to joinorder_details
with products onop.productid = p.productid
, which cannot be validated due to the absence oforder_details
. Also, the query assumes a quantity column inorder_details
which is not present in the schema.
WITH LastYearOrders AS
(SELECT customerid,
COUNT(*) AS order_count
FROM orders
WHERE orderdate >= (CURRENT_DATE - INTERVAL '1 year')
GROUP BY customerid
ORDER BY order_count DESC
LIMIT 1),
FrequentlyOrderedProducts AS
(SELECT op.orderid,
op.productid,
p.productname,
p.unitprice
FROM order_details op
JOIN products p ON op.productid = p.productid
JOIN LastYearOrders ly ON op.orderid = ly.orderid
ORDER BY op.orderid,
op.quantity DESC)
SELECT c.companyname,
fop.productname,
fop.unitprice
FROM FrequentlyOrderedProducts fop
JOIN customers c ON fop.customerid = c.customerid
JOIN orders o ON fop.orderid = o.orderid
WHERE o.orderdate >= (CURRENT_DATE - INTERVAL '1 year')
ORDER BY fop.orderid,
fop.quantity DESC;
Question: Find the supplier that provided the most products currently out of stock, and show the names of their out-of-stock products.
SQLCoder (Failed ❌) #
This query incorrectly groups by productname
and suppliername
, which results in a count of out-of-stock products for each product name, rather than for each supplier. The GROUP BY
should be primarily on supplierid
or suppliername
to find the supplier with the most out-of-stock products. The ORDER BY
clause, in this case, does not correctly identify the supplier with the highest count of out-of-stock products.
SELECT p.productname,
COUNT(*) AS out_of_stock_products_count,
s.suppliername
FROM products p
JOIN suppliers s ON p.supplierid = s.supplierid
WHERE p.unitsinstock = 0
GROUP BY p.productname,
s.suppliername
ORDER BY out_of_stock_products_count DESC
LIMIT 1;
NaturalSQL (Failed ❌) #
- Incorrect Use of
GROUP BY
andORDER BY
: The query fails to properly aggregate the count of out-of-stock products per supplier and thenorder by
this count indescending
order, which is necessary to determine the supplier with the most out-of-stock products.
SELECT s.supplierid,
s.companyname,
p.productname
FROM suppliers s
JOIN products p ON s.supplierid = p.supplierid
WHERE p.unitsinstock = 0
GROUP BY s.supplierid,
p.productname,
s.companyname
ORDER BY COUNT(p.productid) DESC
LIMIT 1;
Question: Determine the customer with the largest total order value last month, and list all the products and their quantities they ordered.
SQLCoder (Failed ❌) #
- Fails to aggregate and compute total order value; lacks
SUM
orGROUP BY
for calculating total order value per customer. - Incorrect date range: the query is fixed to the start of
2020
instead of focusing on 'last month', and incorrectly references an undefinedtotal_order_value
.
SELECT c.customerid,
c.companyname,
o.orderid,
p.productname,
o.quantityordered,
to_char(o.orderdate, 'DD-MM-YYYY') AS order_date
FROM customers c
JOIN orders o ON c.customerid = o.customerid
JOIN orderdetails od ON o.orderid = od.orderid
JOIN products p ON od.productid = p.productid
WHERE to_date(to_char(o.orderdate, 'MM-DD-YYYY'), 'MM-DD-YYYY') BETWEEN to_date('01-01-2020', 'MM-DD-YYYY') AND CURRENT_DATE
ORDER BY total_order_value DESC
LIMIT 1;
NaturalSQL (Passed ✅) #
Passed.
- Correctly calculates the total order value for each customer for the last month, identifying the customer with the highest value using an effective window function.
WITH LastMonthOrders AS
(SELECT o.customerid,
o.orderid,
op.productid,
op.quantity,
p.productname,
op.quantity * p.unitprice AS total_order_value
FROM orders o
JOIN order_details op ON o.orderid = op.orderid
JOIN products p ON op.productid = p.productid
WHERE o.orderdate >= date_trunc('month', CURRENT_DATE) - INTERVAL '1 month'
AND o.orderdate < date_trunc('month', CURRENT_DATE) ),
CustomerTotalOrderValue AS
(SELECT customerid,
SUM(total_order_value) AS total_value
FROM LastMonthOrders
GROUP BY customerid
ORDER BY total_value DESC
LIMIT 1)
SELECT c.customerid,
c.companyname,
lm.productid,
lm.productname,
lm.quantity
FROM CustomerTotalOrderValue ctov
JOIN LastMonthOrders lm ON ctov.customerid = lm.customerid
JOIN customers c ON c.customerid = lm.customerid;
Question: Who is the employee that handled the most orders last quarter, and what is the status of their most recent order?
SQLCoder (Failed ❌) #
- The query fails to correctly link the most recent order's status to the top employee identified in
employee_orders
. The subquery in employee_recent_order does not necessarily return the most recent order related to the employee with the most orders in the last quarter. - Schema Mismatch: The query references
o.orderstatus
in theemployee_recent_order
CTE, but there's noorderstatus
column in the orders table as per the provided schema. The correct column should be status.
WITH employee_orders AS
(SELECT o.employeeid,
COUNT(o.orderid) AS order_count
FROM orders o
JOIN employees e ON o.employeeid = e.employeeid
WHERE o.orderdate BETWEEN (CURRENT_DATE - interval '1 quarter') AND CURRENT_DATE
GROUP BY o.employeeid),
employee_recent_order AS
(SELECT e.employeeid,
e.firstname,
e.lastname,
o.orderstatus
FROM employees e
JOIN orders o ON e.employeeid = o.employeeid
WHERE o.orderdate =
(SELECT MAX(orderdate)
FROM orders
WHERE employeeid = e.employeeid))
SELECT eo.employeeid,
eo.firstname,
eo.lastname,
er.orderstatus
FROM employee_orders eo
JOIN employee_recent_order er ON eo.employeeid = er.employeeid
ORDER BY eo.order_count DESC NULLS LAST
LIMIT 1;
NaturalSQL (Passed ✅) #
Passed.
- This query correctly calculates the employee with the most orders last quarter using a well-structured CTE and appropriate date range.
- Schema Alignment: It successfully retrieves the most recent order's status for the identified employee, correctly using the status column from the orders table, aligning with the provided schema.
WITH LastQuarterOrders AS
(SELECT employeeid,
COUNT(*) AS order_count,
MAX(orderdate) AS most_recent_order_date
FROM orders
WHERE orderdate >= date_trunc('quarter', CURRENT_DATE) - INTERVAL '3 months'
AND orderdate < date_trunc('quarter', CURRENT_DATE)
GROUP BY employeeid
ORDER BY order_count DESC
LIMIT 1),
MostRecentOrderStatus AS
(SELECT o.employeeid,
o.status
FROM orders o
JOIN LastQuarterOrders lqo ON o.employeeid = lqo.employeeid
WHERE o.orderdate = lqo.most_recent_order_date )
SELECT e.firstname,
e.lastname,
mros.status AS most_recent_order_status
FROM employees e
JOIN LastQuarterOrders lqo ON e.employeeid = lqo.employeeid
JOIN MostRecentOrderStatus mros ON lqo.employeeid = mros.employeeid;
Results #
Model | Correct Answers (Out of 5) |
---|---|
NaturalSQL | 3 |
SQLCoder | 0 |
NaturalSQL has superior results on complex questions with it's strong grasp of SQL subqueries and complex questions.
Through these tests, it shows, even with a small, but high quality dataset, LLMs can produce state of the art results. NaturalSQL seems to preform very well for its size (6.7B) compared to the leading models. NaturalSQL has had finetuning on:
- Questions with business logic
- Advanced Date Queries
- Window Functions
- Ratios / Aggregations
That all help it perform really well!
Here are the links to the notebook results and the shared ChatGPT link to verify 😄
Conclusion #
In summary, the benchmarking study demonstrates two key achievements of NaturalSQL-6.7B: first, it has superior performance over SQLCoder-7B in generating SQL for compound questions, and second, it has the ability to outperform both SQLCoder-7B and SQLCoder-15B in certain specific instances. This highlights the effectiveness of NaturalSQL-6.7B, particularly in handling complex queries.
The next iteration will have focused effort on improving short comings in date
queries and improve on compound questions.