Who the hell put this product here?
Let’s dive deeper into the fascinating world of the Amazon ESCI dataset. This round, we’ll try to build a classifier that accurately assigns products to their respective top-level categories, simply based on their names.
We will evaluate two methods based on speed, cost, and accuracy, allowing you to choose who is the winner.
Data preparation
We will begin by preprocessing the data in the original dataset. Initially, we will select only the English-language portion. Subsequently, we will extract the titles of each product along with their top-level categories. We will then select a randomized sample of 100K records from the final dataset. For simplicity, I will omit the descriptions of how to convert the original data into that format and how to randomize/cut the data. Below is a representation of what the final data looks like.
{"title": "Postta 2RCA to 2RCA Stereo Audio Cable (6 Feet) Male to Male Gold Plated Dual Shielded 2RCA Cable -Black", "category": "Electronics"}
{"title": "Manfrotto 035RL Super Clamp with 2908 Standard Stud - Replaces 2900 - Black", "category": "Electronics"}
{"title": "Polident 3 Minute Triple Mint Antibacterial Denture Cleanser Effervescent Tablets, 84 count(pack of 3)", "category": "Health & Household"}
Direct approach — ask GPT-4 (+tools)
Script
import json
from tqdm import tqdm
from openai import OpenAI
client = OpenAI(api_key='<your OpenAI API token>')
taxonomy = []
for i in tqdm(open('slim-esci-00'), total=100000):
i = json.loads(i)
if i['category'] not in taxonomy:
taxonomy.append(i['category'])
print(taxonomy)
def openai_prediction(taxonomy, title):
content = f"""###Instructions###
Assign following product to most precise category based on the title of product.
Title: {title}"""
function = {
"name": "predict_category_of_product",
"description": "Predict e-commerce category for a given product",
"parameters": {
"type": "object",
"properties": {
"category": {
"type": "string",
"enum": taxonomy,
"description": "The predicted categories in e-commerce webshop."
}
},
"required": [
"category"
]
}
}
completion = client.chat.completions.create(
model="gpt-4",
temperature=0.0,
messages=[{"role": "user", "content": content}],
functions=[function],
function_call={"name": "predict_category_of_product"},
)
result = json.loads(completion.choices[0].message.function_call.arguments)
return result.get('category')
o = open('slim-esci-00.matrix', 'w')
for chunk in ('slim-esci-00',):
for i in tqdm(open(chunk), total=100000):
i = json.loads(i)
name = i['title']
print(json.dumps([openai_prediction(taxonomy, name), i['category']]), file=o)
o.flush()
o.close()
Problem 1. It’s not that fast
Problem 2. It’s expensive
So, I only managed to classify approximately 1K of products and spent on that process $12.
What about the quality of classification?
import json
from sklearn.metrics import classification_report, accuracy_score
d = open('slim-esci-00.matrix')
d = d.read().split('\n')
d = [json.loads(i) for i in d if i]
predicted = [i[0] for i in d]
actual = [i[1] for i in d]
report = classification_report(actual, predicted)
print(report)
precision recall f1-score support
Appliances 0.50 0.67 0.57 3
Arts, Crafts & Sewing 0.80 0.91 0.85 22
Automotive 1.00 0.94 0.97 35
Baby Products 0.75 0.92 0.83 13
Beauty & Personal Care 0.89 0.92 0.91 37
Books 0.67 0.95 0.78 40
CDs & Vinyl 0.00 0.00 0.00 1
Car & Vehicle Electronics 0.00 0.00 0.00 0
Cell Phones & Accessories 0.80 0.92 0.86 26
Clothing, Shoes & Jewelry 0.95 0.96 0.95 210
Electronics 0.86 0.73 0.79 49
Gift Cards 0.00 0.00 0.00 0
Grills & Outdoor Cooking 0.00 0.00 0.00 0
Grocery & Gourmet Food 0.80 0.89 0.84 27
Handmade Products 0.00 0.00 0.00 2
Health & Household 0.82 0.70 0.76 47
Home & Kitchen 0.81 0.78 0.79 134
Hunting & Fishing 0.00 0.00 0.00 0
Industrial & Scientific 0.30 0.25 0.27 12
Kindle Store 0.00 0.00 0.00 19
Kindle eBooks 0.00 0.00 0.00 0
Kitchen & Dining 0.00 0.00 0.00 0
Lab & Scientific Products 0.00 0.00 0.00 0
Lighting Assemblies & Accessories 0.00 0.00 0.00 0
Luggage & Travel Gear 0.00 0.00 0.00 0
Movies & TV 0.88 0.88 0.88 8
Musical Instruments 0.92 0.92 0.92 13
Office Products 0.88 0.85 0.86 26
Outdoor Power Tools 0.00 0.00 0.00 0
Patio, Lawn & Garden 0.67 0.69 0.68 32
Pet Supplies 0.91 0.91 0.91 22
Posters & Prints 0.00 0.00 0.00 0
Power & Hand Tools 0.00 0.00 0.00 0
Power Tool Parts & Accessories 0.00 0.00 0.00 0
Remote & App Controlled Vehicles & Parts 0.00 0.00 0.00 0
Safety & Security 0.00 0.00 0.00 0
Shoe, Jewelry & Watch Accessories 0.00 0.00 0.00 0
Small Appliance Parts & Accessories 1.00 1.00 1.00 1
Sports & Outdoors 0.91 0.76 0.83 54
Tools & Home Improvement 0.76 0.42 0.54 52
Toys & Games 0.86 0.88 0.87 49
Video Games 1.00 0.67 0.80 6
accuracy 0.81 940
macro avg 0.45 0.44 0.44 940
weighted avg 0.83 0.81 0.81 940
We don’t have a lot of data to analyze, however, we do have some figures indicating an accuracy of 0.81.
Can it be cheaper?
Let’s explore an alternative method that strikes a reasonable balance between cost, speed, and accuracy compared to the previous example.
To construct our classifier, we’ll require some features. The plan is to utilize OpenAI’s text-embedding-3-small embeddings for this purpose.
Building embeddings
import json
import sys
from tqdm import tqdm
from openai import OpenAI
client = OpenAI(api_key='<your api key>')
chunk = sys.argv[1]
o = open(f'{chunk}.vec', 'w')
for i in tqdm(open(f'{chunk}.txt'), total=100000):
i = json.loads(i)
title = i['title']
response = client.embeddings.create(
input=title,
model="text-embedding-3-small"
)
i['vector'] = response.data[0].embedding
print(json.dumps(i), file=o)
o.flush()
o.close()
If our final dataset is stored in file small.txt and the script above is stored in file embeddings.py, start processing the data with the following command
python embeddings.py small
At this point, you should notice the first benefit: building embeddings is three times faster than “asking the questions.”
Building classifier model
import json
import sys
import numpy as np
import pandas as pd
from joblib import dump
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
file = sys.argv[1]
data = pd.read_json(file, lines=True)
print('load done')
X = pd.DataFrame(data['vector'].tolist())
Y = data['category']
print('convert done')
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
print('split done')
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
print('fit done')
preds = clf.predict(X_test)
probas = clf.predict_proba(X_test)
print('predictions done')
report = classification_report(y_test, preds)
print(report)
dump(model, 'model.joblib')
Run it with
python train.py small.vec
Evaluation results
precision recall f1-score support
Appliances 0.00 0.00 0.00 58
Arts, Crafts & Sewing 0.85 0.21 0.33 502
Audible Books & Originals 0.00 0.00 0.00 11
Automotive 0.81 0.75 0.78 729
Baby Products 0.96 0.23 0.37 291
Beauty & Personal Care 0.88 0.81 0.85 1040
Books 0.67 0.89 0.77 874
CDs & Vinyl 0.00 0.00 0.00 69
Car & Vehicle Electronics 0.00 0.00 0.00 1
Cell Phones & Accessories 0.92 0.83 0.87 413
Clothing, Shoes & Jewelry 0.77 0.99 0.87 4616
Collectibles & Fine Art 0.00 0.00 0.00 18
Electronics 0.76 0.85 0.80 1146
Food Service Equipment & Supplies 0.00 0.00 0.00 2
Gift Cards 0.00 0.00 0.00 6
Grills & Outdoor Cooking 0.00 0.00 0.00 2
Grocery & Gourmet Food 0.89 0.83 0.86 569
Handmade Products 0.00 0.00 0.00 44
Health & Household 0.78 0.61 0.68 1081
Heavy Duty & Commercial Vehicle Equipment 0.00 0.00 0.00 2
Home & Kitchen 0.57 0.94 0.71 2860
Hunting & Fishing 0.00 0.00 0.00 2
Industrial & Scientific 0.78 0.03 0.05 263
Kindle Store 0.82 0.15 0.26 273
Kitchen & Dining 0.00 0.00 0.00 5
Lighting Assemblies & Accessories 0.00 0.00 0.00 4
Lights & Lighting Accessories 0.00 0.00 0.00 1
Magazine Subscriptions 0.00 0.00 0.00 1
Medical Supplies & Equipment 0.00 0.00 0.00 8
Mobility & Daily Living Aids 0.00 0.00 0.00 8
Movies & TV 0.88 0.20 0.32 117
Musical Instruments 1.00 0.18 0.31 153
Office Products 0.87 0.48 0.62 516
Patio, Lawn & Garden 0.94 0.31 0.47 595
Pet Supplies 0.97 0.56 0.71 443
Power & Hand Tools 0.00 0.00 0.00 9
Power Tool Parts & Accessories 0.00 0.00 0.00 5
Remote & App Controlled Vehicle Parts 0.00 0.00 0.00 3
Remote & App Controlled Vehicles & Parts 1.00 0.25 0.40 4
Safety & Security 0.00 0.00 0.00 1
Shoe, Jewelry & Watch Accessories 0.00 0.00 0.00 2
Small Appliance Parts & Accessories 0.00 0.00 0.00 10
Software 1.00 0.14 0.25 7
Sports & Outdoors 0.77 0.46 0.57 1151
Tools & Home Improvement 0.67 0.63 0.65 1070
Toys & Games 0.80 0.64 0.71 853
Video Games 0.94 0.42 0.58 162
accuracy 0.73 20000
macro avg 0.43 0.26 0.29 20000
weighted avg 0.76 0.73 0.70 20000
So, while the accuracy is a bit lower, the cost of the solution is significantly reduced. Examples below are from a different test (with almost a million products processed)
YES! Seriously, 35 cents for 900K requests vs 12 dollars for 1K. Of course, some additional costs (like the hosting model — plenty of opportunities here: sagemaker, custom HTTP API) are attached… And drop in accuracy, though it’s unclear what the accuracy would be if we could apply the first approach to the same number of samples.
Also, a significant assumption was made that the “coordinates” of OpenAI embeddings represent independent features, that allow the use of random forest to build the classifier.