Who the hell put this product here?

Andrew Kornilov
7 min readApr 12, 2024

--

©https://www.goforgrowth.co/p/the-arr-seduction

Let’s dive deeper into the fascinating world of the Amazon ESCI dataset. This round, we’ll try to build a classifier that accurately assigns products to their respective top-level categories, simply based on their names.

We will evaluate two methods based on speed, cost, and accuracy, allowing you to choose who is the winner.

Data preparation

We will begin by preprocessing the data in the original dataset. Initially, we will select only the English-language portion. Subsequently, we will extract the titles of each product along with their top-level categories. We will then select a randomized sample of 100K records from the final dataset. For simplicity, I will omit the descriptions of how to convert the original data into that format and how to randomize/cut the data. Below is a representation of what the final data looks like.

{"title": "Postta 2RCA to 2RCA Stereo Audio Cable (6 Feet) Male to Male Gold Plated Dual Shielded 2RCA Cable -Black", "category": "Electronics"}
{"title": "Manfrotto 035RL Super Clamp with 2908 Standard Stud - Replaces 2900 - Black", "category": "Electronics"}
{"title": "Polident 3 Minute Triple Mint Antibacterial Denture Cleanser Effervescent Tablets, 84 count(pack of 3)", "category": "Health & Household"}

Direct approach — ask GPT-4 (+tools)

Script

import json

from tqdm import tqdm
from openai import OpenAI


client = OpenAI(api_key='<your OpenAI API token>')


taxonomy = []
for i in tqdm(open('slim-esci-00'), total=100000):
i = json.loads(i)
if i['category'] not in taxonomy:
taxonomy.append(i['category'])

print(taxonomy)

def openai_prediction(taxonomy, title):
content = f"""###Instructions###
Assign following product to most precise category based on the title of product.

Title: {title}"""

function = {
"name": "predict_category_of_product",
"description": "Predict e-commerce category for a given product",
"parameters": {
"type": "object",
"properties": {
"category": {
"type": "string",
"enum": taxonomy,
"description": "The predicted categories in e-commerce webshop."
}
},
"required": [
"category"
]
}
}
completion = client.chat.completions.create(
model="gpt-4",
temperature=0.0,
messages=[{"role": "user", "content": content}],
functions=[function],
function_call={"name": "predict_category_of_product"},
)
result = json.loads(completion.choices[0].message.function_call.arguments)
return result.get('category')


o = open('slim-esci-00.matrix', 'w')
for chunk in ('slim-esci-00',):
for i in tqdm(open(chunk), total=100000):
i = json.loads(i)
name = i['title']
print(json.dumps([openai_prediction(taxonomy, name), i['category']]), file=o)
o.flush()
o.close()

Problem 1. It’s not that fast

Problem 2. It’s expensive

So, I only managed to classify approximately 1K of products and spent on that process $12.

What about the quality of classification?

import json
from sklearn.metrics import classification_report, accuracy_score

d = open('slim-esci-00.matrix')
d = d.read().split('\n')
d = [json.loads(i) for i in d if i]
predicted = [i[0] for i in d]
actual = [i[1] for i in d]
report = classification_report(actual, predicted)
print(report)
                                          precision    recall  f1-score   support

Appliances 0.50 0.67 0.57 3
Arts, Crafts & Sewing 0.80 0.91 0.85 22
Automotive 1.00 0.94 0.97 35
Baby Products 0.75 0.92 0.83 13
Beauty & Personal Care 0.89 0.92 0.91 37
Books 0.67 0.95 0.78 40
CDs & Vinyl 0.00 0.00 0.00 1
Car & Vehicle Electronics 0.00 0.00 0.00 0
Cell Phones & Accessories 0.80 0.92 0.86 26
Clothing, Shoes & Jewelry 0.95 0.96 0.95 210
Electronics 0.86 0.73 0.79 49
Gift Cards 0.00 0.00 0.00 0
Grills & Outdoor Cooking 0.00 0.00 0.00 0
Grocery & Gourmet Food 0.80 0.89 0.84 27
Handmade Products 0.00 0.00 0.00 2
Health & Household 0.82 0.70 0.76 47
Home & Kitchen 0.81 0.78 0.79 134
Hunting & Fishing 0.00 0.00 0.00 0
Industrial & Scientific 0.30 0.25 0.27 12
Kindle Store 0.00 0.00 0.00 19
Kindle eBooks 0.00 0.00 0.00 0
Kitchen & Dining 0.00 0.00 0.00 0
Lab & Scientific Products 0.00 0.00 0.00 0
Lighting Assemblies & Accessories 0.00 0.00 0.00 0
Luggage & Travel Gear 0.00 0.00 0.00 0
Movies & TV 0.88 0.88 0.88 8
Musical Instruments 0.92 0.92 0.92 13
Office Products 0.88 0.85 0.86 26
Outdoor Power Tools 0.00 0.00 0.00 0
Patio, Lawn & Garden 0.67 0.69 0.68 32
Pet Supplies 0.91 0.91 0.91 22
Posters & Prints 0.00 0.00 0.00 0
Power & Hand Tools 0.00 0.00 0.00 0
Power Tool Parts & Accessories 0.00 0.00 0.00 0
Remote & App Controlled Vehicles & Parts 0.00 0.00 0.00 0
Safety & Security 0.00 0.00 0.00 0
Shoe, Jewelry & Watch Accessories 0.00 0.00 0.00 0
Small Appliance Parts & Accessories 1.00 1.00 1.00 1
Sports & Outdoors 0.91 0.76 0.83 54
Tools & Home Improvement 0.76 0.42 0.54 52
Toys & Games 0.86 0.88 0.87 49
Video Games 1.00 0.67 0.80 6

accuracy 0.81 940
macro avg 0.45 0.44 0.44 940
weighted avg 0.83 0.81 0.81 940

We don’t have a lot of data to analyze, however, we do have some figures indicating an accuracy of 0.81.

Can it be cheaper?

Let’s explore an alternative method that strikes a reasonable balance between cost, speed, and accuracy compared to the previous example.

To construct our classifier, we’ll require some features. The plan is to utilize OpenAI’s text-embedding-3-small embeddings for this purpose.

Building embeddings

import json
import sys

from tqdm import tqdm
from openai import OpenAI

client = OpenAI(api_key='<your api key>')

chunk = sys.argv[1]

o = open(f'{chunk}.vec', 'w')
for i in tqdm(open(f'{chunk}.txt'), total=100000):
i = json.loads(i)
title = i['title']
response = client.embeddings.create(
input=title,
model="text-embedding-3-small"
)
i['vector'] = response.data[0].embedding
print(json.dumps(i), file=o)
o.flush()
o.close()

If our final dataset is stored in file small.txt and the script above is stored in file embeddings.py, start processing the data with the following command

python embeddings.py small

At this point, you should notice the first benefit: building embeddings is three times faster than “asking the questions.”

Building classifier model

import json
import sys

import numpy as np
import pandas as pd

from joblib import dump
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score


file = sys.argv[1]

data = pd.read_json(file, lines=True)
print('load done')

X = pd.DataFrame(data['vector'].tolist())
Y = data['category']

print('convert done')

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

print('split done')

model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

print('fit done')

preds = clf.predict(X_test)
probas = clf.predict_proba(X_test)

print('predictions done')

report = classification_report(y_test, preds)
print(report)

dump(model, 'model.joblib')

Run it with

python train.py small.vec

Evaluation results


precision recall f1-score support

Appliances 0.00 0.00 0.00 58
Arts, Crafts & Sewing 0.85 0.21 0.33 502
Audible Books & Originals 0.00 0.00 0.00 11
Automotive 0.81 0.75 0.78 729
Baby Products 0.96 0.23 0.37 291
Beauty & Personal Care 0.88 0.81 0.85 1040
Books 0.67 0.89 0.77 874
CDs & Vinyl 0.00 0.00 0.00 69
Car & Vehicle Electronics 0.00 0.00 0.00 1
Cell Phones & Accessories 0.92 0.83 0.87 413
Clothing, Shoes & Jewelry 0.77 0.99 0.87 4616
Collectibles & Fine Art 0.00 0.00 0.00 18
Electronics 0.76 0.85 0.80 1146
Food Service Equipment & Supplies 0.00 0.00 0.00 2
Gift Cards 0.00 0.00 0.00 6
Grills & Outdoor Cooking 0.00 0.00 0.00 2
Grocery & Gourmet Food 0.89 0.83 0.86 569
Handmade Products 0.00 0.00 0.00 44
Health & Household 0.78 0.61 0.68 1081
Heavy Duty & Commercial Vehicle Equipment 0.00 0.00 0.00 2
Home & Kitchen 0.57 0.94 0.71 2860
Hunting & Fishing 0.00 0.00 0.00 2
Industrial & Scientific 0.78 0.03 0.05 263
Kindle Store 0.82 0.15 0.26 273
Kitchen & Dining 0.00 0.00 0.00 5
Lighting Assemblies & Accessories 0.00 0.00 0.00 4
Lights & Lighting Accessories 0.00 0.00 0.00 1
Magazine Subscriptions 0.00 0.00 0.00 1
Medical Supplies & Equipment 0.00 0.00 0.00 8
Mobility & Daily Living Aids 0.00 0.00 0.00 8
Movies & TV 0.88 0.20 0.32 117
Musical Instruments 1.00 0.18 0.31 153
Office Products 0.87 0.48 0.62 516
Patio, Lawn & Garden 0.94 0.31 0.47 595
Pet Supplies 0.97 0.56 0.71 443
Power & Hand Tools 0.00 0.00 0.00 9
Power Tool Parts & Accessories 0.00 0.00 0.00 5
Remote & App Controlled Vehicle Parts 0.00 0.00 0.00 3
Remote & App Controlled Vehicles & Parts 1.00 0.25 0.40 4
Safety & Security 0.00 0.00 0.00 1
Shoe, Jewelry & Watch Accessories 0.00 0.00 0.00 2
Small Appliance Parts & Accessories 0.00 0.00 0.00 10
Software 1.00 0.14 0.25 7
Sports & Outdoors 0.77 0.46 0.57 1151
Tools & Home Improvement 0.67 0.63 0.65 1070
Toys & Games 0.80 0.64 0.71 853
Video Games 0.94 0.42 0.58 162

accuracy 0.73 20000
macro avg 0.43 0.26 0.29 20000
weighted avg 0.76 0.73 0.70 20000

So, while the accuracy is a bit lower, the cost of the solution is significantly reduced. Examples below are from a different test (with almost a million products processed)

YES! Seriously, 35 cents for 900K requests vs 12 dollars for 1K. Of course, some additional costs (like the hosting model — plenty of opportunities here: sagemaker, custom HTTP API) are attached… And drop in accuracy, though it’s unclear what the accuracy would be if we could apply the first approach to the same number of samples.

Also, a significant assumption was made that the “coordinates” of OpenAI embeddings represent independent features, that allow the use of random forest to build the classifier.

--

--