Who the hell put this product here?

7 min readApr 12, 2024

©https://www.goforgrowth.co/p/the-arr-seduction

Let’s dive deeper into the fascinating world of the Amazon ESCI dataset. This round, we’ll try to build a classifier that accurately assigns products to their respective top-level categories, simply based on their names.

We will evaluate two methods based on speed, cost, and accuracy, allowing you to choose who is the winner.

Data preparation

We will begin by preprocessing the data in the original dataset. Initially, we will select only the English-language portion. Subsequently, we will extract the titles of each product along with their top-level categories. We will then select a randomized sample of 100K records from the final dataset. For simplicity, I will omit the descriptions of how to convert the original data into that format and how to randomize/cut the data. Below is a representation of what the final data looks like.

{"title": "Postta 2RCA to 2RCA Stereo Audio Cable (6 Feet) Male to Male Gold Plated Dual Shielded 2RCA Cable -Black", "category": "Electronics"}
{"title": "Manfrotto 035RL Super Clamp with 2908 Standard Stud - Replaces 2900 - Black", "category": "Electronics"}
{"title": "Polident 3 Minute Triple Mint Antibacterial Denture Cleanser Effervescent Tablets, 84 count(pack of 3)", "category": "Health & Household"}

Direct approach — ask GPT-4 (+tools)

Script

import json

from tqdm import tqdm
from openai import OpenAI


client = OpenAI(api_key='<your OpenAI API token>')


taxonomy = []
for i in tqdm(open('slim-esci-00'), total=100000):
 i = json.loads(i)
 if i['category'] not in taxonomy:
  taxonomy.append(i['category'])

print(taxonomy)

def openai_prediction(taxonomy, title):
    content = f"""###Instructions###
    Assign following product to most precise category based on the title of product.

    Title: {title}"""

    function = {
        "name": "predict_category_of_product",
        "description": "Predict e-commerce category for a given product",
        "parameters": {
            "type": "object",
            "properties": {
                "category": {
                    "type": "string",
                    "enum": taxonomy,
                    "description": "The predicted categories in e-commerce webshop."
                }
            },
            "required": [
                "category"
            ]
        }
    }
    completion = client.chat.completions.create(
       model="gpt-4",
       temperature=0.0,
       messages=[{"role": "user", "content": content}],
       functions=[function],
       function_call={"name": "predict_category_of_product"},
    )
    result = json.loads(completion.choices[0].message.function_call.arguments)
    return result.get('category')


o = open('slim-esci-00.matrix', 'w')
for chunk in ('slim-esci-00',):
 for i in tqdm(open(chunk), total=100000):
  i = json.loads(i)
  name = i['title']
  print(json.dumps([openai_prediction(taxonomy, name), i['category']]), file=o)
  o.flush()
o.close()

Problem 1. It’s not that fast

Problem 2. It’s expensive

So, I only managed to classify approximately 1K of products and spent on that process $12.

What about the quality of classification?

import json
from sklearn.metrics import classification_report, accuracy_score

d = open('slim-esci-00.matrix')
d = d.read().split('\n')
d = [json.loads(i) for i in d if i]
predicted = [i[0] for i in d]
actual = [i[1] for i in d]
report = classification_report(actual, predicted)
print(report)

                                          precision    recall  f1-score   support

                              Appliances       0.50      0.67      0.57         3
                   Arts, Crafts & Sewing       0.80      0.91      0.85        22
                              Automotive       1.00      0.94      0.97        35
                           Baby Products       0.75      0.92      0.83        13
                  Beauty & Personal Care       0.89      0.92      0.91        37
                                   Books       0.67      0.95      0.78        40
                             CDs & Vinyl       0.00      0.00      0.00         1
               Car & Vehicle Electronics       0.00      0.00      0.00         0
               Cell Phones & Accessories       0.80      0.92      0.86        26
               Clothing, Shoes & Jewelry       0.95      0.96      0.95       210
                             Electronics       0.86      0.73      0.79        49
                              Gift Cards       0.00      0.00      0.00         0
                Grills & Outdoor Cooking       0.00      0.00      0.00         0
                  Grocery & Gourmet Food       0.80      0.89      0.84        27
                       Handmade Products       0.00      0.00      0.00         2
                      Health & Household       0.82      0.70      0.76        47
                          Home & Kitchen       0.81      0.78      0.79       134
                       Hunting & Fishing       0.00      0.00      0.00         0
                 Industrial & Scientific       0.30      0.25      0.27        12
                            Kindle Store       0.00      0.00      0.00        19
                           Kindle eBooks       0.00      0.00      0.00         0
                        Kitchen & Dining       0.00      0.00      0.00         0
               Lab & Scientific Products       0.00      0.00      0.00         0
       Lighting Assemblies & Accessories       0.00      0.00      0.00         0
                   Luggage & Travel Gear       0.00      0.00      0.00         0
                             Movies & TV       0.88      0.88      0.88         8
                     Musical Instruments       0.92      0.92      0.92        13
                         Office Products       0.88      0.85      0.86        26
                     Outdoor Power Tools       0.00      0.00      0.00         0
                    Patio, Lawn & Garden       0.67      0.69      0.68        32
                            Pet Supplies       0.91      0.91      0.91        22
                        Posters & Prints       0.00      0.00      0.00         0
                      Power & Hand Tools       0.00      0.00      0.00         0
          Power Tool Parts & Accessories       0.00      0.00      0.00         0
Remote & App Controlled Vehicles & Parts       0.00      0.00      0.00         0
                       Safety & Security       0.00      0.00      0.00         0
       Shoe, Jewelry & Watch Accessories       0.00      0.00      0.00         0
     Small Appliance Parts & Accessories       1.00      1.00      1.00         1
                       Sports & Outdoors       0.91      0.76      0.83        54
                Tools & Home Improvement       0.76      0.42      0.54        52
                            Toys & Games       0.86      0.88      0.87        49
                             Video Games       1.00      0.67      0.80         6

                                accuracy                           0.81       940
                               macro avg       0.45      0.44      0.44       940
                            weighted avg       0.83      0.81      0.81       940

We don’t have a lot of data to analyze, however, we do have some figures indicating an accuracy of 0.81.

Can it be cheaper?

Let’s explore an alternative method that strikes a reasonable balance between cost, speed, and accuracy compared to the previous example.

To construct our classifier, we’ll require some features. The plan is to utilize OpenAI’s text-embedding-3-small embeddings for this purpose.

Building embeddings

import json
import sys

from tqdm import tqdm
from openai import OpenAI

client = OpenAI(api_key='<your api key>')

chunk = sys.argv[1]

o = open(f'{chunk}.vec', 'w')
for i in tqdm(open(f'{chunk}.txt'), total=100000):
    i = json.loads(i)
    title = i['title']
    response = client.embeddings.create(
        input=title,
        model="text-embedding-3-small"
    )
    i['vector'] = response.data[0].embedding
    print(json.dumps(i), file=o)
    o.flush()
o.close()

If our final dataset is stored in file small.txt and the script above is stored in file embeddings.py, start processing the data with the following command

python embeddings.py small

At this point, you should notice the first benefit: building embeddings is three times faster than “asking the questions.”

Building classifier model

import json
import sys

import numpy as np
import pandas as pd

from joblib import dump
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score


file = sys.argv[1]

data = pd.read_json(file, lines=True)
print('load done')

X = pd.DataFrame(data['vector'].tolist())
Y = data['category']

print('convert done')

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

print('split done')

model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

print('fit done')

preds = clf.predict(X_test)
probas = clf.predict_proba(X_test)

print('predictions done')

report = classification_report(y_test, preds)
print(report)

dump(model, 'model.joblib')

Run it with

python train.py small.vec

Evaluation results


                                           precision    recall  f1-score   support

                               Appliances       0.00      0.00      0.00        58
                    Arts, Crafts & Sewing       0.85      0.21      0.33       502
                Audible Books & Originals       0.00      0.00      0.00        11
                               Automotive       0.81      0.75      0.78       729
                            Baby Products       0.96      0.23      0.37       291
                   Beauty & Personal Care       0.88      0.81      0.85      1040
                                    Books       0.67      0.89      0.77       874
                              CDs & Vinyl       0.00      0.00      0.00        69
                Car & Vehicle Electronics       0.00      0.00      0.00         1
                Cell Phones & Accessories       0.92      0.83      0.87       413
                Clothing, Shoes & Jewelry       0.77      0.99      0.87      4616
                  Collectibles & Fine Art       0.00      0.00      0.00        18
                              Electronics       0.76      0.85      0.80      1146
        Food Service Equipment & Supplies       0.00      0.00      0.00         2
                               Gift Cards       0.00      0.00      0.00         6
                 Grills & Outdoor Cooking       0.00      0.00      0.00         2
                   Grocery & Gourmet Food       0.89      0.83      0.86       569
                        Handmade Products       0.00      0.00      0.00        44
                       Health & Household       0.78      0.61      0.68      1081
Heavy Duty & Commercial Vehicle Equipment       0.00      0.00      0.00         2
                           Home & Kitchen       0.57      0.94      0.71      2860
                        Hunting & Fishing       0.00      0.00      0.00         2
                  Industrial & Scientific       0.78      0.03      0.05       263
                             Kindle Store       0.82      0.15      0.26       273
                         Kitchen & Dining       0.00      0.00      0.00         5
        Lighting Assemblies & Accessories       0.00      0.00      0.00         4
            Lights & Lighting Accessories       0.00      0.00      0.00         1
                   Magazine Subscriptions       0.00      0.00      0.00         1
             Medical Supplies & Equipment       0.00      0.00      0.00         8
             Mobility & Daily Living Aids       0.00      0.00      0.00         8
                              Movies & TV       0.88      0.20      0.32       117
                      Musical Instruments       1.00      0.18      0.31       153
                          Office Products       0.87      0.48      0.62       516
                     Patio, Lawn & Garden       0.94      0.31      0.47       595
                             Pet Supplies       0.97      0.56      0.71       443
                       Power & Hand Tools       0.00      0.00      0.00         9
           Power Tool Parts & Accessories       0.00      0.00      0.00         5
    Remote & App Controlled Vehicle Parts       0.00      0.00      0.00         3
 Remote & App Controlled Vehicles & Parts       1.00      0.25      0.40         4
                        Safety & Security       0.00      0.00      0.00         1
        Shoe, Jewelry & Watch Accessories       0.00      0.00      0.00         2
      Small Appliance Parts & Accessories       0.00      0.00      0.00        10
                                 Software       1.00      0.14      0.25         7
                        Sports & Outdoors       0.77      0.46      0.57      1151
                 Tools & Home Improvement       0.67      0.63      0.65      1070
                             Toys & Games       0.80      0.64      0.71       853
                              Video Games       0.94      0.42      0.58       162

                                 accuracy                           0.73     20000
                                macro avg       0.43      0.26      0.29     20000
                             weighted avg       0.76      0.73      0.70     20000

So, while the accuracy is a bit lower, the cost of the solution is significantly reduced. Examples below are from a different test (with almost a million products processed)

YES! Seriously, 35 cents for 900K requests vs 12 dollars for 1K. Of course, some additional costs (like the hosting model — plenty of opportunities here: sagemaker, custom HTTP API) are attached… And drop in accuracy, though it’s unclear what the accuracy would be if we could apply the first approach to the same number of samples.

Also, a significant assumption was made that the “coordinates” of OpenAI embeddings represent independent features, that allow the use of random forest to build the classifier.

Who the hell put this product here?

Data preparation

Direct approach — ask GPT-4 (+tools)

Can it be cheaper?

Written by Andrew Kornilov