Playing with Amazon’s Shopping Queries Dataset. Part I

Andrew Kornilov
4 min readJun 8, 2023

--

Recently Amazon published an awesome Shopping Queries Dataset and someone extended it with Extra product metadata. So, I decided to play with it out of pure curiosity, applying to it whatever will come to my mind (therefore, at the time of writing these lines, I have no idea what the final result would be. It gonna be a series of pure experiments).

© https://www.kinderspel.net/nl/soft-blokken-alternatief-voor-lego.html

The dataset contains data in 3 languages. But, to make it simpler, I will use only products in English (Amazon managed to “invent” an imaginary locale “us” for it). I will play with the data from Extra product metadata (so, it’s possible to use images too).

0. “Pre-cook” the data

For the sake of reproducibility, I’ll try to make sure we all use the same data and mention the md5 hash of the initial file with the dataset.

md5sum esci.json
985941335fdbdc8f579fff66ec17e7cd esci.json

The data is provided in ndjson format, which is actually a file, where each line is a valid JSON object (so, we can process it line-by-line — a huge benefit for us).

cat esci.json | grep '"locale":"us"' | tee en_esci.json

1. Show me some numbers

cat en_esci.json | wc -l

Total number of products in the dataset is 1118658

Let’s look at how products are distributed across top-level categories

import json

stats = {}
for i in open('en_esci.json'):
i = json.loads(i)
if category := i.get('category'):
top_category = category[0]
if top_category not in stats:
stats[top_category] = 0
stats[top_category] += 1

stats = dict(sorted(stats.items(), reverse=True, key=lambda x: x[1]))

print(json.dumps(stats))

And the result is:

{
"Clothing, Shoes & Jewelry":236346,
"Home & Kitchen":142662,
"Sports & Outdoors":60114,
"Electronics":57481,
"Tools & Home Improvement":57214,
"Health & Household":54318,
"Beauty & Personal Care":53629,
"Toys & Games":47377,
"Books":45469,
"Automotive":36528,
"Patio, Lawn & Garden":31459,
"Grocery & Gourmet Food":30367,
"Office Products":28642,
"Arts, Crafts & Sewing":25630,
"Cell Phones & Accessories":22079,
"Pet Supplies":21352,
"Baby Products":15218,
"Kindle Store":14365,
"Industrial & Scientific":12872,
"Musical Instruments":8384,
"Video Games":8300,
"Movies & TV":5863,
"CDs & Vinyl":3667,
"Handmade Products":2757,
"Appliances":2412,
"Collectibles & Fine Art":782,
"Audible Books & Originals":625,
"Medical Supplies & Equipment":550,
"Mobility & Daily Living Aids":473,
"Power & Hand Tools":469,
"Small Appliance Parts & Accessories":444,
"Gift Cards":321,
"Software":293,
"Kitchen & Dining":254,
"Remote & App Controlled Vehicles & Parts":155,
"Shoe, Jewelry & Watch Accessories":155,
"Magazine Subscriptions":120,
"Power Tool Parts & Accessories":113,
"Food Service Equipment & Supplies":108,
"Hunting & Fishing":104,
"Car & Vehicle Electronics":94,
"Remote & App Controlled Vehicle Parts":92,
"Outdoor Power Tools":82,
"Sports & Outdoor Recreation Accessories":77,
"Lights & Lighting Accessories":58,
"Grills & Outdoor Cooking":55,
"Lighting Assemblies & Accessories":38,
"Lab & Scientific Products":36,
"Heavy Duty & Commercial Vehicle Equipment":34,
"Instrument Accessories":33,
"Dining & Entertaining":32,
"Safety & Security":30,
"Restaurant Appliances & Equipment":27,
"Motorcycle & Powersports":26,
"Amazon Devices & Accessories":25,
"Kindle eBooks":9,
"Janitorial & Sanitation Supplies":3,
"Office Electronics":3,
"Performance Parts & Accessories":2,
"Professional Dental Supplies":1
}

So, a first conclusion:

Not all categories are represented equally in our dataset

2. Let’s grab the images

For further experimentation, I will need to have the images locally on my machine… I am going to collect them in probably a stupid way (because I am eager to have them asap and don’t want to spend too much time on that part). Also, I will try to improve my experience by providing a progress meter (install package tqdm with pip).

import json
import hashlib

from tqdm import tqdm

with open('images.txt', 'w') as output:
for i in tqdm(open('en_esci.json'), total=1118658):
i = json.loads(i)
if image := i.get('image'):
ext = image.split('.')[-1]
print(f'wget "{image}" -nc -O images/{str(hashlib.md5(image.encode()).hexdigest())}.{ext}', file=output)
python images_list.py
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 1118658/1118658 [03:13<00:00, 5781.52it/s]

3. What if we put them in (vectors) space?

Let’s convert our images from the binary format into some vectors. We will use the popular model Resnet-18 and a library https://github.com/christiansafka/img2vec to build two vectors for each image file (RGB and Greyscale versions) and store those images in files with product id in the name of the file.

Please, install the required dependency

pip install img2vec_pytorch

And create/run python script:

import os
import json
import hashlib
from PIL import Image
from img2vec_pytorch import Img2Vec
from tqdm.auto import tqdm



for i in tqdm(open('en_esci.json'), total=1118658):
i = json.loads(i)
if image := i.get('image'):
ext = image.split('.')[-1]
filename = str(hashlib.md5(image.encode()).hexdigest())
asin = i.get('asin')
if not os.path.exists(f'vectors/{asin}.vec'):
try:
image = Image.open(f'images/{filename}.{ext}')
rgb_image = image.convert('RGB')
greyscale_image = image.convert('RGB').convert('L').convert('RGB')
img2vec = Img2Vec(cuda=False)
rgb_vec = img2vec.get_vec(rgb_image, tensor=True)
rgb_vec = [i[0][0] for i in rgb_vec.tolist()[0]]
greyscale_vec = img2vec.get_vec(greyscale_image, tensor=True)
greyscale_vec = [i[0][0] for i in greyscale_vec.tolist()[0]]
f = open(f'vectors/{asin}.vec', 'w')
print(json.dumps({'asin': asin, 'rgb_vector': rgb_vec, 'greyscale_vec': greyscale_vec}), file=f)
f.close()
except:
pass

It will take quite a long time to process all the images (it took me a few days on my laptop). After finishing you’ll be able to see something like

cat vectors/B074GVPG51.vec | python -m json.tool | less
{
"asin": "B074GVPG51",
"rgb_vector": [
1.2417913675308228,
0.24562086164951324,
1.655302882194519,
0.34100598096847534,
0.6734368205070496,
1.2439863681793213,
1.4576066732406616,
0.8253995776176453,
1.1798666715621948,
1.7090641260147095,
1.3098657131195068,
1.2227827310562134,
.. many more ...

That’s it for today. Next time we will try to use all the data we just collected and prepared in order to achieve some more practical results.

--

--