Amazon数据集可以根据商品类别分为 Books,Electronics,Movies and TV,CDs and Vinyl等子数据集,这些子数据集包含两类信息: 商品信息描述asin 商品idtitle 商品名称price 价格imUrl 商品图片链接related 相关商品salesRank 折扣信息brand 品牌categories 目录类别官方例子:{"asin": "0000031852","title": "Girls Ballet Tutu Zebra Hot Pink","price": 3.17,"imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg","related":{ "also_bought": ["B00JHONN1S", "B002BZX8Z6"], "also_viewed": ["B002BZX8Z6", "B00JHONN1S"], "bought_together": ["B002BZX8Z6"]},"salesRank": {"Toys & Games": 211836},"brand": "Coxlures","categories": [["Sports & Outdoors", "Other Sports", "Dance"]]}123456789101112131415用户评分记录数据reviewerID 用户idasin 商品idreviewerName 用户名helpful 有效评价率(helpfulness rating of the review, e.g. 2/3)reviewText 评价文本overall 评分summary 评价总结unixReviewTime 评价时间戳reviewTime 评价时间{ "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "helpful": [2, 3], "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009"}1234567891011Amazon数据集读取:因为下载的数据是json文件,不易操作,这里主要介绍如何将json文件转化为csv格式文件。以2014版Amazon Electronics数据集的转化为例:商品信息读取import pickleimport pandas as pdfile_path = 'meta_Electronics.json'fin = open(file_path, 'r')df = {}useless_col = ['imUrl','salesRank','related','title','description'] # 不想要的字段i = 0for line in fin: d = eval(line) for s in useless_col: if s in d: d.pop(s) df = d i += 1df = pd.DataFrame.from_dict(df, orient='index')df.to_csv('meta_Electronics.csv',index=False)123456789101112131415161718用户评分记录数据读取file_path = 'Electronics_10.json'fin = open(file_path, 'r')df = {}useless_col = ['reviewerName','reviewText','unixReviewTime','summary'] # 不想要的字段i = 0for line in fin: d = eval(line) for s in useless_col: if s in d: d.pop(s) df = d i += 1df = pd.DataFrame.from_dict(df, orient='index')df.to_csv('Electronics_10.csv',index=False)