今学期的772跟742的数据居然是一样的。。。都是红酒。。
https://www.kaggle.com/zynicide/wine-reviews
但有一个问题就是有不少变量(像designation)有很多很多的项。。。做classification的时候就codeing又不是不搞又不是。。。
想法1:unique integer codeing,这个用RM的转换器就能搞,但。。感觉不太信得过_(:з」∠)_
想法2:feature hashing,将变量转成堆hash值。。。。用的是sklearn的feature hasher,效果比unique integer好点。。但感觉还有别的。。
from sklearn.compose import ColumnTransformer fh = FeatureHasher(n_features=20, input_type='string') data_hash_target = data.drop(columns=['price-cat', 'points']) data_label = data.drop(columns=['country', 'designation','province','region_1','variety','winery']) n_orig_features = data_hash_target.shape[1] hash_vector_size = 12 ct = ColumnTransformer([(f't_{i}', FeatureHasher(n_features=hash_vector_size, input_type='string',non_negative=True), i) for i in range(n_orig_features)]) res_0 = ct.fit_transform(data_hash_target) data_hash_target=pd.DataFrame(res_0, columns=['fh1', 'fh2', 'fh3', 'fh4', 'fh5', 'fh6', 'fh7', 'fh8','fh9','fh10','fh11','fh12', 'fh13', 'fh14', 'fh15', 'fh16', 'fh17', 'fh18', 'fh19', 'fh20','fh21','fh22','fh23','fh24', 'fh25', 'fh26', 'fh27', 'fh28', 'fh29', 'fh30', 'fh31', 'fh32','fh33','fh34','fh35','fh36', 'fh37', 'fh38', 'fh39', 'fh40', 'fh41', 'fh42', 'fh43', 'fh44','fh45','fh46','fh47','fh48', 'fh49', 'fh50', 'fh51', 'fh52', 'fh53', 'fh54', 'fh55', 'fh56','fh57','fh58','fh59','fh60', 'fh61', 'fh62', 'fh63', 'fh64', 'fh65', 'fh66', 'fh67', 'fh68','fh69','fh70','fh71','fh72' ]) data_hash_target data_hashed = pd.concat([data_hash_target, data_label],axis=1)
想法3:从772的把price分类成5类那里出来的想法,把designation分类(或者是归类)到数量较小的类别里面去,例如把原来几万种的designation分类成20类之类的。。然后再做codeing。。(但是没有头绪啊。。。唉_(:з」∠)_)
还有这样我是该单独把designation拆出来clustering做呢,还是用model连着别的变量给designation分类呢。。就很头痛。。
(先记下来吧。。怕忘了。。)
4.1更新:看cluster好像真的designation跟winery这两个的信息最多。。。感觉可以一试。。?