I am working on an unsupervised machine learning algorithm studying marijuana data to offer suggestions on similar strains. I've run into a slight roadblock, which is that the CBD to THC ratio, which is a super import data point, is hidden within the 'Description' column with no real consistency on how it is phrased. Sometimes its 'X:Y CBD/THC ratio', sometimes it's 'a THC to CBD ratio of about X:Y', and sometimes other words are thrown in there to make it more confusing from a coding standpoint.
My current strategy is to make an if statement that searches through all of the descriptions to extract the data, but I can't figure out how to make it work. This is the base idea I'm working with.
strain_breakdown['THC/CBD Ratio'] = 0
for s in strain_data:
if strain_data['Description'].str.contains(f'THC:CBD ratio of about {int}:{int}'):
strain_breakdown['THC/CBD Ratio'] = int/int
Obviously, the code above doesn't work, but I'm trying to find something like this that might.
My plan is to follow this with elif statements that reference other ways that it is phrased in different descriptions and to make separate columns and if statements for THC to CBD ratios and CBD to THC ratios, bu to just need to find a way to extract the numbers. Anyone got any ideas?
Aucun commentaire:
Enregistrer un commentaire