Here’s an easy technique that has streamlined my workflows.
Often there are things that you want to track with your data. I work with collections of tabular data. For example, I may have seven datasets, each representing data collected on a specific day of the week. Often, I want to perform experiments on each dataset. I found it useful to create a
dataset object to instantiate and track these objects.
class dataset: def __init__(self, path, target, rate, label='', note=''): self.path = path self.target = target self.label_path = label self.note = note self.accuracy = 0
For me, some key details are the path to the data, the target value (I’m counting binary events), the path to any label file, and any notes about the dataset. Obviously, this class could be extended for any other attributes you need (or even methods). With the elements I have here, the object attributes capture enough information to load data, perform a test, and generate accuracy metrics. For example, if you had an
evaluate() function and a
model(), you could:
for data in datasets: data.accuracy = evaluate(model(data.path), data.label_path)
This structure gives me an organized way to iterate through my datasets. I usually create a list of these objects, then perform my experiments by iterating on the list. In embarrassingly parallel scenarios, it’s easy to punt function calls on these
dataset objects over to joblib.