More isn’t always better. And this couldn’t be truer when developing your data strategy. What data even matters? How do you know it matters? Where do data acquirers spend their time? What data do you go after next? Often times, it’s hard to structure these questions cogently without being overwhelmed. At CircleUp, we boil things down into three components, reflective of our focus on time-series data assets: (1) the past, (2) the present, and (3) the future.
Let’s start with a few definitions. What do each of these mean to Helio?
The past: What is our back-collection strategy?
The present: What does our current data collection and ingestion strategy look like?
The future: What does our new data acquisition look like?
Somewhat intuitively, we start with the present and spend our strategic and engineering efforts on building a robust ingestion system that reliably and effectively ingests data known to be predictive of future company success. Our current ingestion system comprises of two equally important components of any ML model – (1) our feature data and (2) our training data. Both pieces ingest hundreds of sources that are categorized as partnership, practitioner, or public data. The art of determining the ROI of each of these data assets is the combination of a set of questions we continuously ask ourselves. To name a few…what is the feature importance of this data asset to our primary and secondary ML models? How many brands in our CircleUp universe does this data asset affect? How ephemeral is this data? Asking the right questions enables our data ingestion team to prioritize existing data assets with an understanding of how each affects our business teams.
Second, we look to the future. What data do you go after next? This is a tricky one and includes a lengthy data evaluation process we conduct for each source we could potentially ingest. But that process is for a future post. When thinking through what data we’d initially like to approach for a sample, we try to keep things simple. Will this data asset be incremental to a known predictor of success? Or, will this data asset be an orthogonal signal motivated by a business team hypothesis? From there, we start the data source hunt and evaluation process.
Finally, we think about what we’d ideally like to back-collect. As you’d imagine, this component is easier to identify than actually execute. The feature importance of our data assets are known and prioritized, as per our focus on our “present” collection and ingestion system. But, the ability to back-collect time-series data is nearly impossible. Back-collection, therefore, is the combination of both the acquisition of ephemeral data (super challenging) and the collection of time-series yet to be proven predictive of company success.
Data acquisition is a juggling act that pushes our team to ruthlessly prioritize while thinking about both the short and long term. A data asset predictive of success today might not be one tomorrow (and vice versa). As a result, we find ourselves repeatedly testing and repeatedly taking stock to successfully deploy capital today and scale in the future.