Unlocking AI potential in e-commerce and online business – The Right Data

Welcome! Our “Unlocking AI potential in e-commerce and online business” series aims to provide basic guidance on applying AI (Artificial Intelligence) in businesses which use the internet as a primary source for delivering value to their customers. In this article, we focus on how to think about data and data sources in a way to get the most out of machine learning models. Enjoy your reading!

Garbage in – garbage out in AI-focused projects

80 % – approximately this amount of a data scientist’s time is spent preparing data for advanced analysis and machine learning, according to recent articles published by The New York Times, Harvard Business Review or Forbes

AI is not just one particular computer program or algorithm. Instead, it’s a system of several mathematical models. The ability of this system to emulate the human decision-making process relies on the right input information. Companies often suppose that more data leads to a better model. Unfortunately, it is not that easy.

Big Data does not necessarily mean you will have a good AI model

The result depends on how well the data covers key drivers of the problem we are trying to solve. If successful, AI can save tens of percent. McKinsey Global Institute published the results of its research which attempts to simulate the impact of AI on the world economy:

$13 trillion – that’s the amount of additional global economic activity which could be delivered by 2030
20 % – that’s the amount of decline in cash flow from today’s levels nonadopters might experience by 2030, assuming the same cost and revenue model as today

Are you new to AI and Machine Learning? If so, we recommend the first article in our “Unlocking AI potential in e-commerce and online business” dealing with basic concepts.

A very common beginner’s mistake is spending most of the time on developing complex algorithms instead of brainstorming about the right data.

A simple model based on good features always beats a complex model based on poor features

Collecting data at random with no idea of how exactly to utilize it often leads to data inconsistency and bad data structure. Features extracted from data collected in such conditions are almost unusable for any kind of analysis or machine learning. Having some data is better than having no data, but, without proper business case, mathematical modeling is like looking for a needle in a haystack.

Data worth storing

By building AI for e-commerce, we often automate tasks and procedures which are being done manually by domain experts. A common way to utilize data is a data analysis done by hand followed by some action in response to the new insights. Thanks to machine learning, we can replace this process by a mathematical model which performs an action directly based on real-time and historical data.

The right data for machine learning is often that which yields similar information to what a human expert would use to solve the problem

Data and ontology

Let’s think about how this works in a real case. In our last article we introduced a fictional company called Fictional Online Fashion Store which would like to personalize its website content in real time and offer customers relevant products. We already illustrated how to approach the project from the management point of view. How to handle the data selection process? How would a human expert proceed in a regular brick and mortar fashion store? When a new customer comes in, a salesperson performs a quick brainstorming:

Is it a new or an existing customer?
Is it a man or a woman?
How old is he/she?
What is his/her fashion style?
How wealthy is he/she?
Does he/she view goods slowly or quickly?
Does he/she view goods which share a particular feature?
Does the current day, time or season matter? Is late evening shopping typical for some particular customer segment?
What does our competition offer in this area?
What is popular among celebrities and influencers?

These brainstorming questions form the basis for an ontology – a definition of entities, attributes and their relations in the given context.

In our case we have at least two entities – customers and products. Each customer has such attributes as shopping history, gender, age, fashion style, income and shopping behavior (questions 1-7). Similarly, each product in our fictional fashion store has such attributes as fashion style or target customer segment.

The values of these attributes are used by the human expert (salesperson) to recommend relevant products in the given context (questions 8 – 10). How do we represent such information using available data sources?

Knowledge representation

Let’s split the information mentioned earlier into several segments:

Demographics – customer gender, age and income
Shopping behavior history – customer fashion style based on his or her clothes and accessories, history of purchases in our Fictional Online Fashion Store and history of interaction with newsletters and marketing campaigns
Current shopping behavior – real-time interaction with the Fictional Online Fashion Store website (viewing products, banner clicks)
External context – shopping and fashion trends for the current season or time, what the competition offers, trends among celebrities and influencers

When building an AI model, we need to think about the significance of each information segment and how well we can cover it using available data sources.

Demographics

Demographics such as age or income could be tricky to obtain due to the consumer privacy and data protection regulations. We can use a registration form but we never know whether the customer is telling us the truth. Luckily, we can estimate this information using correlated data.

We often know the location, mobile device type and operating system which reveals the possible income category. An iPhone owner from a big city probably spends more on fashion compared to a cheap phone owner from a rural area. In the case of an existing customer we can combine this information with historical purchases to get a more precise estimate.

Based on the visit time and location we can guess the age group. The daily schedule of an economically active population differs from the schedule typical for kids or seniors.

How to successfully execute AI-related projects? You can find a few tips in this article in our “Unlocking AI potential in e-commerce and online business” series!

Shopping behavior history

It is almost impossible for an online store to guess the shopping behavior history of a new customer. The only information available is usually the mobile device type and operating system. Thankfully, it’s much easier once the customer starts interacting with the website and marketing content. We can begin to create a customer profile starting with the first page view.

Current shopping behavior

Similarly, current shopping behavior trends could be obtained in real time by tracking website interactions. We can look for a similarity in attributes among viewed products, compute the number of visited product categories or count the time spent on product details.

External context

The toughest one is data representation of the external context. In fact, there is literally a countless number of things which could be relevant to a customer’s preferences and behavior. From a technical point of view, less abstract information is easier to represent. In our case, this could mean web scraping selected blogs and newspapers associated with the target demographic group and extracting particular keywords. In a similar way we can track what is offered by selected competitors. Other useful context information could be a weather forecast.

Now we know where to find the right data, what next?

If you want to unlock the whole potential of your data in AI, you need a system to process it, analyze it and most importantly, to use it in AI models. We will cover all these aspects in our series “Unlocking AI potential in e-commerce and online business”. Stay tuned!

About the author

My name is Ondřej Kopička and I help companies automate data analysis.

Does the volume of your data exceed the capacity of your analysts? Then we should talk.

Connect with me on LinkedIn: https://www.linkedin.com/in/ondrej-kopicka/