Journal of Experiences

Listening to people also involves data: GA4 Data Import

How to Fix and Clean Ecommerce Data History with Python Using GA4 Item Data Import

GA4 Data Import

In marketing, data helps us understand how people interact with products and services, both online and offline, living and enriching their experiences. And it is also thanks to numbers that we are able to map and understand their needs in an increasingly in-depth way on a daily basis. Careful data analysis therefore means giving projects back the heart of their existence, that is, people’s experiences, without which objectives and results would not be able to communicate with each other.

This is why every day we work hard to fully understand the behavioral patterns of people who pass through the products and services of the brands we collaborate with, analyzing and optimizing more and more deeply the data from which each project is born and develops.

In this article, we will explore a particular feature of Google Analytics 4 that helps us to increasingly clean the data we work on, to filter out critical issues and optimize work processes.

GA4 Data Import: what is it?

Google Analytics 4 (GA4) offers users a wide range of advanced features to analyze and monitor user behavior on their website. One of these features is “Data Import”, which allows you to load data from external sources (CRM or offline points of sale, for example) in order to combine them with online user traffic data. In this way, you can obtain more precise information and consult it more easily through a single tool.

Features and benefits of GA4 Data Import

To use the Data Import function, you just need a CSV file containing external data, which will be linked to the data collected by GA4 once loaded into the property. There are four different types of Data Import:

“Cost-data”: Used to import click, cost and impression data from third-party ads.

“Item Data”: Covers product metadata such as size, color, category, name and brand.

“User Data”: Used to import metadata relating to users such as the loyalty index that can be collected from CRM.

“Offline Events”: Custom offline events that come from sources that are not connected to the internet.

In the case of Item Data Import, this function becomes even more interesting: this type of Data Import allows not only to modify the data that will be collected in the future, but also acts retroactively on the history to correct any errors that have been recorded in the past.

An example could be that of a multilingual e-commerce in which product data is collected in different languages. In this case it would be more difficult to analyze the performance of the products, since for what is in fact a single product (for example “White T-shirt”) we would not find a single “Item Name” entry, but as many as the languages of the site (“White T-shirt”, “White t-shirt”, “T-shirt blanc” and so on), making the data less usable.

Here the Data Import function of GA4 can help to aggregate the data and present it in a more orderly way. For this process, however, it is necessary to obtain a file that contains the product data such as SKU, product name and category name, a process that can be automated thanks to Python.

Step 1: Prerequisites

First of all, you need to make sure that the information you want to collect is actually present in the pages of your e-commerce website, such as the SKU, the name and the category of the product. You can test it with a couple of product pages and open DevTools (Ctrl+Shift+I), open the “Elements” tab and search (Ctrl+F) for elements such as the name of a product, its sku and its category in the code: if they are present, then you can extract them.

You must then make sure you have Python installed, and it is recommended to create a virtual environment so as to have an isolated Python installation that contains only the libraries necessary for the correct functioning of a given application (in our case, the e-commerce crawler).

Alternatively, you can use Anaconda, an open source distribution of the Python and R programming languages, for a more intuitive and simplified management of packages and virtual environments. Once Anaconda is open, simply click on “Environments”, then “Create” assigning it a name and clicking on the environment just created to activate it.

The last preliminary step is the installation of the libraries needed for the correct functioning of the application, in particular Scrapy to manage the e-commerce crawling process, BeautifulSoup, which allows you to extract certain information from the pages identified by the crawler and Pandas for the manipulation and organization of the collected data. To install them you can use this script inside the command prompt or terminal:

On Windows:

<code>

py -m pip install scrapy beautifulsoup4 pandas

</code>

On macOS/Linux:

<code>

python3 -m pip install scrapy beautifulsoup4 pandas

</code>

Step 2: Importing Python Libraries

After creating a new .py file, you can import the libraries you just downloaded with this script:

<code>

import urllib

import pandas as pd

from scrapy.crawler import CrawlerProcess

from bs4 import BeautifulSoup

from scrapy.spiders import SitemapSpider

</code>

Not only Scrapy, Pandas and BeautifulSoup will be used but also the “urllib” library, a package containing several modules for working with URLs (Uniform Resource Locators).

Step 3: Web crawling with Python and Scrapy

The next step after importing the libraries is crawling the e-commerce, which can be done using this script:

<code>

# Function to get the content of a page

def get_page(url):

response = urllib.request.urlopen(urllib.request.Request(url, headers={‘User-Agent’: ‘Mozilla’}))

soup = BeautifulSoup(response, ‘html.parser’, from_encoding=response.info().get_param(‘charset’))

return soup

# Function to get sitemaps from robots.txt file

def get_sitemaps(robots):

sitemapList = []

lines = str(robots).splitlines()

for line in lines:

if line.startswith(‘Sitemap:’):

split = line.split(‘:’, maxsplit=1)

if ‘/en/’ in split[1]:

sitemapList.append(split[1].strip())

return sitemapList

robots = get_page(‘https://www.example-e-commerce.com/robots.txt’)

sitemaps = get_sitemaps(robots)

# Spider definition for product crawling

class ProductsSpider(SitemapSpider):

name = “products”

allowed_domains = [“example-e-commerce.com”]

custom_settings = {

‘USER_AGENT’: ‘Mozilla/5.0’

}

sitemap_urls = sitemaps

def parse(self, response):

data = {

‘item_name’: response.selector.xpath(“//h1[@class=’page-title’]/span/text()”).get(),

‘item_cat1′: response.selector.xpath(“//*[@class=’items’]/li[2]/a[@title]/text()”).get(),

‘sku’: response.selector.xpath(“//*[@data-product-sku]”).css(“::attr(data-product-sku)”).get()

}

yield data

# Function to run the spider, which creates a JSON file at the end

def run_spider():

process = CrawlerProcess(settings={“FEEDS”: {“products.json”: {“format”: “json”}}})

process.crawl(ProductsSpider)

process.start()

if __name__ == “__main__”:

run_spider()

</code>

In the code above, the function “get_sitemaps(robots)” collects the sitemaps that contain the pages of the site to be crawled in the robots.txt file. The goal is to collect only the English names of the various products, therefore, only the URLs that contain the language folder “/en/” are taken into consideration through a filter and saved in a “sitemapList” list.

The spider that will crawl the collected sitemaps is then configured and in the “parse(self, response)” function the elements that will be collected from each individual page are specified, such as the product name, the category (or categories) and the SKU. These elements can be identified and extracted with an XPath selector, a language used to select specific parts of an XML or HTML document. For example, in the code to identify the product name, the h1 element is used with the following selector: “//h1[@class=’page-title’]/span/text()” where the text inside the “span” tag contained in the h1 tag that has the “page-title” class is selected (the “//” at the beginning is used to take into account all the elements of the page). The same principle is used for the category and the SKU code.

Since each site has its own structure, tags and classes for its pages, it will probably be necessary to adapt the selectors accordingly to correctly extract the product information. You can test the correct functioning of the XPath selectors using Chrome’s DevTools (Ctrl+Shift+I), selecting the “Elements” tab and entering the text of the selector in the search bar (Ctrl+F), and if it finds only and exclusively the element that we are actually trying to extract, it will mean that the selector is correct.

The “run_spider()” function actually starts the spider, and at the end of the crawl a JSON file will be produced that will contain the collected product information. This will be useful in the next step, which is data cleaning.

Step 4: Cleaning the collected data with Pandas

After collecting the data with the previous script, we can proceed with cleaning and formatting the data into a CSV file, ready for import into GA4. You will need to add the following script:

<code>

df = pd.read_json(‘products.json’)

# Removing rows that have empty item_names (NaN) and eliminating duplicate SKUs

df.dropna(axis=0, subset=[‘item_name’], inplace=True)

df.drop_duplicates(subset=[‘sku’], keep=’first’, inplace=True)

# For each row in the dataframe, remove extra spaces and line breaks within the “h1” values

for i, row in df.iterrows():

df.at[i, ‘item_name’] = ‘ ‘.join(df.at[i, ‘item_name’].splitlines()).replace(‘ ‘, ‘ ‘)

df = df.replace(‘”‘, ”, regex=True)

df[‘item_name’] = df[‘item_name’].str.replace(‘,’, ‘.’)

# Saving clean data

df.to_csv(‘ProductData.csv’, index=False)

</code>

In this way, a Pandas dataframe, i.e. a table, is created from the JSON file, which contains the raw data ready to be cleaned.

A first correction to make is to eliminate those rows that have “NaN” as the “item_name” value (i.e. they are empty), since the spider will most likely have also examined pages that are not product pages and that are not of interest to us. Another cleaning consists in eliminating rows that have the same value for SKU, in case the spider has crawled the same product page but for different countries (for example, at the beginning only the starting sitemaps of the crawl that had the language folder /en/ in the URL were included, but there could be pages in English for different countries, such as “it/en”, “uk/en” or “us/en”, and duplicate products would be collected if the cleaning did not also consider this aspect). These operations are useful to ensure that the data in the dataframe is complete (without missing values in critical columns) and unique (without unwanted duplicates).

You can then move on to checking the information that has been collected: it may happen that the product name values contain errors such as extra spaces; therefore, it is necessary to normalize the text string in the ‘item_name’ column, removing carriage returns and replacing any double spaces with a single space.

Another precaution included in the code concerns the removal of double quotes (“) from the table values and the replacement of commas (,) with periods (.), this is because they could cause problems once the CSV file is created (in this case called “ProductData.csv”), which uses commas to separate the data. After a further check of the file, such as verifying that the SKUs are actually unique, you can proceed to import the data into GA4.

Step 5: Importing Product Data into GA4

With the data ready in CSV format, you can proceed with the import into GA4, creating a new data source through the following steps:

1 – Log in to GA4 and go to the “Administrator” section of the specific property

2 – In the box titled “Data Collection and Editing”, select “Data Import”

3 – Click on the “Create Data Source” button

4 – Select “Item” as the data type, and “Manual CSV File Upload” as the import source

5 – Click on “Upload CSV” and select the file created by the Python script, then click on “Next”

6 – Here you must map the CSV values with the dimensions used in GA4. The “id” item is already selected by default, which with the drop-down menu on the side must be associated with “sku”, and consequently “Name” with “item_name” and “Category 1” with “item_cat1”. A preview of the values will also be shown, so you can be sure you are selecting the right CSV file item for each GA4 dimension.

7 – Finally, click on “Import”

The import process will begin and will be completed once a green check icon appears under the “Status” heading. You can then proceed to check the data in the GA4 reports for your products to make sure they have been uploaded correctly. If everything went well, you will now be able to view the data in an aggregated and more organized way, such as having a single product name in a single language for each SKU.

Additional step: Periodic data update

The steps outlined so far allow for a one-time extraction of the data, but with a few more lines of code it is possible to consider a periodic extraction, in order to keep the GA4 data updated over time.

You will need to add to the imported libraries os, which provides a way to interface with the operating system, and shutil, which provides functions for managing high-level operations on files and directories, using this script:

<code>

import os, shutil

from datetime import datetime

</code>

Immediately after the library import lines, the following script should be added:

<code>

if os.path.exists(‘ProductDataNew.csv’):

os.rename(‘ProductDataNew.csv’, ‘ProductDataOld.csv’)

if os.path.exists(‘products.json’):

os.remove(‘products.json’)

</code>

The script checks if the same folder as the .py file contains a file called “ProductData.csv”, and if so, renames it to “ProductDataOld.csv”. It also checks if a file called “products.json” is present, and if so, deletes it. This is to clean the folder of any files that may have been left over from a previous run of the application.

Another script to add right after the data cleansing and CSV file creation is the following:

<code>

if os.path.exists(‘ProductData.csv’) and os.path.exists(‘ProductDataOld.csv’):

dfCurr=pd.read_csv(‘ProductData.csv’)

dfPrev=pd.read_csv(‘ProductDataOld.csv’)

dfPrevRowCount=len(dfPrev)

dfMerge=pd.concat([dfPrev, dfCurr])

dfMerge.drop_duplicates(subset=[‘sku’], keep=’first’, inplace=True)

dfMergeRowCount=len(dfMerge)

numAddedProducts=dfMergeRowCount-dfPrevRowCount

dfMerge.to_csv(‘ProductDataNew.csv’,index=False)

if numAddedProducts>0:

print(‘They have been found and added to the file ‘+str(numAddedProducts)+’ new articles.’)

lastProductsAdded=dfMerge.tail(numAddedProducts)

print(lastProductsAdded)

print(‘Proceed to upload the file “ProductDataNew.csv” to GA4.’)

else:

print(‘No new articles found. It is not necessary to upload the “ProductDataNew.csv” file to GA4.’)

os.remove(‘ProductDataOld.csv’)

os.remove(‘ProductData.csv’)

if os.path.exists(‘SiteName_History-Data-Products/’) == False:

os.mkdir(‘SiteName_History-Data-Products’)

shutil.copy2(‘./ProductDataNew.csv’, ‘./SiteName_History-Data-Products’)

currentMonth=datetime.now().month

currentYear=datetime.now().year

currentDay=datetime.now().day

os.rename(‘SiteName_History-Data-Products/ProductDataNew.csv’,’SiteName_History-Data-Products/SiteName-ProductInfo_’+str(currentYear)+’-‘+str(currentMonth)+’-‘+str(currentDay)+’.csv’)

</code>

This last script ensures that, if there is a file generated by a previous launch of the application (“ProductDataOld.csv”), it is merged with the new CSV just created to create a third file, in which duplicate values are eliminated and in which it is checked if there have been any additions. If so, it will be notified, otherwise a message will warn that it is not necessary to upload the new CSV file to GA4. This will then also be copied to a folder, called “NomeSito_Storico-Dati-Prodotti”, and in the name it will contain the values of day, month and year of creation, so as to have a backup in case problems should arise in the future.

In this way, the script can be activated periodically in order to update the product list over time, and therefore always obtain GA4 reports that are organized and easier to consult, even after the introduction of new products within the e-commerce in question.