forked from catchpoint/WebPageTest.agent
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More corrupted values in technology detection data #29
Comments
pmeenan
added a commit
that referenced
this issue
Dec 10, 2024
Here is a cleanup query: DECLARE crawl_month DATE DEFAULT DATE('2024-11-01');
CREATE TEMP TABLE technologies_cleaned AS (
WITH wappalyzer AS (
SELECT
name as technology,
category
FROM wappalyzer.apps,
UNNEST(categories) AS category
), pages AS (
SELECT
date,
client,
page,
technologies
FROM crawl.pages
WHERE date = crawl_month
), impacted_pages AS (
SELECT DISTINCT
date,
client,
page
FROM pages,
UNNEST (technologies) AS tech,
UNNEST (tech.categories) AS category
LEFT JOIN wappalyzer
USING (technology, category)
WHERE wappalyzer.category IS NULL OR
wappalyzer.technology IS NULL
), flattened_technologies AS (
SELECT
date,
client,
page,
technology,
category,
info
FROM pages,
UNNEST(technologies) AS tech,
UNNEST(tech.categories) AS category
WHERE page IN (SELECT DISTINCT page FROM impacted_pages)
), whitelisted_technologies AS (
SELECT
date,
client,
page,
f.technology,
f.category,
f.info
FROM flattened_technologies f
INNER JOIN wappalyzer
USING (technology, category)
), reconstructed_technologies AS (
SELECT
date,
client,
page,
ARRAY_AGG(STRUCT(
technology,
categories,
info
)) AS technologies
FROM (
SELECT
date,
client,
page,
technology,
ARRAY_AGG(DISTINCT category IGNORE NULLS) AS categories,
info
FROM whitelisted_technologies
GROUP BY date, client, page, technology, info
)
GROUP BY date, client, page
)
SELECT
date,
client,
page,
r.technologies
FROM impacted_pages
LEFT JOIN reconstructed_technologies r
USING (date, client, page)
);
UPDATE crawl.pages
SET technologies = technologies_cleaned.technologies
FROM technologies_cleaned
WHERE pages.date = crawl_month AND
pages.client = technologies_cleaned.client AND
pages.page = technologies_cleaned.page; |
When switched to _detected_technologies we still have a few pages that have unexpected values. WITH wappalyzer AS (
SELECT DISTINCT
name as technology,
category
FROM wappalyzer.apps,
UNNEST(categories) AS category
)
SELECT DISTINCT
technologies,
page
FROM crawl.pages as pages,
UNNEST (technologies) AS tech,
UNNEST (tech.categories) AS category
LEFT JOIN wappalyzer
ON tech.technology = wappalyzer.technology
AND category = wappalyzer.category
WHERE date = '2024-12-01'
AND (wappalyzer.category IS NULL OR wappalyzer.technology IS NULL) Let's try to fix these on the wptagent - to be sure there is no other unexpected impact to detections. At the same time I think it makes sense to add technologies cleanup step when copying to production tables. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Based on the corrupted data here is the list of pages with corrupted ca:
The detection seems to work fine. It looks like page context is messing with some built-in objects again.
Maybe we could avoid using any values that could be impacted by it.
A few cases:
undefined
)One of the observations - in most of these cases only the values within
detected_technologies
have correct data (keys are also impacted).Maybe we should switch to it for the BigQuery data?
For example:
The text was updated successfully, but these errors were encountered: