Posted

May 27, 2014

Tags

MySQL, MODX, Performance

Generating a Google Sitemap with SQL Query by Garry Nutting Analyzed

Google sitemap or generally a sitemap.xml is an important feature that each site should have. Today's sitemap is in version 0.9 which has support from Google, Microsoft and Yahoo. In simplicity it is an XML file of URL's which hold additional metadata so that search engines can more intelligently crawl the site.

This article is inspired by Garry Nutting's blog post about `Generating a Google Sitemap when you have thousands of resources`. Kudos to him for the great solution.

Garry found a terrific way of building Google Sitemap for MODX by skipping xPDO entirely, which in short means skipping the ORM layer all together to decrease memory consumption. This does not say that xPDO objects would be slow or heavy. But simply having 1000s of objects in memory will stop the script execution prematurely. Also there would be heavy iteration for result set to be modified into XML where the query return nodes ready for you.

Why writing this article then? When reading the query there is an anti-pattern to be found within the SELECT clause where priority and frequency are fetched with correlated subquery instead of joins; more about correlated subquery visit Wikipedia. When result set is in thousands or even in tens of thousands, this is not really a concern. The query runs relatively fast depending on hardware and if tables can fit in memory or not. Should the snippet then be used? Definite YES! In this article I will demonstrate how the query could be turned into faster one and show the numbers related to performance.

The Test Bench

The test were run on dedicated hardware with Intel Xeon E3 1225v2 @ 3.2Ghz, 32Gb Mem, 2x 2TB SATA in RAID0 configuration. Test set containing 1,500,485 rows in content table and 3,000,970 rows in template variables (2 rows per content row). The actual data set in modx_site_content reaches almost 6Gb with indexes, but we are not required to have content field loaded so everything is pretty fast. As MySQL platform Percona Server 5.6.15 was used with whole data set fits into memory.

Original Query Slightly Modified

The original query creates friendly url's using `uri` field in table. My dataset does no have this due to populating the dataset using Dante's Divine Comedy and rows from that which would have lead to multiple matching uri's. So changed that non-friendly url version with ?id=. Also editedon field is changed to createdon. These are minor changes and do not affect the query much, they actually make it slightly faster for overall throughput as uri would be longer than id.

The Query

SELECT 
GROUP_CONCAT(
'',
    CONCAT('http://example.com?id=',s.id,''),
    CONCAT('',FROM_UNIXTIME(s.createdon, '%Y-%m-%d'),''),
    IFNULL(
    CONCAT('',(
        SELECT value
        FROM modx_site_tmplvar_contentvalues
        USE INDEX (tv_cnt)
        WHERE contentid=s.id AND tmplvarid=1
        ),''),''),
    IFNULL(
    CONCAT('',(
        SELECT value
        FROM modx_site_tmplvar_contentvalues
        USE INDEX (tv_cnt)
        WHERE contentid=s.id AND tmplvarid=2
        ), ''), ''),
    ''
SEPARATOR ''
) AS node
FROM modx_site_content AS s
WHERE s.deleted = 0 AND s.published = 1 AND s.searchable = 1 AND context_key='web'
GROUP BY s.id
ORDER BY s.id ASC 

Result Set

1.5 million rows later 
| http://example.com?id=15004832014-01-110.82hourly  | 
| http://example.com?id=15004842014-01-110.09always  | 
| http://example.com?id=15004852014-01-110.66weekly  | 
+--------------------------------------------------------------------------------------------------------------------------------------------+ 
1500485 rows in set (26.90 sec)

Explain

*************************** 1. row *************************** 
           id: 1 
  select_type: PRIMARY 
        table: s 
         type: ref 
possible_keys: PRIMARY,published,searchable,context_key,cache_refresh_idx 
          key: context_key 
      key_len: 302 
          ref: const 
         rows: 1500468 
        Extra: Using index condition; Using where; Using filesort 
*************************** 2. row *************************** 
           id: 3 
  select_type: DEPENDENT SUBQUERY 
        table: modx_site_tmplvar_contentvalues 
         type: ref 
possible_keys: tv_cnt 
          key: tv_cnt 
      key_len: 8 
          ref: const,luggage.s.id 
         rows: 1 
        Extra: Using index condition 
*************************** 3. row *************************** 
           id: 2 
  select_type: DEPENDENT SUBQUERY 
        table: modx_site_tmplvar_contentvalues 
         type: ref 
possible_keys: tv_cnt 
          key: tv_cnt 
      key_len: 8 
          ref: const,luggage.s.id 
         rows: 1 
        Extra: Using index condition 
3 rows in set (0.00 sec)

Nothing alarming really rises when looking at result time or explain. Explain shows what to be expected. Those ~1.5M rows are fetched with each looking for two extra rows from second table. The problem is seen when using profiles:

Query Profile

| executing          | 0.000001 | 
| Sending data       | 0.000006 | 
| executing          | 0.000001 | 
| Sending data       | 0.000007 | 
| executing          | 0.000001 | 
| Sending data       | 0.000006 | 
| executing          | 0.000001 | 
| Sending data       | 0.000008 | 
| executing          | 0.000001 | 
| Sending data       | 0.000006 | 
| executing          | 0.000001 | 
| Sending data       | 0.000007 | 
| executing          | 0.000001 | 
| Sending data       | 0.000006 | 
| executing          | 0.000001 | 
| Sending data       | 0.044913 | 
| end                | 0.000007 | 
| removing tmp table | 0.000081 | 
| end                | 0.000004 | 
| query end          | 0.000003 | 
| closing tables     | 0.000012 | 
| freeing items      | 0.000047 | 
| logging slow query | 0.000001 | 
| logging slow query | 0.000133 | 
| cleaning up        | 0.000020 | 
+--------------------+----------+

You can see the repeating executing, sending data pairs. The 1 microsecond and 6-8 microsecond creates a pattern that repeats as pair for 1500485 times. Those two are the subqueries for priority and frequency that are run once for each row fetched. Roughly calculating with 1.5M x 0,000006 sums to 9 seconds just for fetching the frequency.

The new query

Correlated subqueries can be often avoided. Typical way to go around them is using joins. As the value could be null for frequency or priority the query requires LEFT JOIN to function. This is not optimal, but neither is full table scan which the query requires in this case where no documents are deleted nor unpublished.

The Query

SELECT 
    CONCAT("
        http://example.com?id=",s.id,"",
        FROM_UNIXTIME(s.createdon, '%Y-%m-%d'),"
        ",COALESCE(p.value, ""),"
        ",COALESCE(f.value, ""),"
    ")  as node FROM modx_site_content s
LEFT JOIN modx_site_tmplvar_contentvalues p 
    ON p.contentid = s.id AND p.tmplvarid = 1
LEFT JOIN modx_site_tmplvar_contentvalues f 
    ON f.contentid = s.id AND f.tmplvarid = 2
WHERE s.context_key='web' 
    AND s.deleted = 0 
    AND s.published = 1 
    AND s.searchable = 1
ORDER BY s.id ASC;

Results

1.5million rows later 
| http://example.com?id=15004832014-01-110.82hourly  | 
| http://example.com?id=15004842014-01-110.09always  | 
| http://example.com?id=15004852014-01-110.66weekly  | 
+-------------------------------------------------------------------------------------------------------------------------------------------+
1500485 rows in set (14.33 sec)

Great improvement in speed and almost halved the time from original. Reasons behind this are straight forward, join is more efficient than subquery. Also the whole result row is now only one concatenated string there is no need for group by at all.

Explain

*************************** 1. row *************************** 
           id: 1 
  select_type: SIMPLE 
        table: s 
         type: ref 
possible_keys: published,searchable,context_key 
          key: context_key 
      key_len: 302 
          ref: const 
         rows: 1500468 
        Extra: Using index condition; Using where; Using filesort 
*************************** 2. row *************************** 
           id: 1 
  select_type: SIMPLE 
        table: p 
         type: ref 
possible_keys: tmplvarid,contentid,tv_cnt 
          key: tv_cnt 
      key_len: 8 
          ref: const,luggage.s.id 
         rows: 1 
        Extra: Using where 
*************************** 3. row *************************** 
           id: 1 
  select_type: SIMPLE 
        table: f 
         type: ref 
possible_keys: tmplvarid,contentid,tv_cnt 
          key: tv_cnt 
      key_len: 8 
          ref: const,luggage.s.id 
         rows: 1 
        Extra: Using where

The explain is pretty much the same as with original query. Exception is that joins are using `Using Where` instead of `Using index condition`. The using index condition actually is not bad thing, it greatly improves the original query. But it is only available in MySQL 5.6 and up. What really changes is the profile shown below.

+----------------------+-----------+ 
| Status               | Duration  | 
+----------------------+-----------+ 
| starting             |  0.000116 | 
| checking permissions |  0.000004 | 
| checking permissions |  0.000001 | 
| checking permissions |  0.000003 | 
| Opening tables       |  0.000021 | 
| init                 |  0.000053 | 
| System lock          |  0.000009 | 
| optimizing           |  0.000023 | 
| statistics           |  0.000203 | 
| preparing            |  0.000038 | 
| Sorting result       |  0.000003 | 
| executing            |  0.000002 | 
| Sending data         |  0.000006 | 
| Creating sort index  | 14.327392 | 
| end                  |  0.000008 | 
| query end            |  0.000003 | 
| closing tables       |  0.000014 | 
| freeing items        |  0.000031 | 
| logging slow query   |  0.000001 | 
| logging slow query   |  0.000063 | 
| cleaning up          |  0.000048 | 
+----------------------+-----------+ 

Pretty much the whole time was spent filling in the temporary table and doing the sort there. What if it would not be sorted? Does google really care about order? I would guess not as the XML is not structural in terms of parent child relation. Same query run without sorting.

Query Redefined

1.5 million rows later 
| http://example.com?id=15004832014-01-110.82hourly  | 
| http://example.com?id=15004842014-01-110.09always  | 
| http://example.com?id=15004852014-01-110.66weekly  | 
+-------------------------------------------------------------------------------------------------------------------------------------------+ 
1500485 rows in set (13.28 sec)

Just a slight improvement in terms of speed. Explain does look lot brighter though

Explain for Query Redefined

*************************** 1. row *************************** 
           id: 1 
  select_type: SIMPLE 
        table: s 
         type: ref 
possible_keys: published,searchable,context_key 
          key: context_key 
      key_len: 302 
          ref: const 
         rows: 1500468 
        Extra: Using index condition; Using where 
*************************** 2. row *************************** 
           id: 1 
  select_type: SIMPLE 
        table: p 
         type: ref 
possible_keys: tmplvarid,contentid,tv_cnt 
          key: tv_cnt 
      key_len: 8 
          ref: const,luggage.s.id 
         rows: 1 
        Extra: Using where 
*************************** 3. row *************************** 
           id: 1 
  select_type: SIMPLE 
        table: f 
         type: ref 
possible_keys: tmplvarid,contentid,tv_cnt 
          key: tv_cnt 
      key_len: 8 
          ref: const,luggage.s.id 
         rows: 1 
        Extra: Using where

Note the missing `filesort` in first row. This will be a great advantage if the results would not fit in memory. As whole sorting operation would be done using file system. Which we all know is lot slower than memory. Also the profile complements this approach.

Profile for Query Redefined

+----------------------+-----------+ 
| Status               | Duration  | 
+----------------------+-----------+ 
| starting             |  0.000110 | 
| checking permissions |  0.000003 | 
| checking permissions |  0.000001 | 
| checking permissions |  0.000003 | 
| Opening tables       |  0.000021 | 
| init                 |  0.000051 | 
| System lock          |  0.000010 | 
| optimizing           |  0.000023 | 
| statistics           |  0.000202 | 
| preparing            |  0.000030 | 
| executing            |  0.000002 | 
| Sending data         | 13.287025 | 
| end                  |  0.000006 | 
| query end            |  0.000002 | 
| closing tables       |  0.000012 | 
| freeing items        |  0.000016 | 
| logging slow query   |  0.000001 | 
| logging slow query   |  0.000054 | 
| cleaning up          |  0.000037 | 
+----------------------+-----------+

Pretty much all of the time is spend sending data. That is a desired outcome, even the data would not complete fit into memory.

Closure

Firstly, the experiment is not completely compatible with real world use. By Google's usage guidelines, one sitemap file can only consist 50,000 URL's and can be max. 50MB in size uncompressed.

This small experiment was to show how avoiding correlated queries can improve performance. They are handy when going through small sets of data when numbers do not exceed hundreds of thousands. Then they can slowly build up to being a bottleneck in application performance. But that is, when the result sets starts to exceed vast number of rows or system cant perform well. That kind of scenarios could be found on shared hosting platforms quite fast.

Share your Thoughts

0 Responses

Your email address will not be published

Please enter your name.
Please enter valid email address.
Please enter your website address.
Please enter your message.
TwitterX Error: Could not load tweets as Twitter responded with the error: 'Invalid or expired token.'.
-->

Contact me with email or add my account to skype

the_dunnock@outlook.com

the_dunnock@outlook.com