-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One HTML-only cron + one everything-but-HTML cron #131
Comments
TODOIf we do this,
Anything else? |
It's been mentioned previously, one downside of this sort of approach is:
This has been partially addressed by adding a 404 that says "The archive you're trying to download has not been built yet. Please try again later or consult the archives for earlier versions." Perhaps we could also mitigate this by renaming the files so they only have x.y in the filename and not x.y.z? So For example, on the day we release 3.13.0, instead of getting a 404 page for the 3.13.0 PDF, they get the 3.13.0rc2 PDF. I think this is fine. There's usually not much that's changed, and the benefit is everyone gets the HTML and PDF files sooner (all the time, not just for releases). |
I seem to remember that the release process includes building the docs + PDF etc, or perhaps it used to. If this is still the case, can part of the release process be to upload a copy of that archive to the docs server as well as the release server? @ned-deily did this I think for rc2? Is this something we can formalise? A |
This subject is confusingly complex for various reasons (some due to attempts to provide compatibility with older end-of-life releases) so there's a good chance that some or all of what follows is wrong but, to the best of my knowledge, it works today like this. The release process currently does produce a quick build of the untranslated docs (html and PDF et als) for all rc and final releases (but not alpha and beta releases) built from the source release git tag (i.e. At the same time, the cron jobs do their things and (try to) produce their daily/3-hour builds on the docs server that are served under various python.org URLs. The actual file names of the download files do include the version number but are served under the branch-specific directories ( A partial look at a docs server branch directory shows this:
AFAICT, there is no reason to be keeping the older files in this directory and this would be solved/mitigated, along with the sync problem alluded to above, if the download file names were changed as suggested above by @hugovk. The trick, of course, is to eliminate or minimize any compatibility issues with user expectations and with the separate release archives produced by the RMs. To the other point:
I believe that what I did for 3.13.0rc2 was to add a temporary link on the Python Documentation by Version web page to the rc2 documentation produced by the release process; normally that page does not include links to pre-release versions. Maybe there was something else on the release page, too. |
See python/cpython#124489 to alter the build process. We'll need to consider what to do with the existing A |
In terms of splitting the build into HTML and non-HTML, we have a triumvirate of patches:
The first can be merged and the builds will carry on as they are now without any change. The second will start building HTML files twice and could potentially overwite itself when copying to A |
Thanks, that looks good to me. The only potential issue I can think of is that there might be users/scripts out there that might be periodically expecting to directly download the current built artifacts using the old URL formats. I would guess that is not common and I don't think we've ever provided any guarantees about the URLs other than linking through the
... And the links that would break, besides the above-mentioned possible scripts, would be from previously downloaded copies of the html-format documentation (the only artifact where the download links appear?) and from embedded copies of the HTML documentation that are provided, for example, in the python.org macOS installer. Others? So, if we did add redirects from the x.y.z to the new x.y URLs for releases where we apply this change (and are still building docs), that should solve the problem for all of those cases, I think. If it would make the redirecting easier, we could perhaps do a one-time create x.y symlinks for EOL releases. |
If we want to add redirects, I've opened python/psf-salt#498 as a draft. A |
I think this brings up another related issue inspired by the above discussion and a comment in the PR:
That is, like the file names and corresponding URLs, the Python version displayed in the daily document HTML and downloads is also imprecise and potentially misleading. The daily builds currently show the |
An alternative is to go the other way with less precision, and advertise the daily downloads page as for "Python 3.12" (helpfully this is also easier to achieve). The static A |
To be interpreted as «Python 3.12 as that branch looks today» ? … seems good! |
On Monday (30 September 2024) we split the server into a HTML-only and non-HTML cron task, the formed scheduled hourly and the latter daily. After some initial teething problems, we've had a successful full rebuild of the non-HTML job, hence this note. First, two tables of statistics with build times and durations: Build times (HTML only)
Build times (no HTML)
Taking the most recent 16 rebuilds for HTML-only:
These have an average (mean) time of 8344s, or 2h 19m 4s. Excluding the builds for which no work was done (5), we have 11 builds with an average of 3h 32m 3s. These numbers are significantly skewed by the Chinese languages, which take more than an hour each. Excluding the Chinese, we have 109 HTML-only builds at an average (mean) time of 4m 15s. We haven't yet observed a full rebuild for all versions and languages, which is to be expected as a benefit of splitting the workers is that the HTML job will have no work to do more frequently. The expected time for a full rebuild of all 13 languages and 3 versions is (11 x 4m15s + 2 x 66m) x 3, or just under nine hours (8:56:15). This is a significant improvement to the status quo ante (c. 30 hours per rebuild), and will further improve dramatically when we resolve the issue with Chinese builds. The Non-HTML archive builds are currently scheduled daily, and a basic projection estimates that a full rebuild of 12 languages x 3 versions would take just under 19 hours, so we have headroom here (though faster is of course better!). Thank you to everyone involved in making this work happen, I'll now close this issue. A |
And thank you @AA-Turner for all your work here! Another idea to consider: a third cron for only the English HTML for the default But I agree, let's first try and figure out why the Chinese builds are so slow. By language or version:
By language and version:
|
To help python/docsbuild-scripts#169.
Current situation
Right now the docs server is taking over 40 hours to build a full set of 3.12-3.14 docs, plus 12 translations each:
List of versions/languages
3.14/zh-tw
3.14/zh-cn
3.14/uk
3.14/tr
3.14/pt-br
3.14/pl
3.14/ko
3.14/ja
3.14/it
3.14/id
3.14/fr
3.14/es
3.14/en
3.13/zh-tw
3.13/zh-cn
3.13/uk
3.13/tr
3.13/pt-br
3.13/pl
3.13/ko
3.13/ja
3.13/it
3.13/id
3.13/fr
3.13/es
3.13/en
3.12/zh-tw
3.12/zh-cn
3.12/uk
3.12/tr
3.12/pt-br
3.12/pl
3.12/ko
3.12/ja
3.12/it
3.12/id
3.12/fr
3.12/es
3.12/en
Nearly all these include HTML, plain text, PDF, Texinfo and EPUB (Ukrainian is HTML only). HTML-only is fast to build, about 3-4 minutes. The full set of artifacts is much slower to build, between 40 minutes and two hours, depending on the language, and is mostly due to building latex for PDFs.
What happens is:
A cron goes off at 7 minutes past the hour and starts a new full build loop.
For each language/version, we only do a build if the docs have changed since last time, or if the translation has changed since last time. This is good, there's no point rebuilding something that hasn't changed.
However, because the full loop takes over 40 hours, inevitably there have been docs or translation changes since the last time, and we get a full rebuild each time.
This results in long delays between docs being updated, not to mention the high server resources usage.
HTML vs. PDF
We have download stats for the HTML docs, but we don't have download numbers for the other artifacts to compare.
However, I'm certain the HTML is by far the most used, and there's the most benefit to getting fresh HTML up quickly.
An affordance of websites is being able to look up just the pages you need, on-demand. Compared with PDF, where you can download it once and use it as an offline reference. Maybe you'll re-download again later, but there's less benefit in updating often, as the one you usually consult is an old, offline copy.
Proposal
I suggest we have two cron jobs:
The current hourly job only builds HTML.
A new job builds everything else except HTML.
1. HTML only
When there are new changes, they will be built and uploaded much sooner. It will run much quicker.
It's more likely that on the next pass, some languages can be skipped because there's nothing to update this time round.
2. Everything but HTML
This will be much slower than the HTML-only job, and will take about the same as the current loop does now.
Maybe it'll be a bit quicker due to not needing to build HTML, but maybe a bit slower because we'll sometimes be using CPU to build HTML at the same time. However, the majority of the time is spent running a latex command on a single CPU, so it might not make much difference.
We also don't need to update the non-HTML as often, so its cron could be every few days?
The text was updated successfully, but these errors were encountered: