So I've been working on putting this together for the past few months, inspired by JackGriffin and others' efforts considering ways of building resilient archives of Unreal/UT content. This is the solution I've come up with.
To start with, it may look pretty bland, just being effectively a static list of files. There are no forums (no need for more forums when this one exists! ), comments sections on each piece of content, user ratings, etc. I found all these things incompatible with the goal of building a stable and long-lived content archive. The point was also not to make the fanciest website in the world, but rather one that's portable and somewhat time-proof.
Rather than me simply hosting all content myself, you'll note that each download has multiple mirrors. The idea behind this is to provide some form of resiliency, so if one goes down, hopefully some of the others are still up. I still need to implement functionality to cull dead links.
A significant goal of the project is also to ensure the distribution of content, so even if all the mirrors go down, there's a reasonable chance someone has the files locally somewhere. As such, built-in tooling allows simple mirroring and updating of a mirror, of all content in the archive. You may even generate the website itself and use it locally or re-host it.
Here are a couple of the things I've done and decisions made, to achieve these goals:
- The whole thing is open source and hosted on GitHub (https://github.com/unreal-archive)
- The source code and metadata are unlicensed, meaning I do not hold copyright or ownership over it, and it effectively belongs to the world/community
- The main project itself is written in Java, primarily because I'm familiar with it, but it's also proven to be extremely backward compatible, and I don't see the JVM going away any time soon.
- The metadata is hosted in a Git repository, in YAML format. I chose this over a database, since its plain text and in the future anyone can use it without any drivers, codecs, connectors, or whatever else. Being version controlled in Git means its easy for anyone to clone and host in remote repositories, as well as on their own machines.
- The website itself is statically generated, meaning it doesn't rely on a hosting environment with PHP, Java, ASP, or other magical stuff. I can be dumped in a directory on any web server and work, and I've also been careful to ensure it works with local file:// paths, so if you want to use the site for personal use without web hosting, you can do that too.
- To encourage "replication" of the data, the tooling provides the ability to effectively "clone" the entire data store. I want people to download everything, the more copies of stuff we have, the better (over time, this leads to a bit of fragmentation, but I'm OK with that). A problem I have with most of the current archives, is that mirroring is actively discouraged, which I don't feel is a particularly healthy approach. If you want to crawl and scrape the website as-is, that's fine, but ideally you should use the tooling.
- The project is not, and cannot be, a cache service. It is intended to host complete "release" packages, rather than single or .uz compressed cache files.
Here's a quick summary of the current contents:
(*the "unknown" game content are some misclassified mutators I haven't cleaned up yet)Loaded content index with 40711 items (165.38GB) in 13.98s
Current content by Type:
> Map: 28679
> Model: 883
> MapPack: 1125
> Mutator: 1057
> Voice: 682
> Skin: 2823
Current content by Game:
> Unknown: 13
> Unreal: 3128
> Unreal Tournament: 26067
> Unreal Tournament 2004: 6041
I'm out of time to document more of this wall of text right now, but this afternoon I'll add more notes on list of and thanks to current content sources, contributions, current metadata management process, cleanup processes, and more. In the mean time please drop any questions you have and I'll answer them later.