15 minutes
Technical overview of my first solo project - Archive Finds pt. 2
This is a second part to the first part introducing the idea behind Archive Finds and why it came about. In contrast to the previous part, in this one I would like to delve deeper and focus on the development side of the project, including the key technologies and decisions, to give an overview of how such a service functions. I’d like to describe this journey to revisit mistakes I made and insights I gained. And maybe to slightly show off with a product I’m proud of.
The development of this service, then nameless, started more than 3 years ago, and by now its GitHub repository is made up of more than 400 commits. During these months and commits, the service has changed notably. That being said, the same key building blocks have existed:
- The data scraping
- A database to store the scraped data
- A website to show the scraped data
A fourth addition to this “Big 3” could be a back-end, such as an API connecting the website and database. This is a component that was added only later on, but for the time being let’s neglect it.
Scraping the data
Even though I’ve had experience with web scraping before, it was already clear to me that this is the “make or break” part of the system, as I wasn’t even sure whether it’s feasible. I knew some websites can be quite hard to scrape, and I considered the entire process often error-prone and hardly scalable, since my web scraping experience came mainly from E2E testing tools, such as Selenium.
Enter Scrapy, a minimal Python web scraping library that fit my needs perfectly. It schedules the scraping in the form of “spiders”. Each spider scrapes the website’s static HTML content, which makes it quick and lightweight. Using the Playwright plugin, it can be extended to scrape dynamic (JavaScript) content as well, which means it’s not limited to just static content.
I decided to go with 1 spider per website (secondhand store), since each store is different - whether in its layout or product structure. To manage the different types of spiders, there are 3 general configs: a plain Scrapy spider (static content only), a Playwright-enabled spider (dynamic content), and finally a spider relying on a ScrapeOps proxy provider. These configs, essentially Scrapy settings, can be combined with one another.
Apart from the spiders, the other key scraping component was the pipelines.py file, which defined processing pipelines used for inserting scraped items into the MongoDB database and translation of item names using DeepL.
Scrapy as a library helped me easily scale to a large number of scraping jobs, without huge requirements on hardware resources. That being said, web scraping was easily the part of the system breaking the most. Any change in the website’s UI potentially leads to a selector of a certain item property to break and with it the entire pipeline. As a result, I’ve often had to rework these spiders to ensure they are still correct. What is more, without any monitoring or a similar form of constant checking, even identifying these problems wasn’t that easy.
The database
The data gathered is stored in a NoSQL document database - in the form of MongoDB, hosted in MongoDB Atlas.
This schema-less database is something I wanted to try using, and I have to say, the schema flexibility has been convenient. But at times this conviction led to issues, since if no validation was done, anything could end up in the database, even mostly empty objects.
Other than that, to be frank, not much more thought went into selecting MongoDB concretely. It was free, thanks to the Atlas free tier, which was limiting with the 512 MB storage limit (Source). This limit meant I often had to clear the database, which was a bit annoying, but nothing a paid tier wouldn’t resolve.
But at the end of the day, I can imagine easily replacing Mongo with another database, whether SQL or NoSQL-based, but I have never had the need to do so, which bodes well for Mongo. All in all, the database was virtually left unchanged and thus isn’t mentioned much in the rest of the article.
Presentation layer/frontend
Finally, the data presentation, the frontend. In retrospect, this is definitely the part that has evolved the most, with a number of different refactors and redesigns.
If possible, I enjoy keeping a front-end as simple as possible. In this case, that meant I started off by building a simple FastAPI Python web server, which rendered static content using Jinja2 templates. This enabled me to iterate quickly in the initial phases of development. But accommodating more advanced UI was not as straightforward. That is why I tried using different UI libraries (such as Bootstrap or ChakraUI), but rendering websites statically can only go so far. That is why at later phases, the static application was replaced by a dynamic SPA (Single Page Application). More on this later.
Moving from local to deployment
These three components together make up a system or an application that can be run locally. Running locally has one big problem though - what if someone else would like to use this tool?
To enable others to use the application, it is necessary to deploy it. In other words, to move it from my local environment to a production one. And to do so, there are a huge number of possibilities and services that can achieve this. From easy, one-click deployments to quite complicated and hands-on ones. That is why it can be interesting to see how the deployment of the scrapers and the front-end application has evolved throughout time. As a note, the database has been deployed from the get-go.
Starting with the Scrapy spiders - these were initially run manually, one by one, on my machine. To automate this process, I decide to rely on cron jobs to periodically invoke these spiders. Relying on my personal machine isn’t ideal though. The scraping process would halt if I turned the machine off or, by chance, it lost access to the Internet. Because of this, I decided to move the scraping process to a Digital Ocean droplet (virtual machine).
In essence, this virtual machine would keep on running these Scrapy spiders 24/7, which is great! But far from perfect. There was no monitoring of these scraping jobs, and any logging had to be done manually. Updating the code of the spiders also wasn’t streamlined, and usually I had to send all of the code to the virtual machine using sftp. In the later phases, I came back to improve upon this deployment, but for the time being, let’s look at the other component - the static Jinja2 + FastAPI application.
My initial approach to the application’s deployment was a lazy one. Ideally, I was hoping to find a service offering quick deployment, so I don’t have to get my hands dirty. Vercel was one such possibility. I’ve had decent experience with it during one of my previous jobs, but relatively quickly I’ve realized that they offer only beta support for Python deployments, which resulted in my deployment suddenly breaking down, meaning I had to find another service. As an alternative, I have tried Render - another service I have tried before (during my bachelor’s). And although the deployment itself was successful, the service’s free tier was quite limiting. There was a relatively small amount of pre-located compute, that to a degree disincentivized me from inviting more people the application and “going big” with it, since the application would crash if that were to happen. Although it may be a petty reason, I still wanted to find a better solution. Something with less vendor lock-in and where I don’t have to pay a (relatively high) monthly flat fee. That’s how I turned to Google Cloud Run.
Google Cloud Run is a microservices runtime, enabling you to run entire applications as part of a container, in my case my FastAPI application. The container in this case is a Docker one, specified by a Dockerfile, making it easy to port to a different service. What is more, you only pay for the compute used, which usually meant very small charges (<$0.5/month). Although I wouldn’t say it was the easiest to set up, as GCD can feel quite overwhelming, it works quite well and checks all of my boxes.
Improving on the initial product
We have looked at how the product came together and was successfully deployed. By going through the process, some cracks have started to show, and there are clear areas of potential improvements. In this section, I will thus cover some of the more interesting areas that I have continued to improve.
Making a hobby project collaboration-ready
One of the first big steps towards a mature, production-ready project was making sure that the code is understandable and maintainable by others as well, not just me. This became an issue only once a friend of mine, Surya, wanted to contribute to the project, and I realized it is a mess that has to be cleaned up before anyone can be onboarded to the project.
This “cleanup” included restructuring of the code and its documentation. What needed to be especially written down are the processes that were clear to me but to no one else (e.g. clunky manual deployments, how I’ve done self code reviews, etc.). In order for someone else to contribute, it was also necessary to create a roadmap of sorts that would contain both tasks regarding fixes to be done as well as future features that we’d like to add. The hardest aspect of this, I’d say, was the prioritization: How do we know what our customers want with little to no feedback from them? Or, for instance, whether it’s more appropriate to try to fix a small feature or work on adding a new exciting one. Clearly outlining the vision of the product, including any potential issues that I’d prefer to forget about, was challenging at times and required some degree of self-discipline, but it was definitely a good experience.
Better scraping practices
The longer my Scrapy spiders have been running, understandably, the more often I’d run into issues. This often meant being flagged as a bot and running into annoying 403 issues. These can be hard to resolve, especially when scraping static websites, as is normally the case with Scrapy. So it’s usually wiser to try to prevent running into these errors, CAPTCHAs, etc.
As far as I understand, bot detection is dependent on a number of factors - your IP address, headers with which you make a request, behavioral patterns, fingerprinting, and others. Thus, in my scrapers, I have focused on the effects of changing the IP address, changing headers with which I make the request, and finally relying on a browser, rather than pure Scrapy static scraping. While changing your IP address can be harder (financially more expensive, in the form of ideally residential proxies), it is, from my experience, the more reliable solution to the prevention of 403s. On the other hand, request headers, such as User-Agent or present cookies, can be easier to change and in some cases sufficient. Residential proxies were outside of my financial options, and whenever I tried public or Tor/onion proxies, I’d run into the same blocks as with my own IP addresses. This meant I put more focus on tinkering with request headers. Usually, I’d start by copying the request headers made by my browser when visiting the website. To add variance, I’d use ScrapeOps’ Fake User-Agent API to make it seem like clients with different User-Agent headers were making the request. Unfortunately, this almost never prevented bot detection. That being said, changing to a browser-based approach helped in some cases, although such an approach isn’t as lightweight and can be hard to deploy.
As a note, I’ve always tried to respect the existing robots.txt file, if it was present. There were a few websites where I may have ignored the disallowed paths, but at least I tried not to overwhelm the website and respect their defined delay intervals.
Although a topic outside of the scope of this article, the legality and morality of web scraping in the times of huge AI companies scraping all of the Internet is a most interesting and relevant one.
Monitoring of spiders
As I mentioned, the deployed spiders as of now have no logging or monitoring. Meaning the only way to realize if something was wrong was by realizing no new products were being inserted in the database.
The first step to prevent further issues from arising was by introducing simple logging on the virtual machine. Outputs from each cron job would be saved into a text file that I could manually search through to find why a scraping job might have failed. Although a step in the right direction, the debugging process was still far from ideal, as I had to SSH into my virtual machine, go through a number of files, and hope I found the issue there.
An upgrade on this was the introduction of ScrapeOps monitoring. This was a big upgrade, as it made a lot easier for me to realize whether the spiders are working as they should or not. For each spider, or even specific job, I could see the number of inserted items, warnings, errors, runtime, etc. This was helpful information to gain a quick overview, but I feel like the dashboard itself was limiting. From my experience, it was really hard to get an overview of all of the errors encountered (and not just the 1-2 most common ones), be able to search through them, etc. That is why I later decided to opt instead for a different service.
Self-hosting the Scrapy spiders was not always ideal. Frequently I would run into issues with spiders left hanging, not scraping, but not closing either, which would either block the other spiders from starting or slowly fill up the RAM. Paying for a virtual machine also costs something, so out of curiosity, I decided to try the free tier of Zyte, more concretely its Scrapy Cloud. This was not the first time that I’ve had an impulse to switch from somewhat DIY cron jobs, but I have never made a successful switch. Whether using scrapyd or other means of deployment. So I was curious how Scrapy Cloud is going to fare.
The deployment itself was not the smoothest. During the setup, I’ve mainly struggled with providing secrets in the Zyte project dashboard. I felt like the documentation could have been improved in that regard. Most recently, I’ve struggled with using Playwright in the spiders, as I found out only the WebKit browser works, whereas Chrome or Firefox, to my knowledge, do not. For all of this critique, the service is quite cheap and works well. I can easily see logs of each spider or even spider run, and the orchestration is very straightforward. One feature I’d wish for is some form of automatic killing of spiders if they are left hanging for too long (which occasionally happens to me). It is frustrating seeing a spider run indefinitely and not being axed.
Revisiting the front-end & improving the UX
Only after the initial few deployments have two approaches come to my mind that would bring dynamism and a better UX to the website - relying on HTMX (a library I’ve enjoyed using in the past) or moving to an SPA React app. As I wanted to create a visually appealing website, due to its ecosystem size, I went for React.
The move to React implied changes to the FastAPI backend, as now the application is split in two: the frontend (React) and a backend (FastAPI). Meaning the FastAPI endpoints now returned JSON containing relevant data, instead of HTML. The new React application renders this data in a much nicer manner. Not being static enabled me to, for instance, replace pagination of product pages with an infinite scroll, which in my opinion resulted in a much smoother experience. Here, the key packages include react-query and react-infinite-scroll-component.
The biggest part of this refactor was the improvements to the UI - whether that is the use of colors, borders, displaying of relevant data, or other relatively new aspects to me. New in the sense that it was the first time I’ve had to make these design choices myself. Although I wouldn’t claim the UI is perfect, it’s definitely my best work.
This change required changes to how the application is deployed. The FastAPI was still deployed to Google Cloud Run, pretty much without changes to the deployment process. Whereas the React application was deployed using Firebase, as it was easy to do so and still within the Google Cloud universe.
Current problems and looking past them
Future-proofness of scrapers
Definitely, the biggest source of bugs and inconsistencies is the scraping. It has to be maintained and kept up to date so any changes to the website do not break the scraping process. Which can be hard since currently, all the data abstraction is done using XPath or CSS selectors. Meaning if, say, a class name of the h1 element containing the item name changes, the name won’t be abstracted correctly anymore.
This high-cost upkeep also makes it difficult to add new features, since I’m not sure I would be able to maintain them to be consistent enough.
A possible solution could be relying on AI-based scraping, something I have not tried before. I’m curious if anyone reading this has some insights on this topic.
Another problem is making sure the Scrapy spiders do not get flagged as bots, meaning a service providing a list of rotating, residential proxies is used. This means each request has an additional cost, which can make the thousands of requests a day expensive.
Analytics - understanding how the customers react
Getting user feedback is crucial, and an indirect way of achieving this is looking at how users use or interact with the product. To achieve this, I have self-hosted Matomo and relied on it to see how many visitors the website has, what users usually search for, etc. This was a very basic setup, and one that could be improved to offer more detailed metrics. Using something like PostHog.
Ensuring the items are up to date
An ongoing problem was ensuring that products I have scrapped are still available or, in general, up to date. This was solved by relying on a set of new spiders, reusing the already introduced spider configs, that go through each item in a calculated interval. Assuming items usually sell quickly, it makes sense to quickly re-scrape them to ensure they are still available. And the longer they are listed, the less likely it is they will sell. With this assumption, I’ve made an exponential backoff-like algorithm that suggests every item should be checked at least once every 2 days.
It should be noted that this logic isn’t ideal for auctions, where the most activity and changes (e.g. number of bids, current price) occur towards their end, rather than their beginning.