URL Unshortener
In this post, I talk about what I learned from scraping random URL shortener links.
Motivation
While organizing my old files, I came across a tiny python project I made that takes in a shortened URL and spits out the redirected result. It also had an option to try a random URL. Being a little intrigued, I decided to fix up the code a bit and then ran it overnight such that it would continuously try out random links and log the live links for later inspection.
General Patterns
While going through the links the next day, the vast majority of the links were 404 pages. However, there were a few common categories of links that I found interesting:
- Various tourist sites
- One understandable category was news sites, although many of them were surprisingly dead links.
- The last category was internal links, such as the link to a specific (now-expired) api endpoint or a customer survey.
Specific Sites
More rewarding than the general patterns, which were mostly garbage, I found a handful of little sites:
- This one is essentially an OSS version of Second Life, and though I don’t have VR hardware nor much knowledge about the software, I thought it would be interesting to share.
- From what I could gather, this seems to be a unmaintained list of French and Spanish Pokémon websites.
- A website to compare pharmacy prices in Canada
- A nice little website with just a snowman. This one I found by chance when I tried out a word instead of random characters: https://bit.ly/snowman
Scam Websites?
I found a couple scam websites, which I won’t link directly.
- The first seems to generate fake PDFs of transaction slips for the State Bank of India. I couldn’t easily figure out how the website could be used to generate new PDFs, but the only search result on Google was an exact copy of my link with some of the information changed.
- The second website was a bit more cryptic. It was a Blogger that had thousands of posts dating back to 2013. Each post contains one image, a sometimes-relevant title, and no other content. The images were mixed between computer-generated graphics and photos which varied in location and subjects. Almost all the images had no reverse image search results. After some considerable investigation, it turns out I neglected the simplest conclusion that it was a spam website. It was simply copying random posts from flickr and dribbble. I’m still not sure what the purpose of the site is though. While the upload schedule was erratic at the beginning, since mid-2015, it’s almost exactly 100 posts every month. It seems to have been discontinued in mid-2019.
Conclusions
I found a few neat websites and learned a bit through this project, but I should probably get back to organizing my files. If I work on this again in the future, I’m interested to investigate the types of links that I would get from words or combinations of words.