Detection of .nz domains with low content website
Jing Qiao Internet Researcher •
Not all domain names are created equal. Domain names in the .nz registry do not all have a fully functional website. It’s very common nowadays that some domains are registered but haven’t been put into a real use. When browsing this type of domain, you are likely to see a registrar’s web page showing up, usually with a message that the domain is parked. It’s also possible that you will see a page with advertisement links, or a website under construction, or even a blank page or some error message.
We agreed it would be really interesting to find out how much of the .nz namespace has substantial web content, how many are parked, have a developing website or anything else. Within the Technical Research team, we have built a web crawler that can regularly scan and collect the web pages for each .nz domain. Using the content collected, we can categorize .nz domains.
The Technical Research team has been constantly engaged with CENTR (Council of European National Top-Level Domain Registries) in sharing knowledge and ideas. In recent years, CENTR registries have been very keen on building web crawlers and extracting insights leveraging web content data. They have also been working on a project categorizing domain names according to their web content, called Signs of Life.
Borrowing CENTR’s categories, we tried to realise it using pure machine learning techniques. To predict a domain’s type, instead of making judgement following a series of detection steps, we trained a classifier by learning the displayed texts on the web page, the layout of the website, the size of the web page, etc. We’re going to introduce the classification model we have built and show what kind of insights we can get from it.
Low Content Taxonomy (LCT)
We refer to this project as Low Content Taxonomy (LCT), as it can identify different types of low content sites opposed to high content sites. We defined the categories as in the following table. They’re similar to CENTR’s but tailored to our needs:
category_lv1 |
category_lv2 | category_lv3 | description |
Content | High content |
High content |
Website with significant content |
Low content | Parked | Parking site of a registrar or site with individual parked notice | |
Upcoming | Website under construction or initial page of a website builder | ||
Not used | Blank page or website with the ‘index of’ structure | ||
Blocked | Website with a ‘blocked’ or ‘suspended’ note | ||
Abandoned | Website with an ‘expired’ note | ||
No content | Errors | DNS error |
No server/IP address associated to the domain |
Connection error | Connection refused by the server or timeout | ||
Invalid response | Uninterpretable response | ||
HTTP error | HTTP response codes indicating errors |
The categories under ‘No content/Errors’ are directly from the response codes we got from the tool. The breakdown of the ‘Content’ category is a harder problem we conquered.
First, we curated our training data by finding a sample of domains for each ‘category_lv3’ type under ‘Content’. Then we extracted a set of features from the web scan data, which we think are relevant and essential for identifying low content types. One relevant feature is the text on the web page; Low content sites tend to use certain words surrounding a few topics, such as ‘parked’, ‘domain’, popular registrar brands, ‘coming soon’, etc. Using TF-IDF, a commonly used NLP technique, we can let the model find out those important words appearing in different web HTML elements, such as title, keywords, body, alt, etc., and predict the probability of a site being a low content type. In addition to the text features, we also consider the layout of a web page relevant, which can be represented by HTML tags being used, and the scale of the web page, such as the count of words and the size of HTML script, is also brought into consideration.
After iterations of experiments and improvements, we finally achieved a model with 98% accuracy on a test set of 232 domains, which was trained on a dataset of 1,000 domains. The classification report and confusion matrix below showed that the model performed well across all classes.
Figure 1. Classification performance report
Figure 2. Confusion matrix of LCT model
LCT on 10% registry
With web content crawled in June 2020 for a 10% sample of the .nz register, we applied the predicted model to see how much ended up in different categories. The result is visualized in Figure 3.
Less than half of the sample is predicted to be high content, which is surprisingly low. The rest are low content and those with errors, which further break down to various categories. Among errors, almost 15 percent domains got a DNS error, meaning no IP addresses found during DNS resolution. Additionally, under the low content category, a small number of blocked domains were detected, which are typically domains suspended by registrar for some reason.
Figure 3. LCT on 10% registry
LCT across second levels
Given that different second level domains have different levels of maturity, due to historical reasons and customer preference, their qualities of web presence are very likely to be different. With this model, we’re able to look into the LCT predictions for domains under a specific second level, which would help us know the status quo of different second levels.
As an example, Figure 2 showed the aggregations for domains under the two major second levels, co.nz and .nz, with other second levels’ in one group. From the stats shown in the chart, we see co.nz, as the biggest second level, had a higher percentage of high content sites than .nz and the rest together. This fact coincides with our perception for co.nz as the most mature and active second level. On the other side, .nz, which was opened for registration later than co.nz and currently the second largest second level, turned out to have a much bigger percentage being lower content, typically parked domains. About one-third of .nz domains were either parked or web sites under construction, which tells us a big portion of .nz second level didn’t have web content but were held for future use or monetary purpose.
Figure 2. LCT between major second levels (10% registry)
LCT between local and overseas registrations
The segmentation between local and overseas registrants was also very interesting. From Figure 3, we can easily tell that local registrations were much more active online than overseas registrations, indicating that a big portion of overseas registrants were very likely to be speculators, so they didn't have businesses or content associated with their .nz domains.
Figure 3. LCT between local and overseas registrations (10% registry)
The high percentage of errors in the overseas bar looks peculiar, so we further break down the error category to see what’s in there. We can see from Figure 4., the majority of error types from overseas registrations were DNS errors, meaning these domains didn’t yet have an IP address set up in the DNS system, and they were far from getting activated online.
Figure 4. Error breakdown of local and overseas registrations (10% registry)
LCT between domains registered by organisations and people
The Technical Research team built a model that can predict a registrant name to be an organisation or a person, as detailed in this article. From the model, we are able to differentiate domains registered by organisation and by person, and further explore how registrant types relate to the web presence of domains.
From Figure 5, we can see that, compared to domains registered by people, the namespace registered by organisations had a slightly higher rate of high content and lower rate of low content, which were predominantly parked domains. This is consistent with the implication that an organisation normally registers a domain for business and is more likely to have a website, while it’s more common for a person to register a domain only for reservation or investment purposes.
Figure 5. LCT between domains registered by organisation and person (10% registry)
Conclusion
This introduction of the LCT project, and the machine learning model to classify .nz namespace into a taxonomy of high/low content types, also demonstrated the value of the model by showing some analysis done to 10 percent of the total registry. We’ve built a workflow that automated the prediction process, which can be used to run on the whole .nz registry. Having it run regularly will give us a time series of data, which will be absolutely valuable in monitoring the nature of .nz namespace, discovering trends and anomalies.