A traversal view of the .nz space: content and technology
Sebastian Castro •
We keep exploring the difference between namespaces under .nz in a series of blog posts from our Research team. In this blog, we are looking at content, languages and content management systems used on .nz sites.
Here’s what you’ll learn about:
Blog post 3 (you are reading it now): Web content and machine learning in .nz namespace.
So what’s in the .nz space? A brief reminder.
Within .nz, domain registrations can happen directly using .nz (example.nz) or under one of the fifteen subspaces, like .co.nz (trademe.co.nz) or govt.nz (covid19.govt.nz). Not all spaces are open to registration by anyone. For example, .govt.nz is only available to government organisations.
For this story, we are going to divide the .nz domains into four groups: .nz for domains registered directly under .nz, .co.nz for domains under .co.nz, .govt.nz, and other, capturing the remaining thirteen subspaces (.net.nz, .org.nz, .kiwi.nz, etc.)
The .co.nz domains are solidly the majority in our namespace, followed by .nz domains. Over the years, registrations directly under .nz are gaining space against the other group. The .govt.nz has around 1,000 domains that are hard to distinguish.
Web content and machine learning
Starting in 2019, we have collected web content for .nz domains on an ad-hoc basis and always with small samples of no more than a few thousand domains. Fortunately, our team has worked on developing a scalable crawler that allowed us, for the first time, to collect content for all .nz domains, over 700,000 of them.
Web content is rich and allows a lot of discovery. In this blog, we are going to show you two highlights.
Low content sites
Last year our colleague Jing Qiao published a blog post explaining our work on detecting low content sites, following CENTR definition. In brief, we categorise each domain in three major groups: Errors, as when trying to fetch web content we fail for a variety of reasons, Low Content, if we got content but it’s a parked page, an under-construction page or just an index, and High Content.
With this definition in mind, but only one data point, we provide you with this view:
We are surprised by the proportion of Errors in the .govt.nz space, as other data collections don’t show that level of failure. A future collection will clarify if that’s a trend or a one-off error. On the same note, we can see most of .govt.nz sites have some useful content or failed to load. Also to note .nz domains tend to have less content and more errors compared to co.nz.
One of the many advancements allowed by machine learning is detecting language from a piece of text. With the web content at hand for all .nz domains, we can use this to take the first sneak peek at what languages are used on websites, as presented in Figure 3.
It’s not a surprise most of the content is in English, despite the fact that English is not an official language in New Zealand.
We’d like to focus on those languages with little but enough participation to be relevant. To start, te reo Māori is used in the .govt.nz and other spaces. There are sites in Government like kauwhatareo.govt.nz promoting educational material in te reo Māori, or tetaurawhiri.govt.nz, the Māori Language Commission, where the default language is Māori. In the other group, as we have two subspaces .iwi.nz and .maori.nz/.māori.nz dedicated to Tangata Whenua, it’s not unusual to find te reo Māori sites there.
German, French and Swedish languages usually appear on sites that represent multinationals with .nz presence. Based on our experience, the Chinese language has been growing lately, with more and more Chinese speakers publishing content in their language.
As our web crawler captures content and other pieces of data in the process, we are using tools like Wappalyzer to identify web technologies behind websites. Just to show you a couple of findings, let’s take a look at content management systems (CMS) and e-commerce capabilities.
A few points to highlight: CMS identification is not possible for most sites, either because no CMS is behind it or because it’s been obscured for security reasons. Secondly, SilverStripe and Drupal have a higher market share in the .govt.nz space compared to the clear winner anywhere else: WordPress. With future collections, we could compare if any of these technologies gain or lose traction.
Now let’s look at e-commerce technologies, enabling .nz websites to have a shopping cart and allow to order products directly from the site.
In general, a lot of .nz websites don’t have a cart, hinting web content is used mainly for information sharing and contact. It’s a little bit surprising to find shopping carts on government websites. In places like standards.govt.nz or healthed.govt.nz where items are sold, having a cart is helpful. In the future, we could check if the likes of Shopify, WooCommerce or Wix are gaining popularity.
This blog series has been a walk through the data we have available for .nz, but it’s not a thorough exploration. We are sharing some of the most relevant discoveries across security, content and popularity.
- Government sites under govt.nz are leading the way in adoption of security technologies like SPF, DKIM or CAA but there is still a lot of work to be done
- The combination of the high level of DNS validation and govt.nz adoption of DNSSEC provides assurance against DNS spoofing.
- The most popular site under .nz is mega.nz
- Over 90% of .nz domains see activity on a daily basis
- Most of the content under .nz is in the English language, followed by Māori, German and Chinese.