Building and deploying classifiers with Airflow and Docker
Gerard Barbalich •
Classifying fake online stores within .nz
InternetNZ is helping New Zealanders to harness the power of the Internet by ensuring that .nz is a safe place to do business. To play our part, the InternetNZ Research team is deploying a range of classifiers to identify malicious domains, starting with fake online stores. This post details how we are using Apache Airflow within Docker containers to deploy these machine-learning trained classifiers.
Maintaining trust in .nz
“Indian Love Story” wasn’t a romantic tale - rather, it was a fake sneaker website selling underpriced Nike Air Jordans. The domain has since been removed from the Domain Name System, but it was part of a growing number of fake online stores looking to benefit from the reputation of .nz. So in response, we built a classifier that identifies such fake online stores, which are then passed on to the New Zealand Domain Name Commission (DNC) for investigation.
Feature experimentation
The DNC gifted us a list of domains from user-reported cases to begin the training of this tool. We verified this list and combined it with a random sample from the .nz domain registry to form our initial dataset.
For inspiration on features that would best distinguish these two groups, we turned to published work from similar projects - including consulting with our colleagues at SIDN. These brainstormed features were then split into three broad categories for engineering and experimentation; domain-centered, registry-centered, and site-centered.
Domain-centered features relate to the content of the domain name, including its string length and component words. Registry-centered features relate to registrant and registry information held by the domain registry, including; the number of domains a registrant holds, the tenure of a registrant, and the location of the registry. Site-centered features relate to the content on a domain, including text, images, and tags.
Experimentation began with domain-centered and registry-centered features. We expanded the number of features using FeatureTools, used Recursive Feature Elimination in combination with pipelines to iterate through different feature combinations, and analysed feature importance using LIME. Unfortunately, the results with these features did not meaningfully distinguish the two groups. Further experimentation showed the simplicity of site-centered features to be key - with the best results coming from Term Frequency-Inverse Document Frequency (TFIDF) analysis upon the website text itself (sourced from our Web scans).
Constructing a pipeline
A test classifier pipeline was constructed using Scikit-Learn's pipeline class to join a TFIDF Vectoriser and a Random Forest classifier. Training and testing this classifier pipeline yielded initial results of:
The trained classifier pipeline was then tested upon a fresh dataset - a random sample of fresh web content from the .nz domain registry. Consequently, several hundred new domains were labelled as “fake online stores”. These domains were manually classified by a team member to check the accuracy, along with a random sample of equal size from those domains labelled as “not fake online stores”. Based on this manual classification the trained classifier pipeline showed results of:
We noted that the classifier did show a slight tendency towards false-positive classification, however, our team was happy with the results overall, and decided to proceed to deployment.
Using a trained pipeline in conjunction with our Registry Augmentation Platform (RAP)
The RAP is a scalable distributed framework we have designed to collect data on domains. Our Data Engineer Asher Halliwell is leading its design, writing it in Python using the Celery framework.
Utilising a microservice-centered structure in combination with building REST APIs to increase process modularity. This framework allows us to cue different processes as needed for individual workflow pipelines by calling the relevant API.
For example, the trained classifier pipeline from above is just one component of a multi-stage workflow to identify fake online stores. The entire workflow (pictured below) contains three preliminary steps before the classifier pipeline is used; gathering a list of domains to investigate; collecting content from those domains; and processing that collected content. Structuring the process in this way allows the first three APIs to be called separately for a range of different processes, not just the fake online store pipeline. While the RAP and separate APIs are designed for modularity and separation, we needed a tool to organise and orchestrate workflows in one place - for this, we have used Apache Airflow.
Using Airflow and Docker to automate our classifier pipelines
Using Apache Airflow and Docker to automate the running and reporting of classifiers is a popular trend within Data Science - with well-structured tutorial resources.
Airflow
Airflow is an open-source workflow management system originally developed by Airbnb, allowing the automation and scheduling of scripts or workflows.
Airflow is a sophisticated way to schedule and run Extract-Transform-Load jobs as it allows; multiple layers of dependencies, and a simple but effective visualisation for monitoring all your jobs (detailed here). Our team previously decided that Airflow was a good fit for managing our data workflows, and have already been using it to run PySpark in an Airflow task.
Docker
Docker is a platform for developers and sysadmins to develop, pack, deploy, and run applications within containers. Using containers rather than virtual machines cuts down on performance demands drastically while maintaining the benefits of containment.
Running Airflow within Docker
Layering Airflow with Docker gives us the benefits of both applications. First, we can automate, schedule, and monitor workflows within one system - Airflow. Second, as business needs dictate we can; adjust the process of that workflow, change its scheduled interval, or replicate and tweak it. Running Airflow within Docker maintains all of these advantages while adding the ability to contain, replicate, and re-deploy the entire process as needed.
For example, we have stored a Docker image containing the trained classifier pipeline above, as well as the Airflow Directed Acrylic Graph (DAG) and associated plugins that schedule the entire process. We can now replicate this Docker image and alter it as needed; whether to test a proposed alteration in the pipeline or to host a different classifier and associated DAG. The ease with which we can adapt and experiment with this entire pipeline was a big drawcard for our team and a large contributor to why we are using Docker.
Creating a DAG to run our classifier pipeline
Workflows within Airflow are built upon DAGs, which use operators to define the ordering and dependencies of the tasks within them. Each operator typically defines a single task, commonly acting as triggers or markers of status.
Trigger operators within Airflow action events, while sensor (or "status") operators verify states. We applied this model of utilising triggers and status operators throughout our DAG. Triggers and sensors are commonly separated due to the discrepancy in run-time between them.
For example, within our DAG the 'Webscan Trigger' operator triggers the initiation of a webscan for a selected list of domains. The running of this webscan may take between minutes to hours depending upon the number of domains. The 'Webscan Status' operator then periodically checks the status of the webscan until it is marked as complete. These actions are separated so that the status of the webscan can be periodically determined without repeatedly triggering it.
Cumulatively, the Airflow-relevant files for our classifier pipeline consist of four plugin files and one DAG file. Each plugin is responsible for a separate part of the ETL pipeline and the DAG file calls those plugins. These tasks align with the stages outlined in the RAP section above, and so each plugin calls a unique API where appropriate. The tasks for this classifier pipeline coordinated by this DAG are; loading a list of target domains, triggering the Webscan upon those domains, processing the content of that Webscan, loading processed data and making predictions, .
Here is an example of our DAG file:
from airflow import DAG from datetime import datetime, timedelta from airflow.operators.domain_list import GetDomainList from airflow.operators.webscan import WebscanTrigger from airflow.sensors.webscan import WebscanStatus from airflow.operators.webscan_process import WebscanProcessingTrigger from airflow.sensors.webscan_process import WebscanProcessingStatus from airflow.operators.load_and_predict import LoadAndPredict from airflow.operators.email_operator import EmailOperator default_args = { 'owner': 'airflow', 'depends_on_past': False, 'provide_context': True, 'start_date': datetime(2019, 5, 7), 'email_on_failure': False, 'email_on_retry': False, 'retries': 1, 'retry_delay': timedelta(minutes=1), } detection_dag = DAG( 'fake_web_shop_detection', default_args=default_args) get_domain_list = GetDomainList(task_id='get_domain_list', dag=detection_dag) webscan_trigger = WebscanTrigger(task_id='webscan_start', dag=detection_dag) webscan_completed = WebscanStatus(task_id='webscan_status', dag=detection_dag, poke_interval=2) processing_trigger = WebscanProcessingTrigger(task_id='webscan_processing', dag=detection_dag) processing_completed = WebscanProcessingStatus(task_id='processing_status', dag=detection_dag, poke_interval=2) load_and_predict = LoadAndPredict(task_id='load_and_predict', dag=detection_dag) predictions_completed = EmailOperator(task_id = 'predictions_completed', to = '#########', subject = 'weekly fake webshops predictions completed', dag = detection_dag) get_domain_list >> webscan_trigger >> webscan_completed >> processing_trigger >> processing_completed >> load_and_predict >> predictions_completed
Future work
Using Airflow and Docker has allowed our team to quickly and repeatedly deploy machine-learning trained classifiers. We will use the structure outlined here to deploy more classifiers, always aiming to solve interesting business problems in .nz. In future, we will be using this structure to build tools for; industry classification, and for the classification of parked domains and other forms of malicious domains.