Build Your Own OSINT APIs for Pen Testers

Authored by Forrest Kasler

As penetration testers, the first step in any external assessment is the reconnaissance phase, where we tend to rely heavily on open-source intelligence (OSINT) data sources and APIs. This blog post is all about the value of mining OSINT data ourselves, and shows how to index very large datasets for quick searches. We will walk through an example of how to use NodeJS and MongoDB to convert data from Rapid7’s “Project Sonar” into a powerful OSINT API that will allow us to search for target domains and subdomains with basic search terms.

While there are some great OSINT APIs available for finding subdomains of our targets, the free tier for these tools is often limited either by number of API calls per day, number of results returned, or rate limited. We are also limited by the search capabilities of each API. For example, Whoxy searches are case sensitive, and HackerTarget only allows searches based on target domains and does not support basic string searches. To get around these limitations, we can build our own OSINT APIs and index our data to meet our search needs. After all, this is open source data: If some else has mined it, we can do the same!

Here is the error we get when we try to search HackerTarget using a simple string:

By building our own OSINT API, we can create a database that allows simple string searches and get results like this:

And even some relevant data that we would not get from other OSINT APIs:

To get started, we are going to need some open source data. The traditional way of getting open source data is to mine it ourselves. With current technology, it is actually feasible to run basic scans against every public IPv4 address on the Internet in a reasonable timeframe with the right equipment and Internet connection. The major drawback of this approach is the additional steps required to obtain permission from your ISP, coordinate scans with abuse reporters and network maintainers, document your reasons for scanning, provide an opt-out for organizations, and manage packet rates to avoid denial-of-service (DoS) issues. Rapid7 has some great tools and resources on the DIY data mining approach:

https://blog.rapid7.com/2013/10/30/legal-considerations-for-widespread-scanning/

https://github.com/zmap/zmap/wiki/Scanning-Best-Practices

https://en.wikipedia.org/wiki/Reserved_IP_addresses

As an alternative, Rapid7 is constantly scanning the Internet and providing their scan results for free for researchers  This open source initiative is call Project Sonar:

https://opendata.rapid7.com/

Each month, they update massive datasets for forward DNS records, reverse DNS records, service banners, common port scans for TCP and UDP, http/s responses, certificate information, and critical vulnerability scans. How massive? Just the forward DNS records for IPv4 are 33 Gigs of GZipped data and expands to roughly 200 Gigs of text data. In order to search this data quickly, we need to index it with some useful search terms. We’ll be using MongoDB since it is specifically designed for very large datasets. In addition, here are a few other reasons why MongoDB rocks for building your own OSINT APIs:

  • It’s Free!
  • “MongoDB’s horizontal, scale-out architecture can support huge volumes of both data and traffic.” – www.mongodb.com/why-use-mongodb
  • Extremely simple query language – Prototype Quickly!
  • Represent data with flexible objects
  • You do not have to define table structure at all to get started
  • Powerful indexing capabilities built-in

To get started, we first need to download the dataset. Given the size of the data, you will likely want to pick up an external hard drive for this step. A terabyte drive should be adequate for our needs and you can pick one up for about $50 these days. To save time downloading and unpacking the data, we can combine the two steps using a pipe operator to gunzip:

Once we have the data, we need to process it a bit before loading it into a MongoDB instance. We will need to remove an irrelevant timestamp and calculate some search terms for each record. The raw data from Project Sonar looks like this:

After processing, our data will be transformed into JSON objects that look like this:

That “terms” attribute of each record will be used by MongoDB as a “Multikey Index”. We can use this feature to pre-define search terms sort of like hashtags on our data set. In our case, we slice each DNS record into separate words, remove small words (e.g. com, net, edu, gov, club, etc.) and use those as searchable terms for each record. NodeJS has data piping features that can allow us to efficiently process such a large dataset. We will use a special pipe called a “Transform Stream” to modify each record and write it out to an intermediate file:

Once we have processed the data, we can use MonoDB’s “mongoimport” tool to pull it into our database:

Next, we will need to apply indexes so that we can quickly search for records. For datasets this large, we will always want to perform this step last. The additional processing required to index the data as it is being imported would slow down the process to the point that it would never finish on standard hardware.

While we index the database, it is important to open up our terminal’s normal limit for open files. MongoDB structures its data using many linked files and may need to have hundreds open at a time during the import and indexing steps. To increase this limit, use the ‘ulimit’ command and then run the MongoDB daemon:

Finally, we need to expose our MongoDB instance with a simple API. We can use just a few lines of Node to achieve both forward and reverse lookups:

And the final results:

Reverse Lookups too 🙂

By taking this same approach with the other datasets in Project Sonar, you can create an OSINT API that is similar to the core features of Shodan, but without the rate limit. You can also apply this technique to other datasets like ASN data, password breaches, and whois data.

But what good is all this data if we just get a bunch of flat lists? Wouldn’t it be better if we could visualize connections between data points and actually interact with our data? In a follow-up post, we will do just that using a tool called ScopeCreep! If you want to skip the blog and go straight to the tool, you can check it out on GitHub: https://github.com/fkasler/scope_creep

  • 704-816-8470

Javier is a principal within the Cybersecurity Services Group at CLA. Prior to joining CLA, Javier spent ten years supporting the Department of Defense as well as a financial services company in the fields of insider threat, incident response, analytics, and systems engineering.

Comments are closed.