Scribbles / IPv4 CIDR aggregation script

By Jason Spencer
First published: 11 March 2013
Last update: 11 March 2013
HISTORY

This is a script that'll take a list of CIDR IPv4 network addresses and generates a list of the largest subnets that are aggregates of the input networks.

For example: 128.0.0.0/7 130.0.0.0/8 131.0.0.0/8 will aggregate to 128.0.0.0/6.

A common use of this script is to take lists of country subnets and aggregate them to be used by firewalls, such as ipset and iptables.

You can grab the script here. The --re-interval GAWK switch must be used when executing the script.

It looks like this:

# CIDR_aggregate.awk by Jason Spencer 2013
# CIDR_aggregate.awk is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
# CIDR_aggregate.awk is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
# See http://www.gnu.org/licenses/ for a copy of the GNU General Public License

# Usage: cat cidr2.txt|gawk --re-interval -f CIDR_aggregate.gawk > foo.out

$0 ~ /^([0-9]{1,3}\.){3}[0-9]{1,3}\/[0-9]{1,2}$/ {
	n=split($0,arr,/[.\/]/);
	if((n==5)&&isBoundedBy(arr[1],0,255)&&isBoundedBy(arr[2],0,255)&&isBoundedBy(arr[3],0,255)&&isBoundedBy(arr[4],0,255)&&isBoundedBy(arr[5],1,32)) {
		binstr=substr(binconv_(arr[1],8) binconv_(arr[2],8) binconv_(arr[3],8) binconv_(arr[4],8),1,arr[5]);
		nets[length(binstr),binstr]=1;
		next;
	}
}
$1 ~ /^#/ { next; }	# skip comment line
{ complain("Invalid CIDR in line " NR); }	# contents not recognised as CIDR
END {
	if(EXITCODE) exit EXITCODE;		# if an earlier pattern called exit (from complain()) then exit is stilled called anyway
	for(len=32;len>1;len--) {
		for(key in nets) {
			nres=split(key,out,SUBSEP);
			if(nres!=2) complain("Array inconsistency detected.");
			if(out[1]==len) {
				a=substr(out[2],1,len-1);
				x = a "0"; y = a "1";
				akey = len-1 SUBSEP a; xkey = len SUBSEP x; ykey = len SUBSEP y;
				if((xkey in nets)&&(ykey in nets)) {
#					if(akey in nets) print "WARNING: About to create aggregate that already exists: " printBinaryIPnet_(a) > "/dev/stderr";
					checkNoSupers_(nets,a);
					delete nets[xkey]; delete nets[ykey];
					nets[akey]=1;
					len++;
					break;
				}
			}
		}
	}
	for(key in nets) {
		nres=split(key,out,SUBSEP);
		if(nres!=2) complain("Array inconsistency detected.");
		else { print printBinaryIPnet_(out[2]); }
	} 
}
function printBinaryIPnet_(netstr,i_,len_,ss_,out_) {
	len_=length(netstr);
	out_="";
	for(i_=len_;i_<32;i_++) netstr = netstr "0";
	for(i_=0;i_<4;i_++) {
		ss_=substr(netstr,i_*8+1,8);
		out_ = out_ binstr2num(ss_);
		if(i_!=3) out_ = out_ ".";
	}
	out_ = out_ "/" len_;
	return out_;
}
function binconv_(decstr,minlen,out_) {
	out_=decstr2binstr(decstr);
	out_=makeRepeatString("0",minlen-length(out_)) out_;
	return out_;
}
function checkNoSupers_(nets,n_,i_){
	for(i_=length(n_);i_>0;i_--) {
		key_ = i_ SUBSEP substr(n_,1,i_);
		if(key_ in nets) print "WARNING " printBinaryIPnet_(n_) " collides with " printBinaryIPnet_(substr(n_,1,i_)) > "/dev/stderr" ;
	}
}
function isBoundedBy(n,l,u){ return ((n>=l)&&(n<=u)); }
function complain(msg){ print msg > "/dev/stderr"; EXITCODE=1; exit EXITCODE; }
function binstr2num(bin,i_,n_,l_) {
	l_=length(bin);
	n_=0;
	for(i_=1;i_<=l_;i_++) {
		n_ += ((substr(bin,i_,1)=="1")?1:0);
		n_ *= 2;
	}
	return n_/2;
} 
function decstr2binstr(decstr,out_,d_,b_){
	if(decstr=="") return "0";
	out_="";
	d_=strtonum(decstr);
	off_=1;
	while(d_>0){
		out_ = ((and(d_,1)==1)?"1":"0") out_;
		off_=lshift(off_,1);
		d_=rshift(d_,1);
	}
	return out_;
}
function makeRepeatString(c,len,out_) {
	for(out_="";len>0;len--) out_ = out_ c;
	return out_;
}

You use it like this:

root@wopr:~/ipset# head ca.zone
23.16.0.0/15
23.29.192.0/19
24.36.0.0/16
24.37.0.0/16
24.38.144.0/20
24.48.0.0/17
24.48.176.0/20
24.49.224.0/19
24.50.64.0/18
24.52.192.0/18
root@wopr:~/ipset# cat ca.zone gb.zone | gawk --re-interval -f CIDR_aggregate.gawk > master.zone
root@wopr:~/ipset# wc -l ca.zone gb.zone 
  6608 ca.zone
  5483 gb.zone
 12091 total
root@wopr:~/ipset# wc -l master.zone 
7604 master.zone
root@wopr:~/ipset#

The program itself is relatively simple - the bulk of it is from the ancillary functions which are mostly for string manipulation. The program works by first scanning all the input lines and validating them as valid CIDR network addresses, and then converting them into a character string of 1s and 0s and storing those strings in an array. The array is then scanned, in descending order of address length, and the existence of the network address which has the LSB flipped is tested. If the address is found then a new address which is one bit shorter is added to the list, the two longer addresses are removed, and testing continues.

It would be possible, I suppose, to make this program much shorter (and faster) by using the bit-wise operands of GAWK and storing the network addresses as integers (GAWK even has arbitrary precision integers), however, due to the various ways that the various versions of awk and gawk handle numbers and integers it was safer to store them as text strings of 1s and 0s - for reliability and portability.
Also, I realise that I don't actually flip the LSB and test it, rather I create two new addresses from a truncated address, but it was faster this way.
Ideally the program should form a binary tree structure and traverse that, but good luck creating one of those in AWK.
Oh, and before anyone asks, the reason this isn't written in Python is because GAWK is more widespread.

Stick a fork in me. I'm done.

Version history:

Mon 11 March 2013 21:35 UTC

initial version