[Warning to non-technical followers…keep on walking. This is another obscure hyper-technical post. Also, forgive the bizarre looking images, but it’s not worth the effort to force Tumblr to show them correctly.]
So…you, like me, have spent the last few weeks playing around with various Semantic Web triple stores, trying to figure out which is best suited for whatever particular quasi-mysterious application you’re trying to build.
After an exciting but awkward and unfulfilling first time with Sesame’s native repository and a briefly passionate but now apparently fizzled relationship with Neo4j, you’ve finally found that special store that you’re ready to settle down with, at least until it stops scaling gracefully or something faster and better documented enters your field of view.
Let’s say you’ve even built a bit of application code, and have a cute little toy process running on your laptop that executes SPARQL queries against a locally hosted server.
Now what?
Well, if you’re like me, your just might want to start showing off your ugly little duckling to those friends and family who don’t know enough about technology to laugh at the inadequacy of your architectural endeavor.
So, naturally, you’re going to want to make your application publicly accessible.
Now with a normal Django or Rails web app using a MySQL databse, deploying your demo is a snap using platform as a service (PaaS) solutions like Heroku, which let you publicly deploy your application from git by typing in about three lines of code. Heroku has even recently added basic support for Java, so you can build your app right out of Eclipse and onto a Heroku server.
But what if you’re using some adapted open source code that builds with Ant instead of Maven? And more importantly, what if your SPARQL server needs a 25GB data file to answer queries?
Well then, my friend, you’re pretty much out of luck on the PaaS side. As far as I can tell, you have two choices:
1) Use an infrastructure-as-a-service (IaaS) service like Amazon which lets you spin up your own cloud hosted servers. If you want to build a robust, scalable, secure solution, this is probably the way to go. And if/when I almost invariably move in that direction, I will try to write a blog post explaining how to do this. But it requires quite a bit of “upfront investment” in learning how AWS works and how to create, boot and administer a linux server that can run your code. Plus, it costs money if you want to use a non-trivial amount of computation or transfer and store a non-trivial amount of data.
2) You can do things the old-fashion way, and turn your home computer into a web-server that can handle SPARQL requests from the open internet. This has a lot of disadvantages – it’s probably insecure as all hell, your laptop has to be turned on and connected to the internet for it to work, and if you end up getting any real traffic, you’re going to be clogging up your bandwidth and CPU cycles handling SPARQL requests (plus many ISP’s forbid you from running servers at home.)
But rolling your own has a few trump card advantages.
First, it’s relatively easy (at least if you don’t have to figure out how to do it, which is why I’m writing this guide.)
Second, it lets you run you server with basically no additional configuration or porting or data uploading or anything – if you can run a SPARQL query against localhost, you can use your server as a remote host.
Third, it’s (nearly) free. You may elect to pay for a dynamic DNS service that costs $30/year (though there are free alternatives), but everything else uses software/services you already have.
So, here’s how to do it:
STEP 1: Make sure you have the pre-requisites
In theory, you can probably make this work with just about any computer and any internet connection. But for my purposes, I’m going to assume you have a configuration approximately similar to mine, i.e.:
OSX Lion
Running a SPARQL endpoint through a Tomcat server
Verizon FiOS or similar “always-on” internet connection, via a home router
STEP 2: Setup a static IP on your laptop
For this to work, you do NOT need a static IP from your ISP (which apparently cost extra). We’re going to use a service called “Dynamic DNS” that will let the internet find your network even when your ISP changes your IP address. But you do need a static IP on your laptop so that you your router can figure out what to do with incoming traffic from the internet.
Here’s how to do this on Verizon FiOS if you have a standard Actiontec router. First, open your admin panel by going to 192.168.1.1 in your browser:
Enter your username / password (the default username is “admin” and the default password is, I think, the serial number of the router.) If you can’t remember your login, you can hard-reset the router by pressing the little reset button on the back for ten seconds. This will wipe your configuration, but these routers are pretty good at automatically setting themselves back up.
Now, inside your control panel, click “My Network”, then “Network Connections.” Find the entry for your local area network (in my case “Network (Home/Office), and click the little edit button in the rightmost column of the table. Scroll to the bottom and click “Settings”.
Now, find the line that says “End IP Address”. By default, this is set to something like 192.168.1.255. You need to set the last number to something less than 255 to give you some address space that’s not automatically assigned to devices connecting to your router. I set this to 192.168.1.100. Click “apply”.
For some reason, you can’t just give your laptop it’s own IP and expect the router to talk to it. So next we need to go into the “Advanced” heading on the router control panel and select “IP Address Distribution”.
Click “Connection List”. Then at the bottom of the table of connections click “New Static Connection”.
Type in a name for you laptop, the static IP address you want to use (should be something like 192.168.1.150), and the MAC address of your laptop. On OSX, you can find the MAC address by going to “System Preferences” -> “Network” -> “Wifi” -> “Advanced” -> “Hardware”. (I’m not going to show screens with my particular MAC and network details to try to make it slightly harder to hack me.)
Go back to your router control panel and click “Apply”. Now, your laptop should attach to the router using the IP address you specified. If it doesn’t, try refreshing your IP by going to “System Preferences” -> “Network” -> “Wifi” -> “Advanced” -> “TCP/IP” and clicking “Renew DHCP Lease”. If that doesn’t work, restart your computer.
STEP 3: Get a dynamic DNS provider.
You know those DNS servers on the internets that make it so that you can type www.google.com into your browser, and your computer magically starts exchanging packets with the servers at Google’s IP address, and the Google homepage magically loads?
Well, you can use that same basic technology to get around the fact that your ISP gives you an ever changing address on the internet. The trick is a dynamic DNS service, which gives you a standard “whatever.mysite.com” URL, and automatically handles the nasty business of routing anyone who visits that URL to your router’s IP address. There are free services that do this, but they’re harder to use so I’m just using a fairly slick service called DynDNS (www.dyndns.com)
They require you to sign up for a “Pro-Trial” account which will start charging you after 14 days, but you can apparently cancel the account after a few days and still use them to route to ~5 IP addresses. They’re pretty simple to set up, but this video (http://revision3.com/systm/dyndns) covers the signup/setup process in detail, so I’ll refer you to them instead of repeating. At some point in the process, you’ll need to enter your router’s current IP address, and download a small client to your computer that will let DynDNS know if your IP address changes.
STEP 4: Setup port forwarding on your router.
Okay, so now the internet can find your router. But your router still needs to know what to do with income traffic. So if someone from the webs comes and gives the secret handshake, Mr. Router needs to send them to visit me. We do this with port forwarding.
Let’s go back to our router control panel. Click “Firewall Settings”. Click “Port Forwarding”. Pick your laptop out of the dropdown menu, and select “custom port” form the other menu. It seems that at least for me, Verizon blocks incoming traffic on port 80, the default HTTP port. But that doesn’t matter since it leaves high # ports unblocked. So just enter a random high number like 60000 under port. Click add.
AND THAT’S PRETTY MUCH IT.
Now, anyone who visits “whatever.yoursite.com:60000/whatever” can access that resource on your local machine.
If you actually want to run a SPARQL endpoint, there’s a bit more work to do. So,
STEP 5 (OPTIONAL): Configure Tomcat to deal with remote traffic
Most of the triple stores I’ve experimented with run as applications inside a Tomcat or Jetty servlet instance. If you don’t have one of those setup, you’re in for some not-particularly-fun work that’s way beyond the scope of this post (though you can try this post for a walkthrough of how to get started with a simple Sesame instance).
If you do have a Tomcat server running on your computer, you need one more step to actually use it as a SPARQL endpoint. Tomcat by default will run on Port 8080. We need to set it to run on whatever port we forwarded earlier on (e.g. 60000) so that traffic coming in on that port will hit the server and get a response.
To do this, you need to edit the “server.xml” file inside of your Tomcat installation. For me, the path to the containing folder is: /usr/local/apache-tomcat-7.0.23/conf
Inside of server.xml, look for the block that says:
<Connector port=“XXXXX” protocol=“HTTP/1.1”
connectionTimeout=“20000”
redirectPort=“8443” />
And change XXXXX to whatever port you forwarded.
And that’s actually it. Now when anyone send a url-encoded SPARQL query to “whatever.yoursite.com:60000/sparql” or whatever the appropriate URL is, the server will send back an appropriate response!
What this is useful for:
This is actually a pretty cool result in my opinion. While it is almost certainly not a good idea to run an openly accessibly SPARQL endpoint off your home network because it could easily get hacked or flooded with traffic, you CAN use this method to make an endpoint available to trusted friends and rely on “security by obscurity”. As long as you don’t actually list the access address to your endpoint anywhere, you are probably not going to get bombarded with queries.
But the more cool result is that you can combine this architecture with a REST paradigm to build a fully publicly accessible application using a framework like Rails or Django, throw that up on Heroku, and route all the SPARQL queries to your server behind the scenes. If you have an always on broadband connection and an old laptop lying around, you can throw linux on the laptop, setup Tomcat, copy your data over there, and use that laptop as an always-on server to support your public facing application.
That’s obviously not a scalable solution, but it is free, and way easier than trying to set up a whole AWS infrastructure. And if you’re only getting a handful of visitors to your public site each month, even an old laptop should be able to handle the traffic reasonably well.
Anyway, this walk-through is pretty configuration specific, but I imagine the process should be at least loosely analogous on any other setup. So hopefully this post will save you the trouble of figuring out how to do all this (which is definitely the hardest part). If you have any questions / problems / suggestions for how to do this better, just leave a comment or send me an email and I’ll try to help out or update the post!