Hackwatck was my final year project in University as part of my Maths and Computer Science course. Essentially Hackwatch is the name given to my software/webpage that’s purpose is to automatically search out, detect retrieve and categorise cases of copycat journalism.
The difference between this project and the Databaked project is that this has to deal with large amounts of data from unknown sites which raises a whole heap of new challenges. A lot of work and research was placed into data mining, and web content mining.
All data has to be verified and corrected before it can be stored to avoid injections, cross site scripting and various other methods of hacking/tampering. The software then needs to be automatically capeable of detecting what is and is not considered plagiarised and syndication, and be capeable of picking up quotations that have been used without attribution along with various other elements.
The main software was written in Java, and integrated onto the PHP webpage via SSH to give the operator of the software full control from any where in the world. It is important to note that the Java software is not set up as an applet, but instead runs silently in the background on a daily basis except where an update is pushed from the webpage by the operator. The data is all stored in a MySQL database.









