1. Introduction.
Plagiarism Detector
is a specially designed standalone windows application
to effectively detect and report cases of textual plagiarism
in multiple documents. This section covers the details
of the technology - that is "the way it works"
issue.
! It is
strongly recommended that you should read the following
page before using the application !
2. Technology basics.
The core of the application includes several important
parts 'Working modules':
- Graphical User Interface.
- Document Manager.
- Project Manager.
-
File parser.
- Lexical text analyzer.
- CAPTCHA automation core.
- Profile Selector.
- Request forming and processing core.
- Reporting core.
The main operational sequence is this 'What you do':
- You create a project.
- You add a number of documents.
- You set the newly created project properties.
- You start the analysis.
- On successful analysis finish you are prompted to navigate
to the reports folder or to load the last analyzed document
into the browser.
The detailed sequence:
1. The application works with 'Project' notion.
Project - is a unity consisting of
a number of documents (possibly with different locations,
types, size, authors etc.) and a number of project properties.
A 'Project' is presented by a file with ".pd"
extension. E.g. "MyProject.pd",
"Subgroup45a.pd", located
in the Project Folder.
Project Properties - are the following
values:
- Project File Location - project file
name [complete filename + path]
- Project File Name - project file name
[only the title]
-
Project
Profile - a predefined set of the
next two values:
- Chain Length -
Check Chain Length (a number of words to be joined
together to be checked against Google in one request)
- Chain Step -
Check Chain Step (a number of words to skip to
form a new Check Chain)
To modify current project settings click: "Current
Project Properties" button on the main screen.
The following screenshot presents the Current Project
Properties Screen:

To illustrate the way the last two parameters influence
the behavior of the application the following diagram is
used:

The two search requests that are going to be sent to Google
according to this diagram are:
- "Mike likes to eat".
- "Sunday..."
The idea is that these two values - Chain
Length and Chain
Step result in two different directions
each:
Big
Chain Length
- the degree of Plagiarism Suspicion Degree
is going to be higher. Accidental hits are excluded.
Small
Chain Length
- the degree of Plagiarism Suspicion Degree
is going to be low. Accidental hits are expected.
Big Chain
Step - small amount of
time required for the document analysis. Less detailed analysis.
Small Chain
Step - big amount of time
required for the document analysis. More thorough analysis.
What is Plagiarism Suspicion Degree?
Every time Plagiarism
Detector runs into
a Plagi-Hit (for more details on this see
Alive
Reports), a plagiarism suspected place
occurs. To definitely state the fact of true Plagiarism
- you must check this occurrence manually. But the degree
is higher when the word chain is bigger. To illustrate this
lets take two different examples:
"I am free" - this check chain
consists of 3 words. That is the Chain
Length is small
its value is '3'. The occurrence of this word sequence over
the Internet will be fantastically high. So you may not
speak about the True Plagiarism here.
"I am free after the exact midnight today!"
- this check chain consists of 8 words. That is
the Chain
Length is big its
value is '8'. The occurrence of this word sequence over
the Internet will be... zero. So you may not speak about
the True Plagiarism here. In case this sequence was marked
as Plagi-Hit you can be 98% sure that this
passage is taken from some source over the web.
3. Extremely Important Assessment Criteria
You may put forward a logical question - how do I know
that the text is plagiarized?
The answer is the following - check the Top 5
section in the originality report.
Two Originality Report examples:
True Plagiarism:

Truly Original:

The explanation is pretty obvious. Is a sample of 10
sentences will be taken from the web and included into
the analyzed document the application will immediately
react increasing the Frequency Link counter.
After Google-based analysis finishes, all the harvested
urls are accumulated into the so called Url Stack
and their Occurrence Frequency is counted. Top
5 is actually the top of the Url Stack
- that is the urls that are most frequent links with the
suspicion of Plagiarism.
The core idea lies in the fact that having even 2
(!!!) accidental links to 1 (out of billions!) document
in the web is mathematically ABSOLUTELY IMPOSSIBLE.
It is possible that some small word sequence (2-3 words)
can be found on thousands of documents over the web, but
if any other word sequence in the checked
document links to the same source... It's plagiarized!
The google will never ever show this occurrence in such
a sequence. Still - make 3
your Plagiarism Barrier - just to be
on the safe side.
|