I have long ago threatened to replace you with a small shell script. That time has come. This isn’t a discussion on AI, but instead a few simple ways to generate pseudo random English. The initial part of this project is to document a relatively old trick to replace someone with a script. The idea is to generate a function, and port that function to various languages to show how easy it is to produce a bit of text in the language that someone else uses.
It should be noted that the function does not generate NEW text, it merely uses text that a person has already used. For best results, you need a lot of text written by one person – mixing and matching people and indeed style will produce a very confused response.
Method
The system works by using a corpus written by the target, a “length” n and a starting phrase, which is at least as long as the length.
The output has to be buffered slightly – you start with the starting phrase, and you take the last n characters in the phrase (including punctuation) and search for all the occasions that sequence occurs in the corpus, and takes the next character that appears, and adds it to an array.
Once all the occasions have been found, if the array is bigger then 0, then a random letter is picked from it – this is where the corpus actually plays its part. If a user uses particular sequences of letters more then others, then the probability is affected, i.e. there may be multiple instances of some letters in the array. (if the length is 3 and the last three letters are ” th” then the likley hood is that the array will contain {e, e, e, a} so the system is likley to pick an e.
You now append that letter to your output, and go again, still only using the last n letters. For example:
|
This is a sample corpus that the test can write and think about without the test failing in its demonstration of good use. Lets see what happens when the test gets given the corpus with a length of just two and a starting word of “th” and whether it will work.
|
|
|
This is a sample corpus that the test can write and think about without the test failing in its demonstration of good use. Lets see what happens when the test gets given the corpus with a length of just two and a starting word of “th” and whether it will work.
|
Select from: { a, e, o, e, e, “ } Selects an “e”,
Phrase now: “the” last 2 letters he, search:
|
|
This is a sample corpus that the test can write and think about without the test failing in its demonstration of good use. Lets see what happens when the test gets given the corpus with a length of just two and a starting word of “th” and whether it will work.
|
Select from: { _, _, e } Selects an “_” (space)
Phrase now: “the ” last 2 letters “e ”, search:
|
|
This is a sample corpus that the test can write and think about without the test failing in its demonstration of good use. Lets see what happens when the test gets given the corpus with a length of just two and a starting word of “th” and whether it will work.
|
Select from: { t, t, w, c } Selects an “w”
Phrase now: “the w” last 2 letters “ w”, search:
|
|
This is a sample corpus that the test can write and think about without the test failing in its demonstration of good use. Lets see what happens when the test gets given the corpus with a length of just two and a starting word of “th” and whether it will work.
|
Select from: { r, i, h, h, i , o, o } Selects an “o”
Phrase now: “the wo” last 2 letters “wo”, and continue…
|
Note, the array can only be empty in one of two circumstances – 1) the starting phrase does not occur, and 2) the only instance of the phrase is the last n letters of the corpus. Also note that the array of letters can either use or ignore case, and also punctiation. For more realism, you need punctiation.
Now. The big question you’re all asking is “does this really work”, well the script to find out is at the end of this post, In this I have adapted the formula slightly. The system takes the phrase compiled in, and it uses new line chars to mark the end of a sentence. Here’s where some problems start coming up – the length of the corpus is preferably long, with a lot of variety in phrases and so on. Also the length of search pattern is important – the longer the length, the more precise and less flexible, but the shorter the more random. Above I used a lenght of two – and this is very unsuitable because there is no connection between words – if a space is introduced, you only base the starting letter of the next work on the last letter of the previous. Certainly lots of enhancements to this algorithm can be written.
But does it WORK I hear you cry. Well what I did (thinking I would be very smart and clever) is take all the post written to the end of the next sentence, and run it through the system with a starting phrase of “But” and a length of 3. When it hits a new line, it will end, and I will post the entire thing verbatim as it comes out of the system here (and be warned, because I repeat that test sentence several times that will skew the results) but here we go!
But’ and slightly. (if a functuatime long phrase, bufferelation an eith of just 2 letterson AI, but write array base corpus wheith a space you take test the letten the no of sequence of 3. When by old takes is part. The is very uses this test characterson of 3. Well work.
If we take the same results, and increase its length to 4 (and ofcourse, changing starting phrase to “But ” then we get the following:
But does new line, it with a length, the come. The initial parting <strong>out two and whether is probability instead a script to show how how how easy it to pick to picked from it to be buffered style warned, you takes text sequence occasion of “the test can <strong> w</strong>e </strong>ork.
Note that the length of 4 shows a much more “english” set of words. The random snippets of HTML in there come from the actual design. If we increase to 5, 6 and 7 as in the following three paragraphs:
But does not occurs in the lenght of the array. (if the array can be written.
But does not occur, and go again, still only instance of this is a sample corpus that sequence occur, and I will skew the results) but here we go!
But does it WORK I hear you cry. Well what I did (thinking I would be very smart and clever) is take <b>all</b> the post written to the end of this project is to generate a function does not occur, and 2) the only instance of the phrase is the length of 3. When it hits a new line chars to mark the entire thing verbatim as it comes out of this post, In this I have long as the languages to show how easy it is to document a relatively old trick to replace someone with a length is 3 and this is where the corpus with a length of search:
Hopefully now you will see the problem: the length of 7 produces words which are the most clear, and in sentences which are very clear english, but they are snatches verbatim from the first half of the post. With longer sequences, the average number of availible options decreases – there has to be a trade between number of choices availible to the script, and readbility of the output it produces.
And now for the source code:
static Random r = new Random();
static void Main(string[] args) {
String corpus = @"A Long String in here with newlines"; // either \r or \n will cause the paragraph to ocme to an end.
String startup = "A Lon"; // this is a phrase of length "length" and must appear in the corpus.
int length = 5;
Console.Write(startup);
bool alive =true;
string start = startup.Substring(startup.Length - length);
while (alive) {
try {
char next = lookCorup(corpus, start);
if (next == '\n' ) {
alive = false;
}
if (next == '\r') {
alive = false;
}
start = start.Substring(1) + next;
Console.Write(next);
} catch (Exception) {
alive = false;
}
}
Console.ReadKey();
}
private static char lookCorup(string Corpus, string start) {
List next = new List();
int last = Corpus.IndexOf(start, 0);
while (last != -1) {
next.Add(Corpus[last + start.Length]);
last = Corpus.IndexOf(start, last+1);
}
if (next.Count > 0) {
return next[r.Next(0, next.Count)];
} else {
throw new Exception("End of Line");
}
}