n***@gmail.com
2012-09-20 18:51:28 UTC
My use of this utility would be to delete the inevitable
repeated/redundant 'packing' that we get when we
fetch http-textS and append them to accumulatorFile.
Each URL-fetch accumulates its 3 parts:
u : the URL
C: the contents
|: the one-line-separator: "<><><>".
A sequence of 4 triplets described above can be show as:
uA|vB|wC|xD|
where
u,v,w,x are the 1-line URLs
A,B,C,D are the multiline text-blocks [fetched by lynx/links]
and | are the one-line-separators "<><><>".
But now, the contents: A,B,C,D will contain common,
repeated/redundant text, which usually is near the
page-beginning, but must be handled anywhere is the page.
So then whith the garbage represented as 8, it looks like:
u8A|v8B|wC8|x8D|
The algorithm that I see is:
the human starts reading/editing accumulatorFile,
and notices the/some repeated/redundant/garbage,
which be pastes out to FileH;
and then the program does:
scan accumulatorFile, and delete all copies of the
text-block:H, except the frst copy.
That should'nt be difficult, but the following refinement
is also required:
for matching purposes, ignore all "["d{d}"]" and
spaces and tabs. So the following 2 lines should match:
the [4] cat sat on the mat
the [27] cat sat on the mat
== TIA.
PS. by just pasting the next round of URLs to a file,
for their pages to be appended to your <book>,
you can collect and MANAGE some of the good
edumation available.
repeated/redundant 'packing' that we get when we
fetch http-textS and append them to accumulatorFile.
Each URL-fetch accumulates its 3 parts:
u : the URL
C: the contents
|: the one-line-separator: "<><><>".
A sequence of 4 triplets described above can be show as:
uA|vB|wC|xD|
where
u,v,w,x are the 1-line URLs
A,B,C,D are the multiline text-blocks [fetched by lynx/links]
and | are the one-line-separators "<><><>".
But now, the contents: A,B,C,D will contain common,
repeated/redundant text, which usually is near the
page-beginning, but must be handled anywhere is the page.
So then whith the garbage represented as 8, it looks like:
u8A|v8B|wC8|x8D|
The algorithm that I see is:
the human starts reading/editing accumulatorFile,
and notices the/some repeated/redundant/garbage,
which be pastes out to FileH;
and then the program does:
scan accumulatorFile, and delete all copies of the
text-block:H, except the frst copy.
That should'nt be difficult, but the following refinement
is also required:
for matching purposes, ignore all "["d{d}"]" and
spaces and tabs. So the following 2 lines should match:
the [4] cat sat on the mat
the [27] cat sat on the mat
== TIA.
PS. by just pasting the next round of URLs to a file,
for their pages to be appended to your <book>,
you can collect and MANAGE some of the good
edumation available.