Developing a Niche Online-Content Indexing System? 134
tebee writes "One of my hobbies has benefited for 20
years or so by the existence of an online index to all magazine
articles on the subject since the 1930s. It lets you list the
articles in any particular magazine or search for an article by
keyword, title or author, refining the search if necessary by
magazine and/or date. Unfortunately the firm which hosts the
index have recently pulled it from their website, citing security
worries and incompatibilities with the rest of their e-commerce
website: the heart of the system is a 20-year-old DOS program! They
have no plans to replace it as the original data is in an unknown
format. So we are talking about putting together
a team to build a open source replacement for this – probably using
PHP and MySQL. The governing body for the hobby has agreed to host
this and we are in negotiations to try and get the original data. We
hope that by volunteers crowd-sourcing the conversion, we will be
able to do what was commercially impossible." Tebee is looking for ideas about the best way to go about this, and for leads to existing approaches; read on for more.
tebee continues:
"It occurs to me that there could be
existing open-source projects that do roughly what we want to do —
maybe something indexing academic papers. But two days of trawling
through script sites and googling has not produced any results.
Remember that here we only point to the original article, we don't have the text of it online, though it has been suggested that we expand to do this. Unfortunately I think copyright considerations will prevent us from doing it, unless we can get our own version of the Google book agreement!
So does anyone know of anything that will save us the effort of writing our system or at least provide a starting point for us to work on?"
Sphinx or Lucene (Score:3, Informative)
Or did I misunderstand the question?
Re:Sphinx or Lucene (Score:4, Informative)
Yes, you did misunderstand.
We do not have the full text of the article online , all we have is its title, author and some manually created keywords. It's necessary to have access to the physical magazine to read the content of the article, but this is a hobby(model railroading) where many clubs and individuals have vast libraries often spanning 5 or 6 decades of monthly magazines.
All the solutions I could find seemed to be based, like those two, on indexing the text of the articles.
It would be much better if we did have the text as well, but as I said there is the minor problem of copyright. The fact that the index has been run for the last 10 years by the major (dead tree) publisher is this field has also discouraged development in this direction.
Re:Sphinx or Lucene (Score:4, Interesting)
If this isn't what you have in mind, please elaborate.
Re:Sphinx or Lucene (Score:4, Interesting)
If you have relatively little but highly structured data, running it through a general search engine like Lucene or Sphinx doesn't seem like the ideal solution, because it doesn't make it easy to do structured queries ("give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966").
A bibliography indexer would probably be a better choice. Two good free ones are Refbase [refbase.net] or Aigaion [aigaion.nl]. Both are targeted mainly at databases of scientific literature, so might need some tweaking for this purpose, though.
Re: (Score:3, Interesting)
Like someone else pointed out, though, if at some point he expects to get access to the full text or even just scans of the articles, he'd better have chosen a system that can easily expand to
Re: (Score:2)
It's really not a text indexing problem, unless you are going to throw out rdbms and use a flat text file.
If you will use relational database, then it is a 3-table problem at most. Articles, sources, and articles to sources. If you can join those, you have the core of a classic content management system.
From what I gather, they haven't even gotten that far. It is just a master index of articles that are available (which point to nothing in particular), so it is a 1-table problem.
For 1-table problems I gen
Re: (Score:2)
Couldn't you just scan and OCR the magazines and then use that data to compile a searchable database? You could even supply extracts from the articles where the search terms appear, similar to what Google does. By not presenting the full text of the article I think you would be on safe ground copyright wise.
It would also be a good way of archiving old paper documents which can degrade over time. I'm not sure what copyright terms are in your country but some of those magazines might be in the public domain a
Re: (Score:2)
Re: (Score:1)
Not being up to speed on current open source that might prevent premature wheel re-invention my answer would be 'No'. That said, I don't see any particular trouble with the project itself. If I understand correctly, you've bare bones bibliographic information that you want to create an on-line index of. The notion of PHP and MySql seems sound although I suspect that Perl would work as well if not better, depending on the knowledge of your volunteer talent. I expose my bias here when I point out that text an
Re: (Score:1)
Just noticed an thread on Hacker News on http://www.gotapi.com/html which might be of interest...
Re: (Score:2)
They have no plans to replace it as the original data is in an unknown format.
Well there aren't that many obvious candidates... any of these [d2ca.org] look familiar?
Re:Sphinx or Lucene (Score:4, Informative)
So let me get this straight: This is a single table? You have one table (spreadsheet), where each row represents one article. The columns would be title, author, and either five or so columns of keywords, or a single varchar column that would hold them all (comma-delineated or whatever).
Then you need the standard row_id and whatever other crufty columns creep in. If this is all you need, you can do this in Excel (har har). Or install MySQL, create the table (we'll call it mr_article_list), then write the standard php scripts to add, edit, delete, and retrieve entries.
These scripts are basically just web forms that pass through the entered values into the database. You're talking a single code page for each of the inputs, and then a page each for the output/result, or 8 pages total.
For example, the mr_add.php script (mr_ stands for model railroad) retrieves a new row_id from the db. Then it presents a web form with input fields for the title, author, and keywords. Then it does db_insert(mr_article_list, $title, $author, $keywords). Then it calls mr_add2.php, which is either success or failure.
The edit, delete, and retrieve scripts are similarly simple. All you need is a linux box to do this, and the basic scripts could be written in two evenings (or one long one) - assuming you hired someone who does this for a living.
Now this is where it gets interesting:
>many clubs and individuals have vast libraries often spanning 5 or 6 decades of monthly magazines
Do you want to store this information as well, so that people know who to call to get the issue? I assume this would be the real useful feature. So now you need a second table, mr_sources, which is basically a list of clubs/people, so the columns in this table would be like row_id, name, address, phone number (standard phone book shit).
Then you need a third table, mr_article_sources, which is real simple, it just matches up the rows in the article list to the rows in the source list. It's columns are simply row_id, article_row_id, source_row_id. This is a long and narrow table that cross-indexes the two shorter, fatter tables (the list of articles, and the list of sources).
Example, article_id #19 is "How to shoot your electric engine off the tracks in under three seconds." Source_id #5 is Milwaukee Railroad Club, #7 is San Jose Railroad Surfers, and #9 is Bill Gates Private Book Collection. All three of them have this article. So your cross-index table would look like this:
01 19 05
02 19 07
03 19 09
When you search for article #19, it finds sources 5, 7, and 9 in the cross-index table, then queries the source table for the names and phone numbers of those three clubs (and displays them).
Finally, if you're wondering how to query three different tables at the same time, well, databases were made to do exactly this.
No, no, NO! (Score:4, Insightful)
Your suggestions make sense, but suggesting to store comma-delimited plain text in a SQL table is wrong by any and all database standards & best practises. You fail to reach even the first normalized form.
Read http://en.wikipedia.org/wiki/Database_normalization [wikipedia.org]
You want to define a table "tags" or something with id, article_id, name, comment. Make the combination of id, parent_id, name unique.
* id is on auto-increase, not NULL
* article_id is a foreign key to the id of the article, not NULL
* name is the name of the tag, not NULL
* comment is an optional comment explaining the tag (for example in the mouse-over or on the site listing everything with that tag), may be NULL
Not only is that easier to maintain in the long run (think of parsing plain text out of a VARCHAR. argh!), but all of a sudden, you have the data you _store_ available to _access_.
How many artcles are tagged electric? SELECT count (1) FROM article_tags WHERE name = "electric";
Give me a list of all article relating to foo and bar? SELECT article_id FROM article_tags WHERE name = "foo" OR name = "bar".
etc pp.
If you want to go really fancy with multi-level tags, replace article_id with parent_id (referring to the id in the same table) and create a relation table as glue. If you want all upper levels to apply, throw in a transitive closure:
http://en.wikipedia.org/wiki/Transitive_closure [wikipedia.org]
Generally speaking, you want a table for magazines with their names, publication dates, publisher, whatnot; and only refer to them via foreign keys. Same goes for train models (which you could cross-ref via tags. Yay for clean db design!), authors, collectors, train clubs and and pretty much everything else.
One last word of advice: No matter what anyone tells you: Either you use a proper framework or you _ALWAYS_ use prepared statements. You get some performance benefits and SQL injection becomes impossible, for free! Repeat: Even if you ignore all the other tips above, you _MUST heed this.
http://en.wikipedia.org/wiki/SQL_injection [wikipedia.org]
Richard
PS: You are more than welcome to reply to this post once you have your DB design hammered out. I will have a look & optimize, if you want.
Re: (Score:2)
The second statement should have read
SELECT article_id FROM article_tags WHERE name = "foo" AND name = "bar"
for obvious reasons.
Re: (Score:2)
No, OR is correct here. AND doesn't find any rows because field "name" has only a single value, so 'name = "foo" AND name = "bar"' can't ever be true for any row. You want something like
SELECT article_id FROM article_tags WHERE name = "foo" INTERSECT SELECT article_id FROM article_tags WHERE name = "bar"
Re: (Score:2)
Ah, yes thanks... I should not be allowed to post before coffee...
Re: (Score:2)
I think you should still have a look as Sphinx and Lucene. You can put whatever data you want into them, in whatever schema you want (at least with Lucene, I believe with Sphinx too). You can then easily create a UI as a front end and let the indexing engine do the hard work of slicing and dicing by your criteria. I believe the Zend Framework library has a Lucene API.
Also if you do manage to go fulltext later then it'll mean less work.
Re: (Score:1)
I don't know about Sphinx but I agree that Lucene could be a good solution, for the reasons tolan-b lists. I work on a digital library cataloging project that indexes it's metadata with Lucene. We use PHP to generate the user-facing website, which queries our Lucene index via a Solr server. We do have a highly structured metadata schema and we do run queries that include things such as "give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966" (which somebody in another
Re: (Score:2)
>we do run queries that include things such as "give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966"
SELECT * FROM banjo_articles WHERE title LIKE "%foo%", date BETWEEN "1950-01-01" AND "1966-12-31"
You're bragging that your "system" has a single line of code?
I've seen selects ten or twenty lines long, with multiple joins, and joins and selects within joins. Granted it's not fast, but it works, and it takes all of an hour (or less) to write such a query.
Re: (Score:3, Interesting)
I do the same thing for tropical fish and wrote a shitload of C code. If this is an old DOS program it should port to C/UNIX really stupid easy.
Drop me a line if you want to and I'll ask you to send me some sample data. This might be really easy.
Re: (Score:2)
The standard /. IANAL applies here, but I'm pretty sure that if you have legal access to the copyrighted text (ie you or someone you know owns a copy of the magazine) then it is ok to create a derivative work for the purposes of searching that work. This is the loophole that Google (name your favorite search engine here) uses, and they go so far as to offer cached versions of some sites.
Lucene, or a more friendly wrapper around it like SOLR, has the option of creating a search index based on an original te
Re: (Score:2)
Check out: http://xtf.sourceforge.net/ [sourceforge.net]
I think it uses lucene on the backend. It's designed to map meta-data sources to meta-data outputs via XSL templates. I talked with some of the developers recently and it sounds reasonable. If your inputs are binary then it's probably not much help but for XML-like inputs it might give you some of the capabilities you're looking for. HTH
It would help (Score:3, Insightful)
Re:It would help (Score:4, Funny)
Re: (Score:3, Funny)
I'm pretty sure porn indexing isn't niche... or a hobby. Its the true reason Google exists.
Re: (Score:2)
if you said what hobby and index is that. Doing so would surely catch more interest from the Slashdot crowd.
Maybe it's the type of magazines that people used to read "for the articles?"
And that's precisely the type of magazine that would catch the interest of the Slashdot crowd.
Re:It would help (Score:4, Informative)
Wayback (Score:4, Interesting)
Re: (Score:2)
Definitely an easy re-write.
Just going to be painful to re-enter all that data if they can't use the original binary blob.
Long time ago I had a programming segment regarding binary blobs. Basically, unknown data structures within a binary. Provided they used no encryption it should be relatively painless to extract the data. It was trivial then and now I'm way better.
Re: (Score:2)
How will that help? As far as I understand, the pages are created on the fly, so without the "engine" behind them you won't get anything.
Re: (Score:2)
Re: (Score:1)
It's a DOS program that runs on the server, rather like a CGI script. It's output is a web page.
It is bit of a throw back to the dawn of the web when people thought up innovative ways to do things.
Re: (Score:2)
CGI works by having the server executes the program (passing the data to it from STDIN or the command line) and then retreiving the page's complete HTML code from STDOUT. You can use any file that can be executed and use STDIN/STDOUT in this manner that is located in a specified location(like cgi-bin). On Windows this would be any exe,com,pif,bat or cmd file, and the extension must be there for the operating system to determine that it is an executable. On Linux you can use any file that +x permissions, com
Developing a Niche Online-Content Indexing System? (Score:4, Insightful)
Re: (Score:1)
WikiPedia's search stinks in my opinion. It's gotten better of late, but still not the Gold Standard by any stretch.
Just migrate it to VMware or KVM (Score:3, Informative)
Leverage the power of virtualization to run your legacy platform for now, and have time to come up with other solutions.
Re: (Score:1)
Re: (Score:3)
Leverage the power of virtualization to run your legacy platform for now, and have time to come up with other solutions.
That assumes that the original data is available to the OP. It may be that it is not.
Re: (Score:1)
> That assumes that the original data is available to the OP. It may be that it is not.
If only the article in some way made this clear.
"we are in negotiations to try and get the original data."
Oh, it does.
Re: (Score:2)
"we are in negotiations to try and get the original data."
In other words the OP does not have the data. And from the OP's reply below it may be that they never get it.
Re: (Score:2)
>Well until he does get it, any consideration of how to process it is somewhat moot.
Not quite. He was clear enough to construct a data model. This customer knows what he wants. Problem is, it will take his own efforts to fill in the gaps (in terms of getting access).
"Hi, I want you to install a refrigerator in my apartment. It needs to fit in a hole 30 inches wide by 30 inches deep."
"Will you take a refrigerator 28 inches wide by 26 inches deep?"
"Sure but....lemme talk to my landlord first."
If Zeus d
Re: (Score:1)
As of now it is not available.
We are putting pressure on the current owners to make it available, as they have suffered a certain amount of bad publicity over this, but so far to no avail. They did purchase the program for real money 10 years ago, but the fact that they are unable to run it should indicate to them it has little or no value now.
My thoughts have been on the lines of running it on some old PC hanging off some ADSL line with dynamic DNS but virtualization may be a better idea. Does anyone of
Re: (Score:2)
Well the program and the data are two different things. At least to me they are.
All you need to do is run the program once, get a dump of the entire article list, and import it into your new MySQL table.
And running the program requires, what, DOS? Come on. Forget the web, that's out of the picture now with regards to the old, expired system. You just need ONE copy of the data and you can re-build the web interface yourself with php.
It sounds to me like the data is proprietary and they are being stingy w
Re: (Score:3, Interesting)
If you do get the original data, I'll volunteer to either disassemble the exe or RE the data format or preferably both. Just for the fun of it. Contact me at the /. nick over in the google mail system.
Offer to let them host a redirect if they want - interstitial advert page with a 'we have moved', and offer to redirect to that page if they are not the referrer for a certain timeframe. They get some advert money, you get the data, I have something to entertain myself with.
Gimme just the DOS program at ela
Re: (Score:2)
Mod parent up!
Nothing necessarily wrong with the DOS program anyway- if it works, why break it?
You should be able to run it pretty easily with either a virtual machine or an emulator- you can then look at extracting from it the data and migrating it to a flashier site. Sticking with the DOS program sounds like the simpler solution for now.
put the data online if you can (Score:1, Informative)
There is an annoying "business model" that drives most commercial websites for greed reasons, and spreads from them to non-commercial websites for no good reason at all except lemming effect. That is when the site has an interesting chunk of data but instead of putting it online to download, wraps a web application around it to deal it out in dribs and drabs, so that users have to keep returning, clicking ads, and so forth.
Yeah having some kind of online query interface can be useful and you should certain
Re: (Score:2)
Data hoarders (Score:2)
hoarding == massive replication (Score:3, Interesting)
Re: (Score:2)
Realistically, though, I doubt the database is very large (moreover, I doubt there are all that many people who would want this data). I mean, if you are indexing 50 magazines, over 100 years, with an average of 10 articles in each one, that's 50k articles. Let's say each article has 200B of data, thats, what? ~2 meg uncompressed?
The binary file shouldn't be hard to read (Score:2, Informative)
Try Ruby on Rails (Score:5, Funny)
I'm sure that Ruby on Rails could have a fully functional web site made from this data in about half an hour.
The downside is that if more than two people try to access the data, it will display a whale suspended by balloons.
(Please Note: This post is a joke, and not an attempt to start a flame war).
Re:Try Ruby on Rails (Score:5, Funny)
It's data for model railroading magazine, so not only are they used to rails, they already have protocols to serialize access to shared resources and prevent collisions.
That is a data convertion project (Score:2)
You could write a custom program that would scrape the the data from a website you setup to allow that program to run stand alone or you figure out what the data format is and write a program to convert that.
If you want to recreate the data from scratch then you'd need to set up a website your group would access and enter data. That would be crowd sourcing but you'd probably want something specific to your needs but using easily maintainable code.
As others have stated you could use virtualization. Inside th
Screen Scrape the Site (Score:2)
See if you can get access to the site again, and screen scrape it. That should not be too hard (search for all articles beginning with "A", then "B", etc.). Then, it should be straightforward to enter it into MySQL or your database of choice.
(It is just possible the search functionality is still there, with just the HTML being taken down. The WayBack Machine could be your friend here...)
Re: (Score:1)
If you could scape the site, I would have done it years ago. Unfortunately the programmer built in anti-scraping technology to the program to "protect his data". If you issue too many sequential requests it locks your IP out - Permanently ! I discovered this about 8 years ago when I was doing some manual scraping and it did it to me.
if you look at the site ( http://index.mrmag.com/ [mrmag.com] ) on the wayback machine you can see the strange error you get - it locked that out too!
Re: (Score:1)
My company has some pretty sophisticated data transformation tools that we use in forensics. You can connect with me via the /. friends system if you manage to get hold of the source data. We may be able to return it to you in something simple like CSV and then from there things should be easy.
Not promising a result but happy to at least take a look
Ask Pubmed guys (Score:3, Interesting)
Ask guys behind the Pubmed
http://www.ncbi.nlm.nih.gov/pubmed [nih.gov]
The database of scientific articles in the field of medicine and biology.
NCBI has the most generous software code licensing that is possible: the code is absolutely free, absolutely no restriction for distributing, changing, selling, even closing it. All because we, taxpayers, paid for it already.
I am surprised none of them reacted yet, I am sure they read ./
Re: (Score:2)
And a thousand Mac Fanbois ... (Score:3, Funny)
Oh, the number of times that I've heard that refrain... shudder
Re: (Score:2)
Eww, the people responsible for that thing need to be lead into the street and shot.
Until quite recently you could not even talk sql to it.
Re: (Score:1)
Re: (Score:2)
I also have spent a long time dealing with FileMaker too and it can be a huge PITA. Be thankful you didn't have to maintain a FileMaker Pro Server or web server for many people!
It is very easy for non-tech savvy people to use to build a bunch of databases and start using them which is cool. The problem is that the databases have a very simple design and most people don't even know how to setup a relationship between two fields. They just drag and drop fields onto a form and let FileMaker figure out how to s
Re: (Score:2)
Strangely enough though, only had one customer with it on Mac. The rest have been fools running PC version under Windows... when they already have office installed with Access... or even an SQL Server on the network. ?!!?
If you can show me a way to publish databases to the web that's as quick and easy as FileMaker Pro, I'd love to hear about it.
Drupal, hands down. (Score:2, Interesting)
Re: (Score:2)
mod up seriously. Knowing what I know about Drupal + Solr, along with these fantastic examples, this is informative, truly.
Built in to mySQL (Score:1)
File format, not the implementation details (Score:3, Insightful)
Re: (Score:2)
You captured the main point in by refer/BibTeX posting better than I did. Thanks.
More than once I've had to salvage important data from obs
IMO it shouldnt be hard to re-parse the data (Score:1)
B& (Score:2)
wget has builtin recursive-fetching capabilities
Which will get the IP address of the machine running the scraper permanently banned. See the post above [slashdot.org].
if at all possible I would bypass the exe and just look at importing the raw data into a relational database like mysql
It's likely that the raw data is encrypted. Based on the comment so far, I see no reliable indication of from what country tebee operates or whether this country has a DMCA-alike.
Re: (Score:1)
Re: (Score:2)
It's not acedemic if we can show the poster some sort of very simple wiki-like CMS that people with 6 decades of back issues might volunteer to enter/edit information. If everyone were organized, 100 people could enter the data in a weekend. Allowing time to edit and refine keywords, without copying the actual content, would add some time. And the backend database could end up more valuable than the original.
Scraping the data isn't possible, getting the data looks unlikely. So you recreate it. Have pe
hyperestraier (Score:2)
Take a look at http://hyperestraier.sourceforge.net/ [sourceforge.net] ... there might be something newer by the same author, Mikio Hirabayashi
Extracting the text from whatever files you have would be a separate step.
Fancy that (Score:1, Funny)
> One of my hobbies has benefited for 20 years or so by the existence of an online index to all magazine articles on the subject since the 1930s. [...] The governing body for the hobby has agreed to host this
Huh, I didn't realize that porn had a governing body.
It's a library catalog. (Score:2)
Don't ask generic nerds -- ask library nerds : code4lib [code4lib.org]. They have a pretty active mailing list.
Also, there's oss4lib [oss4lib.org] which is specifically for open source software, but I haven't seen much activity on their list in a while, and I think most of us are on both lists. (there's also a few cataloging specific lists, but they get to be all library-sciencey, with discussions of RDA and FRBR and cataloging aggregates).
Re: (Score:2)
Denholm: It's settled. I've got a good feeling about you Jen and they need a new manager.
Jen: Fantastic! So, the people I'll be working with, what are they like?
Denholm: Standard nerds!
[Note: Not to be confused with standards nerds]
Using a Howitzer to Hunt Squirrels (Score:3, Insightful)
Lots of people here are recommending using tools that are built for very large scale projects. Based on the fact you have a DOS based system that likely used a pretty common library for storing the data (something like c-tree, btrieve, a dbase library or simply saving binary data using whatever language the app was written in), using any RDBMS like MySQL or even SQLite probably would do the job. PHP, Python, Ruby and Perl would probably make writing the actual application a snap - and be able to handle more of a load that the DOS app could.
Here's to hoping you can get the data. Hopefully the vendor that pulled the database down realizes how important to marketing it is and reverses course.
This is the ModelRR mag. database (Score:1)
As many posters have said, it should be easy (for a programmer) to pull the data from the DB -- if you can get the original data files from Kalmbach. The data was not complex, and 80's DBs tended to have simple file formats. As many suggested, a C++, Java, Python or oth
Re: (Score:1)
Why PHP? (Score:1)
Dspace (Score:2)
Check out Dspace (http://www.dspace.org/). I'm by no means an expert in the area but it seems it might be what you need.
Hypercard 2.0? (Score:2)
Anyone can understand a card system, enter unique data per card and save.
Humans are good at that.
Bring them all together and you have a huge digital stack to be sorted, searched or as the backend to a nice simple topic interface.
Computers are great at that now.
That would help your crowd sourcing if its open source no MS closed issues later on.
DOS Data (Score:2)
If it's 20 years old DOS, chances are that it's either Paradox or dBASE or any xBASE format, which could be easily opened with Access or even Winword.
BibTeX (Score:1)
It may not be a complete solution, but have you looked at BibTeX? BibTeX itself is only a format for nicely stating the information you have available (which magazine, article title, which pages in the magazine, authors, etc), but in the entire BibTeX ecosystem a number of indexing systems are built. Quite a lot of them are for desktop use (so you can manage your own BibTeX entries), but I'd imagine there would be some web-based system for this as well.
Sure About DOS? (Score:2)
Are you sure it's a true DOS application and not a Win32 Console App? I know it is entirely possible for someone to write a CGI in DOS but it seems really weird to me that they would use DOS since it didn't have anything that would server CGI, and coding a hand rolled database format would be a lot of extra work.
If it is using Win32 it might just be accessing a DAO database without using the mdb extension, which many companies do to make it look like a proprietary format you can't just open with MS Access.
Re: (Score:1)
Your right it is. A visit to the wayback machine found this page- http://web.archive.org/web/20070626092758/www.index.mrmag.com/tm.exe?tmpl=tm_info [archive.org]
Nn which is written -
The TM application is written in "C", and is based on an ISAM/Network database manager I wrote in the late 1980's. The code is highly portable, and versions exist for MS-DOS, Windows NT and several flavors of UNIX. I also run it on my HP Palmtop. The version running on this site is a Win32 console application.
Which just goes to show I shou
Re: ISAM/Network database manager (Score:1)
If you or someone can get me the database files (from Kalmbach?) I am willing
to try to extract useful data from them, into simple ASCII text file(s), suitable
for loading into a relational database like Postgresql, for free.
There is at least one misunderstanding. (Score:2)
Your first priority is to find out how the original data is stored and accessed. If as you say it is about 20 years old, I strongly suspect it is stored in a C-ISAM or D-ISAM database, and known code libraries are used to access it.
You should then be able to lean heavily on exi
Nail down the file format first (Score:2)
There are file formats for this. Probably there are XML languages if you like that kind of thing, but either of two older ones would serve you well I think: the refer(1) format for bibliographic databases, and the BibTeX format. At least the latter is still in
Semantic web (Score:1)
Brewster Kahle's Digital Library (Score:1)
A library catalog system is needed (Score:2)
This sounds almost exactly like a library catalog system. If the system doesn't index articles, then just treat each article as a book in a multi-volume set. I know that several open source library system exist. Look into those.
Backup everything (Score:2)
Seriously, first step, back up *EVERYTHING*. This includes your programs and your data.
Then see if your ancient programs can be run inside a useful modern emulation enviornment, like "dosbox" or "freedos". That can buy you another 10 years.
It also buys you access to the data without using your ancient hardware: you can read the backups and play with the data much more safely, to try and decode the format. Given the software's age, it's unlikely to be more sophistated than a very simply index and tables that
Use a blog (Score:1)
Zebra is great of bibliographic data (Score:1)
CWIS Open Source Solution (Score:1)
Re: (Score:2)
Bad idea.
It's a bad idea for the same reason they don't want to host a a dos executable anymore.
Even if some strange reason the text could not be retrieved from a binary blob (which is not likely) the application still works today.
A single command line wild card search would re-dump the text which could be parsed and stored in a simple database.
Re: (Score:2)
If the individual article summaries are also made available on individual pages, let Googlebot index them and people will be able to discover relevant individual articles through Google as well. Then anyone looking for something covered by an article will discover your index, as well as which issue of the magazine they need.
Re: (Score:1)
At the moment things are a little fragmented but we seem to be congregating on this thread http://model-railroad-hobbyist.com/discountinued_mag_index [model-rail...bbyist.com]
I hope things will get a little more organized once we have a clear idea whether the original data is going to be available to us
I too have a website we could use , I'm currently putting up versions of some of the things that have been suggested here. It's at http://pc-cafe.co.uk/mr [pc-cafe.co.uk]