MAILS about: Message-ID: <200203020316380679.03EE30B9@smtp.worldonline.dk> X-Mailer: Calypso Version 3.30.00.00 (4) Date: Sat, 02 Mar 2002 03:16:38 +0100 From: "Kasper Skårhøj" To: typo3feature Subject: Dev-help: Any SQL-wizards? Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" If you are an SQL wizard, you may be able to help me here. In the (coming) index searching thing, I have three main tables. - index_words which contains all the words indexed - index_pages which represents a link to a page id or external url - index_rel which links the two tables together. So searching an OR search for "content" and "management" could be done like this: SELECT STRAIGHT_JOIN [some fields here...] FROM index_words AS IW, index_rel AS IR, index_phash AS IP WHERE IR.phash = IP.phash AND IW.wid=IR.wid AND (IW.baseword = 'content' OR IW.baseword = 'management') [... and here comes some GROUP BY, ORDER BY and LIMIT] This is not a problem. Actually it works very well apparently. However performing an AND search is much harder. Because you cannot just switch OR with AND (because "baseword" cannot be two things at the same time... :-) Therefore I tought of a little trick to do it: SELECT STRAIGHT_JOIN [some fields here...] FROM index_words AS IW, index_rel AS IR, index_words AS IW2, index_rel AS IR2, index_phash AS IP WHERE IW.wid=IR.wid AND IW2.wid=IR2.wid AND IR.phash = IP.phash AND IR2.phash = IP.phash AND (IW.baseword = 'content' and IW2.baseword = 'management') [... and here comes some GROUP BY, ORDER BY and LIMIT] ... and actually I think this works, but it's very slow, probably because the internal result in MySQL becomes extremely large due to the joins. Can anyone help me? I checked out kwIndex from hotscripts and he does it like this: 1) Select the word-ids (SQL-query 1) 2) If both words were found, make another query for all linking-table entries matching the words and group by the word-id. The count(*) statement shows the number equal to the number of searchwords if they were both found. So select only records which delivers this. Then you have the document ids.... (SQL 2) However this solution will not let us: - search for parts of a word like "content%" or metaphone values. The word is matched exact! - It uses 2 SQL queries - I hope you do fine with one only... ***************************************************************************************************************** ***************************************************************************************************************** Message-ID: <200203021103320083.000F523D@smtp.worldonline.dk> References: <200203020316380679.03EE30B9@smtp.worldonline.dk> X-Mailer: Calypso Version 3.30.00.00 (4) Date: Sat, 02 Mar 2002 11:03:31 +0100 From: "Kasper Skårhøj" To: typo3-features@netfielders.de Subject: Re: [Typo3-features] Dev-help: Any SQL-wizards? Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable 1) To the AND question: Maybe it's best to make a search for each word; After getting the total list of page-ids from first search, this is included as a condition in the next search, which generates a new list which is included in the next search, which... However this approach forces us to get a list of ids into PHP and include that in the next search. This will work for small sites (still more than 1000 pages though) but is not very wise in the long run (because this list could be very, very long). Then MySQL has an option of creating a temporary table which one could store the result in and then join with that table upon the next search. This makes more sense I think, but that is only MySQL 3.23+ (I run 3.22). Comments? 2) Search query syntax Any suggestions to a search query syntax. - Search for "content management" is by default AND search - But should "+content -management" be the same as "content not management" ? - What about nesting? Like "content and (management or production)". And which operator (AND or OR) has precedence anyways? Please comment. ***************************************************************************************************************** ***************************************************************************************************************** Return-Path: Delivered-To: pop3user-typo3-kasper@typo3.com Received: (qmail 18622 invoked from network); 3 Mar 2002 00:19:45 -0000 Received: from unknown (HELO netfielders.de) (194.245.114.28) by 192.168.1.4 with SMTP; 3 Mar 2002 00:19:45 -0000 Received: from host1.deltaphon.net [209.239.36.16] by mailman.k1net.de (SMTPD32-6.06 EVAL) id AAA2EDE03EC; Sun, 03 Mar 2002 01:13:22 +0100 Received: from [10.0.1.2] (pD9EB6EA1.dip.t-dialin.net [217.235.110.161]) by host1.deltaphon.net (8.10.2/8.10.2) with ESMTP id g230F5532210 for ; Sat, 2 Mar 2002 19:15:05 -0500 User-Agent: Microsoft-Entourage/10.0.0.1309 Date: Sun, 03 Mar 2002 01:15:06 +0100 Subject: Re: [Typo3-features] Dev-help: Any SQL-wizards? From: Daniel Hinderink To: Message-ID: In-Reply-To: <200203021103320083.000F523D@smtp.worldonline.dk> Mime-version: 1.0 Content-type: text/plain; charset="ISO-8859-1" Content-transfer-encoding: quoted-printable Precedence: bulk Sender: typo3-features-owner@netfielders.de Reply-To: typo3-features@netfielders.de Hi Kasper, OK, I am not an sql-wizard. Just some hints and wishes, which I hope are helpful. > > 1) To the AND question: > > Maybe it's best to make a search for each word; After getting the total list > of page-ids from first search, this is included as a condition in the next > search, which generates a new list which is included in the next search, > which... > However this approach forces us to get a list of ids into PHP and include that > in the next search. This will work for small sites (still more than 1000 pages > though) but is not very wise in the long run (because this list could be very, > very long). > Then MySQL has an option of creating a temporary table which one could store > the result in and then join with that table upon the next search. This makes > more sense I think, but that is only MySQL 3.23+ (I run 3.22). > Comments? In fact I believe there has to be an auto-indexing spider which builds a search reference table at record creation time to compare against. I thought this is what the current typo3 indexing is doing? Best would be if that would be filtering the text for indexing words against a "filler"-set to extract the noise and keep the search swift. Here is a very short bit on the fulltext in feature in mysql, is that what you are talking about? This seems to be the only way for me to dit inside mysql: It's from 3.23.23 on: http://www.mysql.com/doc/F/u/Fulltext_Search.html However it would be very nice to have a result ranking mechanism, not only going by the frequency of a match in a given record, but also by relevance, as shown by hits and by the rank in the pagetree, so that a match in a record on rootlevel +1 is shown before a record in rootlevel +2, etc. Take a look at atomz.com ->search. The administration surface has quite a few important features I would love to see in a typo3 admin search surface. I have to say a really elaborate, fast and multiformat (pdf's !) search engine is really one the single most important things for every 100+ website. > > 2) Search query syntax > Any suggestions to a search query syntax. > - Search for "content management" is by default AND search > - But should "+content -management" be the same as "content not management" ? > - What about nesting? Like "content and (management or production)". And which > operator (AND or OR) has precedence anyways? As for the syntax, simple boolean is much more widespread in use, but implied boolean (+,-) should be the prevailing statement of them. That's how it was done in harvest and is still done in most search engines. Here is a handy little comparison table of search engine syntax: http://lisweb.curtin.edu.au/staff/gwpersonal/compare.html Thanks for listening, good night, Daniel ***************************************************************************************************************** ***************************************************************************************************************** Return-Path: Delivered-To: pop3user-typo3-kasper@typo3.com Received: (qmail 4770 invoked from network); 4 Mar 2002 07:58:03 -0000 Received: from unknown (HELO netfielders.de) (194.245.114.28) by 192.168.1.4 with SMTP; 4 Mar 2002 07:58:03 -0000 Received: from av001.thyssen.com [149.211.49.30] by mailman.k1net.de (SMTPD32-6.06 EVAL) id A75E110303EC; Mon, 04 Mar 2002 08:50:54 +0100 Received: from mail.thyssen.com (unverified) by av001.thyssen.com (Content Technologies SMTPRS 4.2.10) with ESMTP id for ; Mon, 4 Mar 2002 08:52:43 +0100 Received: from srv479.thyssen.com (srv479 [149.206.183.11]) by mail.thyssen.com (8.10.0.Beta6/1.0.3) with ESMTP id g247leJ66914 for ; Mon, 4 Mar 2002 08:47:41 +0100 Received: from SRV533.thyssen.com (SRV533.thyssen.com [149.206.246.196]) by srv479.thyssen.com (8.11.1/8.11.1) with ESMTP id g247sSj20724 for ; Mon, 4 Mar 2002 08:54:28 +0100 Subject: Antwort: [Typo3-features] Dev-help: Any SQL-wizards? To: typo3-features@netfielders.de X-Mailer: Lotus Notes Release 5.0.5 September 22, 2000 Message-ID: From: Malecki@blohmvoss.thyssen.com Date: Mon, 4 Mar 2002 08:48:27 +0100 X-MIMETrack: Serialize by Router on SRV533/Server/BuV(Release 5.0.6a |January 17, 2001) at 04.03.2002 08:52:40 MIME-Version: 1.0 Content-type: text/plain; charset=iso-8859-1 Content-transfer-encoding: quoted-printable Precedence: bulk Sender: typo3-features-owner@netfielders.de Reply-To: typo3-features@netfielders.de Hi Kasper, I was working some time with Oracle and (less) with mySQL on similar problems. I think, mySQL goes the same base ways, so the same hints apply to mySQL. Basic hint is: always try to keep the driving set of the query as small as possible. The mentionerd query is like: .... WHERE IR.phash = IP.phash AND IW.wid=IR.wid AND (IW.baseword = 'content' OR IW.baseword = 'management') and probably there is a huge amount of rows matching the first line: IR.phash = IP.phash AND (this is a HHUUGGEE JJOOIINN i think, right?) which is creating the "driving set" (of matching rows) of the selection. All following "AND"'s are applied to cut-down this initial amount of rows. It's then obvious, if this initial amount of rows is small, then the succeding narrowing takes less time and resources. Suggestion: try first: (IW.baseword = 'content' OR IW.baseword = 'management') AND .... AND .... The first line of the criteria shall provide as small as possible amount of rows matching. (this looks then strange somehow, but is effective). Basically, when the tables in the database are well analysed (statistics is actual), then some optimization shall (and most likely will) be done by the query optimizer. But there is no guarantee. What the guaranteed is to don't rely on something else. A good query works good without any artificial help. BTW. If You like to see IN ADVANCE how the query will behave: call the EXPLAIN for this query. You will get then more information about how mySQL will go to process the query. I hope, this helps. Regards and God's Blesses for Your week Piotr ***************************************************************************************************************** *****************************************************************************************************************