Wikipedia:Reference desk/Archives/Computing/2016 December 8

Computing desk
< December 7 << Nov | December | Jan >> December 9 >
Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


December 8 edit

Solving a 3-SAT problem with a constraint in linear time edit

Hello,

In an homework assignment at algorithms class I was asked to solve in linear time, using divide and conquer technique, a 3-SAT problem with n boolean variables x1,...xn and m clauses with the constraint that each clause contains variables that are are at most 20 indices apart (i.e. you can have a clause (⌐x1 ∨ x15 ∨ x3) but not (x1 ∨ ⌐x5 ∨ x22) as x1 and x22 are more than 20 indices apart. I have no idea how to divide the problem efficiently using the constraint, and would appreciate any hint or suggestion for a direction. Thanks. — Preceding unsigned comment added by 77.125.47.76 (talk) 14:06, 8 December 2016 (UTC)[reply]

Not sure I understand the problem. Let me state what I think it is, and you can correct me:
1) We have up to 20 Boolean variables, 1-n. Each variable, of course, may be T or F.
2) We need to test to find if some combo of those 20 variables, in each of m given Boolean expressions, evaluates as true.
Do I have it right ? If so, it sounds like it would take a max of 2nm Boolean expression evaluations, which, for the case of n = 20, would be about a million times m. That's not unreasonable for modern PCs. Do you want to abort the check on each Boolean expression after one combo is found which evaluates as T ? That's easy to do, by making the outer loop the 1 thru m loop, and then have a check inside the 1-20 inner loops that jumps to the next element in the m loop, once a T evaluation is obtained. A case statement could be used to quickly select the desired Boolean expression to test, inside the innermost loop.
As for how to structure the variable number of n loops (1-20), I'd be tempted to write 20 subroutines, each with a different number for n, and a 21st subroutine, called by each of the 20, to evaluate the selected one of the the m Boolean expressions. I'm sure there's a more elegant technique, perhaps using recursive methods, but for only 20 I wouldn't bother. You could also do it all in one piece of code, with another case statement to select for the value of n, but that would be an unreasonably long chunk of code, IMHO. StuRat (talk) 15:50, 8 December 2016 (UTC) (Struck out answer to wrong Q.)[reply]
Thanks for the answer, I would try to define the problem more accurately: there are n boolean variables (not 20; it's possible that n>>20). you get an expression composed of m clauses, each containing three variables of the n variables, with an "or" operator between the variables in each clause and with "and" operators between different clauses: for example, for m=15, n=30 you may get the expression (x1 ∨ x2 ∨ x3) ∧ (x7 ∨ x12 ∨ ⌐x19) ∧ (x5 ∨ ⌐x2 ∨ x3) ∧ (x1 ∨ x18 ∨ x10) ∧ (x14 ∨ x23 ∨ ⌐x6) ∧...(other 15-5=10 clauses that can contain variables x1...x30). the only restrection is that the three variables in each clause are at most 20 indices apart. the target is to find an assignment of x1,...xn (for example x1=true, x2=false, x3=false, ...., xn=true) such that the entire expression is true, or to say there isn't one if this is the case. the instructions specifically talk about a divide-and-conquer solution, and the solution should be in linear time (not exponential in n).
OK, thanks, I understand it better now. Let me think about it and get back to you. How soon do you need an answer ? StuRat (talk) 17:31, 8 December 2016 (UTC)[reply]
take your time (-: I would appreiciate any hint or direction. 77.125.47.76 (talk) 17:38, 8 December 2016 (UTC)[reply]
One thought that comes to mind is that the only variables of interest are those which are used alone in at least one clause and with a NOT in at least one other clause. (Those only defined alone, in any number of clauses, should simply be set to T, while those only defined with a NOT, in any number of clauses, should be set to F, and those variables which don't appear in any clause needn't be set at all.) So, I'd do some preprocessing to determine this:
A) In first run, just find lowest and highest index values for all m clauses, called LO and HI. You might possibly want to get fancy here and compress them. For example, if your LO value is 1 and your HI value is 23, but the index values 4 and 8 are never used, you could compress the HI down to 21, by defining some pointers. This will take some time in the preprocessing step, and increase the code complexity, but could save quite a bit of time in the main part.
B) Do a loop from LO to HI, and find those values that should always be set to T, always F, and those few variables "of interest" we need to try both ways. Then you may want to do a similar compression step where you get the variables "of interest" down to the minimum continuous range.
C) Then you do the main part of the program, where you try every combo of T and F between LO_OF_INTEREST and HI_OF_INTEREST. So, if only 10 variables are "of interest", then you only need to try out 210 or 1024 combos, not bad at all.
I still feel I might be missing something, though, as that index range of 20 on each clause seems like it should allow for a shortcut, but I don't see it...yet. StuRat (talk) 18:39, 8 December 2016 (UTC)[reply]
Yeah, you need to use the restriction, and your hypothetical 2^10 is not linear time. SemanticMantis (talk) 18:53, 8 December 2016 (UTC)[reply]
One method I should mention which uses the 20 but doesn't work is to have a loop from 0-20, and apply the offset to each clause so that each of those 21 values is in the LO to HI range for that particular clause, so that you can simultaneously try each combo of T and F values in every clause. The problem with this method can be demonstrated by the fact that the first variables in every clause must therefore either all be set to T or F at once. Just thought I'd mention this so you don't waste any time trying this method. StuRat (talk) 14:06, 9 December 2016 (UTC)[reply]
My general hint is to rely heavily on the restriction, as that is the main point of the exercise, as presumably you've been taught about the full complexity of the general 3-SAT. Recall that linear time algorithm may well end up with a huge constant term, so there is no limit on the number of steps you'll be allowed for each n, so long as it's finite. More general advice for tough problems is to treat some sub cases first. E.g. if n<20, then your restriction is sort of built in, and solving that case in linear time may help you see how to do the general case. SemanticMantis (talk) 18:59, 8 December 2016 (UTC)[reply]
Thanks. We didn't seriously discussed the SAT problem or NP problems in general, yet. In this case, does linear time mean O(m) or O(mn)? does the solution require more than O(n) space? does this require some sort of backtracking in order to resolve conflicting assignments of the same variable? — Preceding unsigned comment added by 77.125.47.76 (talk) 20:14, 8 December 2016 (UTC)[reply]
" does linear time mean O(m) or O(mn)?" - I don't know, what does the original problem statement say? Does it make sense to think that the number of steps should be independent of m? I've not solved this specific problem, and to be honest I'm not all that good at theory of computational complexity, but my guess is that O(mn) should be enough. If you consider m to be fixed, then you can fold that in to the constant, and then an algorithm that is O(mn) for changes in m and n is O(n) for the fixed m case. The exact original wording would probably help. Also I suspect you might be able to get some traction out of the restriction by noting that, for large n, m*20<n, but 2^(m*20) cases should be enough to check. Anyway, that's all the suggestions I have for today, here are a selection of notes and readings on 3-sat restrictions and variants that may be helpful to you as you work it out [1] [2] [3] [4], [5]. SemanticMantis (talk) 20:50, 8 December 2016 (UTC)[reply]
The purpose of this is to realize that each 3-value set is now sortable. You can divide the problem into low sets and high sets (some will overlap and be in both divisions). Now, you have two problems that are half the size of the original problem. It will only work a little until there is so much overlap that you can't divide anymore. 209.149.113.4 (talk) 14:01, 9 December 2016 (UTC)[reply]
Yes, this seem related to my suggestion to suss out the "variables of interest", where they are defined with a NOT in one or more clauses and without a NOT in at least one other clause. The other variables are trivial. Still not linear time, though. StuRat (talk) 14:06, 9 December 2016 (UTC)[reply]
A note on computational complexity. I've done lots of performance benchmarking and optimization, and I don't find big O calcs to be particularly useful. I consider them basically just to be an academic calc with little application in real life. Others may disagree, but I find actual measurements with the typical data cases to be far more practical in optimizing code for performance. In many cases the big O predictions may match the real life performance benchmarking, and in others they don't, because some type of "overhead" not included in the calcs swamps the time for the portion considered in the big O calcs. For example, I had to fix the performance on a piece of code which was retrieving data from a remote database, one row at a time, then sorting it. Nothing in the big O calcs showed a problem with the retrieval step, as it was all linear, but indicated the problem would be in the sort portion, which was nonlinear. However, the issue was in the way the code worked, by opening the db, retrieving a row, then closing it, and repeating. The open db call was the issue, taking too much time when multiplied by thousands of rows. Once I changed the code to leave the db open while retrieving all the rows, the performance was greatly improved, and the big O calcs did nothing to help locate the problem. StuRat (talk) 14:06, 9 December 2016 (UTC)[reply]
But that is a plain performance question, not computational complexity. Experiments with real-life datasets are useful and encouraged, but they cannot give you any guarantees. For that you need a theoretical analysis (which of course can at least estimate the constant overhead and whether it can be ignored). We had a case where the system used a linear list as an event queue (i.e. all future "todos" where inserted into the list, sorted by time). That worked find for years. Then someone had the idea of organising the timeouts for a large class of objects via that queue. Again, that worked fine in small tests. Under real load, the system suddenly went to 100% cpu load and started dropping events. --Stephan Schulz (talk) 14:43, 9 December 2016 (UTC)[reply]
But my method would have involved testing it with a realistic data load (and retesting should that data load change), which would have found the problem. Big O calcs can also fail to guarantee performance or find problems like low memory/swapping paging space, which a real test will find. StuRat (talk) 14:49, 9 December 2016 (UTC)[reply]
...if you have realistic test data up front. I always tell my student that Big-O (or really big θ analysis) cannot prove that a certain algorithm will be able to deal with a problem, but you can often prove that it is not. If your design is   and your input has one million items, you are in trouble, no matter how small the constants. --Stephan Schulz (talk) 17:46, 9 December 2016 (UTC)[reply]
I would argue that a program ought not to be written if realistic test data is not available. To do so is to risk your code being unusable when completed. StuRat (talk) 16:18, 11 December 2016 (UTC)[reply]
To which I counter that a) that's the risk when doing original science (my current setting) and b) code typically lives long beyond past the time the original assumptions about the environment and data have become obsolete (any setting). --Stephan Schulz (talk) 19:46, 11 December 2016 (UTC)[reply]
As for b, if my code fails after 20 years, when used against a data set far larger than it was intended for, I'd feel a lot better (and more importantly, still get paid), whereas if it fails right away I'd be depressed and probably without a job. StuRat (talk) 22:14, 11 December 2016 (UTC)[reply]

Find data matching a certain standard and calculate based on that data edit

In Excel, I have a chart in which each line consists of a bunch of columns, including one containing the title of a book and another the date it got used; each use gets a separate line (we're past 14,000 lines), so the same title may appear once or may appear dozens of times, and I've created a pivot table that counts the number of times that each title got used. My goal is to find the latest date when a book got used, so I'm trying to construct a formula that (1) takes a title from the pivot table, (2) finds all occurrences of that title in the sheet where the data's stored, (3) finds all the dates associated with the occurrences of that title, and (4) determines which of those dates is the latest. It has to use this process, rather than just mining a date from an array, because we constantly update this list with additional uses — even if we sort the table by title and never re-sort it, an additional use for "Fox and Socks" will require an additional line, so every single title after "Fox and Socks" in the alphabet will have to be moved.

I would be tempted to think this impossible, but it was working until earlier today. A former coworker compiled a formula that goes on the same sheet as the pivot table; for the title appearing in A19, the code is =INDEX(Tracking.xlsx!Date,SUMPRODUCT(MAX((Tracking.xlsx!Title_Trimmed=A19)*ROW(Title_Trimmed)))-ROW(Date)+1). "Title_Trimmed" and "Date" indeed are the title and date column from the sheet where all the data's stored, and the file is "Tracking.xlsx"; I've been careful to avoid typos. Until this morning, it returned the desired results, but when I supplied additional lines to the sheet where data is stored (and saved the file before going anywhere else), the formula suddenly started returning a #REF! error, and neither my current coworkers nor I can figure out how to fix it. I've found stuff like https://exceljet.net/formula/lookup-lowest-value, but they all assume that you're mining from a defined array, rather than from cells adjacent to ones that match a certain criterion.

With all this in mind, can someone identify where this formula has been corrupted, or what I can do to get the desired latest-used dates? Nyttend (talk) 16:19, 8 December 2016 (UTC)[reply]

I'm thinking you may have just hit a memory limit on the calculation abilities of Excel. My suggestion, transfer the data into a relational database, which is actually designed to handle that volume of data efficiently. StuRat (talk) 16:23, 8 December 2016 (UTC)[reply]
I doubt that's the case. I've created another pivot table that pays attention to just 10% of the table (so it's handling far less data than the latest-date formula was handling yesterday while working properly), but it's returning the same error. Nyttend (talk) 16:42, 8 December 2016 (UTC)[reply]
To find the latest date when a book got used, you can use a PivotTable to display the "maximum" date for each title. Use steps like this:
  1. Drag the title into the Row Labels section.
  2. Drag the date into the Values section.
  3. Click on the "Count of Date" item, choose Value Field Settings, then choose Max.
  4. Still in the Value Field Settings box, click the Number Format button and change to Date.
I only have Excel at work, so some of the details above may be inexact. You can use a search engine to search for something like PivotTable max of date if you need more detailed help. --Bavi H (talk) 02:34, 9 December 2016 (UTC)[reply]