User Tools

Site Tools


extraction_wizard

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
extraction_wizard [2022/05/26 09:35]
autobook [3.1.2 Extracting from email source]
extraction_wizard [2022/09/14 11:47] (current)
autobook
Line 1: Line 1:
 ====== Extraction Wizard ====== ====== Extraction Wizard ======
  
-The Extraction Wizard allows you to create Extraction Schemes for capturing data from a source text into the columns of a database. You can capture data from emails, cloud-systems, PDF files or any other kind of structured data.+The Extraction Wizard allows you to create Extraction Schemes for capturing data from a source text into the columns of a Database. You can capture data from emails, cloud-systems, PDF files or any other kind of structured data.
  
-You are able to define conditions on line, sub-line, word, and/or symbol level for each database column while the extraction results are shown on-the-fly in the **Test Result** field.+You are able to define conditions on line, sub-line, word, and/or symbol level for each Database column while the extraction results are shown on-the-fly in the **Test Result** field.
  
-The available options cover vastly more than you will need in typical usage. In most cases, you will need only 2 or 3 controls for each database column. In special, complex situations, you can activate the [[#Use Regex for text fields]] checkbox to inject regular expressions into your settings which open up literally unlimited possibilities.+The available options cover vastly more than you will need in typical usage. In most cases, you will need only 2 or 3 controls for each Database column. In special, complex situations, you can activate the [[#Use Regex for text fields]] checkbox to inject regular expressions into your settings which open up literally unlimited possibilities.
  
-To begin, paste a test string – e.g. a purchase order from which you want to extraction information – into the large **Test String** field on the right side and toy around a bit with the options, always first picking the [[#Data Source]] (except you only want to write a [[#Fixed Text]] into that database column), and then trimming down the text to be extracted by applying various limitations on line, sub-line, word, and/or character level until the result matches the content you want to store in a Database. To undo a selection in any of the listboxes, simply double-click it.+To begin, paste a test string – e.g. a purchase order from which you want to extraction information – into the large **Test String** field on the right side and toy around a bit with the options, always first picking the [[#Data Source]] (except you only want to write a [[#Fixed Text]] into that Database column), and then trimming down the text to be extracted by applying various limitations on line, sub-line, word, and/or character level until the result matches the content you want to store in a Database. To undo a selection in any of the listboxes, simply double-click it.
  
 You'll probably get how it works without reading the manual, although you might miss out on some less obvious functionalities and tricks. A detailed description of each control follows below. Alternatively, skip directly to the [[#Examples]] section to gain intuitive understanding. You'll probably get how it works without reading the manual, although you might miss out on some less obvious functionalities and tricks. A detailed description of each control follows below. Alternatively, skip directly to the [[#Examples]] section to gain intuitive understanding.
Line 15: Line 15:
 ⯈ Right-click into the **Schemes** listbox on the **Home** tab and select **Create new** from the context menu. ⯈ Right-click into the **Schemes** listbox on the **Home** tab and select **Create new** from the context menu.
  
-The Auto Book Dataviewer will open and show the available database column headers. Each row represents one set of column headers.+The Auto Book Dataviewer will open and show the available Database column headers. Each row represents one set of column headers.
  
 If no row contains the headers you want, add a new row and type your own headers or modify the existing ones. Don't forget to click **Save** if you want to keep your changes for later use. If no row contains the headers you want, add a new row and type your own headers or modify the existing ones. Don't forget to click **Save** if you want to keep your changes for later use.
Line 76: Line 76:
 This functionality is intended for troubleshooting, and you probably will never need it. If you're just reading the manual to learn general usage, skip ahead to the next section for now. This functionality is intended for troubleshooting, and you probably will never need it. If you're just reading the manual to learn general usage, skip ahead to the next section for now.
  
-Auto Book can decode all common email encoding schemes such as Quoted-Printable, BASE64, etc., and also parses HTML. When you press Auto Book's [[start#Hotkeys Tab|Email Source Extraction hotkey]], the selected text or clipboard's content is assumed to be an email source and decoded internally. Your Extraction Schemes are then applied to this decoded content instead of the raw text you are seeing on your screen.+Auto Book can decode all common email encoding schemes such as Quoted-Printable, BASE64, etc., and also parses HTML. When you press Auto Book's [[manual#Hotkeys Tab|Email Source Extraction hotkey]], the selected text or clipboard's content is assumed to be an email source and decoded internally. Your Extraction Schemes are then applied to this decoded content instead of the raw text you are seeing on your screen.
  
 The decoding results can also be viewed by pasting the email source into the **Test String** field and clicking the **Process Email Source** checkbox in [[#the right-side panel]], and then selecting a [[#data source]] – the only difference is that the Email Interpreter will show the decoding result of the whole email, including both the header and the text body. The decoding results can also be viewed by pasting the email source into the **Test String** field and clicking the **Process Email Source** checkbox in [[#the right-side panel]], and then selecting a [[#data source]] – the only difference is that the Email Interpreter will show the decoding result of the whole email, including both the header and the text body.
Line 138: Line 138:
 [{{ :autobook-v1.1-extractionwizard-singleview-datasource.png?nolink|Data Source selection}}] [{{ :autobook-v1.1-extractionwizard-singleview-datasource.png?nolink|Data Source selection}}]
  
-The choice of the data source is important only if you are are going to transmit data to Auto Book via [[start#Direct transmission of Email Data to Auto Book|Direct Transmission]] or [[start#Processing the Email Source|Email Source Extraction]]. For [[start#Selecting Text Manually|Normal Clipboard Extraction]], your selection won't make a difference, because you are using only one piece of text for data extraction (the text within the clipboard); in this case, click either **(6) Clipboard/General** or any of the other options in case you want to make your scheme compatible with other methods of data transmission.+The choice of the data source is important only if you are are going to use [[manual#Parameter Extraction Mode]] or [[manual#Email Source Extraction Mode]]. For [[manual#Normal Text Extraction]], your selection won't make a difference, because you are using only one piece of text for data extraction (the text within the clipboard); in this case, click either **(6) Clipboard/General** or any of the other options in case you want to make your Extraction Scheme compatible with other methods of data transmission.
  
-Otherwise, pick the source from which you want to extract the data for this database column:+Otherwise, pick the source from which you want to extract the data for this Database column:
  
 (1) Date: The email date as indicated by your email client.\\ (1) Date: The email date as indicated by your email client.\\
Line 147: Line 147:
 (4) Address: The email address of the sender of the email as indicated by your email client.\\ (4) Address: The email address of the sender of the email as indicated by your email client.\\
 (5) Text: The text body of the email as indicated by your email client.\\ (5) Text: The text body of the email as indicated by your email client.\\
-(6) Clipboard/General: Select this option only if //not// using Direct Transmission or Email Source Extraction.\\+(6) Clipboard/General: Select this option only if //not// using [[manual#Parameter Extraction Mode]] or [[manual#Email Source Extraction Mode]].\\
  
 If you're going to process the email source, these sources – Date, Sender, Subject, Address and Text (body of the email) are automatically extracted from the email source, and your Extraction Scheme settings will be applied onto these resulting sources. Thus, if you are going to extract a part of an email's subject line, for example, you don't need to worry about capturing the subject line from the email source, but only need to make the settings to define which part of this single line you need. As another example, if you select Date as your source for a certain column, you won't need to make any other settings if you're happy to capture the whole date as per email source into your Database – selecting a source without any other limiting settings means you are going to keep the whole source. If you're going to process the email source, these sources – Date, Sender, Subject, Address and Text (body of the email) are automatically extracted from the email source, and your Extraction Scheme settings will be applied onto these resulting sources. Thus, if you are going to extract a part of an email's subject line, for example, you don't need to worry about capturing the subject line from the email source, but only need to make the settings to define which part of this single line you need. As another example, if you select Date as your source for a certain column, you won't need to make any other settings if you're happy to capture the whole date as per email source into your Database – selecting a source without any other limiting settings means you are going to keep the whole source.
Line 167: Line 167:
 This is done by entering the line number or range of numbers into the **[N]** field. The initial line – the beginning of the source text or the line where the specific text occurs – is considered line 1, the next line is line 2, and so on. To define a range of line numbers, enter, for example, <q>2-5</q> or <q>1<N<6</q> for lines from the 2nd to the 5th, <q><5</q> for lines up to the 4th, <q>>1</q> for all following lines, and so on. Negative line numbers are currently not supported, but might be implemented in a later Auto Book version. This is done by entering the line number or range of numbers into the **[N]** field. The initial line – the beginning of the source text or the line where the specific text occurs – is considered line 1, the next line is line 2, and so on. To define a range of line numbers, enter, for example, <q>2-5</q> or <q>1<N<6</q> for lines from the 2nd to the 5th, <q><5</q> for lines up to the 4th, <q>>1</q> for all following lines, and so on. Negative line numbers are currently not supported, but might be implemented in a later Auto Book version.
  
-When indicating a range of lines, note that at most one line will be inserted into your database. If you don't make any other limiting settings (e.g. via [[#Word Limitations]]), the first line of your range will be captured. If you do make other limiting settings, the first line with content that fulfills all other conditions as well will be captured. For example, if you stipulate that captured words must include the letters <q>EUR</q>, the first line of your range where such a word appears will be captured.+When indicating a range of lines, note that at most one line will be inserted into your Database. If you don't make any other limiting settings (e.g. via [[#Word Limitations]]), the first line of your range will be captured. If you do make other limiting settings, the first line with content that fulfills all other conditions as well will be captured. For example, if you stipulate that captured words must include the letters <q>EUR</q>, the first line of your range where such a word appears will be captured.
  
 To count only lines that include some content (anything else than whitespace), click **N-th non-empty line from** in the top field. Otherwise, to count all lines, click **N-th line from**. To count only lines that include some content (anything else than whitespace), click **N-th non-empty line from** in the top field. Otherwise, to count all lines, click **N-th line from**.
Line 210: Line 210:
 ⯈ To capture <q>XY123456</q>, enter <q>:</q> or <q>PO:</q> (a trailing space is optional). By entering <q>PO</q>, you make sure that only lines including <q>PO:</q> will be captured, if you haven't defined the line via [[#Line Limitations]]. Otherwise, the first line including <q>:</q> will be used. ⯈ To capture <q>XY123456</q>, enter <q>:</q> or <q>PO:</q> (a trailing space is optional). By entering <q>PO</q>, you make sure that only lines including <q>PO:</q> will be captured, if you haven't defined the line via [[#Line Limitations]]. Otherwise, the first line including <q>:</q> will be used.
  
-(This example is identical to using the [[start#standard_schemes|Standard Format]] in case <q>PO</q> is also the column title.)+(This example is identical to using the [[manual#standard_schemes|Standard Format]] in case <q>PO</q> is also the column title.)
  
 == - Use Standard Format checkbox == == - Use Standard Format checkbox ==
  
-To extract text based on the [[start#standard_schemes|Standard Format]], activate the **Use Standard Format** checkbox. It's in this group of controls because it is, in effect, a kind of pre-defined sub-line limitation.+To extract text based on the [[manual#standard_schemes|Standard Format]], activate the **Use Standard Format** checkbox. It's in this group of controls because it is, in effect, a kind of pre-defined sub-line limitation.
  
 This means that you are going to extract the line part following the column header and a colon, such as <q>2022-12-31</q> from the line <q>Date: 2022-12-31</q> if <q>Date</q> is the column header. As this is a pre-defined complete column configuration, all other controls will be disabled, except [[#Fixed Text]], which you still can add before or after the extracted text. This means that you are going to extract the line part following the column header and a colon, such as <q>2022-12-31</q> from the line <q>Date: 2022-12-31</q> if <q>Date</q> is the column header. As this is a pre-defined complete column configuration, all other controls will be disabled, except [[#Fixed Text]], which you still can add before or after the extracted text.
Line 287: Line 287:
 The two fields in the **Fixed Text** group allow you to add fixed text, ie. text independent of the source text, in front of or behind the extracted text. The two fields in the **Fixed Text** group allow you to add fixed text, ie. text independent of the source text, in front of or behind the extracted text.
  
-If you want to store ONLY fixed text for this database column, simply leave all other fields empty. In this case, it doesn't matter whether you use the **Put this text in front** or **Put this text behind** field, as there is nothing to put it in front of or behind.+If you want to store ONLY fixed text for this Database column, simply leave all other fields empty. In this case, it doesn't matter whether you use the **Put this text in front** or **Put this text behind** field, as there is nothing to put it in front of or behind.
  
-The **Fixed Text** fields also accept a few commands that make it somewhat dynamic - it's called "fixed" because it doesn't depend on the source text. Simply enter each command including the tags <> into either one of the **Fixed Text** fields. When saving data to a database, these commands will be automatically replaced as detailed below:+The **Fixed Text** fields also accept a few commands that make it somewhat dynamic - it's called "fixed" because it doesn't depend on the source text. Simply enter each command including the tags <> into either one of the **Fixed Text** fields. When saving data to a Database, these commands will be automatically replaced as detailed below:
  
-|<AutoFolder>|Will be replaced with the folder path generated from the AutoFolder pattern saved with this Extraction Scheme (see section ???? below).|+|<AutoFolder>|Will be replaced with the folder path generated from the [[manual#Auto Folder]] pattern saved with this Extraction Scheme.|
 |<Time>|Will be replaced with the current time and date in the system locale format (click the **Test** button if you're unsure what this format looks like on your computer).| |<Time>|Will be replaced with the current time and date in the system locale format (click the **Test** button if you're unsure what this format looks like on your computer).|
 |<Time.Format>|Will be replaced with the current date and/or time in a user-defined format.| |<Time.Format>|Will be replaced with the current date and/or time in a user-defined format.|
  
-The Time commands will also be replaced in the **Test Result** field, but <AutoFolder> is not because the [[start#Auto Folder]] pattern has not yet been set at this point.+The Time commands will also be replaced in the **Test Result** field, but <AutoFolder> is not because the [[manual#Auto Folder]] pattern has not yet been set at this point.
  
 == - Date/Time Formats == == - Date/Time Formats ==
Line 347: Line 347:
  
 ==== - Order Email ==== ==== - Order Email ====
 +
 +<WRAP box right prewrap 270px>
 +**See this example as a Video**
 +
 +{{ Website-Sample-Complete.mp4|Example Video}}
 +</WRAP>
  
 We will show step-by-step instructions for two examples for Auto Book's primary use case, that is, extracting purchase order information from emails. In the first example that follows immediately below, we will extract these information from the email text body as it is displayed in an email client. In the second example, we will use the email source instead to get all data we want (see [[#Extracting from email source]]). We will show step-by-step instructions for two examples for Auto Book's primary use case, that is, extracting purchase order information from emails. In the first example that follows immediately below, we will extract these information from the email text body as it is displayed in an email client. In the second example, we will use the email source instead to get all data we want (see [[#Extracting from email source]]).
Line 394: Line 400:
 To confirm that you settings for each column are correct, paste the above sample email into the large **Test String** field on the right side. Whenever you complete the instructions below for one of the columns, the extraction result will be automatically displayed in the **Test Result** field below. To confirm that you settings for each column are correct, paste the above sample email into the large **Test String** field on the right side. Whenever you complete the instructions below for one of the columns, the extraction result will be automatically displayed in the **Test Result** field below.
  
-**Date**: In this sample email, the client did not indicate an order date. We could use either [[Start#Direct transmission of Email Data to Auto Book|direct data transmission]] or [[start#Processing the Email Source|email source extraction]] to capture the email's sent date. However, let's assume we want to stick with simplest case of manual extraction via <q>CTRL+SHIFT+E</q> from the email text body only. In this case:+**Date**: In this sample email, the client did not indicate an order date. We could use either [[manual#Parameter Extraction Mode]] or [[manual#Email Source Extraction Mode]] to capture the email's sent date. However, let's assume we want to stick with simplest case of using [[manual#Normal Extraction Mode]], whereby you will select the email text with your mouse in your email client and then press <q>CTRL+SHIFT+E</q>. In this case:
  
 ⯈ Type <q><Time.yyyy-MM-dd></q> into either one of the **Fixed Text** fields (it doesn't matter which one). ⯈ Type <q><Time.yyyy-MM-dd></q> into either one of the **Fixed Text** fields (it doesn't matter which one).
Line 413: Line 419:
 ⯈ Enter <q>Regards OR Wishes OR Greetings OR Sincerely</q> for **Specific Text**.\\ ⯈ Enter <q>Regards OR Wishes OR Greetings OR Sincerely</q> for **Specific Text**.\\
  
-This approach is somewhat fuzzy in so far as the 4 words used as alternatives for **Specific Text** also could appear in a different context somewhere else within an email. However, for typical order emails, this shouldn't happen often and if it does, the name can be edited in the [[start#Data Preview]].+This approach is somewhat fuzzy in so far as the 4 words used as alternatives for **Specific Text** also could appear in a different context somewhere else within an email. However, for typical order emails, this shouldn't happen often and if it does, the name can be edited in the [[manual#Data Preview]].
  
 Click **Next**. Click **Next**.
Line 457: Line 463:
 ⯈ Activate the [[#Use Standard Format checkbox]] in the [[#Sub-Line Limitations]] group. ⯈ Activate the [[#Use Standard Format checkbox]] in the [[#Sub-Line Limitations]] group.
  
-Nothing else needs to be done, because this piece of data conforms to the [[start#Standard Schemes|Standard Format]], since the column is called <q>Remuneration</q> and the text we want to extract follows <q>Remuneration</q> and a colon (<q>:</q>), as prescribed by the [[start#Standard Schemes|Standard Format]]. (If the column were named differently, we would have to enter <q>Remuneration:</q> into the **Pick text part from** field.)+Nothing else needs to be done, because this piece of data conforms to the [[manual#Standard Schemes|Standard Format]], since the column is called <q>Remuneration</q> and the text we want to extract follows <q>Remuneration</q> and a colon (<q>:</q>), as prescribed by the [[manual#Standard Schemes|Standard Format]]. (If the column were named differently, we would have to enter <q>Remuneration:</q> into the **Pick text part from** field.)
  
 **Saving the Extraction Scheme** **Saving the Extraction Scheme**
Line 463: Line 469:
 Only the 2 **Comments** columns are left. We won't extract anything into these and keep them as a reserve in order to add notes to our records. We can use them, for example, to track whether each record has already been invoiced, etc. Only the 2 **Comments** columns are left. We won't extract anything into these and keep them as a reserve in order to add notes to our records. We can use them, for example, to track whether each record has already been invoiced, etc.
  
-So there's nothing left to do. Click **Save** and enter a name for the newly created Extraction Scheme. Then, select the sample email text above and press the Extraction Hotkey (<q>CTRL+SHIFT+E</q> by default). Select the name of the just created Extraction Scheme in the **Schemes** panel of the [[start#Data Preview]] that appears. All desired data should pop up, ready for being added to a Database via the **Add to Database** button. If you wish, also enter an [[start#Auto Folder]] pattern based on these data and create and open a folder by activating the corresponding checkboxes.+So there's nothing left to do. Click **Save** and enter a name for the newly created Extraction Scheme. After the Extraction Scheme has been saved, click **Exit** to close the Extraction Wizard. Then, select the sample email text above and press the Extraction Hotkey (<q>CTRL+SHIFT+E</q> by default). Select the name of the just created Extraction Scheme in the **Schemes** panel of the [[manual#Data Preview]] that appears. All desired data should pop up, ready for being added to a Database via the **Add to Database** button. If you wish, also enter an [[manual#Auto Folder]] pattern based on these data and create and open a folder by activating the corresponding checkboxes.
  
 === - Extracting from email source === === - Extracting from email source ===
Line 469: Line 475:
 In this example, we will use an email only slightly modified from the example above ([[#Extracting from email text body]]). The PO number, this time, is found only in the email's subject, but not in the text body. Furthermore, instead of using the current date, this time we want to extract the email's sent date, and we also want to extract the email's sender name instead of manually entering the client name. In this example, we will use an email only slightly modified from the example above ([[#Extracting from email text body]]). The PO number, this time, is found only in the email's subject, but not in the text body. Furthermore, instead of using the current date, this time we want to extract the email's sent date, and we also want to extract the email's sender name instead of manually entering the client name.
  
-As these information are not available within the text body, we have to use either [[Start#Direct transmission of Email Data to Auto Book|direct data transmission]] or [[start#Processing the Email Source|email source extraction]], and in this example, we are going to use the latter method. (The Extraction Wizard settings for [[Start#Direct transmission of Email Data to Auto Book|direct data transmission]] would actually be identical – the only difference is that we wouldn't be using the email source.)+As these information are not available within the text body, we have to use either [[manual#Parameter Extraction Mode]] or [[manual#Email Source Extraction Mode]], and in this example, we are going to use the latter method. (The Extraction Wizard settings for [[manual#Parameter Extraction Mode]] would actually be identical – the only difference is that we wouldn't be using the email source.)
  
 Below is our sample email source. The header is slightly shortened for space reasons. If the header of your emails contains lengthy incomprehensible data salad, don't worry about it – Auto Book will simply ignore it. Below is our sample email source. The header is slightly shortened for space reasons. If the header of your emails contains lengthy incomprehensible data salad, don't worry about it – Auto Book will simply ignore it.
Line 575: Line 581:
 ⯈ Enter <q>Regards OR Wishes OR Greetings OR Sincerely</q> for **Specific Text**.\\ ⯈ Enter <q>Regards OR Wishes OR Greetings OR Sincerely</q> for **Specific Text**.\\
  
-This approach is somewhat fuzzy in so far as the 4 words used as alternatives for **Specific Text** also could appear in a different context somewhere else within an email. However, for typical order emails, this shouldn't happen often and if it does, the name can be edited in the [[start#Data Preview]].+This approach is somewhat fuzzy in so far as the 4 words used as alternatives for **Specific Text** also could appear in a different context somewhere else within an email. However, for typical order emails, this shouldn't happen often and if it does, the name can be edited in the [[manual#Data Preview]].
  
 Click **Next**. Click **Next**.
Line 610: Line 616:
 ⯈ Activate the [[#Use Standard Format checkbox]] in the [[#Sub-Line Limitations]] group. ⯈ Activate the [[#Use Standard Format checkbox]] in the [[#Sub-Line Limitations]] group.
  
-Nothing else needs to be done, because this piece of data conforms to the [[start#Standard Schemes|Standard Format]], since the column is called <q>Remuneration</q> and the text we want to extract follows <q>Remuneration</q> and a colon (<q>:</q>), as prescribed by the [[start#Standard Schemes|Standard Format]]. (If the column were named differently, we would have to enter <q>Remuneration:</q> into the **Pick text part from** field.)+Nothing else needs to be done, because this piece of data conforms to the [[manual#Standard Schemes|Standard Format]], since the column is called <q>Remuneration</q> and the text we want to extract follows <q>Remuneration</q> and a colon (<q>:</q>), as prescribed by the [[manual#Standard Schemes|Standard Format]]. (If the column were named differently, we would have to enter <q>Remuneration:</q> into the **Pick text part from** field.)
  
 **Saving the Extraction Scheme** **Saving the Extraction Scheme**
Line 616: Line 622:
 Only the 2 **Comments** columns are left. We won't extract anything into these and keep them as a reserve in order to add notes to our records. We can use them, for example, to track whether each record has already been invoiced, etc. Only the 2 **Comments** columns are left. We won't extract anything into these and keep them as a reserve in order to add notes to our records. We can use them, for example, to track whether each record has already been invoiced, etc.
  
-So there's nothing left to do. Click **Save** and enter a name for the newly created Extraction Scheme. After the Extraction Scheme has been saved, click **Exit** to close the Extraction Wizard. Then, select the sample email text above and press the Extraction Hotkey (<q>CTRL+SHIFT+E</q> by default). Select the name of the just created Extraction Scheme in the **Schemes** panel of the [[start#Data Preview]] that appears. All desired data should pop up, ready for being added to a Database via the **Add to Database** button. If you wish, also enter an [[start#Auto Folder]] pattern based on these data and create and open a folder by activating the corresponding checkboxes.+So there's nothing left to do. Click **Save** and enter a name for the newly created Extraction Scheme. After the Extraction Scheme has been saved, click **Exit** to close the Extraction Wizard. Then, select the sample email text above and press the Email Source Extraction Hotkey (<q>CTRL+WIN+E</q> by default). Select the name of the just created Extraction Scheme in the **Schemes** panel of the [[manual#Data Preview]] that appears. All desired data should pop up, ready for being added to a Database via the **Add to Database** button. If you wish, also enter an [[manual#Auto Folder]] pattern based on these data and create and open a folder by activating the corresponding checkboxes.
extraction_wizard.1653550552.txt.gz · Last modified: 2022/05/26 09:35 by autobook