Meta Integration® Model Bridge (MIMB)
"Metadata Integration" Solution

MIMB Bridge Documentation

MIMB Import Bridge from Microsoft Azure Blob Storage

Bridge Specifications

Vendor Microsoft
Tool Name Azure Blob Storage
Tool Version 1
Tool Web Site https://azure.microsoft.com/en-us/services/storage/blobs/
Supported Methodology [File System] Multi-Model, Data Store (NoSQL / Hierarchical) via Java API

BRIDGE INFORMATION
Import tool: Microsoft Azure Blob Storage 1 (https://azure.microsoft.com/en-us/services/storage/blobs/)
Import interface: [File System] Multi-Model, Data Store (NoSQL / Hierarchical) via Java API from Microsoft Azure Blob Storage
Import bridge: 'MicrosoftAzureBlobStorage' 10.1.0

BRIDGE DISCLAIMER
This bridge requires internet access to https://repo.maven.apache.org/maven2/ (and exceptionally a few other tool sites)
in order to download the necessary third party software libraries into $HOME/data/download/MIMB/
(such directory can be copied from another MIMB server with internet access).
By running this bridge, you hereby acknowledge responsibility for the license terms and any potential security vulnerabilities from these downloaded third party software libraries.

BRIDGE DOCUMENTATION
This bridge crawls a data lake implemented on the Microsoft Azure Blob Storage Service to detect (reverse engineer) metadata from all the data files (for data catalog purpose).
It is critical that the parameters are filled correctly in order to satisfy the local connection requirements on the machine that runs the bridge.

SUPPORTED FILES

This bridge supports the following file formats:
- Delimited (Flat) files such as CSV (see details below)
- Positional (Fixed Length) files typically from mainframe (see details below)
- COBOL COPYBOOK files typically from mainframe (see details below)
- Open Office Excel XML .XSLX (see details below)
- W3C XML
- JSON (JavaScript Object Notation)
- Apache Avro
- Apache Parquet
- Apache ORC

as well as the compressed versions of the above formats:
- ZIP (as a compression format, not as archive format)
- BZIP
- GZIP
- LZ4
- Snappy (as standard Snappy format, not as Hadoop native Snappy format)

DELIMITED FILES

This bridge detects (reverse engineer) the metadata from a data file of type Delimited File (also known as Flat File).
The detection of such Delimited File is not based on file extensions (such as .CSV, .PSV) but rather by sampling the file content.

The bridge can detect a header row, and use it to create the field name, otherwise generic filed names are created.

The bridge samples up to 1000 rows in order to automatically detect the field separators which by default include:
', (comma)' , '; (semicolon)', ': (colon)', '\t (tab)', '| (pipe)', '0x1 (ctrl+A)'
More separators can be added in the auto detection process (including double characters), see the Miscellaneous parameter.

During the sampling, the bridge also detects the file data types, such as DATE, NUMBER, STRING.

POSITIONAL FILES

This bridge creates metadata for data files of type Positional File (also known as Fixed Length File).
Such metadata cannot be automatically detected (reverse engineered) by sampling the data files (e.g. customers.dat or even just customers with no extension).
Therefore, this bridge imports a 'Positional File Definition' file which must be with extension .positional_file_definition format file
(e.g. customers.dat.positional_file_definition format file will create the metadata of a file named file customers with the fields defined inside)
This is the equivalent of a RDBMS DDL for positional files. With such a long extension, this data definition file can coexist with the actual data files in the each file system directory containing them.

The 'Positional File Definition' file format is defined as follows:
- Format file must start with the following header:
column name, position, width, data type, comment
- All positions must be unique and greater than or equal to 1.
a,1
b,5
- The file format is invalid when some columns have positions and others don't.
a,1
b,
c,5
- When all columns do not have positions but have widths the application assumes that columns are ordered and calculates positions based on widths.
a,,4 -> a,1,4
b,,25 -> b,5,25
- When the position is present the application uses widths for documentation only.
a,1,4
b,5,25
- Types and comments are used as documentation only.
a,1,4,int
b,5,25,char[25],identifier

COBOL COPYBOOK FILES

This bridge can only import the COBOL COPYBOOK files (which contain the data definitions), therefore does not detect (reverse engineer) metadata from actual COBOL data files.
The detection of such COBOL COPYBOOK File is not based on file extensions (such as .CPY) but rather by sampling the file content.

This bridges creates a 'Physical Hierarchical Model' which reflects a truly flat, byte-position defined, record structure, which is useful for stitching to the DI/ETL processes. Therefore, the physical model has all the physical elements required to define a flat record, which is ONE table with all the elements (including multiple columns for OCCURS elements when the proper bridge parameter is set).

Note that this bridge does not currently support the COPY verb, and reports a parsing error at the line and position at which the COPY statement begins. In order to import Copybooks with the Copy Statement, create an expanded Copybook file with the included sections already in place (replacing the COPY verb). Most COBOL compilers have the option to output only the preprocessed Copybooks with the COPY and REPLACE statements expanded.

Frequently Asked Questions:
Q: Why is the default start column '6' (six) and the default end column '72' (seventy-two)?
A: The bridge parser counts columns starting at 0 (zero), rather than 1 (one). Thus, the defaults leave the standard first six columns for line numbers, next column for comment indicators, and last 8 columns (out of 80) for additional line comment information.

EXCEL (XLSX) FILES

This bridge detects (reverse engineer) the metadata from a data file of type Excel XML format (XLSX).
The detection of such Excel File is based on file extension .XLSX.

The bridge can detect a header row, and use it to create the field name, otherwise generic filed names are created.

The bridge samples up to 1000 rows to detect the file data types, such as DATE, NUMBER, STRING.

If an Excel file has multiple sheets, each sheet is imported as the equivalent of a file/table with the same sheet name.

The bridge uses the machine's local to read files and allows you to specify the character set encoding files use.

MORE INFORMATION

Please refer to the individual parameter's tool tips for more detailed examples.


Bridge Parameters

Parameter Name Description Type Values Default Scope
Storage account An Azure storage account provides a unique namespace in the cloud to store and access your data objects in Azure Storage. STRING      
Storage access key A String that represents the Base-64-encoded 512-bit storage account access key, which are used for authentication when the storage is accessed. PASSWORD      
Root directory Set directory containing metadata files or specify it using browsing tool.


Bridge uses only wasbs protocol to load files.
f.e. wasbs://container_01@user.blob.core.windows.net/Folder_01/Samples
REPOSITORY_SUBSET     Mandatory
Include filter The include folder and file filter pattern relative to the root directory.
The pattern uses extended unix glob case-sensitive expression syntax.
Here are some common examples:
*.* - include any file at the root level
*.csv - include only csv files at the root level
**.csv -include only csv files at any level
*.{csv,gz} include only csv or gz files at the root level
dir\*.csv - include only csv files in the 'dir' folder
dir\**.csv - include only csv files under 'dir' folder at any level
dir\**.* - include any file under 'dir' folder at any level
f.csv - include only f.csv under root level
**\f.csv - include only f.csv at any level
**dir\** - include all files under any 'dir' folder at any level
**dir1\dir2\** - include all files under any 'dir2' folder under any 'dir1' folder at any level
STRING      
Exclude filter The exclude folder and file filter pattern relative to the root directory.
The pattern uses the same syntax as the Include filter. See it for the syntax details and examples.
Files that match the exclude filter are skipped.
When both include and exclude filters are empty all folders and files under the Root directory are included.
When the include filter is empty and the exclude one is not folders and files under the Root directory are included except ones matching the exclude filter.
STRING      
Partition directories Files-based partition directories' paths.
The bridge tries to detect partitions automatically. It can take a long time when partitions have a lot of files.
You can shortcut the detection process for some or all partitions by specifying them in this parameter.
Specify the partition directory path relative to the Root directory.
Use . to specify the root directory as the partitioned directory.
Separate multiple paths with the , (or ;) character.

ETL tools can read and write to pattern-based partitions directories.
For example, ETL can read all *.csv files from a folder F. The ETL bridge representes it as the '*.csv' dataset in the 'F' folder (F/*.csv).
You can instruct this bridge to generate the matching dataset by specifying its name in square brackets after the folder name, like F[*.csv].
Similar it true for application specific partitions.
For example, ETL can write files under folder F to partition sub-folders named using the 'getDate@[yyyyMMdd]' function expression.
The result is represented as the 'getDate@[yyyyMMdd]' dataset in the 'F' folder (F/getDate@[yyyyMMdd]).
Agan, you can instruct this bridge to generate the matching dataset by specifying something like F/[getDate@[yyyyMMdd]].

You may specify additional info about partitioned directory internal structure, using [dataset name] and {partitioned column name} patterns for following cases:
For application partitions like:
zone/po/us/2018/00001.csv
use: zone/[po]/{region}/{year}/*.csv or
zone/[po]/{*}/{*}/*.csv
if partition columns names are not important. They will be stitched by positions

For custom application partitions like:
zone/table1/2018/data/00001.csv
zone/table1/2018/log/00001.txt
zone/table2/2018/data/00001.csv
zone/table2/2018/log/00001.txt
use: zone/*/{year}/[data]/*.csv, zone/*/{year}/[log]/*.txt

For file based partitions like:
zone/mlcs.dataset1_data_document_20190219_132315.125.csv
zone/mlcs.dataset1_data_document_20190313_232416.225.csv
zone/mlcs.dataset1_data_document_20190414_532317.535.csv
zone/mlcs.dataset2_data_document_20190211_131215.125.xml
zone/mlcs.dataset2_data_document_20190314_130316.225.xml
zone/mlcs.dataset2_data_document_20190416_132317.535.xml

use: zone/mlcs.[dataset1]_data_document_{date}.csv,zone/mlcs.[dataset2]_data_document_{date}.xml
STRING      
Partition file number Number of files to scan during data-partitioning directories analyze. This parameter doesn't work when 'Partition directories' parameter is specified. NUMERIC      
Incremental import Specifies whether to import only the changes made in the source or to re-import everything (as specified in other parameters).

True - import only the changes made in the source.
False - import everything (as specified in other parameters).

An internal cache is maintained for each metadata source, which contains previously imported models. If this is the first import or if the internal cache has been deleted or corrupted, the bridge will behave as if this parameter is set to 'False'.
BOOLEAN
False
True
True  
Miscellaneous Specify miscellaneous options identified with a -letter and value.

For example, -m 4G -f 100 -j -Dname=value -Xms1G

-m the maximum Java memory size whole number (e.g. -m 4G or -m 2500M ).
-v set environment variable(s) (e.g. -v var1=value -v var2="value with spaces").
-j the last option that is followed by Java command line options (e.g. -j -Dname=value -Xms1G).
-hadoop key1=val1;key2=val2 to manualy set hadoop configuration options
-tps 10 maximum threads pool size
-tl 3600s processing time limit in s -seconds m - minutes or h hours;
-fl 1000 processing files count limit;
-delimited.top_rows_skip 1 number of rows to skip while processing csv files
-delimited.extra_separators ~,||,|~ comma separated extra delimiters each of which will be used while processing csv files
-delimited.no_header by default, bridge automatically tries to detect headers while processing csv files(basing on header columns types), use this option to disable headers import(f.e. to hide sensitive data)
-fresh.partition.models - use to import latest modified files when processing partitions defined in Partitioned directories parameter
-subst K: C:/test - use to associate a root path part with a drive or another path.
-skip.download - use to disable dependencies downloading and use only download cache
-prescript [cmd] - runs a script command before bridge execution. Example: -prescript \"script.bat\"
The script must be located in the bin directory, and have .bat or .sh extension.
The script path must not include any parent directory symbol (..)
The script should return exit code 0 to indicate success, or another value to indicate failure.
-disable.partitions.autodetection - use this option to disable automatic partitions detection(when "Partition directories" option is empty)
STRING      

 

Bridge Mapping

Mapping information is not available

Last updated on Thu, 7 Nov 2019 17:33:24

Copyright © Meta Integration Technology, Inc. 1997-2019 All Rights Reserved.

Meta Integration® is a registered trademark of Meta Integration Technology, Inc.
All other trademarks, trade names, service marks, and logos referenced herein belong to their respective companies.