Kỹ thuật nén số liệu mã hóa Số học

TÓM TẮT LUẬN VĂN TỐT NGHIỆP Nội dung của đồ án với đề tài Kỹ thuật nén số liệu mã hóa Số học được triển khai thành hai phần chính là phần thuyết minh và phần chương trình: * Phần thuyết minh gồm 6 chương được tổ chức như sau: Chương I: Giới thiệu thực trạng và sự cần thiết của nén số liệu. Chương II: Trình bày các kiến thức tổng quan, một số khái niệm và cấu trúc của quá trình nén số liệu. Chương III: Trình bày về các vấn đề liên quan của quá trình mã hóa. Triển khai các thuật toán với cấu trúc dữ liệu phù hợp để cài đặt trên ngôn ngữ lập trình bậc cao. Chương IV: Trình bày về cấu trúc dữ liệu và thuật toán của các mô hình từ và mô hình kí tự. Trình bày về mô hình bậc cao. Chương V: Trình bày về kết quả thực nghiệm đã tiến hành với các mô hình đã triển khai. So sánh kết quả với phương pháp nén khác. Chương VI: Kết luận, nêu những vấn đề đạt được, vấn đề chưa giải quyết và hướng phát triển của chương trình. * Phần chương trình gồm càc file nguồn và tệp thực thi. Các file nguồn viết trong môi trường C ở hệ điều hành Linux. Dựa theo thuật toán trình bày trong tài liệu [3]. MỤC LỤC TÓM TẮT LUẬN VĂN TỐT NGHIỆP 1 MỤC LỤC 2 CHƯƠNG I MỞ ĐẦU 4 CHƯƠNG II TỔNG QUAN VỀ NÉN SỐ LIỆU 6 I. GIỚI THIỆU 6 II. PHÂN LOẠI CÁC KỸ THUẬT NÉN 6 II.1. Nén tổn hao 6 II.2. Nén không tổn hao 7 III. CÁC KHÁI NIỆM LIÊN QUAN ĐẾN NÉN SỐ LIỆU 7 III.1. Sự phân bố của kí tự 7 III.2. Sự lặp lại của những kí tự 7 III.3. Độ dư thừa vị trí 7 III.4. Đơn vị đo thông tin (Entropy) và độ dài trung bình của từ mã 8 III.5. Tỷ số nén 8 IV. cấu trúc của QUÁ TRÌNH NÉN SỐ LIỆU 8 IV.1. Mô hình hoá 9 IV.1.1. Khái niệm 9 IV.1.2. Bậc của mô hình 9 IV.1.3. Phân loại mô hình 10 IV.2. Mã hoá 10 IV.2.1. Khái niệm 10 IV.2.2 Các phương pháp mã hóa . 11 CHƯƠNG III KỸ THUẬT MÃ HÓA SỐ HỌC 15 I. các vấn đề cần giải quyết 15 I.1. Mã hóa các kí tự lạ 15 I.2. Quản lý bộ nhớ 16 II. Mã hóa 16 II.1. Nguyên tắc mã hóa 16 II.2. Cấu trúc dữ liệu và giải thuật 16 III. Giải mã 19 III.1. Nguyên tắc giải mã 19 III.2. Cấu trúc dữ liệu và giải thuật 19 IV. Những hạn chế. 21 CHƯƠNG I V Xây dựng mô hình 23 I. Mô hình bậc không 23 I.1. Tính tần số xuất hiện của các kí hiệu 23 I.2. Mô hình từ 25 I.3. Mô hình kí tự . 28 II. Mô hình bậc cao 29 CHƯƠNG V chương trình và thực nghiệm 31 I. cài đặt Chương trình 31 I.1. Mã hóa 31 I.2. Mô hình 31 I.3. Các mô đun 31 I.4. Cách dùng 32 II. thực nghiệm 33 II.1. Giới thiệu 33 II.2. Kết quả thực nghiệm 33 II.2.1. Kết quả thực nghiệm với các file kiểu *.txt 33 II.2.2. Kết quả thực nghiệm với các file kiểu *.cpp 35 III. Kết luận về kết quả thực nghiệm 36 CHƯƠNG VI KẾT LUẬN 37 PHỤ LỤC Chương trình nguồn 39 I. Các file nguồn (*.c) 39 I.1. File arith.c 39 I.2. File bitio.c 44 I.3. File char.c 45 I.4. File hashtable.c 46 I.5. File main.c 50 I.6. File stats.c 57 I.7. File word.c 64 ii. các file tiêu đề (*.h) 69 II.1. File arith.h 69 II.2. File bitio.h 69 II.3. File hashtable .h 71 II.4. File main.h 72 II.5. File stats.h 73 II.6. unroll.i 73 CHƯƠNG I MỞ ĐẦU Ngày nay cùng với sự phát triển không ngừng của nền khoa học kỹ thuật thế giới là sự phát triển vượt bậc của ngành Công nghệ Thông tin nói chung và ngành Tin học nói riêng. Các hệ thống Tin học giúp ích rất nhiều trong công việc hàng ngày của các cơ quan, đơn vị tập thể hay mỗi cá nhân có liên quan. Các hệ thống Tin học còn giúp ta lưu trữ các thông tin cần thiết có liên quan đến công việc và đời sống hàng ngày. Mà ngày càng khối lượng thông tin cần lưu trữ tăng lên gấp bội chúng ta không còn không gian để lưu giữ chúng nữa. Một yêu cầu đặt ra là ta phải làm nhỏ lại không gian dành cho các thông tin đó. Với xu thế toàn cầu hóa, thì mạng Internet đã ra đời. Đây là một mạng của các mạng con trên thế giới. Qua mạng Internet ta có thể ngồi ở nhà liên lạc với các công ty, các đối tác làm ăn hay với bạn bè ở trên khắp thế giới. Qua mạng này ta có thể gửi đi và nhận về các thông tin cần thiết phục vụ cho cuộc sống hàng ngày. Khi gửi các thông tin đi mà ta để nguyên như vậy gửi đi thì tốn rất nhiều thời gian kèm theo đó là sự tốn kém về tiền bạc. Cùng với sự hoà nhập của Tin học vào các lĩnh vực đời sống thì việc xử lý các loại tập tin với các kiểu file khác nhau là điều tất yếu, và các tập tin này thường có kích thước lớn nên nhiều khi nó gây khó khăn cho công tác lưu trữ và truyền gửi. Vì vậy, khi lưu trữ hay truyền gửi người ta mong muốn giảm đến mức thấp nhất dung lượng bộ nhớ mà các tập tin này chiếm dụng để dễ tổ chức, quản lý và tiết kiệm về kinh phí. Để đáp ứng các yêu cầu nêu trên người ta đã nghĩ ra phương pháp làm cho các thông tin đó nhỏ lại nhằm chiếm dụng bộ nhớ ít hơn. Người ta gọi đó là kỹ thuật nén số liệu. Vậy nén số liệu là gì? Ở đây chúng ta có thể giới thiệu sơ qua về nén số liệu. Nén số liệu là quá trình giảm dung lượng nhớ cần thiết dành cho các tập tin mà vẫn biểu diễn cùng một lượng thông tin như trước. Có nhiều phương pháp nén khác nhau và chúng được thiết kế cho các loại dữ liệu khác nhau như hình ảnh, âm thanh, văn bản.v.v trong nội dung đồ án này chỉ bàn đến phương pháp dùng cho nén văn bản. Nén văn bản bao hàm thay đổi sự biểu diễn của tập tin so với không gian trống lấy được nhỏ hơn ở khối lượng dự trữ hoặc nhỏ hơn thời gian truyền tín hiệu, tuy nhiên nguyên bản chính của tập tin phải được khôi phục chính xác sau khi giải nén. Đã có rất nhiều phương pháp nén được phát minh và sử dụng trong nhiều năm qua. Khi thực hiện giải pháp nén số liệu chúng ta cần phải xem xét đến hai vấn đề trái ngược nhau: các thuật toán nén số liệu thực hiện trước hết phải đảm bảo giảm chi phí lưu trữ mà lại không sử dụng quá nhiều thời gian. Nguyên tắc chung của các phương pháp mã hoá đều dựa trên nhận xét loại bỏ việc lưu lại các thông tin trùng lặp. Các kỹ thuật nén không tổn hao thường được áp dụng cho các tập tin văn bản vì nó chứa các ký tự xuất hiện thường xuyên hơn các ký tự khác, và các thuật toán nén tổn hao thường áp dụng cho các tập tin ảnh vì nó có thể là các vùng đồng nhất, hay cho mô tả số của âm thanh và các ký hiệu tương tự khác. Đặc điểm của nén văn bản là dữ liệu của tệp tin gốc phải luôn luôn được khôi phục lại một cách chính xác sau khi giải nén. Một trong những phương pháp nén văn bản được biết sớm và đã thành công là nén Huffman, lần đầu tiên được phổ biến vào đầu những năm 1950. Phương pháp này tạo ra các từ mã khác nhau cho các kí hiệu đầu vào. Mã hóa Huffman được xem như một trong những phương pháp nén tốt trong nhiều thập kỷ, cho đến khi có bước đột phá về kỹ thuật nén vào cuối những năm bảy mươi là sự xuất hiện của phương pháp nén Mã hóa Số học. Mã hóa Số học ra đời vào thời gian này và được nghiên cứu phát triển bởi nhiều nhà nghiên cứu. Trong một thời gian dài từ những năm bảy mươi đến đầu những năm tám mươi, mã hóa Số học bị coi là phương pháp khó thực hiện bởi vì nó không tạo ra từng từ mã riêng lẻ cho từng ký hiệu mà chỉ tạo ra một từ mã duy nhất cho toàn bộ nguồn số liệu. Vì vậy nó không đòi hỏi số nguyên các bít để mã hóa ký hiệu. Ví dụ, ký tự s có xác suất xuất hiện là Pr[s] thì lượng thông tin chứa đựng trong nó là -logPr[s] bit (biểu thức này được định nghĩa là entropy của ký tự), giả sử xác suất xuất hiện của ký tự này là 99% thì Mã hóa Số học chỉ cần sử dụng 0,015 bit để mã hóa nhưng mã hóa Huffman phải sử dụng ít nhất là 1 bit để mã hóa ký tự này vì vậy người ta nói mã hóa Số học gần với entropy. Từ đây ta có thể nghĩ đến một tỉ số nén cao hơn cho nén văn bản khi áp dụng phương pháp mã hóa Số học. Sự công bố mã nguồn cho việc mã hóa nhiều kí hiệu được viết bởi Witten,Neal và Cleary trong tập Communication of ACM (CACM ). Với các vấn đề nêu trên và dựa vào tài liệu [3], trong đồ án này em đi sâu nghiên cứu về phương pháp mã hóa Số học. Đưa ra các mô hình khác nhau và kết quả thực tế của từng mô hình để người đọc lựa chọn. Chi tiết của các vấn đề sẽ được trình bày trong các chương tiếp theo của đồ án.

77 trang | Chia sẻ: lvcdongnoi | Lượt xem: 5606 | Lượt tải: 1

Bạn đang xem trước 20 trang tài liệu Kỹ thuật nén số liệu mã hóa Số học, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

tãûp åí mæïc bêt âãø náng cao täúc âäü .v.v.. Chæång trçnh cáön âæåüc phaït triãøn âãø phaït huy taïc duûng våïi táút caí caïc loaûi táûp tin (vàn baín, hçnh aính, ám thanh, táûp tin nhë phán, ...), vaì giao diãûn cuîng cáön caíi tiãún laûi âãø dãù daìng sæí duûng. Cáön caíi tiãún chæång trçnh âãø coï thãø chaûy âæåüc trãn caïc hãû âiãöu haình khaïc nhæ Windows NT, Windows 98 .v.v... Hiãûn nay neïn säú liãûu khäng phaíi laì mäüt váún âãö måïi meî âäúi våïi ngæåìi sæí duûng maïy tênh, båíi vç táút caí caïc maïy háöu nhæ âãöu âæåüc caìi âàût mäüt hoàûc vaìi tiãûn êch neïn cho ngæåìi sæí duûng. Tuy nhiãn âãø coï mäüt sæû hiãøu biãút vãö cå chãú laìm viãûc cuía quaï trçnh neïn thç chæa coï mäüt taìi liãûu tiãúng Viãût naìo âãö cáûp âãún. Do viãûc tiãúp cáûn caïc cäng nghãû neïn trong thåìi gian ngàõn, kinh nghiãûm láûp trçnh cuía baín thán coìn yãúu vaì coìn gàûp nhiãöu khoï khàn trong viãûc dëch taìi liãûu næåïc ngoaìi nãn khäng traïnh khoíi mäüt säú sai soït. Thãm vaìo âoï sæû hiãøu biãút cuía caï nhán vãö táút caí caïc khêa caûnh chæa tháût sáu sàõc, nãn âãö taìi chæa hoaìn chènh vaì cáön coï sæû caíi tiãún khaïc âãø náng cao hiãûu suáút. Ráút mong âæåüc sæû goïp yï cuía caïc tháöy cä, caïc anh chë âi træåïc vaì caïc baûn âãø em coï thãø hoaìn thaình âãö taìi âæåüc täút hån. PHUÛ LUÛC Chæång trçnh nguäön Caïc file nguäön (*.c) File arith.c #include #include "arith.h" #include "bitio.h" #define SHIFT_STR "Ma hoa so hoc " char *coder_desc = SHIFT_STR; #define Half ((code_value) 1 << (B_bits-1)) #define Quarter ((code_value) 1 << (B_bits-2)) static code_value in_R; static code_value in_D; static div_value in_r; static code_value out_L; static code_value out_R; static unsigned long out_bits_outstanding; #define ORIG_BIT_PLUS_FOLLOW(b) \ do \ { \ OUTPUT_BIT((b)); \ while (out_bits_outstanding > 0)\ { \ OUTPUT_BIT(!(b)); \ out_bits_outstanding--; \ } \ } while (0) # define BIT_PLUS_FOLLOW(x) ORIG_BIT_PLUS_FOLLOW(x) #define ENCODE_RENORMALISE \ do { \ while (out_R <= Quarter) \ { \ if (out_L >= Half) \ { \ BIT_PLUS_FOLLOW(1); \ out_L -= Half; \ } \ else if (out_L+out_R <= Half) \ { \ BIT_PLUS_FOLLOW(0); \ } \ else \ { \ out_bits_outstanding++; \ out_L -= Quarter; \ } \ out_L <<= 1; \ out_R <<= 1; \ } \ } while (0) #define DECODE_RENORMALISE \ do { \ while (in_R <= Quarter) \ { \ in_R <<= 1; \ ADD_NEXT_INPUT_BIT(in_D,0); \ } \ } while (0) void arithmetic_encode(freq_value low,freq_value high,freq_value total) { code_value temp; M = total << (B_bits - F_bits - 1); if (A >= M) { A -= M; temp += low; temp2 += high; } # define UNROLL_NUM B_bits - F_bits - 1 # define UNROLL_CODE \ A <<= 1; temp <<= 1; temp2 <<= 1; \ if (A >= M) { A -= M; temp += low; temp2 += high; } # include "unroll.i" out_L += temp; if (high < total) out_R = temp2 - temp; else out_R -= temp; ENCODE_RENORMALISE; if (out_bits_outstanding > MAX_BITS_OUTSTANDING) { fprintf(stderr,"Bits_outstanding limit reached - File too large\n"); exit(1); } } freq_value arithmetic_decode_target(freq_value total) { freq_value target; M = total << ( B_bits - F_bits - 1 ); if (A >= M) { A -= M; in_r++; } # define UNROLL_NUM B_bits - F_bits - 1 # define UNROLL_CODE \ A <<= 1; in_r <<= 1; \ if (A >= M) { A -= M; in_r++; } # include "unroll.i" A = in_D; target = 0; if (in_r < (1 << (B_bits - F_bits - 1))) { M = in_r << F_bits; if (A >= M) { A -= M; target++; } A <<= 1; target <<= 1; } else { M = in_r << (F_bits - 1); } if (A >= M) { A -= M; target++; } # define UNROLL_NUM F_bits - 1 # define UNROLL_CODE \ A <<= 1; target <<= 1; \ if (A >= M) { A -= M; target++; } # include "unroll.i" return (target >= total ? total-1 : target); } void arithmetic_decode(freq_value low, freq_value high, freq_value total) { code_value temp; M = in_r << F_bits; if (M & Half) { temp += low; temp2 += high; } # define UNROLL_NUM B_bits - F_bits - 1 # define UNROLL_CODE \ M <<= 1; temp <<= 1; temp2 <<= 1; \ if (M & Half) { temp += low; temp2 += high; } # include "unroll.i" in_D -= temp; if (high < total) in_R = temp2 - temp; else in_R -= temp; DECODE_RENORMALISE; } void binary_arithmetic_encode(freq_value c0, freq_value c1, int bit) { int LPS; freq_value cLPS, rLPS; if (c0 < c1) { LPS = 0; cLPS = c0; } else { LPS = 1; cLPS = c1; } code_value numerator, denominator ; denominator = (c0 + c1) << (B_bits - F_bits - 1); if (numerator >= denominator) { numerator -= denominator; rLPS += cLPS; } # define UNROLL_NUM B_bits - F_bits - 1 # define UNROLL_CODE \ numerator <<= 1; rLPS <<= 1; \ if (numerator >= denominator) \ { \ numerator -= denominator; \ rLPS += cLPS; \ } # include "unroll.i" if (bit == LPS) { out_L += out_R - rLPS; out_R = rLPS; } else { out_R -= rLPS; } ENCODE_RENORMALISE; if (out_bits_outstanding > MAX_BITS_OUTSTANDING) { fprintf(stderr,"Bits_outstanding limit reached - File too large\n"); exit(1); } } int binary_arithmetic_decode(freq_value c0, freq_value c1) { int LPS; int bit; freq_value cLPS, rLPS; if (c0 < c1) { LPS = 0; cLPS = c0; } else { LPS = 1; cLPS = c1; } code_value numerator, denominator ; denominator = (c0 + c1) << (B_bits - F_bits - 1); if (numerator >= denominator) { numerator -= denominator; rLPS += cLPS; } # define UNROLL_NUM B_bits - F_bits - 1 # define UNROLL_CODE \ numerator <<= 1; rLPS <<= 1; \ if (numerator >= denominator) \ { \ numerator -= denominator; \ rLPS += cLPS; \ } # include "unroll.i" if ((in_D) >= (in_R-rLPS)) { bit = LPS; in_D -= (in_R - rLPS); in_R = rLPS; } else { bit = (1-LPS); in_R -= rLPS; } DECODE_RENORMALISE; return(bit); } void start_encode(void) { out_L = 0; out_R = Half; out_bits_outstanding = 0; } void finish_encode(void) { int nbits, i; code_value bits; nbits = B_bits; bits = out_L; for (i = 1; i <= nbits; i++) BIT_PLUS_FOLLOW(((bits >> (nbits-i)) & 1)); } void start_decode(void) { int i; in_D = 0; in_R = Half; if (in_D >= Half) { fprintf(stderr,"Corrupt input file (start_decode())\n"); exit(1); } } void finish_decode(void) { /* No action */ } File bitio.c #include #include "bitio.h" unsigned int _bytes_input = 0; unsigned int _bytes_output = 0; int _in_buffer; unsigned char _in_bit_ptr = 0; int _in_garbage; int _out_buffer; int _out_bits_to_go; #ifndef FAST_BITIO int _bitio_tmp; #endif void startoutputtingbits(void) { _out_buffer = 0; _out_bits_to_go = BYTE_SIZE; } void startinputtingbits(void) { _in_garbage = 0; _in_bit_ptr = 0; } void doneoutputtingbits(void) { if (_out_bits_to_go != BYTE_SIZE) OUTPUT_BYTE(_out_buffer << _out_bits_to_go); _out_bits_to_go = BYTE_SIZE; } void doneinputtingbits(void) { _in_bit_ptr = 0; } int bitio_bytes_in(void) { return _bytes_input; } int bitio_bytes_out(void) { return _bytes_output; } void unget_bit(int bit) { _in_bit_ptr <<= 1; if (_in_bit_ptr == 0) _in_bit_ptr = 1; _in_buffer = _in_buffer & (_in_bit_ptr - 1); if (bit) _in_buffer |= _in_bit_ptr; } File char.c #include #include #include "bitio.h" #include "arith.h" #include "stats.h" #include "main.h" #define END_OF_MESSAGE 256 #define CHAR_CONTEXT 256 void encode_char() { int i, cur_char; context *characters; characters = create_context(CHAR_CONTEXT+1, STATIC); for (i=0; i < CHAR_CONTEXT; i++) install_symbol(characters, i); if(install_symbol(characters,END_OF_MESSAGE)== TOO_MANY_SYMBOLS) { fprintf(stderr,"TOO_MANY_SYMBOLS: " "Couldn't install initial symbols\n"); fprintf(stderr,"(Perhaps F_bits is too small?)\n"); exit(1); } startoutputtingbits(); start_encode(); while ((cur_char = INPUT_BYTE()) != EOF) { encode(characters, cur_char); } encode(characters, END_OF_MESSAGE); finish_encode(); doneoutputtingbits(); return; } void decode_char() { int i, symbol; context *characters; characters = create_context(CHAR_CONTEXT+1, STATIC); for (i=0; i < CHAR_CONTEXT; i++) install_symbol(characters, i); if(install_symbol(characters,END_OF_MESSAGE)== TOO_MANY_SYMBOLS) { fprintf(stderr, "TOO_MANY_SYMBOLS: " "Couldn't install initial symbols\n"); exit(1); } startinputtingbits(); start_decode(); for (;;) { symbol= decode(characters); if (symbol == END_OF_MESSAGE) break; OUTPUT_BYTE(symbol); } finish_decode(); doneinputtingbits(); return; } File hashtable.c #include #include #include #include "arith.h" #include "stats.h" #include "main.h" #include "hashtable.h" static bucket *get_bucket(hash_table *pTable); static unsigned char *add_string(string *pItem, hash_table *pTable); hash_table *create_table(void) { hash_table *pTable; if((pTable =(hash_table*)do_malloc(sizeof(hash_table))) == NULL) return NULL; pTable->nBits = 0; pTable->nBuckets = 0; if((pTable->pIndex=(bucket**)do_malloc(sizeof(bucket*)))== NULL) return NULL; pTable->pFreeBuckets = NULL; pTable->pIndex[0] = get_bucket(pTable); pTable->pIndex[0]->nBits = 0; pTable->pIndex[0]->nWords = 0; pTable->next_number = EXTRA_SYMBOLS; if((pTable->pStrings=(string_block*) do_malloc(sizeof(string_block)))== NULL) return NULL; pTable->pStrings->prev = NULL; if ((pTable->pNumToWords = (unsigned char **) do_malloc(ALLOCATE_BLOCK * sizeof(unsigned char **))) == NULL) return NULL; pTable->numToWordsLimit = ALLOCATE_BLOCK; return pTable; } static bucket *get_bucket(hash_table *pTable) { bucket_block *pBlock, *pNew; pBlock=pTable->pFreeBuckets; if ((pBlock == NULL) || (pBlock->nFreeBuckets == 0)) { if((pNew=(bucket_block*)do_malloc(sizeof(bucket_block)))== NULL) return NULL; pNew->prev = pBlock; pTable->pFreeBuckets = pNew; if ((pNew->pBuckets = (bucket *) do_malloc(sizeof(bucket) * ALLOCATE_BLOCK)) == NULL) return NULL; pNew->nFreeBuckets = ALLOCATE_BLOCK; pBlock = pNew; } pTable->nBuckets++; pBlock->nFreeBuckets--; return (pBlock->pBuckets + pBlock->nFreeBuckets); } int hash(int length, unsigned char *pText) { int hash, i; hash = 0; for (i = 0; i < length; i++) hash = HASH_MULT*hash + pText[i]; hash += length; hash = hash % HASH_MOD; return hash; } int lookup_word(string *pItem, hash_table *pTable) { int i, key; bucket *pBucket; key = hash(pItem->length, pItem->text); key = key & ((1 nBits) -1); pBucket = pTable->pIndex[key]; for (i = 0; inWords; i++) { if (pItem->length == (int) *(pBucket->words[i].pWord)) { if(memcmp(pItem->text,pBucket->words[i].pWord+1,pItem->length)==0) return pBucket->words[i].wordNumber; } } return pTable->next_number; } int add_word(string *pItem, hash_table *pTable) { int i, key, tail, length, nWord, nWordsOld, nWordsNew, word_no; bucket *pBucket, *pNewBucket; if (get_memory(MEM_PER_SYMBOL) == NOMEMLEFT) return NOMEMLEFT; key = hash(pItem->length, pItem->text); key = key & ((1 nBits) -1); pBucket = pTable->pIndex[key]; nWord = pBucket->nWords; if (nWord < MAXBUCKET) { pBucket->words[nWord].wordNumber=(word_no= pTable->next_number); pTable->next_number++; if((pBucket->words[nWord].pWord=add_string(pItem,pTable))==NULL) return NOMEMLEFT; pTable->pNumToWords[word_no-EXTRA_SYMBOLS] = pBucket->words[nWord].pWord; pBucket->nWords++; } else { tail = key & ((1 nBits) -1); pBucket->nBits++; if ((pNewBucket = get_bucket(pTable))==NULL) return NOMEMLEFT; pNewBucket->nBits = pBucket->nBits; nWordsOld = 0; nWordsNew = 0; for (nWord = 0; nWord nWords; nWord++) { key = hash(*(pBucket->words[nWord].pWord), (pBucket->words[nWord].pWord)+1); key = key & (1 nBits - 1)); if (key>0) { pNewBucket->words[nWordsNew] = pBucket->words[nWord]; nWordsNew++; } else { pBucket->words[nWordsOld] = pBucket->words[nWord]; nWordsOld++; } } pNewBucket->nWords = nWordsNew; pBucket->nWords = nWordsOld; if (pBucket->nBits nBits) { tail = tail | (1 nBits-1)); for (i=0; i nBits - pBucket->nBits)); i++) pTable->pIndex[(i nBits) | tail] = pNewBucket; } else { length = 1 nBits; pTable->nBits++; if ((pTable->pIndex = (bucket **) do_realloc(pTable->pIndex,length*2* sizeof(bucket*)))==NULL) return NOMEMLEFT; memcpy(pTable->pIndex + length, pTable->pIndex, length*sizeof(bucket *)); pTable->pIndex[(1nBits-1)) | tail] = pNewBucket; } return (add_word(pItem, pTable)); } if(pTable->next_number-EXTRA_SYMBOLS == pTable->numToWordsLimit) { pTable->numToWordsLimit *= GROWTH_RATE; pTable->pNumToWords = (unsigned char **) do_realloc(pTable->pNumToWords, pTable-numToWordsLimit*sizeof(char*)); if (pTable->pNumToWords == NULL) return NOMEMLEFT; } return word_no; } voidget_word(hash_table*pTable,intsymbol, unsigned char**pText,int*pLength) { *pText = pTable->pNumToWords[symbol-EXTRA_SYMBOLS]; *pLength = (int) *((unsigned char *)*pText); *pText += 1; } static unsigned char *add_string(string *pItem, hash_table *pTable) { unsigned char *pWord; string_block *pBlock, *pNew; pBlock = pTable->pStrings; if (STRINGBLOCK - pBlock->length > pItem->length+1) { pWord = pBlock->strings+pBlock->length; *pWord = pItem->length; memcpy(pWord+1, pItem->text, pItem->length); pBlock->length += pItem->length+1; return (pWord); } else { if((pNew=(string_block*)do_malloc(sizeof(string_block)))== NULL) { return NULL; } pNew->prev = pBlock; pNew->length = 0; pTable->pStrings = pNew; return (add_string(pItem, pTable)); } } void purge_table(hash_table *pTable) { string_block *pThis, *pPrev; bucket_block *pBlock, *prev; free(pTable->pNumToWords); free(pTable->pIndex); pBlock = pTable->pFreeBuckets; while (pBlock != NULL) { prev = pBlock->prev; free(pBlock); pBlock=prev; } pThis = pTable->pStrings; while (pThis != NULL) { pPrev = pThis->prev; free(pThis); pThis = pPrev; } free(pTable); } File main.c #include #include #include #include #ifdef SYSV # include # include # include #endif #include "bitio.h" #include "arith.h" #include "stats.h" #include "main.h" static void print_results(int operation, int method); static void usage(char *str1); static void encode_only(char *str1, int selected, char **argv); static int determine_model(char *argv); static int decode_magic(unsigned char *str1); static void describe_setup(int method); typedef struct { char *name; char *magic; void (*encode)(void); void (*decode)(void); int needs_mem; int purge_mem; void (*results)(int); char *desc; } ModelType; ModelType model[] = {/*name magic encoder decoder,needs_mem,purge results,description */ {"char", "\377" "23c", encode_char, decode_char, 0, 0, NULL, "Character based model" }, { "word", "\377" "23w", encode_word, decode_word,1, 1, print_results_word, "Word based model" }, { NULL, NULL, NULL, NULL, 0, 0, NULL, NULL } }; int verbose = 0; static int mbytes = DEFAULT_MEM static int total_memory = 0; static int peak_memory = 0; static int purge_counter=0; int main(int argc, char *argv[]) { int i; unsigned char tempstore[MAGICNO_LENGTH]; char version_str[VERSION_LEN+1]; int selected = -1; int method; int mem_specified = 0; int default_method; int filename_supplied = 0; char *modelstr = argv[0]; for (i = 1; i < argc; ) { if (argv[i][0] == '-') { switch(argv[i][1]) { case 'e': /* encode */ selected = ENCODE; i++; break; case 'd': /* decode */ selected = DECODE; i++; break; case 'm': encode_only("Kich thuoc bo nho", selected, argv); mem_specified = 1; i++; if (i>=argc) usage(argv[0]); mbytes = atoi(argv[i++]); break; case 'v': verbose = 1; i++; break; case 't': i++; if (i>argc) usage(argv[0]); modelstr=argv[i]; i++; break; default: usage(argv[0]); } } else { if (filename_supplied) { fprintf(stderr,"Only one filename can be specified\n"); exit(1); } if (freopen(argv[i++], "rb", stdin) == (FILE *)NULL) { fprintf(stderr, "%s: cannot read file %s\n", argv[0], argv[--i]); exit(1); } filename_supplied = 1; } } if (mbytes MAX_MBYTES) { fprintf(stderr, "memory limit must be between %d and %d\n", MIN_MBYTES, MAX_MBYTES); exit(1); } if (B_bits MAX_B_BITS) { fprintf(stderr, "number of B_bits must be between %d and %d\n", 3, MAX_B_BITS); exit(1); } if (F_bits MAX_F_BITS) { fprintf(stderr, "number of f bits must be between %d and %d\n", 1, MAX_F_BITS); exit(1); } if (selected == -1) usage(argv[0]); if (selected == ENCODE) /* ENCODE */ { if (B_bits < F_bits + 2) { fprintf(stderr,"Code bits must be at least freq bits + 2.\n"); fprintf(stderr,"(Code bits=%i,freq bits=%i)\n",B_bits, F_bits); exit(1); } method = determine_model(modelstr); if (method == -1) method = determine_model(DEFAULT_MODEL); if (method == -1) method = 0; BITIO_FWRITE(model[method].magic, 1, MAGICNO_LENGTH); BITIO_FWRITE(VERSION, 1, VERSION_LEN); OUTPUT_BYTE(F_bits); OUTPUT_BYTE(B_bits); if (model[method].needs_mem) { OUTPUT_BYTE(mbytes); } if (!model[method].needs_mem && mem_specified) { fprintf(stderr,"This compression method doesn't use the -m option\n"); usage(argv[0]); } if (verbose) describe_setup(method); model[method].encode(); /* call ENCODER */ } else /* DECODE */ { BITIO_FREAD(tempstore, 1, MAGICNO_LENGTH); method = decode_magic(tempstore); BITIO_FREAD(version_str, 1, VERSION_LEN); version_str[VERSION_LEN] = '\0'; if (strcmp(VERSION,version_str) != 0) { fprintf(stderr,"Wrong version!\n"); fprintf(stderr,"This is version %s\n",VERSION); fprintf(stderr,"Compression was with version %s\n",version_str); exit(1); } default_method = determine_model(modelstr); if (default_method == -1) default_method = method; if (method != default_method) fprintf(stderr,"Using method: %s\n",model[method].desc); F_bits = INPUT_BYTE(); B_bits = INPUT_BYTE(); { fprintf(stderr, "Differing value for frequency / code bits:\n" "This program was compiled with fixed F_bits=%i, B_bits=%i\n" "but the data file was compressed with F_bits=%i, B_bits=%i.\n", F_bits, B_bits, f1, b1); exit(1); } } if (B_bits - F_bits < 2) { fprintf(stderr,"Corrupt input file; B_bits - F_bits < 2\n"); exit(1); } if (model[method].needs_mem) { mbytes = INPUT_BYTE(); } if (verbose) describe_setup(method); model[method].decode(); /* Call DECODER */ } if (verbose) { print_results(selected, method); } return 0; } static void encode_only(char *str1, int selected, char **argv) { if (selected!=ENCODE) { fprintf(stderr, "%s is defined by the encoder, and can not be " "specified by the decoder '-e' is required on command " "line before any encode options)\n", str1); usage(argv[0]); } } static int determine_model(char *arg0) { char *progname; int i; progname = arg0; if (strrchr(progname, '/') != NULL) progname = strrchr(progname, '/') + 1; i = 0; while(model[i].name!=NULL && strcmp(progname,model[i].name)!= 0) i++; if (model[i].name == NULL) { for (i=0; model[i].name != NULL; i++) if (strstr(progname, model[i].name) != NULL) break; } if (model[i].name == NULL) return -1; return i; } static int decode_magic(unsigned char *str1) { int i = 0; while (model[i].name != NULL && memcmp(str1, model[i].magic, MAGICNO_LENGTH) != 0) i++; if (model[i].name == NULL) { fprintf(stderr, "Bad Magic Number\n"); exit(1); } return i; } static void usage(char *str1) { int default_method; char model_list[1024]; char freq_string[127]; char code_string[127]; int i; model_list[0]='\0'; strcpy(model_list, model[0].name); for (i = 1; model[i].name!=NULL; i++) { strcat(model_list, " "); strcat(model_list, model[i].name); } sprintf(freq_string,"%i ", F_bits); sprintf(code_string,"%i ", B_bits); default_method = determine_model(str1); if(default_method == -1) default_method = determine_model(DEFAULT_MODEL); if (default_method == -1) default_method = 0; fprintf(stderr, "\nCach dung: " "%s [-e [-t s] [-m n] | -d] [-v] [filename1][filename2] \n\n", str1); fprintf(stderr, "-e: Ma hoa\n" " -t s Mo hinh(%s, default = %s)\n" " -m n Gioi han bo nho(MB)khi chon mo hinh tu(%i..%i, " "default= %i)\n" "-d: Giai ma\n" "-v: Dua ra cac thong bao ve ty so nen,thoi gian thuc hien, mo" "hinh da su dung,v.v...\n\n", model_list, model[default_method].name, MIN_MBYTES, MAX_MBYTES, DEFAULT_MEM ); if (verbose) { fprintf(stderr,"Version: %s\n\n",VERSION); describe_setup(-1); } exit(1); } static void describe_setup(int method) { if (method>=0) fprintf(stderr,"Mo hinh : %s\n",model[method].desc); fprintf(stderr,"Ma hoa: %s\n",coder_desc); if (method >= 0) { fprintf(stderr,"So bit cua tu ma : %10i\n", B_bits); fprintf(stderr,"So bit cua tan so : %10i\n", F_bits); } if (method >= 0 && model[method].needs_mem) fprintf(stderr,"Gioi han bo nho : %10i Mb\n",mbytes); } static void print_results(int operation, int method) { int bytes_compressed, bytes_uncompressed; if (operation == ENCODE) { bytes_compressed = bitio_bytes_out(); bytes_uncompressed = bitio_bytes_in(); } else { bytes_compressed = bitio_bytes_in(); bytes_uncompressed = bitio_bytes_out(); } fprintf(stderr,"Kich thuoc chua nen:%10u bytes\n", bytes_uncompressed); fprintf(stderr,"Kich thuoc nen:%10u bytes\n", bytes_compressed); if (bytes_uncompressed > 0) fprintf(stderr, "Ty so nen : %10.2f%) \n", (float)bytes_compressed/bytes_uncompressed*100); #ifdef SYSV { struct tms cpu_usage; float cpu_used; times(&cpu_usage); cpu_used = ((float) cpu_usage.tms_utime) / sysconf(_SC_CLK_TCK); fprintf(stderr, "Thoi gian nen : %10.2f giay\n" cpu_used,); } #endif if (model[method].needs_mem) { if (model[method].purge_mem) { fprintf(stderr, "Memory purges : %10d time%s\n", purge_counter, (purge_counter == 1 ? "" : "s") ); } if (peak_memory > total_memory) total_memory = peak_memory; fprintf(stderr, "Peak memory used : %10.1f Kbytes\n", total_memory*1.0/1024); } if (model[method].results != NULL) model[method].results(operation); } void *do_realloc(void *ptr, size_t size) { if (((total_memory+size) / MEGABYTE) >= mbytes) return NULL; total_memory += size; return (realloc(ptr, size)); } void *do_malloc(size_t size) { if (((total_memory+size) / MEGABYTE) >= mbytes) return NULL; total_memory += size; return (malloc(size)); } int get_memory(size_t size) { if (((total_memory+size) / MEGABYTE) >= mbytes) return NOMEMLEFT; total_memory += size; return total_memory; } void purge_memory(void) { if (total_memory > peak_memory) peak_memory = total_memory; total_memory = 0; purge_counter++; } File stats.c #include #include #include "arith.h" #include "stats.h" # define Max_frequency ((freq_value) 1 << F_bits) #define MOST_PROB_AT_END #ifdef MOST_PROB_AT_END char*stats_desc ="Cumulative stats with Fenwick tree(MPS at front)"; #else char *stats_desc = "Cumulative stats with Fenwick tree"; #endif static void get_interval(context *pContext, freq_value *pLow, freq_value *pHigh, int symbol); static void halve_context(context *pContext); #define INCR_SYMBOL_PROB_ACTUAL(pContext, symbol, inc1) \ freq_value _inc = (inc1); \ freq_value *_tree = pContext->tree; \ int i=symbol; \ do { \ _tree[i] += _inc; \ i = FORW(i); \ } while (imax_length); \ pContext->total += _inc; #define INCR_SYMBOL_PROB_MPS(pContext, symbol) \ { \ if (symbol == pContext->most_freq_symbol) \ pContext->most_freq_count += _inc; \ else if (_high-_low+_inc > pContext->most_freq_count)\ { pContext->most_freq_symbol = symbol; \ pContext->most_freq_count = _high-_low+_inc; \ pContext->most_freq_pos = _low; \ } \ else if (symbol most_freq_symbol) \ pContext->most_freq_pos += _inc; \ } #ifdef MOST_PROB_AT_END # define INCR_SYMBOL_PROB(pContext, symbol, low1, high1, inc1) \ do { \ freq_value _low = low1; \ freq_value _high = high1; \ INCR_SYMBOL_PROB_ACTUAL(pContext, symbol, inc1) \ INCR_SYMBOL_PROB_MPS (pContext, symbol) \ } while (0) #else # define INCR_SYMBOL_PROB(pContext, symbol, low1, high1, inc1) \ do { INCR_SYMBOL_PROB_ACTUAL(pContext, symbol, inc1)} while (0)\ #endif #define adjust_zero_freq(pContext) \ do { freq_value diff; \ diff = ZERO_FREQ_PROB(pContext) - pContext->tree[1]; \ if (diff != 0) \ INCR_SYMBOL_PROB(pContext, 1, 0, pContext->tree[1], diff); \ } while (0) #define ZERO_FREQ_PROB(ctx) ((freq_value)ctx->nSingletons) void init_zero_freq(context *pContext) { if (pContext->type == DYNAMIC) pContext->nSingletons += pContext->incr; else pContext->nSingletons = 0; } context *create_context(int length, int type) { context *pContext; int i; int size = 1; Max_frequency = ((freq_value) 1 << F_bits); length+=2; while (size < length) size = size << 1; if(((pContext =(context *) malloc(sizeof(context))) == NULL) || ((pContext-tree =(freq_value*) malloc((size+1)*sizeof(freq_value))) == NULL)) { fprintf(stderr, "stats: not enough memory to create context\n"); exit(1); } pContext->initial_size = size; pContext->length = 1; pContext->total = 0; pContext->nSymbols = 1; pContext->type = type; pContext->max_length = size; pContext->most_freq_symbol = -1; pContext->most_freq_count = 0; pContext->most_freq_pos = 0; for (i = 0; i max_length; i++) pContext->tree[i] = 0; pContext->incr = (freq_value) 1 << F_bits; pContext->nSingletons = 0; init_zero_freq(pContext); adjust_zero_freq(pContext); return pContext; } int install_symbol(context *pContext, int symbol) { int i; freq_value low, high; symbol+=2; while (symbol >= pContext->max_length) { pContext->tree = (freq_value *) realloc(pContext->tree, pContext->max_length * 2 * sizeof(freq_value)); if (pContext->tree == NULL) { fprintf(stderr, "stats: not enough memory to expand context\n"); return NO_MEMORY; } for (i=pContext->max_length; imax_length; i++) pContext->tree[i] = 0; pContext->tree[pContext->max_length] = pContext->total; pContext->max_length <<= 1; } if (((pContext->nSymbols + 1) = Max_frequency) return TOO_MANY_SYMBOLS; if (symbol > pContext->length) pContext->length = symbol; pContext->nSymbols++; get_interval(pContext, &low, &high, symbol); INCR_SYMBOL_PROB(pContext, symbol, low, high, pContext->incr); if (pContext->type == DYNAMIC) pContext->nSingletons += pContext->incr; adjust_zero_freq(pContext); while (pContext->total > Max_frequency) halve_context(pContext); return 0; } int encode(context *pContext, int symbol) { freq_value low, high, low_w, high_w; symbol+=2; if ((symbol > 0) && (symbol max_length)) { if (pContext->most_freq_symbol == symbol) { low = pContext->most_freq_pos; high = low + pContext->most_freq_count; } else get_interval(pContext, &low, &high, symbol); } else low = high = 0; if (low == high) { if (ZERO_FREQ_PROB(pContext) == 0) { fprintf(stderr, "stats:cannot code zero-probability novel symbol"); abort(); exit(1); } symbol = 1; if (pContext->most_freq_symbol == 1) { low = pContext->most_freq_pos; high = low + pContext->most_freq_count; } else get_interval(pContext, &low, &high, symbol); } #ifdef MOST_PROB_AT_END if (symbol > pContext->most_freq_symbol) { low_w = low - pContext->most_freq_count; high_w = high - pContext->most_freq_count; } else if (symbol == pContext->most_freq_symbol) { low_w = pContext->total - pContext->most_freq_count; high_w = pContext->total; } else { low_w = low; high_w = high; } #else low_w = low; high_w = high; #endif arithmetic_encode(low_w, high_w, pContext->total); if (symbol != 1) { if (pContext->type == DYNAMIC && high-low == pContext->incr) pContext->nSingletons -= pContext->incr; INCR_SYMBOL_PROB(pContext, symbol, low, high, pContext->incr); } adjust_zero_freq(pContext); while (pContext->total > Max_frequency) halve_context(pContext); if (symbol == 1) return NOT_KNOWN; return 0; } int decode(context *pContext) { int symbol; freq_value low, high, mid, target; freq_value total = pContext->total; target = arithmetic_decode_target(total); #ifdef MOST_PROB_AT_END if (target >= total - pContext->most_freq_count) {arithmetic_decode(total - pContext->most_freq_count, total, total); symbol = pContext->most_freq_symbol; low = pContext->most_freq_pos; high = low + pContext->most_freq_count; } else { if (target >= pContext->most_freq_pos) target += pContext->most_freq_count; #endif symbol = 0; low = 0; mid = pContext->max_length >> 1; { freq_value *tree_pointer = &(pContext->tree[symbol]); while (mid > 0) { if (tree_pointer[mid] + low <= target) { low += tree_pointer[mid]; tree_pointer += mid; } mid >>= 1; } symbol = tree_pointer - pContext->tree; } symbol++; if (symbol & 1) high = low + pContext->tree[symbol]; else { int parent, sym_1; freq_value *tree = pContext->tree; parent = BACK(symbol); high = low; sym_1 = symbol - 1; do { high -= tree[sym_1]; sym_1 = BACK(sym_1); } while (sym_1 != g3parent); high += tree[symbol]; } #ifdef MOST_PROB_AT_END if (low >= pContext->most_freq_pos) arithmetic_decode(low - pContext->most_freq_count, high - pContext->most_freq_count, total); else #endif arithmetic_decode(low, high, total); #ifdef MOST_PROB_AT_END } #endif if (symbol != 1) { if (pContext->type == DYNAMIC && high-low == pContext->incr) pContext->nSingletons -= pContext->incr; INCR_SYMBOL_PROB(pContext, symbol, low, high, pContext->incr); } adjust_zero_freq(pContext); while (pContext->total > Max_frequency) halve_context(pContext); if (symbol == 1) return NOT_KNOWN; return symbol-2; } static void get_interval(context *pContext, freq_value *pLow, freq_value *pHigh, int symbol) { freq_value low, high, shared, parent; freq_value *tree = pContext->tree; high = tree[symbol]; parent = BACK(symbol); symbol--; low = 0; while (symbol != parent) { low += tree[symbol]; symbol = BACK(symbol); } shared = 0; while (symbol > 0) { shared += tree[symbol]; symbol = BACK(symbol); } *pLow = shared+low; *pHigh = shared+high; } static void halve_context(context *pContext) { int i, zero_count, temp; freq_value old_values[MAX_F_BITS], new_values[MAX_F_BITS]; freq_value sum_old, sum_new; freq_value *tree = pContext->tree; freq_value incr; pContext->incr = (pContext->incr + MIN_INCR) >> 1; if (pContext->incr incr = MIN_INCR; pContext->nSingletons = incr = pContext->incr; for (i = 1; i max_length; i++) { zero_count = 0; for (temp = i; !(temp&1); temp >>= 1) zero_count++; old_values[zero_count] = tree[i]; sum_old = 0; sum_new = 0; for (temp = zero_count-1; temp >=0; temp--) { sum_old += old_values[temp]; sum_new += new_values[temp]; } tree[i] -= sum_old; pContext->total -= (tree[i]>>1); tree[i] -= (tree[i]>>1); if (tree[i] == incr && i != 1) pContext->nSingletons += incr; tree[i] += sum_new; new_values[zero_count] = tree[i]; } if (pContext->type == STATIC) pContext->nSingletons = 0; if (pContext->most_freq_symbol!=-1) { freq_value low, high; get_interval(pContext,&low,&high,pContext-most_freq_symbol); pContext->most_freq_count = high-low; pContext->most_freq_pos = low; } adjust_zero_freq(pContext); } void purge_context(context *pContext) { int i; free(pContext->tree); if ((pContext->tree = (freq_value *) malloc((pContext->initial_size + 1)*sizeof(freq_value)))== NULL) { fprintf(stderr, "stats: not enough memory to create context\n"); exit(1); } pContext->length = 1; pContext->total = 0; pContext->nSymbols = 1; pContext->most_freq_symbol = -1; pContext->most_freq_count = 0; pContext->most_freq_pos = 0; pContext->max_length = pContext->initial_size; for (i = 0; i initial_size; i++) pContext->tree[i] = 0; pContext->incr = (freq_value) 1 << F_bits; pContext->nSingletons = 0; init_zero_freq(pContext); adjust_zero_freq(pContext); } binary_context *create_binary_context(void) { binary_context *pContext; Max_frequency = ((freq_value) 1 << F_bits); pContext = (binary_context *) malloc(sizeof(binary_context)); if (pContext == NULL) { fprintf(stderr, "stats: not enough memory to create context\n"); exit(1); } pContext->incr = (freq_value) 1 << (F_bits - 1); pContext->c0 = pContext->incr; pContext->c1 = pContext->incr; return pContext; } int binary_encode(binary_context *pContext, int bit) { binary_arithmetic_encode(pContext->c0, pContext->c1, bit); if (bit == 0) pContext->c0 += pContext->incr; else pContext->c1 += pContext->incr; if (pContext->c0 + pContext->c1 > Max_frequency) { pContext->c0 = (pContext->c0 + 1) >> 1; pContext->c1 = (pContext->c1 + 1) >> 1; pContext->incr = (pContext->incr + MIN_INCR) >> 1; } return 0; } int binary_decode(binary_context *pContext) { int bit; bit = binary_arithmetic_decode(pContext->c0, pContext->c1); if (bit == 0) pContext->c0 += pContext->incr; else pContext->c1 += pContext->incr; if (pContext->c0 + pContext->c1 > Max_frequency) { pContext->c0 = (pContext->c0 + 1) >> 1; pContext->c1 = (pContext->c1 + 1) >> 1; pContext->incr = (pContext->incr + MIN_INCR) >> 1; } return bit; } File word.c #include #include #include #include "bitio.h" #include "arith.h" #include "stats.h" #include "main.h" #include "hashtable.h" #define WORD 0 #define NON_WORD 1 #define INIT_CONTEXT 1023 #define CHAR_CONTEXT 256 #define BUFFER_SIZE 512 #define END_OF_MESSAGE 0 #define ISWORD(c)(((c >='A')&&(c = 'a')&&(c <= 'z'))|| ((c >= '0') && (c <= '9'))) static void install_symbol_safe(context *pContext, int symbol); static void init_word_model(hash_table *tables[], context *words[]); static void purge_word_model(hash_table *tables[],context *words[]); static void init_char_model(context*characters[],context*lengths[]); static void read_word(charbuffer[],int *buffer_length,int *curr_pos, string *pWord, int type); static int base_memory; static unsigned int nWords[2]; static unsigned int nDistinctWords[2]; void print_results_word(int operation) { fprintf(stderr, "\n Loai ki hieu words non-words\n"); fprintf(stderr,"So tu dau vao:%10u %10u\n",nWords[0],nWords[1]); fprintf(stderr, "So cac tu khac nhau : %10u %10u\n", nDistinctWords[0], nDistinctWords[1]); } static void install_symbol_safe(context *pContext, int symbol) { if (install_symbol(pContext, symbol) == TOO_MANY_SYMBOLS) { fprintf(stderr,"TOO_MANY_SYMBOLS error installing” “initial symbols\n"); fprintf(stderr,"(Perhaps F_bits is too small?)\n"); exit(1); } } static void init_word_model(hash_table *tables[], context *words[]) { tables[WORD] = create_table(); tables[NON_WORD] = create_table(); words[WORD] = create_context(INIT_CONTEXT, DYNAMIC); words[NON_WORD] = create_context(INIT_CONTEXT, DYNAMIC); if (tables[WORD]==NULL || tables[NON_WORD]==NULL) {fprintf(stderr,"init_word_model(): Un able to create” “wordt ables!\n"); exit(1); } install_symbol_safe(words[WORD], END_OF_MESSAGE); install_symbol_safe(words[NON_WORD], END_OF_MESSAGE); get_memory(2 * MEM_PER_SYMBOL); } static void purge_word_model(hash_table *tables[], context *words[]) { purge_context(words[WORD]); purge_context(words[NON_WORD]); purge_table(tables[WORD]); purge_table(tables[NON_WORD]); purge_memory(); get_memory(base_memory); tables[WORD] = create_table(); tables[NON_WORD] = create_table(); if (tables[WORD]==NULL || tables[NON_WORD]==NULL) { fprintf(stderr,"purge_word_model(): Unable to recreate” “word tables!\n"); exit(1); } install_symbol_safe(words[WORD], END_OF_MESSAGE); install_symbol_safe(words[NON_WORD], END_OF_MESSAGE); } static void init_char_model(context *characters[], context *lengths[]) { int i; characters[WORD] = create_context(CHAR_CONTEXT, STATIC); characters[NON_WORD] = create_context(CHAR_CONTEXT, STATIC); lengths[WORD] = create_context(MAX_WORD_LEN+1, STATIC); lengths[NON_WORD] = create_context(MAX_WORD_LEN+1, STATIC); for (i = 0; i < CHAR_CONTEXT; i++) { if (ISWORD(i)) install_symbol_safe(characters[WORD], i); else install_symbol_safe(characters[NON_WORD], i); } for (i = 0; i <= MAX_WORD_LEN; i++) { install_symbol_safe(lengths[WORD], i); install_symbol_safe(lengths[NON_WORD], i); } get_memory(2 * MAX_WORD_LEN * MEM_PER_SYMBOL); get_memory(2 * CHAR_CONTEXT * MEM_PER_SYMBOL); } void encode_word(void) { char buffer[BUFFER_SIZE]; int buffer_len, buffer_pos = 0, word_no, i, type; string curr_word; context *words[2], *characters[2], *lengths[2]; hash_table *tables[2]; init_char_model(characters, lengths); init_word_model(tables, words); base_memory = get_memory(0); buffer_len = 0; startoutputtingbits(); start_encode(); type = WORD; for (;;) { read_word(buffer, &buffer_len, &buffer_pos, &curr_word, type); if ((buffer_len == 0) && (curr_word.length == 0)) break; nWords[type]++; word_no = lookup_word(&curr_word, tables[type]); if (encode(words[type], word_no) == NOT_KNOWN) { encode(lengths[type], curr_word.length); for (i = 0; i<curr_word.length; i++) encode(characters[type], curr_word.text[i]); if((word_no = add_word(&curr_word,tables[type]))== NOMEMLEFT || install_symbol(words[type], word_no) != 0)) { if (verbose) fprintf(stderr, "Reached %s limit adding new word...purging\n", word_no == NOMEMLEFT ? "memory" : "symbol"); purge_word_model(tables, words); } nDistinctWords[type]++; } type = !type; } encode(words[type], END_OF_MESSAGE); finish_encode(); doneoutputtingbits(); } void decode_word(void) { int i, symbol, type, length; hash_table *tables[2]; context *words[2], *characters[2], *lengths[2]; string word; unsigned char *pWord; init_char_model(characters, lengths); init_word_model(tables, words); base_memory = get_memory(0); startinputtingbits(); start_decode(); type = WORD; for (;;) { symbol = decode(words[type]); if (symbol == END_OF_MESSAGE) break; nWords[type]++; if (symbol == NOT_KNOWN) { word.length = decode(lengths[type]); for (i = 0; i<word.length; i++) word.text[i] = decode(characters[type]); pWord = word.text; length = word.length; nDistinctWords[type]++; if (((symbol = add_word(&word, tables[type])) == NOMEMLEFT) || (install_symbol(words[type], symbol) != 0)) { if (verbose) fprintf(stderr,"Reached %s limit adding new word...purging\n", symbol == NOMEMLEFT ? "memory" : "symbol"); purge_word_model(tables, words); } } else get_word(tables[type], symbol, &pWord, &length); BITIO_FWRITE(pWord, length, 1); type = !type; } finish_decode(); doneinputtingbits(); } static void read_word(char buffer[], int *buffer_length, int *curr_pos, string *pWord, int type) { pWord->length = 0; while (pWord->length < MAX_WORD_LEN) { if (*buffer_length == 0) { if ((*buffer_length = BITIO_FREAD(buffer, 1, BUFFER_SIZE)) == 0) return; *curr_pos = 0; } if ((!ISWORD(buffer[*curr_pos])) ^ type) return; else { pWord->text[pWord->length] = buffer[*curr_pos]; pWord->length += 1; *curr_pos += 1; *buffer_length -= 1; } } } caïc file tiãu âãö (*.h) File arith.h #ifndef CODER_H #define CODER_H #ifndef B_BITS #define B_BITS 32 #endif #ifndef F_BITS #define F_BITS 27 #endif typedef unsigned long code_value; typedef unsigned long freq_value; typedef unsigned long div_value; #define MAX_BITS_OUTSTANDING ((unsigned long)1<<31) #define MAX_B_BITS (int)( sizeof(code_value) * 8) #define MAX_F_BITS (int)((sizeof(freq_value)*8)-1 < MAX_B_BITS - 2\ ? (sizeof(freq_value)*8)-1 : MAX_B_BITS - 2) # define B_bits B_BITS # define F_bits F_BITS extern char *coder_desc; void arithmetic_encode(freq_value l, freq_value h, freq_value t); freq_value arithmetic_decode_target(freq_value t); void arithmetic_decode(freq_value l, freq_value h, freq_value t); void binary_arithmetic_encode(freq_value c0,freq_value c1, int bit); int binary_arithmetic_decode(freq_value c0, freq_value c1); void start_encode(void); void finish_encode(void); void start_decode(void); void finish_decode(void); #endif /* ifndef arith.h */ File bitio.h #ifndef BITIO_H #define BITIO_H #define BYTE_SIZE 8 extern unsigned int _bytes_input, _bytes_output; extern int _in_buffer; extern unsigned char _in_bit_ptr; extern int _in_garbage; extern int _out_buffer; extern int _out_bits_to_go; #ifndef FAST_BITIO extern int _bitio_tmp; #endif #define OUTPUT_BIT(b) \ do { \ _out_buffer <<= 1; \ if (b) \ _out_buffer |= 1; \ _out_bits_to_go--; \ if (_out_bits_to_go == 0) \ { \ OUTPUT_BYTE(_out_buffer); \ _out_bits_to_go = BYTE_SIZE; \ _out_buffer = 0; \ } \ } while (0) #define ADD_NEXT_INPUT_BIT(v, garbage_bits) \ do { \ if (_in_bit_ptr == 0) \ { \ _in_buffer = getchar(); \ if (_in_buffer==EOF) \ { \ _in_garbage++; \ if ((_in_garbage-1)*8 >= garbage_bits) \ { \ fprintf(stderr,"Bad input file - attempted " \ "read past end of file.\n"); \ exit(1); \ } \ } \ else \ { _bytes_input++; } \ _in_bit_ptr = (1<<(BYTE_SIZE-1)); \ } \ v = (v << 1); \ if (_in_buffer & _in_bit_ptr) v++; \ _in_bit_ptr >>= 1; \ } while (0) #ifdef FAST_BITIO # define OUTPUT_BYTE(x) putchar(x) # define INPUT_BYTE() getchar() # define BITIO_FREAD(ptr,size,nitems)fread(ptr, size, nitems, stdin) #define BITIO_FWRITE(ptr,size,nitems)fwrite(ptr,size,nitems, stdout) #else # define OUTPUT_BYTE(x) ( _bytes_output++, putchar(x) ) # define INPUT_BYTE() ( _bitio_tmp = getchar(), \ _bytes_input += (_bitio_tmp == EOF ) ? 0 : 1, _bitio_tmp )\ # define BITIO_FREAD(ptr, size, nitems) \ ( _bitio_tmp = fread(ptr, size, nitems, stdin), \ _bytes_input += _bitio_tmp * size, _bitio_tmp ) \ # define BITIO_FWRITE(ptr, size, nitems) \ ( _bitio_tmp = fwrite(ptr, size, nitems, stdout), \ _bytes_output += _bitio_tmp * size, _bitio_tmp ) \ #endif void startoutputtingbits(void); void startinputtingbits(void); void doneoutputtingbits(void); void doneinputtingbits(void); int bitio_bytes_in(void); int bitio_bytes_out(void); void unget_bit(int bit); #endif /* ifndef bitio_h */ File hashtable .h #ifndef HASHTABLE_H #define HASHTABLE_H #define MAXBUCKET 10 #define MAX_WORD_LEN 32 #define STRINGBLOCK 1024 #define GROWTH_RATE 2 #define EXTRA_SYMBOLS 1 #define ALLOCATE_BLOCK 100 #define HASH_MULT 613 #define HASH_MOD 1111151 typedef struct { unsigned char length; unsigned char text[MAX_WORD_LEN]; /* } string; typedef struct string_block string_block; struct string_block { int length; unsigned char strings[STRINGBLOCK]; string_block *prev; }; typedef struct { unsigned char *pWord; int wordNumber; } word_rec; typedef struct { int nBits; int nWords; word_rec words[MAXBUCKET]; } bucket; typedef struct bucket_block bucket_block; struct bucket_block { int nFreeBuckets; bucket *pBuckets; bucket_block *prev; }; typedef struct { int nBuckets; int nBits; int next_number; bucket **pIndex; string_block *pStrings; unsigned char **pNumToWords; int numToWordsLimit; bucket_block *pFreeBuckets; } hash_table; hash_table *create_table(void); int lookup_word(string *pItem, hash_table *pTable); void get_word(hash_table *pTable, int symbol, unsigned char **pText, int *pLength); int add_word(string *pItem, hash_table *pTable); void purge_table(hash_table *pTable); #endif File main.h #ifndef MAIN_H #define VERSION " 2.0" #define VERSION_LEN 4 #define ENCODE 0 #define DECODE 1 #define DEFAULT_MODEL "char" #define NOMEMLEFT (-1) #define MEGABYTE (1 << 20) #define DEFAULT_MEM 1 #define MIN_MBYTES 1 #define MAX_MBYTES 255 #define MAGICNO_LENGTH 4 extern int verbose; void purge_memory(void); void *do_malloc(size_t size); void *do_realloc(void *ptr, size_t size); int get_memory(size_t size); void encode_word(void); void decode_word(void); void print_results_word(int); void encode_char(void); void decode_char(void); #endif /* #ifndef MAIN_H */ File stats.h #ifndef STATS_H #define STATS_H #define BACK(i) ((i) & ((i) - 1)) #define FORW(i) ((i) + ((i) & - (i))) #define NOT_KNOWN (-1) #define TOO_MANY_SYMBOLS (-1) #define NO_MEMORY (-2) #define STATIC 0 #define DYNAMIC 1 #define MEM_PER_SYMBOL (4 * sizeof(freq_value)) #define MIN_INCR 1 typedef struct { int initial_size; int max_length, length; freq_value nSingletons; int type; int nSymbols; freq_value total; freq_value *tree; freq_value incr; int most_freq_symbol; freq_value most_freq_count; freq_value most_freq_pos; } context; typedef struct { freq_value c0; freq_value c1; freq_value incr; } binary_context; extern char *stats_desc; context *create_context(int length, int type); int install_symbol(context *pTree, int symbol); int encode(context *pContext, int symbol); int decode(context *pContext); void purge_context(context *pContext); binary_context *create_binary_context(void); int binary_encode(binary_context *pContext, int bit); int binary_decode(binary_context *pContext); #endif unroll.i #if UNROLL_NUM > 0 UNROLL_CODE #endif #if UNROLL_NUM > 1 UNROLL_CODE #endif #if UNROLL_NUM > 2 UNROLL_CODE #endif #if UNROLL_NUM > 3 UNROLL_CODE #endif #if UNROLL_NUM > 4 UNROLL_CODE #endif #if UNROLL_NUM > 5 UNROLL_CODE #endif #if UNROLL_NUM > 6 UNROLL_CODE #endif #if UNROLL_NUM > 7 UNROLL_CODE #endif #if UNROLL_NUM > 8 UNROLL_CODE #endif #if UNROLL_NUM > 9 UNROLL_CODE #endif #if UNROLL_NUM > 10 UNROLL_CODE #endif #if UNROLL_NUM > 11 UNROLL_CODE #endif #if UNROLL_NUM > 12 UNROLL_CODE #endif #if UNROLL_NUM > 13 UNROLL_CODE #endif #if UNROLL_NUM > 14 UNROLL_CODE #endif #if UNROLL_NUM > 15 UNROLL_CODE #endif #if UNROLL_NUM > 16 UNROLL_CODE #endif #if UNROLL_NUM > 17 UNROLL_CODE #endif #if UNROLL_NUM > 18 UNROLL_CODE #endif #if UNROLL_NUM > 19 UNROLL_CODE #endif #if UNROLL_NUM > 20 UNROLL_CODE #endif #if UNROLL_NUM > 21 UNROLL_CODE #endif #if UNROLL_NUM > 22 UNROLL_CODE #endif #if UNROLL_NUM > 23 UNROLL_CODE #endif #if UNROLL_NUM > 24 UNROLL_CODE #endif #if UNROLL_NUM > 25 UNROLL_CODE #endif #if UNROLL_NUM > 26 UNROLL_CODE #endif #if UNROLL_NUM > 27 UNROLL_CODE #endif #if UNROLL_NUM > 28 UNROLL_CODE #endif #if UNROLL_NUM > 29 UNROLL_CODE #endif #if UNROLL_NUM > 30 UNROLL_CODE #endif #if UNROLL_NUM > 31 UNROLL_CODE #endif #if UNROLL_NUM > 32 { int __i7; for (__i7=0; __i7 < UNROLL_NUM-32; __i7++) { UNROLL_CODE } } #endif #undef UNROLL_NUM #undef UNROLL_CODE TAÌI LIÃÛU THAM KHAÍO Mark Nelson vaì Jean-Loup Gally, The Data Copression Book, M&T Books, 1996. Ian H. Written, Alistair Moffat vaì Timothy C. Bell, Magazine Gigabyte: Compressing and Indexing Documents and Images,Van Nostrand Reinhold, New York, 1994. Ian H. Written, Alistair Moffat, Radford M. Neal, "Arithmetic Coding Revisited" 1995. Gilbert Held, Thomas R. Marshall, Data and Image Compression, John Wiley & Sons Ltd, 1996. Alistair Moffat,Timothy C.Bell vaì Ian H.Witten, "Lossless Compression for Text and Images", 1997. Phaûm Vàn Áút, Kyî Thuáût Láûp Trçnh C Cå Såí vaì Náng Cao, Nhaì Xuáút Baín Khoa Hoüc Kyî Thuáût, 1995. Phan Kim Long, Cáøm nang láûp trçnh trong C++, Nhaì Xuáút Baín Âäöng Nai, 1998.

Các file đính kèm theo tài liệu này:

thuyet minh doan tn.doc
CHUONGTRINH.ZIP